Rare Words, Part 3: Recent Methods to Improve Learned Representations
By Nick McKenna • September, 2018
In this series of posts we discovered that word occurrences in natural language are drawn from an exponential distribution: a handful of words occur very frequently yet most words occur very infrequently. We conducted an experiment to highlight this phenomenon and showed that we can dramatically reduce the number of words we learn representations for with no impact on task performance (this can depend on the task). We then explored the semantics of our learned word vectors and compared models. We discovered that infrequent words tend to associate closely with more frequent words even if they’re less related semantically. This is because our models don’t account well for infrequent words. We have comparatively fewer examples to learn from and don’t make use of each example enough to overcome the limitation.
In this post we’ll review recent research on vector representation learning. We’ll look at several techniques designed to improve representation quality of rare words, or at least provide this as a side-effect. We’ll look at several kinds of methods: increasing attentiveness to rare words, sharing learned information between words, and improving overall representation quality.
In the model we used in our experiments, we constructed a mechanism which “read” reviews in order to categorize them as positive or negative. It did this by reading each review one sentence at a time, and each sentence one word at a time. We learned word embedding vectors through a basic version of gradient descent[1], which means each word was attended to and updated proportionally to how frequently it occured in the corpus. This is the issue: we can’t learn enough about infrequent words by observing them only as they come up. Most words, in fact, are used only once in the whole corpus.
Let’s now look at two examples that transform the architecture of the learning phase to give more attention to infrequent words.
GloVe[2] is a recent algorithm designed to generate word vectors. It’s based on co-occurrences of words over the entire training corpus, so it makes good use of global statistical information. The foundation of the algorithm is the co-occurrence matrix, where each cell measures the probability of word i occurring in the same context as word j. GloVe optimizes an embedding matrix to promote substructure between similar words based on the co-occurrence matrix, which results in great performance on word analogy tasks.
What’s interesting here is that an infrequent word may only occur once in the corpus, but will be captured in all of its neighboring words’ contexts. The GloVe algorithm makes pairwise comparisons between a word and each of its context words, so infrequent words get more exposure during training. Additionally, frequent words will not get exactly the same boost since many of their context words will not be unique — with fewer unique appearances than total appearances, frequent words will receive a relatively smaller bump in attention by the model.
The Skip-gram model, while older than GloVe, still provides great insight. There are many variations of the model, but a popular addition called "Negative Sampling"[3] (a simplified form of Noise Contrastive Estimation) is used in the objective function to help optimize the model. The algorithm, in abstract, tries to predict several context words out of the entire vocabulary for each given input word. Much like our model, skip-gram scans one word at a time, however it attends to other vocabulary words at each input for use in the prediction process. In practice this is computationally infeasible since the model would need to compare every input word with every vocabulary word (!) which would be quite slow due to so many unnecessary comparisons.
Noise Contrastive Estimation, however, selects only a few negative example words at each step. The goal of the model is thus to separate the positive examples (signal) from the negative ones (noise). In doing so the model attends to and updates the noise words as well as the target words, since they form part of the loss function. These noise words are drawn from a distribution similar to the corpus’s unigram counts, however It’s been shown to still moderately improve the quality of learned vectors for infrequent words since they get additional exposure in the learning process.
We’ve seen several methods of rebalancing the model to attend more to infrequent words. Let’s now take a look at a different approach to modeling word representations. Instead of compiling an isolated representation for every word in the vocabulary, recent research has successfully modeled words using smaller, atomic units of character n-grams (substrings)[4][5]. FastText[6] is a good example of this model. The algorithm makes character slices of each word in the vocabulary and learns a vector for each one. These representations are pooled so that learning one word can help learn other words. For instance, the word “predict” will yield several substrings including “pre”. Other words that include “pre” (i.e. “prevent”, “prevalent”, “predicate”), will share the same representation for that substring. A word’s vector is then a combination of these substring vectors.
This method has had a big impact on learned representations for uncommon words since they can efficiently leverage learnings from common words. Not only that, but even unseen words (“out-of-vocabulary words”) benefit from this technique if they contain previously learned n-grams.
Another interesting thing about this model is that substring slices are coarsely extracted (with a sliding window that captures both meaningful word roots and strange character combinations that cross morpheme boundaries), yet they yield a pleasantly intuitive result. Frequently, the most impactful substring vector in a word does in fact correspond to a whole morpheme versus a less intuitive n-gram. From the “prevent” example above, “pre” is likely more impactful than, say, “reven”. This is a good indication that the model has learned to compose word roots in a generalizable way.
Finally, it’s worth addressing the actual contents of the vector representations we learn. While most current research in augmenting representations focuses on overall performance and not specifically on rare word performance, it’s reasonable to expect that enriching the word vector space may impact the quality of rare word representations.
Let’s look at two recent papers which aim to enrich the learning process: CoVe and ELMo. The CoVe method[7] (or “Context Vectors”) makes use of a pre-trained Machine Translation encoder (commonly used in encoder-decoder architectures[8]) to add better contextual information to the learned word vectors. The pre-trained encoder adds a layer of semantics from a task-centric model which a generic model might not be able to acquire, and effectively transfers (translates?) the learnings from one domain into a more generalizable set of universal embeddings.
ELMo[9] also seeks to add extra context to word embeddings. Instead of drawing from outside resources, ELMo is built using a deep network of recurrent layers and forms embeddings by combining the outputs of each layer. Similar to how different layers of a convolutional network capture different semantics in an image, this method captures syntactic information at lower levels and semantic information at higher levels. Also, since each word is processed in the context of its sentence, each word gets the additional benefit of neighboring syntax and semantics. It’s not far-fetched to imagine that rare words could benefit from multiple streams of contextual information vs. the single stream in our own model.
[1] https://en.wikipedia.org/wiki/Gradient_descent
[2] https://nlp.stanford.edu/projects/glove/
[3] https://arxiv.org/abs/1310.4546
[4] https://en.wikipedia.org/wiki/N-gram
[5] https://arxiv.org/abs/1607.04606
[6] https://github.com/facebookresearch/fastText
[7] https://arxiv.org/abs/1708.00107
[8] https://www.quora.com/What-is-an-Encoder-Decoder-in-Deep-Learning
[9] https://allennlp.org/elmo