Rare Words, Part 2: Word Vector Semantics with Restricted Vocabulary

By Nick McKenna • August, 2018

In the last post we explored first order entropy of natural language. We created several machine learning models, each with a different vocabulary size limit. In this post, we’ll explore the semantic effects of this restriction on the resulting word vectors that each model learned.

As a refresher, in the experiment we discovered that word occurrences in natural language follow an exponential distribution, meaning that some words occur much more frequently than others. In fact, most words occur rarely while a small number are used a lot. We exploited this property by cutting out infrequently used words in our models for the task of sentiment analysis. In the most restricted version, we saved 90% of the space needed to represent the vocabulary while performing about the same on the task.

Since we are exploring word semantics, a natural first step is to compare the learned representations of words to each other, using human standards to judge similarity. Later we’ll explore the learning process itself by looking deeper into the numerical representations and comparing models.

Similarity Between Learned Words

The models simultaneously learned to classify movie reviews as positive or negative (sentiment analysis^[1]) while learning vector representations of the words themselves. As a first step, let’s ask the fully-trained models for their “idea” of similar words to a target word. We’ll retrieve the top 30 words with the highest cosine similarity (this will select neighboring words that have similar values with respect to each dimension). Worth noting is that since all our words are embedded in the same vector space and we are using a simple similarity function, we’ll see results that might feel “unfiltered.” Result words may be picked without regard for part of speech, tense, or other syntactic information. This may look strange since we’re used to thesauruses pre-filtering results according to this metadata. Emphasis is my own — I’ve bolded words which I personally believe are similar or strongly related to “movie”.

Target word: “movie”
6,000 Word Vocabulary	stole, these, championship, kate, peck, suggests, personal, between, shoes, six, portray, jenny, on, belief, guilty, mars, hard, fairly, robot, angle, home, nuclear, clad, apart, spite, cute, wayne, compassion, reynolds, football
8,000 Word Vocabulary	is, behavior, decent, titanic, jo, relentlessly, usa, bosses, murdering, high, near, exquisite, observe, handled, combination, pretty, bacon, latin, suggests, hippie, sebastian, republic, chooses, carpenter, omar, rabbit, honesty, akin, cameo, acted
10,000 Word Vocabulary	bats, producing, religion, disappears, plots, operatic, drops, displayed, mamet, gesture, joining, support, press, handicapped, this, capsule, serving, douglas, horror, role, basic, imagine, hires, autobiography, courageous, result, signature, trap, hoover, less
Full Vocabulary*	1969, potato, enchanted, overstatement, bimbo, account, discussion, fleet, aloofness, sawdust, aborted, kaplan, loopy, incendiary, standup, santo, bee, coordinate, firstly, ornithochirus, gap, pointe, seventy, deadbeat, floating, mcphillips, replete, crossings, vereen, uncontrollably

* All restricted vocabulary models use word vectors from epoch 5, whereas the full vocabulary model uses epoch 2, because performance peaked then during testing (see original experiment).

We used “movie” here because we trained using the IMDB movie review dataset, so we should expect good results from a thematically similar target word. It’s worth noting that our models were optimized for a “coarse-grained” binary classification task, so the resulting vector representations (and thus, similar words) may not be as nuanced as models optimized for word similarity such as word2vec^[2] or GloVe^[3].

In this example, it looks as if the reduced vocabularies produce more relevant words, and there may be a “goldilocks” vocabulary size somewhere around 10,000 words (remember that the full vocabulary contains about 101,000 unique words).

Frequent Words vs. Infrequent Words

Let’s think more about why some models appear to do better than others. Keeping with the theme of “rare words”, let’s run this experiment again but instead of simply reporting similar words, let’s quantify result words by their occurrence frequency in the corpus. We’ll plot the average frequency of words similar to a target word, over 15 training epochs for each model. Let’s compare target words with similar semantics in English but which have different frequencies in the corpus.

First, we’ll look at “movie” (44,047 occurrences), “cinema” (1,494), and “theater” (828).

While it’s hard to say from just these plots how the semantics compare between reduced models, it seems that there is a noticable difference between reduced vocabularies and the full vocabulary. Interestingly, the full vocabulary model seems to associate low-frequency words with the target word, no matter what frequency the target word has. However, we can see that the reduced models associate higher frequency words when the frequency of the target word goes up.

Thinking back to the “similar words” produced by the full vocabulary model for the word “movie” above, it’s not surprising to find out that “aloofness”, “incendiary”, and “ornithochirus” have low frequencies in the corpus. These are rare words in everyday English, and may be even more rare in a movie database like IMDB.

To further illustrate this point, check out the appendix below for more comparison plots of (“girl”, “woman”, “lady”) and (“man”, “guy”, “boy”)

Frequency Floor

At this point you might be wondering, “Wait, aren’t we removing low-frequency words from the reduced models? Maybe the model is simply unable to pick words with frequencies as low as in the full model?” It turns out that even the most restricted model (6,000 words) contains words with frequencies as low as 65 in the above example, so this isn’t the case. In fact, for higher-frequency target words the restricted models almost never pick similar words at the model’s “frequency floor”. On the other hand, the full vocabulary commonly does do this, picking words that occur once in the whole corpus.

Do Frequent Words Stand Out?

So why does the full vocabulary model associate high-frequency words with lower-quality, low-frequency words? We can narrow this down to two possible phenomena in the full vocabulary model. Remember, to define “similarity” we take the closest neighbor words in the embedding vector space.

Higher-frequency words end up farther apart from each other
Lower-frequency words end up closer to high-frequency words

To test this, we’ll look at the top 2,000 most frequent words (these are the same regardless of the model). For these top words, we’ll find the average standard deviation across dimensions in the embedding space. This will give us a simplified measure of spread.

If the distributions of frequent words look similar between the reduced models and the full model, we may reason that (2) is correct. However, if the spread is significantly different in the full vocabulary than in the reduced ones, we may reason that (1) may be correct (and (2) may still be as well).

Top Word Spread
6,000 Word Vocabulary	0.0962
8,000 Word Vocabulary	0.1040
10,000 Word Vocabulary	0.0949
Full Vocabulary (101,100 words)	0.0981

From this data it seems possible that the full vocabulary model learns high-frequency word representations that may be numerically similar to those in the reduced models. You can imagine this scenario as a forest with tall trees (our high-frequency words) and underbrush (low-frequency words). Since the trees aren’t spaced farther apart, it must be that there is simply more underbrush surrounding each tree, which is causing the model to commonly select low-frequency words as similar to high-frequency words.

Do Infrequent Words Stand Out?

Let’s focus again on the low end of the word frequency spectrum. We know from testing that low-frequency words often appear as “similar” to unrelated words (a sign that they haven’t been learned well). We can run a similar test of spread as above to get a sense of what is happening with these words numerically. Let’s find the same average deviation for each model like we did above, but for the bottom 20% of words (by frequency).

Bottom 20% Word Spread
6,000 Word Vocabulary	0.0885
8,000 Word Vocabulary	0.0930
10,000 Word Vocabulary	0.0849
Full Vocabulary	0.0523

We can see clearly that the most infrequent words in the full vocabulary have a significantly different distribution than those in the reduced vocabularies. What’s more, as we increase the threshold (from 20% to 30%, 50%, etc.) the spread increases. At 100% (all words included), the full vocabulary spread is 0.0722. It seems words with greater frequencies are drawn from a more spread-out distribution.

Conclusion

Through some careful analysis we have discovered some information about how the full vocabulary model works vs. reduced models. It appears that in the full model, low-frequency words are distributed densely amongst high-frequency words. While this in itself isn’t bad, it leads to the model selecting poorly-learned words when asked for similar words to a target. As a stopgap measure we can simply remove these infrequent words as we do in the reduced models, however it would be more ideal to improve the learned representations of these words instead. This would allow us to keep our full vocabulary and still make quality inferences about our data.

We also observed that low-frequency words have significantly smaller spread than frequent words. We can’t say concretely that this factor defines whether a word has been learned or not, but it does appear to be linked. We might be able to compare a word’s vector representation to these prior distributions for use in a classifier.

Appendix

Plots of “man” (5,979 occurrences), “guy” (3,036), and “boy” (1560):

Plots of “girl” (2,854 occurrences), “woman” (2,796), and “lady” (848):

[1] https://en.wikipedia.org/wiki/Sentiment_analysis

[2] https://code.google.com/archive/p/word2vec/

[3] https://nlp.stanford.edu/projects/glove/