Rare Words, Part 1: Discovering Order in Natural Language
Leveraging hidden distributional organization to improve vocabulary-based tasks
By Nick McKenna • August, 2018
Language is one of the most powerful communication tools we have. Although expressive, language is nuanced and ambiguous, which is part of what makes it so difficult for us to create computer models that master it. In this post we’ll explore some of the challenges when using natural language data in a machine learning setting. We’ll use a data-driven approach based on information theory, which will help us understand fundamental properties of communicating through language.
I’ll assume you have basic knowledge of machine learning, but will try to explain things in an approachable way. I’ll pose a task as motivation for this analysis in which we’ll use a machine learning model and explore optimization methods. I will give some basic details of the model, with full specifics in an appendix at the end. However, we will be focusing on data and not the model itself. It’s safe to think of it as a black box.
Our problem is to examine a movie review and classify it as a positive or negative review (sentiment analysis[1]). This is our “very sophisticated” model.
Inside the black box our model will scan the words in each review, perform calculations with hidden parameters, and then make a decision. During training, it will also learn a representation for each word to be used in calculations.
We’re using the IMDB database[2] of movie reviews for our data. We’ll score the model on its accuracy (percentage of reviews it classifies correctly).
If our model learns to tell the difference between a good review and a bad one, we expect its accuracy to approach 100% (it always correctly predicts a given review’s sentiment). If it fails to learn, it will average around 50% (not 0%) since our data has a 50/50 split between positive and negative examples. If it hasn’t learned to tell the difference, it will perform no better than picking randomly between the two choices.
We’ll run this model over the data for 5 epochs (an epoch is one complete pass over the entire training set) and report the model accuracy by the end of each epoch.
Training Accuracy | |
Epoch 1 | 60.59% |
Epoch 2 | 89.87% |
Epoch 3 | 95.30% |
Epoch 4 | 97.84% |
Epoch 5 | 98.83% |
Testing Accuracy | 83.19% |
At first, things seem great: the model begins to learn about the training data and significantly improves its accuracy by the end of each training epoch. Then things go wrong. After training, we run the model on an as-yet unseen dataset, the test set. The test set demonstrates how the model performs on novel data, and is ultimately the only metric we care about. The results of this trial aren’t so good.
Though the training accuracy seemed near-perfect, the testing accuracy was much lower, which indicates that our model overfit during training. Overfitting occurs when a model learns too much about its training data, learning more than generalizable patterns and beginning to memorize the errors or “noise” in the input signal. This results in poor test performance, because the test set will likely have different “noise” than the training set.[3]
There are many ways to lessen overfitting. Since our text corpus is large, we may get big improvements by optimizing around it. We’d like to make use of our findings here to reduce model variability and improve movie review classification.
Our data is natural language, so let’s learn a bit about the words in this corpus and their distribution, which we hope will give insight to the overall dataset.
First, we’ll identify the size of our vocabulary (the number of unique words in the corpus). By my count, IMDB contains over 101,000 unique words. However you count it, the size of this vocabulary is quite high! It’s amazing that we can learn so many different words and their meanings. Or can we?
Since we have our explorer hats on, let’s pose a hypothesis: the vocabulary is perfectly learned by our model and is used to full effect in the task. To test this theory we’ll do a simple experiment. We’ll run the model again, but this time we’ll cut down our vocabulary size to 20% of the total (roughly 20,000 unique words). When we see an “unknown” word (outside the 20,000 word vocabulary we just defined), we will treat it as a special, default word type. This process is also called “UNK’ing”, because the word type is conventionally set to “UNK”. If our hypothesis is correct we should expect to see worse results than before, which would indicate that a larger vocabulary (such as the full vocabulary) is useful in the discrimination task.
Training Accuracy | |
Epoch 1 | 64.93% |
Epoch 2 | 89.46% |
Epoch 3 | 93.80% |
Epoch 4 | 96.10% |
Epoch 5 | 97.58% |
Testing Accuracy | 84.46% |
Ironically, by using less data our model performs no worse in discriminating reviews (in fact, it may do better). This confirms that our model was previously overfitting. It tried to learn too much information about the training data than appropriate to capture the generalizable patterns and ended up also memorizing noise.
We can reframe this statement to get a different view of the situation: perhaps the model tried to learn too many parameters about the data, or perhaps there isn’t enough information in our data to learn so many parameters in the model. Despite having such a large corpus, we are facing an issue of data sparsity.
How can this be? To illustrate the issue, let’s look at a graph. Below are the top 600 most frequently used words out of all the reviews. Can you guess which word is used the most (almost 700,000 times)?
We can see that after the first few hundred most common words, the frequency of a word in our corpus drops toward 1. Now consider that 100,400 more unique words were cropped out of this graph! It’s easier to see why our model may have had trouble learning the data — so many of the words are observed only a few times each.
We can see this issue has two sides. We have many words that are superfluous in our task, but perhaps they are superfluous because we can’t learn enough about them to make them useful.
Frequency Table: IMDB Reviews (Top 5 Words) | ||
Rank | Word | Frequency |
1 | the | 667,993 |
2 | and | 324,441 |
3 | a | 323,030 |
4 | of | 289,410 |
5 | to | 268,125 |
This effect is called “Zipf’s Law.” It states that in natural language, more common words appear significantly more than less common words (often by orders of magnitude). Specifically, a word’s actual frequency is “inversely proportional to its rank in the frequency table.”[4] This is a natural property of language which sheds some light on the structure of efficient communication.
Notice how the top 5 words in the frequency table are “function words.” They hold little meaning themselves, but serve to organize other “content words” in a sentence.[5] These words are used to create structure, which is necessary for communicating complex ideas. In fact, though they are frequently repeated there is an optimal amount of repetition in natural language which allows a speaker to communicate complex ideas with minimal effort.[6]
If we convert our frequency graph to log-log scale, a linear trend line appears. This is (a small portion of) the classic Zipf distribution. Interestingly, many natural languages such as Russian, Arabic, Written Chinese and others, share this distribution of vocabulary in everyday communication.[7] In fact, several animal verbal systems also have Zipfian distributions. The bottlenose dolphin is one such example, whose diverse vocabulary of whistles has been analyzed and compared to human speech.[8] Maybe we should translate these IMDB reviews to dolphin, so they can help us decide on a movie to watch!
The analysis we just did comes from information theory, and in this case we learned about the organization or “entropy” of our corpus text. Let’s bring this back to our model. Since we now know most of the IMDB vocabulary is used infrequently, we can make smarter decisions about which words to keep in our model vocabulary and which ones to throw away.
In our last iteration, we simply kept the first 20% of unique words seen in the corpus as our vocabulary — let’s now select the top 10% across the whole dataset instead, keeping only the most-used words. What happens when we add these further constraints, but focus the model on only high-use words?
Training Accuracy | |
Epoch 1 | 66.40% |
Epoch 2 | 89.56% |
Epoch 3 | 93.20% |
Epoch 4 | 94.80% |
Epoch 5 | 96.61% |
Testing Accuracy | 85.67% |
Our model seems to have improved again. Of interest here is that despite keeping only 10% of our vocabulary, we still kept 94.7% of the corpus text (which means 5.3% of the corpus makes up 90% of the unique words in the total vocabulary).
At this point, we have more questions than answers. By cutting out low-use words, it seems we are able to eliminate noise and focus the model only on what is essential for the task. However, we can’t fully understand what’s happening with only a few trials. We need more data.
We have building evidence that reducing the vocabulary helps in the discrimination task. Let’s further test this idea so we can better understand it.
We’ll run the experiment again with different model configurations. We’ll train one model with a full vocabulary, one with a vocabulary of the top 10,000 most frequent words, one with the top 8,000 words, and one with the top 6,000. For robustness, we’ll run 10 trials for each version. Additionally, we’ll partition out a small chunk of our test data to validate the model after each training epoch. This will act as a “mini” test to give us a sense for how the model would perform on the full test set at each point in the training.
We see above that the full vocabulary helps our model learn best during training. Not so surprising given what we’ve seen so far. In the other models, we erase the words that are not in our vocabulary, which prevents the model from learning optimally during training. However, there’s more to the story. Let’s look at our validation accuracy after each epoch. Remember, the validation data is not used for training, only testing. We run the validation test after each epoch.
The error bars here represent one standard deviation from the mean of all 10 trials. For clarity, let’s remove the error bars so we can zoom in.
The tables have turned! We see our reduced models reaching peaks about as high as the full vocabulary model.
Looking closer, the full vocabulary model tops out at epoch 2, then validation accuracy declines. This is further evidence that the original model overfit to the training data. As we trained more and more, it got better on train data and worse on test data.
To play fair, let’s imagine we use “early stopping”, which is a technique that works exactly as it sounds. We’ll “stop” each model at its best validation accuracy and compare only these values.
Even in this case, the reduced models perform better than the full model, if only slightly. Finally, let’s review test results from the full testing phase (which is four times as large as the validation test data). There’s little reason to consider the results from the full vocabulary model since we now know that it was significantly crippled by overfitting during training.
From these results, the smaller 6,000 word vocabulary appears to be the optimal choice among the models tested for this task.
By leveraging structural knowledge of natural language we have successfully created a modest but effective improvement in classifier performance for sentiment analysis.
Most significantly, we are able to save at least 90% of the space used for storing word representations, since we can ignore these words without affecting model performance on the task. This is a space-time tradeoff: we can cut these model parameters at the expense of doing more training.
We are able to separate signal from noise using properties about our corpus discovered in first-order entropy analysis (Zipf’s Law). In doing so, our classifier learns to better discriminate reviews by focusing on generalizable patterns in the input data. While this technique may be useful in other tasks, it’s likely to be less helpful in more general language understanding tasks which require a more robust vocabulary.
Though we found significant space savings through this analysis, we still have many open questions:
Beyond these questions, there are several tangible strategies we may apply to improve task performance. For the model itself, we may try regularization of model parameters and adding dropout. Both of these techniques are useful for reducing overfitting by helping the model learn more generalizable information about the data. Additionally, we can do further entropic analysis on our data. Zipf’s Law, or “first order entropy”, only scratches the surface of language structure. We may consider using higher-order Shannon entropies[9] to identify longer and more meaningful structural patterns in the data, which we might use to make smarter decisions in our classifier. Who knows what deeper patterns we can learn and exploit?
The sentiment analysis model is based on a neural architecture using two LSTM layers. Conceptually, the network reads in a movie review one sentence at a time, one word at a time. The first LSTM reads in each word and computes an embedding for the sentence, and the second LSTM reads sentence embeddings and computes an utterance embedding for the review. This utterance embedding is then fed into a binary classifier to produce our final sentiment prediction.
During training, word embeddings are learned through network updates. We use an Adam optimizer with learning rate 0.002 and batch size of 128 reviews.
We learn dense word embeddings with 20 dimensions. We compute sentence embeddings with 50 dimensions and utterance embeddings with 100 dimensions.
[1] https://en.wikipedia.org/wiki/Sentiment_analysis
[2] http://ai.stanford.edu/~amaas/data/sentiment/
[3] https://en.wikipedia.org/wiki/Overfitting
[4] https://en.wikipedia.org/wiki/Zipf%27s_law
[5] http://www.psych.nyu.edu/pylkkanen/Neural_Bases/13_Function_Words.pdf
[6] http://www.eve.ucdavis.edu/gpatricelli/McCowan%20et%20al%201999.pdf
[7] http://www.eve.ucdavis.edu/gpatricelli/McCowan%20et%20al%201999.pdf
[8] https://www.sciencedirect.com/science/article/pii/S2405722316301177?via%3Dihub
[9] http://www.eve.ucdavis.edu/gpatricelli/McCowan%20et%20al%201999.pdf