CIS 700-003: Big Data Analytics |
Spring 2017 |
Due April 14, 2017 by 10pm
For this assignment, we will focus on experimenting with different classifiers and combinations of classifiers, initially using SciKit-Learn and subsequently looking at TensorFlow. The application we’ve chosen is one in text processing, namely spam detection for SMS messages. Can we build a classifier to predict spam vs ham?
As always, you should first open Jupyter and a Terminal, then clone the Homework 5 Bitbucket repository:
git clone https://upenn-cis@bitbucket.org/pennbigdataanalytics/hw5.git
Then go into the hw5 directory in Jupyter.
We are going to try several different classifiers on the task of classifying spam (for SMS messages). Open the Jupyter notebook SpamClassifier.
As a first step, run the Cell that starts with “! pip install”... By default you have Scikit-Learn 0.17 on your Jupyter install; this will upgrade to 0.18.1 in order to provide the latest functionality (specifically, the ability to automatically create test + training sets with model_selection.train_test.split). Next go to the Kernel menu and select Restart to restart Python.
Next, you’ll see a TODO question where you’ll need to do a bit of simple data wrangling. As mentioned in the notebook, you’ll want to drop the fields other than ‘class’ and ‘sms’.
After you’ve done this, step through the Step 1.1 Results cells to see how the input data divides between ‘spam’ texts and ‘ham’ (non-spam) texts. You should note that the spam has certain terms, e.g., ‘winner’, that don’t appear as frequently in “ham.”
Recall from Homework 2 that we used document vectors, stop words, and stemming to analyze document data. For SMS messages, people often use shorter words and stemming is generally considered not necessary. So, while we want to build document vectors and drop stop words, we’ll ignore stemming.
Fortunately SciKit-Learn has a handy CountVectorizer that builds a (sparse) matrix of word counts, and even does stemming. See:
Under “Step 1.2. Vectorizing the Text”, import the CountVectorizer, then call it with:
CountVectorizer(decode_error = 'ignore', stop_words = 'english')
And finally, apply it to your SMS DataFrame to produce a feature vector per item:
X = {my_vec}.fit_transform(sms_df['sms'].values)
Let’s replace all URL patterns, and all numbers, with a single text token (“_url_” and “_number_”). We’ve given you a helper package called regularize with two functions, regularize_urls and regularize_numbers, to do this.
You’ll need to import regularize_urls and regularize_numbers from regularize. Then you can call either of these functions on a DataFrame column, such as sms_df[‘sms’]. Replace the SMS text with the results of regularizing.
Data Check for Step 1.2.2. Below the Cell saying “Step 1.2.2 Results,” re-run the CountVectorizer, re-apply fit_transform on sms_df, and re-compute the top-30 spam terms. Output the top-30 spam terms. |
If all goes well, you should see both _num_ and _url_ quite high in the list!
Currently we have a very large number of features, namely all of the words that aren’t stop words. Let’s do dimensionality reduction, by only looking for the words that frequently occur in either spam or ham. Recall that we just recomputed the top-30 spam words.
Find the Cell under Step 1.3. Compute the top-30 ham words, then create a list of “vocabulary” words from the combination of the spam + ham words. Create a relevant_vec using CountVectorizer with just those words.
When you have this, step through the next Cell, which will add an sms length feature (normalized by the maximum length), and which will create training and test sets for you.
Go to the Cell marked Step 1.4 Classifier Evaluation. You’ll see a code cell that creates a list, followed by a “placeholder” cell that creates a depth-2 decision tree, and records its parameters and performance in the list. The classifier_results list should consist of dictionaries with fields Classifier, Depth (only populated for Decision Trees), and Score.
Your task is to flesh this Cell (and/or subsequent Cells) out to compare:
(2 different classifiers)
Data Check for Step 1.4. Below the Cell saying “Step 1.4 Results,” re-run Cell to print the list of classifier results in a DataFrame. Be sure that your particular classifier entries are named as above, and beware of adding duplicate rows from multiple runs of your Cells. |
Now let’s look at whether ensembles are useful in improving accuracy. Let’s try using the following methods. For each of the following, let’s set n_estimators=31 and random_state=314 for the ensemble method.
(2 different classifiers)
As above, add rows to the classifier_results list (you don’t need to clear it). This time you don’t need to add a Depth field in the dictionary, but you should add a Count that (where applicable) contains the number of classifiers in the ensemble. Use the following names for ensembles + classifiers: RandomForest, Bag-DecTree, Bag-LogReg-L1, Bag-LogReg-L2, Bag-SVM, Boost-DecTree, Boost-LogReg-L1, Boost-LogReg-L2, Boost-SVM.
Data Check for Step 2.0. Below the Cell saying “Step 2.0 Results,” re-run Cell to print the list of classifier results in a DataFrame. Be sure that your particular classifier entries are named as above, and beware of adding duplicate rows from multiple runs of your Cells. Note, as a sanity check, the score of the random forest and bagging classifiers should all be above .90. |
Let’s continue building upon our spam classifier, this time using neural networks -- both Perceptrons and feed-forward networks.
Try the following kinds of classifiers over the data:
Again, add performance information to classifier_results. You should ignore the Depth and Count fields but should add a Hidden field with a tuple containing the number of hidden layers (e.g., 10,10,10). From the above, you should see some (not huge, but notable) differences in performance over the test set, even as you add or widen hidden layers.
You should be able to observe if (1) making the hidden nodes wider helps, and (2) whether adding layers helps.
Data Check for Step 3.0. Below the Cell saying “Step 3.0 Results,” re-run Cell to print the list of classifier results in a DataFrame. Be sure that your particular classifier entries are named as above, and beware of adding duplicate rows from multiple runs of your Cells. |
For this step, you will use Google’s TensorFlow. Open a new Terminal and enter:
docker run -v $HOME/Jupyter:/notebooks/work -it -p 8008:8888 tensorflow/tensorflow:latest-py3
Eventually it will print something like:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=f3ce5484437e06452d939c4e79cf02806cd7857ca6213aca
Select this, copy it, and paste into your browser, but replace the “8888” part (which is wrong) with “8008”. Now, you should have two different Docker containers running -- the “normal” one on port 8888, and one for TensorFlow on port 8008.
If you want to see and control your Docker instances, you can run “docker ps” to see them. The commands “docker pause” and “docker unpause” with the Container ID will suspend and resume the containers. Additionally, from the Terminal that ran the original Docker container, you can hit Ctrl-C at the Terminal to end the job. Please make sure all of your files were saved into the host machine’s Jupyter director first, as under some conditions you may lose state in the container!!!
For this part of the assignment, we’ll build off our solution to Parts 1-3, which will allow you to focus on how Tensorflow differs from SciKit.
For the data loading and initial wrangling, we will use the same components as in the previous questions. To do this, copy your solution from SpamClassifier.ipynb to a new notebook, TensorFlow. Open this new notebook.
You may delete all of the Cells from Step 1.4 onwards -- these are what you’ll replace with TensorFlow code. Alternatively, you can just leave them and add more Cells at the bottom.
Divide your original data into test and training sets with model_selection.train_test_split with random_state=42. This should be the same as Step 1.3 of your spam classifier, except that you will need to convert the labels from strings to ints.
Define TensorFlow columns (features) for each of the top-occurring keys in spam and ham. Also add an additional column for the length. Store these columns in a list.
Create a function input_fn that takes parameters x (2D np array of features) and y (1D np array of labels). This should create a tensor for each column of the 2D array x. You can think of this as creating a tensor for each feature. This function should return a tuple of the dictionary of the tensors created from the columns and a tensor created from the second input y.
Create a function test_input_fn that takes no arguments, but returns the output of passing in the test set and labels to input_fn. Create a similar function train_input_fn that does the same thing except passes in the training set and labels.
Call tf.set_random_seed(42) to make the computations below a bit more deterministic.
Data Check for Step 4.3 part 1. Create a DNNClassifier with two hidden layers of 5 units each, and run for 1000 steps. Create a Markdown Cell saying “Step 4.3.1 Results.” Run the fit operation over the training data and the evaluate operation over the test data. Sort the results of the evaluate operation by key, and output the keys. For reference, here is an example of using the DNNClassifier. Note the accuracy. |
Data Check for Step 4.3 part 2. Create a LinearClassifier and run for 1000 steps. Create a Markdown Cell saying “Step 4.3.2 Results.” Run the fit operation over the training data and the evaluate operation over the test data. Sort the results of the evaluate operation by key, and output the keys and their values. Note the accuracy. |
Between the two, which is more accurate?
Please sanity-check that your Jupyter notebooks contain both code and corresponding data. Add the notebook files to hw5.zip using the zip command at the Terminal, much as you did for HW0 and HW1. The notebooks should be:
Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account. At this point the system should know you are in the appropriate course. Select CIS 700-003 Homework 5 and upload hw5.zip from your Jupyter folder, typically found under /Users/{myid}.
If you check on the submission site after a few minutes, you should see whether your submission passed validation. You may resubmit as necessary.