CIS 700-003: Big Data Analytics |
Spring 2017 |
Due April 26, 2017 by 10pm
For this assignment, we will focus on analyzing and understanding time-varying or time series data, and brushing up a bit on tuning and data visualization. In general, for time series data we will be doing comparative testing of data collected over different time segments, to characterize it, learn when an event is happening, or make predictions. At times we’ll want to visualize what is going on.
As always, you should first open Jupyter and a Terminal, then clone the Homework 6 Bitbucket repository:
git clone https://upenn-cis@bitbucket.org/pennbigdataanalytics/hw6.git |
Then go into the hw6 directory in (your original, as opposed to the TensorFlow version of) Jupyter.
For our first foray into time series data, we’ll take real data from the Penn/Mayo Clinic Seizure Detection Challenge on Kaggle, which we have mentioned in the class. Quoting from the challenge site:
“Of the more than two million Americans who suffer from recurrent, spontaneous epileptic seizures, 500,000 continue to experience seizures despite multiple attempts to control the seizures with medication. For these patients responsive neurostimulation represents a possible therapy capable of aborting seizures before they affect a patient's normal activities.
In order for a responsive neurostimulation device to successfully stop seizures, a seizure must be detected and electrical stimulation applied as early as possible. A seizure that builds and generalizes beyond its area of origin will be very difficult to abort via neurostimulation. Current seizure detection algorithms in commercial responsive neurostimulation devices are tuned to be hypersensitive, and their high false positive rate results in unnecessary stimulation.
In addition, physicians and researchers working in epilepsy must often review large quantities of continuous EEG data to identify seizures, which in some patients may be quite subtle. Automated algorithms to detect seizures in large EEG datasets with low false positive and false negative rates would greatly assist clinical care and basic research.”
The goal of the Kaggle competition was to train a classifier to do the job of the physician, namely to look at the segments of EEG data acquired across time and recognize when these indicate a seizure. The figure above shows a typical visualization of a segment of EEG data. The portion highlighted in red indicates a seizure, sometimes called an “ictal state.” The blue portion of the lines represents “normal” activity, sometimes called “inter-ictal” activity since it is between seizures. You should observe that there are visually distinct characteristics between the signals. Part of the seizure detection contest involved finding good features. For this assignment you’ll benefit from others’ hard work!
Intracranial EEG was recorded from dogs with naturally occurring epilepsy, using an “ambulatory monitoring system.” EEG was sampled at 399 Hz from 16 electrodes implanted within the brain, and recorded voltages were referenced to the group average (i.e., we have stored the difference between each channel and the mean value across all channels). We can see in Figure 1 the overall setup.
… Fortunately, given that we will be pointing you at the right features, you can actually do this assignment without being deeply familiar with EEG and neuroscience! You’ll rely on your understanding of how to train classifiers to work well over features.
Open a Terminal in Jupyter. cd into hw6, then
mkdir clips mkdir clips/Dog_1 cd clips/Dog_1 unzip ../../clips.zip |
Open the Epilepsy notebook in Jupyter.
Based on insights from the winner of the contest, we will focus on extracting the low-frequency signal components, from 1 through 47 Hz, using Fast Fourier Transforms. The signal level at each frequency band, for each channel, will form a feature.
We’ve provided several helper functions. First, based on code from one of the contestants, we have provided you with code to load the data. The data is broken into time segments (“windows”) of 400 samples taken at 399 Hz, from each of 17 channels (electrodes recording voltage levels from the brain). Each segment contains a seizure (if its filename is Dog_1_ictal_segment_*.mat) or non-seizure (if its filename is Dog_1_interictal_segment_*.mat).
If you run the Jupyter notebook, you’ll see how we initially load all of the clips. These load into a “Panel”, which essentially works like a dictionary with multiple DataFrames. Calling keys() on the panel will give you a list of dictionary keys. Each key is in fact a source file corresponding to a particular time segment from the recording; its name will include “interictal” if it’s seizure free and “ictal” if it includes a seizure.
You’ll also see a preview of three of the EEG channels, which looks like this:
And ultimately you’ll see the definition of a simple function fft that, when applied to a 2D array, separates out the amplitudes for data in the 1Hz frequency band, 2Hz, … up to 47Hz. These amplitudes will be the features we’ll train on -- for each of our 17 channels. You’ll see an example of the FFT’s output in the notebook.
Go to the part of the notebook labeled “Step 1.3.”
`
First, create feature array X and label vector y. Iterate through all of the time segments:
Next use train_test_split to separate into training and test sets. Use random_state=42 and test_size=0.3.
Now use your knowledge of classifiers and tuners to produce the most accurate classifier you can, when run over the test set. Recall how to tune hyperparameters, look at validation and training curves, use cross-validation, plot ROC curves, and so on. Recall that ensembles typically do better than individual classifiers.
Output the predictions of the best classifier under Step 1.4 Best Result. See if you can get above 99% accuracy!
For our second task, we’ll be looking at data that is both geospatial as well as time-varying: in this case, earthquakes across the world. Rather than try to build a classifier here, we’ll be trying to understand the data using data visualization as well as some reasoning about temporal behavior.
Open the notebook Earthquakes, which has some initial setup code that will install an addon to matplotlib called basemap. Basemap can plot maps of the world, and overlay data on this! (Unfortunately, one of the cooler aspects of Basemap, which is the ability to plot the map 3D-style with surface textures, is broken in the current release. So we’ll have to plot with a “flat” map.)
Run the early Cells and validate that you have reasonable-looking content in quake_df. You’ll note that it is indeed time series data, as there are reports of earthquakes by location (latitude + longitude) as well as date.
Our first task will be to do just a bit of cleaning and collection of the data. First, use the pivot_table function on the DataFrame to look at what values of Type exist:
quake_df.pivot_table(index = 'Type', values = 'Magnitude', aggfunc=len) |
You might be surprised what’s there besides earthquakes!
Second, we will want to count on how many different days earthquakes occur in the same place, within a specified time window. Since the lat/lon coordinates are recorded to at least 3 significant digits, we’ll want a coarser-grained notion of “location” to group by. We’ll take a very simple approach, which is to multiply latitude and longitude each by 100 (shifting left by 2 decimal digits) and convert to integer (truncating any remaining digits). For example, a latitude value of 39.11 would become 3911 in the new column.
Under the MarkDown that says “Step 2.1 Earthquakes Only, truncate at 2 decimal places,” fill in the Cell to do all of the following:
|
In the next step, we’ll use the Cell under “Step 2.2 Plot on Map.” Initially, this code plots a map with coastlines, continents, and countries. Then it takes each (near_lat, near_lon), converts this back to a latitude and longitude, and plots a point on the map indicating an earthquake. With the sample code, you don’t get a sense of how frequently earthquakes have occurred in each region -- just whether they have occurred.
Modify the code in the Cell to do all of the following:
|
You should get something that looks roughly like this:
In the next step, we want to look at temporal behavior: specifically, where have earthquakes occurred repeatedly within a one-month time frame?
For this one, we’ll need to look at combinations of rows within quake_df where (1) the location is the same, (2) the dates of the two quakes are within a month of each other. Recall that you can do a merge to join together the DataFrame, even with itself. Recall also that by default, it will merge on specific columns’ values matching, then rename the remaining columns from the left and right so you can distinguish. Thus your output should have fields like Date_x, Depth_x, Date_y, etc.
You’ll also need to use the relativedelta function to compute dates. If you use:
from datetime import date |
You’ll get February 1st of 2002. You can similarly use relativedelta to add a month to the date Date, Date_x, or Date_y fields in your DataFrames.
In the Cell under “Step 2.3 Find Locations with Multiple Quakes in a Month”:
|
Please sanity-check that your Jupyter notebooks contain both code and corresponding data. Add the notebook files to hw6.zip using the zip command at the Terminal, much as you did for HW0 and HW1. The notebooks should be:
Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account. At this point the system should know you are in the appropriate course. Select CIS 700-003 Homework 6 and upload hw6.zip from your Jupyter folder, typically found under /Users/{myid}.
If you check on the submission site after a few minutes, you should see whether your submission passed validation. You may resubmit as necessary.