CIS 545: Big Data Analytics

Fall 2018

Homework 2: Big Data and Graph Data

Due October 15, 2018 by 10pm

For this assignment, we will focus on graph data. You saw an instance of this with Homework 1 -- the airline flight network is actually a graph -- but we only did limited kinds of computation over the graph. However, many real-world datasets are, or can be modeled by, graphs (or trees which are special cases of graphs). Examples include:

This assignment is the second of four in the course that consists of a Basic component, to be done by everyone, and an Advanced component, to be done by students who wish to do 3 homeworks and a project.  Please see the separate steps below for the Advanced component.

The “Basic” Assignment

For this assignment, we will be doing a few common operations on graphs. In the next assignment, when we have the power of matrices, we will do some further computation over the same graph data. (It’s very common to encode graph connectivity through an adjacency matrix that we’ll discuss in lecture.)

To start, go to Jupyter Notebook in your web browser (http://localhost:8888/tree with the big token as before).  Click on your work directory, then New|Terminal.  Run:

git clone https://bitbucket.org/pennbigdataanalytics/hw2.git

to get your initial data sets and skeleton notebook with test cases.  

What to Work on

The basic Homework 2 has only one notebook, Homework 2.ipynb.  However, at the earliest possible point you should run through Steps 2.1 and 2.2 to make sure you can (1) connect to a simple version of Spark in your Docker container [this won’t be fast but will let you play with Spark], and (2) can download a 2.5+GB dataset.

The Data You’ll be Using

The data files come from the Yelp data posted on Kaggle.  This data has some quirks and dirty aspects -- some of which we have cleaned for you, and some of which you’ll need to clean yourself in the Homework.

Submitting Homework 2

Once both of your Jupyter notebooks are sanity-checked and pass all tests, go into your work/hw1 directory on Jupyter Notebook.  Run zip hw2.zip Homework*.ipynb.  

Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account.  At this point the system should know you are in the appropriate course.  Select CIS 545 Homework 2 and upload hw2.zip from your Jupyter/hw1 folder, typically found under /Users/{myid}.