CIS 545: Big Data Analytics

Fall 2018

Homework 0: Getting Started

Due September 10, 2018 by 10pm

For this initial assignment, our primary goals are to get you set to run the infrastructure we’ll need for this class, and to familiarize you with how to submit your assignments.  That software infrastructure includes Jupyter, Python, and Spark.  To do this, we’ll be using Docker, which is a “container” manager that enables you to pull down and run different software components.  Docker will manage your development tools and environment.

At times we will also give you an option to use Microsoft Azure Notebooks, a cloud-hosted version of Jupyter, but you should be able to familiarize yourself with both options.

1 Getting the Necessary Software

1.1 Installing Docker

Your first task will be to install Docker itself.  Please see the setup instructions below, which depend on your operating system.  If, during download of the installer from Docker, you have an option to choose between the “Stable” and “Beta” versions, please stick with Stable!

As you follow the instructions, note that you don’t need to validate with docker-compose (which we won’t be using for this class). 

1.2 Installing Jupyter on Docker

Initially Launching Jupyter and Sharing a Directory

Launch your operating system command-line: in Mac OS and Linux, this is “Terminal” and on Windows it’s “Command Prompt”.  Then type in the following two lines.  In this document, we’ll use userid to refer to your user login ID on your local machine.

For Windows 10 Pro/Education (where \Users\{userid} is in %USERPROFILE%):

mkdir %USERPROFILE%\Jupyter

echo "Test" > %USERPROFILE%\Jupyter\test.txt

docker run -v %USERPROFILE%\Jupyter:/home/jovyan/work -it -p 8888:8888 jupyter/all-spark-notebook


For Windows 10 Home you’ll need to first open the Command Prompt and run:

vboxmanage sharedfolder add default --name "%USERPROFILE%\Jupyter" --hostpath "%USERPROFILE%\Jupyter" --automount

For Mac/Linux, or Windows 10 Home under Docker Quickstart Terminal (where /Users/{userid} is in $HOME):

mkdir ~/Jupyter

echo "Test" > $HOME/Jupyter/test.txt

docker run -v $HOME/Jupyter:/home/jovyan/work -it -p 8888:8888 jupyter/all-spark-notebook

These two commands will (1) create a directory for your Jupyter environment and files in /Users/{userid}, (2) download and install a Docker image containing Python and Jupyter (formerly iPython) as well as a local version of Spark.  (If you want to dive deeply into the Docker-Jupyter environment, you can get details here.)  For Linux, you may need to add the parameter “--net=host” to the command line.

Once Docker says it is ready, it should give a message like:

 Copy/paste this URL into your browser when you connect for the first time,

    to login with a token:

        http://localhost:8888/?token=9ed4c2dad760cbde0215a0ee7784adf1d416c1ff4d9068eb

Select the URL (starting with http://) and copy it.  Open up your Web browser and paste it into the URL bar.  You should see a screen like the one below.  Click on the “work” directory.

Verify that test.txt exists.  This file was created in your $HOME/Jupyter directory on your host machine, and it needs to be there to confirm that your Jupyter instance can share files with the host.

If You Need a Password or Token

Follow the instructions as above.

If It Didn’t Work

Check the URL you were given.  If it says something like “http://(eabacdef or 127.0.0.1):8888/…” replace the item in the parentheses with 127.0.0.1 or localhost.

If you are on Windows 10 Home or otherwise using Docker Toolbox, you may need to replace 127.0.0.1:8888 with 192.168.99.100:8888.  If that still doesn’t work, docker machine ip default might tell you a different address to use.

1.3 Connecting to Jupyter after a Reboot

At times you’ll need to stop your Docker instance, e.g., after rebooting.  If you reboot and repeat the steps in Section 1.2, you’ll end up creating another container with Jupyter, which can be very wasteful.  Instead you can relaunch and reconnect to your existing container via Kitematic.

Mac OS X: Run Kitematic. You can skip the registration with Docker Hub.

Windows 10: Run Kitematic.  

If Kitematic complains about an ENOENT error, click on “Use Virtualbox.”  You can skip the registration with Docker Hub, since you won’t be publishing any containers.  

  1. Once you are at the main page, look on the left side, where you’ll see a list of containers on your machine, e.g.,:  

 

2.  Click on the container and click on the Start button. This will start your Jupyter Notebook.

Inside the log window you will eventually see something like:

 Copy/paste this URL into your browser when you connect for the first time,

    to login with a token:

        http://localhost:8888/?token=9ed4c2dad760cbde0215a0ee7784adf1d416c1ff4d9068eb

Click on the Web Preview icon on the right to launch it on a browser.  If you see a screen like the one below, then you are ready to go, and you can skip to the “Your Data” section below!  If not, you should follow the instructions below.

If You Need a Password or Token

If, instead, you see something like:

Use the token value generated in the Kitematic container terminal.

 eg: http://localhost:8888/?token=9ed4c2dad760cbde0215a0ee7784adf1d416c1ff4d9068eb

Your Data

Please consistently work in the “work” directory in Jupyter -- this corresponds to the Jupyter directory on your host machine.  Your files should be saved there, and you’ll be able to back up and retrieve them even if Docker crashes.

2 Creating and Visualizing Data

2.1 A First, Really Simple Program

Make sure you are in the work directory.

From the browser’s view of Jupyter, click on “New”:

Jupyter is running inside the Docker container.  You can create text files or folders, but more commonly, you’ll want to create Notebooks in Python 3.  (If you really need to get under the covers, you can open a Terminal window to it; you’ll see that by default you get a user called jovyan, which astute readers will note is a bad pun on Jupyter.)

For now, click on Python 3, and type the following into the In [1] Cell:

import pyspark

sc = pyspark.SparkContext('local[*]')

rdd = sc.parallelize(range(1000))

rdd.takeSample(False, 5, seed=314)

Click on the “Run” button (it looks like Screen Shot 2017-01-02 at 1.40.21 PM.png) or type [Shift]-[Enter].  Now wait a few moments.  You should ultimately get a result that looks like the following (probably with different values):

Congratulations, you have just run a very simple Spark program that created a vector of 1000 numbers in parallel, then sampled 5 of them without replacement!

2.2 Something a Little More Fun

In the next Cell, where it says In [2], type in:

# We’ll be using Matplotlib to plot a visualization

%matplotlib inline

import matplotlib.pyplot as plt

import numpy as np

# Sample 100 values from the RDD

y = np.array(rdd.takeSample(False, 100, 1))

# Create an array with the indices

x = np.array(range(len(y)))

# Create a plot with a caption, X and Y legends, etc

plt.title(str(len(y)) + ' random samples from the RDD')

plt.xlabel('Sample no')

plt.ylabel('Value')

plt.figtext(0.995, 0.01, 'CIS 545 student', ha='right', va='bottom')

# Scatter plot that fits within the box

plt.scatter(x, y)

plt.tight_layout()

# Now fit a trend line to the data and plot it over the scatter plot

m, c = np.polyfit(x, y, 1)

plt.plot(x, m*x + c)

# Save the SVG

plt.savefig('hw0.svg')

Press the “Run” button on the toolbar, and in a few moments you should see a scatter plot.  Now edit  the figure text (which says “CIS 545 student”) with your Pennkey (eniac login ID).  You can select the In [2] cell and click the “Run” button again to update the figure.

Now click at the very top of the window, next to the Jupyter logo, where it says “Untitled.”  Replace “Untitled” with “HW0.”  Click on the left-most icon under the “File” menu (the old-school floppy disk) to save.  You should see a brief message saying “Checkpoint created” with a timestamp.

3 Submitting Your Homework

3.1 Creating a Submission File

Now you will need to create a Zip file.  For Windows, you may need to install Zip for Windows or 7-zip.

If your Docker instance is fully configured, you should be able to do the following.

cd ~/Jupyter

zip hw0.zip HW0.ipynb hw0.svg

cd %USERPROFILE%\Jupyter

zip hw0.zip HW0.ipynb hw0.svg

        Depending on your setup, you may need to first download zip from here.

It’s possible the above didn’t work, which means your Docker instance isn’t configured to share folders.  This is not a good state to leave things in, so please see a TA for help.  But meanwhile you can open each item (HW0.ipynb and hw0.svg) in Jupyter, and separately Download each file into your browser. (Download the svg file directly and Download the notebook as a Notebook.)  Again:

cd ~/Downloads

zip hw0.zip HW0.ipynb hw0.svg

cd %USERPROFILE%\Downloads

zip hw0.zip HW0.ipynb hw0.svg

3.2 Submitting Homework 0

Your first submission should be easy.  

Go to the submission site, and click on the Google icon.  Log in using your Google@SEAS (if at all possible!) or (if you aren’t an Engineering student) GMail account.  

Please go to Settings and set your Student ID to your PennID (the numeric identifier for your account).  You should now have an account for uploading homework and accessing your grades.

Click on the Courses icon at the top, then select CIS 545 and Save. Select Homework 0 and upload hw0.zip, in your Jupyter folder under /Users/{myid}.

You should see a message on the submission site notifying you about whether your submission passed validation.  You may resubmit as necessary.