CIS 700-003: Big Data Analytics |
Spring 2017 |
Due February 1, 2017 by 10pm
Deadline extended to February 6, 2017 by 10pm
Note: this is an archived version of a homework. Find a more recent version from www.cis.upenn.edu/~cis545.
Big data analytics often involves combining data from multiple sources, possibly from multiple different organizations, in heterogeneous formats. For this assignment, our primary goals are to get you comfortable with importing, saving, and combining data -- largely using Pandas DataFrames as the basic abstraction. You’ll read data from files, address issues with missing values, save the data in a database, combine the data, and generate visualizations.
The Task. Most of you likely were on a plane over the break, and chances are that at least one of your flights got delayed. Did you ever wonder how well different airlines do? We’ll answer those questions in this assignment! (Caveat: with apologies to international travelers, most of the data is only available for US cities and routes!)
Terminology. We’ll generally use field, column, and attribute interchangeably to mean a named column in a DataFrame. We’ll also generally assume that table, DataFrame, and relation mean the same thing.
Submission of Your Homework. Details are provided in Section 7 below for how you’ll do submission to the Web site. We expect you to implement your solution to the Homework in a series of Jupyter files with specified names. In each Jupyter notebook, you’ll include both the code and the output as produced by Jupyter. For each step we’ll ask you to provide a label so we know what Step you are doing, and then we’ll ask you to output content in a DataFrame, a visualization, or both. (Directions will appear in a framed box such as this one.) Please note that components of your submissions will be automatically checked, so please follow directions about field names, etc. very carefully. When you are asked to create a label and a series of cells, please create them consecutively and in the same order! |
We’ll be bringing together data from OpenFlights.org with data from the US government Bureau of Transportation at http://www.transtats.bts.gov. To start, open your Terminal or Command Prompt and cd to Jupyter. Run:
git clone https://bitbucket.org/pennbigdataanalytics/hw1.git
to get your initial data sets and some “helper” code.
The data files, whose contents are described in the provided notebook Dataset Descriptions, are:
Go to your copy of Jupyter running on Docker, by opening your web browser to http://localhost:8888. Create a Jupyter Notebook, using Python 3, in the hw1 folder that was created by git above. Name the notebook “Load DataFrames”. Please be precise with the naming, as we’ll have scripts that look for this file upon submission.
Create a Cell where you read the above files using Pandas’ read_csv function. For the first 3 sources you’ll need to give column names to the data, based on the Dataset Descriptions and some hints below. For the “on time performance” you can read the file, and have Python use the header names already in the file. We use the variable names airports_df, airlines_df, routes_df, and performance_df to refer to the DataFrames below. The “_df” is because Python is dynamically typed, so it helps make code more clear when the type is apparent from the variable name.
Loading the performance DataFrame will generate a warning about mixed types. We don’t actually care about all of the columns here. By deleting columns from the DataFrame, update performance_df to only have the following fields (these were named in the CSV file header, thus don’t follow Python naming conventions): ['Year','Month','DayofMonth','AirlineID','Carrier','FlightNum','Origin','Dest','ArrDelayMinutes','Cancelled']
Data Check for Step 1. Let’s now create some Cells to get a quick look at what we have loaded (and so we can check your work!):
|
Nulls. You should see for airlines a variety of entries that say “NaN” (not a number), which represents “unknown” or null information. In fact, if you look closely there are also other values representing “unknown” such as “\N” and even “-”. We’ll have to regularize all of this later!
Schemas. OK, you’ve loaded the DataFrames. You can get the schemas -- the names and types of the columns -- of the DataFrames by the dtypes property. Use airlines_df.dtypes and routes_df.dtypes to take a look at the schemas. Now compare the types of routes_df.airline_id and airlines_df.airline_id. You should see that one is “int64” and the other is “object.”
Why is this? Python automatically infers the types based on what it reads from the CSV files. Unfortunately, things like “NaN” are actually floating-point (fractional) numbers and “\N” is a string. If a column has multiple kinds of values, Python will consider it to be an “object.” Unfortunately, this will interfere with how we combine tables later, so we need to “clean” the data.
We are going to clean the airlines_df, airports_df, and routes_df DataFrames. First, in a new Cell, let’s define a “helper” function.
# Replace NaNs with blanks if the column is a string, so we don’t get
# objects in the column
def fillna_col(series):
if series.dtype is pd.np.dtype(object):
return series.fillna('')
else:
return series
Additionally, define a second function called nullify that takes a single parameter. Given the parameter with value “\N” it returns NaN, otherwise it returns the value of the parameter.
Regularizing and removing nulls. Next, we’ll need to use two functions to apply nullify and fillna_col to our DataFrames.
You’ll want to update your DataFrames using all of the above functions. Think carefully about the order in which to apply these.
Changing column types. After all of this, routes_df.airline_id will only have integers, but will still have its existing type of object. Later we’ll need it to be an integer, so that it can be compared directly with the entries from airlines_df and airports_df. Let’s convert it to integer via:
routes_df['airline_id'] = routes_df['airline_id'].astype(int)
Repeat the same process for the source and destination airport IDs.
Data Check for Step 2. Let’s now create some Cells to get a quick look at the effects of our “cleaning” operations:
|
Now let’s actually save the data in a persistent way, specifically using a relational database. For simplicity we’ll use SQLite here, but we could alternatively use a DBMS such as MySQL or PostgreSQL on the cloud (or in another Docker container).
Create a Cell that starts with the following two lines, which you’ll extend momentarily:
import sqlite3
engine = sqlite3.connect('HW1_DB')
This establishes a connection to an SQLite database, which will be written to the file HW1_DB in your HW1 Jupyter directory.
Next, in this Cell you should save each of your DataFrames (airlines_df, airports_df, performance_df, routes_df) to the database. To do this, call the to_sql method (make sure to not save the index column) on the DataFrame. Give it a table name matching the DataFrame name, and set the flag if_exists=’replace’ in case you want to run this multiple times.
Once this is all done, you can save the “Load DataFrames” Jupyter notebook for submission.
Create a new Python 3 notebook, called “Wrangling”. Start off by connecting to the database (as above), import pandas as pd, and then load back your first 3 DataFrames (not counting performance, which will take forever to load) from SQL, using the syntax:
dataframe = pd.read_sql('select * from table_name', engine)
Next create a Cell that calls airports_df.info(), airlines_df.info(), routes_df.info().
Observe that this tells you the data types on the various DataFrame elements.
Data Check for Steps 3-4 . Create some Cells to get a quick look at what we have loaded:
|
Before we use data -- it’s essential to understand that data is seldom “pristine” and perfect. Human data entry error, incompleteness, and measurement error (and even flipped bits due to gamma rays!)
We would expect certain characteristics, such as every airport to have flights, and every airport to be a source and a destination. (Otherwise we’d have a situation where some airports pile up airplanes, and others have no planes!)
Take a look at our data and see:
We’ll revisit these questions momentarily when we have more detail. Meanwhile...
Data Check for Step 4.1. Let’s create a structure to report part of what we’ve discovered.
|
How “skewed” are cities with respect to incoming routes?
“Query” for Step 4.2.
|
You should get graphs with distributions resembling the figures below. What is notable about the distribution? [Note that these figures aren’t actually the best way to represent the information -- but we’ll defer a more detailed discussion of this to another assignment whose goal isn’t to focus on the mechanics of data wrangling!]
Now let’s combine information from the different sources!
The first question is which airports are in the data as destinations, but not as sources.
“Query” for Step 4.3.1.
|
You’ll need to do a merge (join) with the original airports_df DataFrame. Note that dataframe.merge() can be called with a second DataFrame, as well as the columns to match (left_on, right_on).
Next let’s consider what destination airports are most-trafficked, by name. Recall that we have the frequencies of arrivals for each airport by its ID. You may also find the .reset_index() method useful.
“Query” for Step 4.3.2.
|
Data wrangling typically refers to the process of importing, remapping, and converting data. On the other hand, data integration typically refers to linking data across heterogeneous sources. We have an integration problem (in its simplest form) when we want to combine the on-time information (what was originally performance_df, and is now in the SQL database) and the route information (routes_df).
Now we’ll make use of the on-time performance table… Recall that we put this into an SQL table and that we’ll prefer to avoid bringing that into memory for performance reasons. Let’s get a random sample of the table into a DataFrame. You can get a sample of x rows from table by:
df = pd.read_sql_query(‘SELECT * FROM table ORDER BY RANDOM() LIMIT x’, con=engine)
Replace x and table to get a 1000-element sample of the performance table; call this performance_s_df.
Now let’s see if we can actually detect what fields (columns) “match” between routes_df and performance_s_df (for the sample) DataFrames.
To do this, let’s note that we only care about the “categorical” (string-valued) columns. What we want is, e.g., to find fields that have lots of common values (for instance, airline abbreviations or airport codes). To measure this in a principled way, we’ll use a measure called the Jaccard distance (or Jaccard index or Jaccard measure). The Jaccard distance measures similarity between two sets A and B, and is merely:
Let’s compute the Jaccard distances between (the values of) all pairs of attributes in routes_df and performance_s_df. This will tell us which attributes include similar values - and thus which might represent the same concept.
Create a Cell with an algorithm that iterates over all routes_df keys (columns) and all performance_s_df keys. In a nested map {r → {p → distance}}, the algorithm should store the Jaccard distance between the values in routes column r to the values in performance column p (for distances that are non-zero and only for non-numeric columns). Remember, since there may be duplicates, you’ll need to convert columns into sets of unique values through drop_duplicates().
Data Check for Step 5.1. Let’s look at the output of your Jaccard similarity-based attribute matching algorithm.
|
Your output should look something like:
{'airline_iata': {'Carrier': 0.04...},
'code_share': {},
...}
This map should help you decide which pairs of columns (one in each DataFrame) are “semantically equivalent.” You should be able to find a very clear match between non-numeric codes for airlines in the two DataFrames. The origin and destination codes will also have corresponding pairs (but if you only look at the Jaccard distances, there will be some ambiguity, since both origin and destination cities have values from the same domain; you will need to use your own knowledge of the semantics of the fields to disambiguate which pairs are correct).
Now that you know how routes_df and performance_s_df data relate, let’s see how the routes and flights actually relate. Note that each data source (and thus DataFrame) is in fact incomplete. For this step we’ll want to only consider the flight information and route information that “intersects.”
Create a joint DataFrame flights_s_df by merging the two DataFrames on the fields that correspond to flight information. Use routes_df.merge as well as left_on and right_on. Note that flights_s_df only represents a sample of all flights, since performance_s_df only contains a sample of 1000 entries. However, it’s also worth noting that OpenFlight only has information about certain routes, and we want to focus on the flights corresponding to routes in OpenFlight.
Hint: you’ll need to pass lists for left_on and right_on in order to merge on the 3 fields between routes_df and performance_s_df that you found to “match” in the previous part. These 3 fields should effectively correspond to flights in performance_s_df.
Nonetheless we’ll use this sample to figure out how to combine DataFrames. Later, we’ll make this scale further by rewriting our computations in SQL.
“Query” for Step 5.2.1.
|
Now let’s repeat the above, but only for flights that met a frustrating fate.
“Query” for Step 5.2.2.
|
“Query” for Step 5.2.3.
|
What if we re-examine the above question, but by airline instead of by city?
“Query” for Step 5.2.4.
|
Create a new Jupyter notebook titled “Advanced.” Initialize it by loading the airlines_df, airports_df, and routes_df from the SQL database, as before. This time, we want to look at the flights more comprehensively -- and at this point we will have exceeded the point where Python DataFrames are efficient in handling the data. We’ll instead make heavier use of the SQL database.
First, we will go beyond the 1 month’s data that currently exists, instead expanding to 3 months’ flight data.
The fastest way to read a remote CSV file into a DataFrame is to call read_csv with the URL. Create a DataFrame aug_flights_df in this manner using flight data from August at URL http://big.dataanalytics.education/data/Flights-Aug-2016.csv . Now write the DataFrame to an SQL table flights.
Repeat this for …-Sep-2016.csv and …-Oct-2016.csv. Each time, write using if_exists=’append’ so you can keep the existing contents.
An alternate way to get this data (which is slightly slower, but that can be used to access Web data that requires querying Web forms) uses a few additional Python packages that should be imported: requests and StringIO (located in the io package). We illustrate below how to obtain the flight data from August in this manner.
You can use the get call in the requests Python package to pull data from the Web. You can use this to retrieve http://big.dataanalytics.education/data/Flights-Aug-2016.csv into a variable aug_response. The result (as a large string) will be available as aug_response.text (Note: trying to display the whole string in your Jupyter notebook may cause it to hang).
Let’s use another tool, StringIO (located in the io package), to treat a string as if it is the contents of a file. Call:
aug_csv = io.StringIO(aug_response.text)
To “wrap” the CSV text. Now create a DataFrame aug_flights_df using read_csv over the wrapped file. Project it to only include the fields:
'Year','Month','DayofMonth','Carrier','FlightNum','Origin',
'Dest','DepTime','ArrTime','ArrDelayMinutes','Cancelled'
Great, now let’s see how much data we have in the SQL database.
Use (a modified version of) the Pandas command:
df = pd.read_sql(query, db_connection)
with the appropriate query to get a count on the number of rows in the flights table.
Data Check for Step 6.1.
|
It should be a bit more than 1.4 million rows.
Recall that you can call the info() or describe() functions on the DataFrame to get its fields, types, and information about values.
In this part of the assignment, you’ll investigate the correspondence between DataFrame and SQL abstractions. Recall that in 5.2 you used Python’s DataFrame merge command to merge two dataframes:
output_df = left_df.merge(right_df, left_on=[‘left_col1’,’left_col2’,...],
right_on=[‘right_col1’,’right_col2’,...])
We can do the same thing between left and right SQL tables using:
output_df = pd.read_sql(“SELECT * FROM left_table JOIN right_table
ON left_col1=right_col1, left_col2=right_col2”, db_connection)
If you only want a subset of the attributes (columns) to show up in the DataFrame, replace “*” with a list of columns names.
Also, if you want to compute aggregates, you can do:
output_df = pd.read_sql(“SELECT col_g_1, col_g_2, agg1(col_a_1), agg2(col_a_2)
FROM table
GROUP BY col_g_1, col_g_2”, db_connection)
Now we’ll use these basic abstractions to look in more detail at the bigger flight dataset, repeating some of the questions of Step 5.2 above.
For all of the steps in this section, be sure that you are using SQL queries to retrieve the data.
As in 5.2.1, we have two different data sources with flight-related information: the routes from Openflight.org and the flights from the US government. For this part of the assignment (all of Step 6.2), we are going to focus solely on the flights that correspond to routes in Openflight -- i.e., those that match known routes.
“Query” 6.2.1.
|
To filter individual rows from a table, add a predicate of the form
SELECT … FROM …
WHERE condition_on_attributes
GROUP BY …
to filter.
In a Cell, create a new SQL query based off the one in 6.2.1 but with a WHERE constraint conditioning on the arrival delay/cancellation.
“Query” 6.2.2.
|
To filter results from a grouping, by some property like the count, add a HAVING clause.
SELECT … FROM …
GROUP BY …
HAVING agg-expression
to filter.
In a Cell, create a new SQL query based off the one in 6.2.2 but with the HAVING constraint specifying that the COUNT must be at least 1000 flights delayed by 30 minutes or more, or cancelled.
Data Check and Visualization for Step 6.2.3.
|
Your bar chart distribution should end up resembling something like this:
Our last step will be to perform a shortest path computation over the flight data. For each fastest direct flight between a pair of cities (C1, C2) on the same day we want to find shorter pairs of flights (from initial departure time to final arrival time) that get us to the same destination. Assume that it takes one hour to arrive in an airport on one flight, and make it to another flight before it departs (hint: take a look at the data in the DepTime and ArrTime fields before you assume that you should add 60 minutes to get an hour!).
The next flight from the stopover must be on the same day. For simplicity you may ignore flights that wrap to the next day. For performance reasons limit your computation to data for September 2016. [Note that we are quite performance-bottlenecked in this part of the assignment -- later we’ll see how to handle large-scale graph traversals more efficiently.]
First create a “base case” where you take compute, and store in a SQL table, the earliest arrival for each departure time for each pair of cities on the same day. Make sure that all flights in the table are along a valid route by joining with the routes table (like in previous parts of this assignment). Put these these into a table called shortest. You should use SQL’s min() aggregate function to find the earliest ArrTime. Be sure to drop entries with null dates, times, sources, or destinations. In SQL, you can output a field named f with literal value v using:
SELECT v AS f … FROM ...
And you can also replace v with an expression (such as (w-3)) instead of a variable.
As a second step, you will want to create an index over the shortest table to speed up computation. You’ll probably want to do:
engine.execute(‘CREATE index my_index ON shortest(Year,Month,DayofMonth,Origin,Dest)’)
Next, find a pair of shorter paths that leave at the same time, but arrive earlier to the same destination on the same day. Make sure to ignore the flights that arrive on the following day. You will probably want to make use of (1) variables referring to different iterators (aliases) over the SQL table, and (2) the exists() predicate in SQL, as in the use of s1,s2,s3 in:
SELECT v1,v2 …
FROM shortest AS s1 JOIN shortest AS s2 ON s1… = s2…
WHERE EXISTS
(SELECT v1 FROM shortest AS s3 WHERE s3… = s1… AND ...)
Note: “AS” is optional. The query will work exactly the same if “AS” is omitted, but adding “AS” makes the query more explicit.
Hint: You can get the right query behavior (and keep the computation from going on forever) by essentially limiting the 2 hop flights by seeing if there EXISTS a corresponding flight (same time, route, and departure time) that takes longer in the shortest (1 hop) table.
Query for Step 6.3.1.
|
Now take the results from the above and use grouping to compute the overall shortest 1- or 2-hop flights from each origin to each destination, for each day and departure time. You may use either Python DataFrames or SQL (by putting the above results into an SQL table and then posing another query over that).
Query for Step 6.3.2.
|
ADVANCED EXTRA CREDIT. Compute the transitive closure of shortest paths, where we see if a 2-stop, 3-stop, etc. flight is shorter in getting from one city to the next.
Extra credit data check.
|
For submission, we will expect you to submit your Jupyter notebooks with both the code and the output results in each Cell. We’ll be trying to give you rough feedback by auto-validating that the data looks OK. Before submitting, we recommend you go to the Cell menu in Jupyter and select Run All, then check that every cell has output that makes sense.
Once your Jupyter notebooks are sanity-checked, add the notebook files to hw1.zip using the zip command at the Terminal, much as you did for HW0. The notebooks should be:
Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account. At this point the system should know you are in the appropriate course. Select CIS 700-003 Homework 1 and upload hw1.zip from your Jupyter folder, typically found under /Users/{myid}.
If you check on the submission site after a few minutes, you should see whether your submission passed validation. You may resubmit as necessary.