CIS 455/555: Internet and Web Systems |
Spring 2019 |
Homework 2: Web Crawling and Stream Processing
Milestone 1 due March 18, 2019, at 10:00pm EDT
Milestone 2 due March 27 29, 2019, at 10:00pm EDT
In this assignment, you will explore several more Web technologies, and continue to build components useful towards your course project, by building a topic-specific Web crawler. A topic-specific crawler looks for documents or data matching certain phrases or content in the header or body.
This assignment will entail, in Milestone 1:
In Milestone 2:
The resulting application will be very versatile, able to filter documents into categories with specified keywords. Assignment 2 builds on your application server from Assignment 1, if your Assignment 1 server is ready to go. If you are not confident that your application server is working well, please use Spark Framework (http://www.sparkjava.com/) to test your application. (Using your own server will earn you +5% extra credit. You are allowed to make fixes to it as necessary.)
We recommend that you continue using Eclipse Che and Docker as with HW0 and HW1. (If necessary, you may use Eclipse or IntelliJ to do edits, while running your code from within Che. You should fork and then clone the framework code for HW2 using the same process as for HW0 and HW1 (fork from ssh:/git@bitbucket.org/upenn-cis555/555-hw2.git to your own private repository, then clone from your repository to Eclipse Che). You should, of course, regularly commit code as you make changes so you can revert; and you should periodically push to your own repository on bitbucket, in case your computer crashes.
Initially you will be using the Spark Framework for this assignment, but for extra credit you can run it using your own HW1 framework (see Section 6).
Carefully read the entire assignment (both milestones) from front to back and make a list of the features you need to implement. There are many important details that are easily overlooked! Spend at least some time thinking about the design of your solution. What classes will you need? How many threads will there be? What will their interfaces look like? Which data structures need synchronization? And so on.
We strongly recommend that you regularly check the discussions on Piazza for clarifications and solutions to common problems.
For the first milestone, your task is to crawl and store Web documents. These will ultimately be fed into a pattern matching engine for Milestone 2.
In preparation for Milestone 2, we will have a Web interface for login. Your main class should be called edu.upenn.cis455.crawler.WebInterface and it should register routes for various behaviors. We have given you a partial implementation of the login handler so you can get started.
You’ll see in the provided code how Filters allow you to make decisions about whether the user’s request should proceed, or the user should be redirected to the login form. So some of the above should already be present.
Beyond the above, you should build additional routes to support the following functions:
Note that you may take advantage of Sessions and the various other capabilities of the Spark Framework and/or the ones you developed in Homework 1 Milestone 2. Some of the above functionality is provided, so please look carefully to understand what isn’t yet there.
We will use Berkeley DB Java Edition (http://www.oracle.com/technetwork/database/berkeleydb/
overview/index.html), which may be downloaded freely from their website, to implement a disk-backed data store. Berkeley DB is a popular embedded database, and it is relatively easy to use as a key-value store; there is ample documentation and reference information available on its Web site, e.g., the Getting Started Guide. Berkeley DB has several different ways of storing and retrieving data; perhaps the simplest is through the Java Collections interface.
Your store will hold (at least):
Typically you will create a series of objects representing your data, stored in the model directory. We have started things by giving you a really simple User. The User password is stored in memory in clear text right now. It should instead be saved using SHA-256 hashing. No cleartext passwords should be saved.
The WebInterface program, when run from the command-line, should take as the first argument a path for the BerkeleyDB data storage instance, and as a second argument, a path to your static HTML files. This will be directory that should be created if it does not already exist.
Your web crawler will initially be a Java application that can be run from mvn exec:java@crawler and from the command line, as edu.upenn.cis455.crawler.Crawler. It will take the following command-line arguments (in this specific order, and the first three are required):
It is intended that the crawler be run periodically, either by hand or from an automated tool like the cron command. So there is therefore no need to build a connection from the Web interface to the crawler, except that the two will share a common database. Note also that BerkeleyDB does not like to share database instances across concurrent programs.
The crawler traverses links in HTML documents. You can extract these using a HTML parser, such as JSoup (https://jsoup.org/, included with your Maven package), or simply by searching the HTML document for occurrences of the pattern href="URL" and its subtle variations. We have provided a simple extractor for hrefs, which might be adequate for your purposes, and will at least provide a baseline.
If a link points to another HTML document, it should be retrieved and scanned for links as well. If it points to an XML or RSS document, it should be retrieved and parsed as well. Don't bother crawling images, or trying to extract links from XML files. All retrieved HTML and XML documents should be stored in the BerkeleyDB database (so that the crawler does not have to retrieve them again if they do not change before the next crawl). The crawler must be careful not to search the same page multiple times during a given crawl, and it should exit when it has no more pages to crawl. You’ll need to understand what parts of functionality are provided and where you need to supplement.
Redundant documents and cyclic crawls: Your crawler should compute an MD5 hash of every document that is fetched, and store this in a “content seen” table in BerkeleyDB. If you crawl a document with a matching hash during the same crawler run, you should not index it or traverse its outgoing links.
When your crawler is processing a new HTML or XML page, it should print a short status report to the Apache Log4J logger, using the “info” status level. Example: "http://xyz.com/index.html: Downloading" (if the page is actually downloaded) or "http://abc.com/def.html: Not modified" (if the page is not downloaded because it has not changed).
Your crawler must be a considerate Web citizen. First, it must respect the robots.txt file, as described in A Standard for Robot Exclusion (http://www.robotstxt.org/robotstxt.html). It must support the Crawl-Delay directive (see http://en.wikipedia.org/wiki/Robots.txt) and "User-agent: *", but it need not support wildcards in the Disallow: paths. Second, it must always send a HEAD request first to determine the type and size of a file on a Web server. If the file has the type text/html or one of the XML MIME types:
and if the file is at most as large as the specified maximum size, then the crawler should retrieve the file and process it; otherwise it should ignore it and move on. For more details on XML media types, see RFC 3023 (http://www.ietf.org/rfc/rfc3023.txt). Your crawler should also not retrieve the file if it has not been modified since the last time it was crawled, but it should still process unchanged files (i.e., match them against XPaths and extract links from them) using the copy in its local database.
We have given you some “helper” classes in the crawler/info package, which might be useful to store information about URLs and robots.txt.
Certain web content, such as the papers in ACM's Digital Library, normally costs money to download but is free from Penn's campus network. If your crawler accidentally downloads a lot of this content, this will cause a lot of trouble. To prevent this, you must send the header User-Agent: cis455crawler in each request.
You must develop at least two JUnit tests for storage system and two more for the crawler.
You should next add a Route to enable retrieval of documents from the BDBstore, using the following form:
localhost:8080/lookup?url=…
that takes a GET request with a URL as parameter url, and returns the stored document corresponding to that URL. Think of this as the equivalent of Google’s cache. We will use this for testing.
Submit using OpenSubmit. Note that your user interface and crawler must use the particular paths and package names specified.
The next milestone will consist of an evaluator for streams of document content. You will extend your Milestone 1 project to run on a stream processing engine that enables multithreaded execution. You’ll also extend your application and storage system to enable users to register “subscription channels” with the system. Finally, you’ll build a stream processing component that checks documents against the various “channels” and outputs results per user.
Your Milestone 1 project had a simple execution model, in which you controlled the execution of the crawler and presumably did this in a crawler loop. Now we want to break in into a smaller unit of work that can be parallelized.
To do this, we’ll be using a CIS 455/555-custom emulation of the Apache Storm stream engine, which we call StormLite (it should show up in your HW2 repo already). Please see the document on StormLite and see TestWordCount as an example of a multithreaded stream job. Storm has spouts that produce data one result at a time, and bolts that process data one result at a time. As with our emulation of the Spark Framework, you should be able to use examples of Apache Storm code to understand how StormLite works.
You should refactor your Milestone 1 to refactor the crawl task to run within StormLite, as follows. Note that StormLite supports multiple worker threads but you can control the number of “executors” (and start with 1).
Based on the sample code in test.upenn.cis.stormlite, the StormLite document, and (in fact) the documentation on Storm available from Stack Overflow and the Web, you should be able to assemble a stream dataflow like the one illustrated above.
You can assume a single crawler queue bolt, but should look at how the fieldGrouping, allGrouping, and shuffleGrouping specifiers allow you to specify how data gets distributed when there are multiple “executors” such as multiple copies of the crawler, parser, etc.
For Milestone 2, you will also enhance the Web application to support the following functions for logged in users.
Now that you have users and HTML or XML, we want to “connect” users with “interesting” content. To do this, any logged in user will be able to create channels. A channel is a named pattern describing a class of documents. An example of a channel definition would be
sports : /rss/channel/title[contains(text(),”sports”)]
and you can see an example of content that would match the channel at: http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml
Assume that channels and their names are global.
You should implement an interface to create a channel, as a GET call:
localhost:8080/create/{name}?xpath={xpath-pattern}
As before, you should have a login/registration form at localhost:8080/register.jsp. Once a user is logged in, you should have a “home page” at localhost:8080/.
Obviously you’ll need to add some logic to the BerkeleyDB storage system to store user subscriptions and to store which documents correspond to a channel (see Section 3.2 for how this will be populated). How you implement most of the functionality of the Web interface is entirely up to you; we are just constraining the URL interfaces. In order to make things consistent across assignments, we are specifying how the channel must be displayed by the “show” request.
We expect this application to run on your application server from the first assignment. If you did not complete the first assignment, or for some other reason do not want to continue to use the application server that you wrote, you may use Jetty (http://www.eclipse.org/jetty/) as described previously to test your application. There will not be any penalties for using Jetty or Tomcat.
You need to write a class edu.upenn.cis455.xpathengine.XPathEngineImpl that implements edu.upenn.cis455.xpathengine.XPathEngine (included with the code in Bitbucket), and evaluates a set of XPath expressions on the specified HTML or XML document. Once you’ve tested that individually, you’ll incorporate it into (call it from) a StormLite bolt. We will be focusing only on elements, sub-elements, and text nodes.
The implementation object (instance of XPathEngineImpl) should be created via the XPathEngineFactory. The setXPaths method gives the class a number of XPaths to evaluate. The isValid(i) method should return false if the ith XPath was invalid, and true otherwise. You should implement the evaluateEvent() method:
To make things simpler, we are supporting a very limited subset of XPath, as specified by the following grammar (modulo white space, which your engine should ignore):
XPath → (/ step)+
step → nodename ([ test ])*
test → text() = "..."
→ contains(text(), "...")
where nodename is a valid XML identifier, and "..." indicates a quoted string. This means that a query like /db/record/name[text() = "Alice"] is valid. Recall that if two separate bracketed conditions are imposed at the same step in a query, both must apply for a node to be in the answer set.
Below are some examples of valid XPaths that you need to support (not an exhaustive list):
/foo/bar/xyz
/xyz/abc[contains(text(),"someSubstring")]
/a/b/c[text()="theEntireText"]
/d/e/f/foo[text()="something"]/bar
/a/b/c[text() = "whiteSpacesShouldNotMatter"]
You should be able to think about these kinds of XPaths as regular expressions over open and close events.
The stream of OccurrenceEvents will be coming from a separate parser bolt (in our standard architecture; you can diverge from this if you prefer). You will probably want to create a test bolt when developing the XPath engine. The easiest HTML/XML parser to use in Java is probably a DOM (Document Object Model) parser, eg the one from JSoup. Such a parser builds an in-memory data structure holding an entire HTML or XML document. From there, it is easy to walk the tree and output events. You can also look into SAX parsers.
Once your XPath engine works over individual documents, you’ll want to write a StormLite bolt whose execute() method instantiates the XPath engine for a given input document (passed in as a tuple), looks up all of the channels defined in the BerkeleyDB database, and for each document that matches an XPath for a channel, records the document as a match to the channel. Subsequently, the Web application interface will be able to show the documents as matches.
To test the XPath engine, you are to supply XPath rules to find RSS 2.0 documents which contain items whose title or description contains the character strings war or peace. (More than one XPath rule is OK.) This (list of) XPath(s) should be in a file called warandpeace.xp.
In addition, you must implement at least 5 unit tests, using the JUnit package (see the section on Testing below for helpful Web page references). JUnit provides automated test scaffolding for your code: you can set up a set of basic objects for all of the test cases, then have the test cases run one-by-one.
Your JUnit test suite should instantiate any necessary objects with some test data (e.g., parse a given HTML or XML document or build a DOM tree), then run a series of unit tests that validate your Web application and your XPath matcher. In total you must have at least 5 unit tests (perhaps each designed to exercise some particular functionality) and at least one must be for the Web application and one for the XPath evaluator.
Your solution must meet the following requirements (please read carefully!):
Reminder: All the code you submit (other than the dependencies on the JSoup/JTidy/TagSoup parser, the standard Java libraries, and any code we have provided) must have been written by you personally, and you may not collaborate with anyone else on this assignment. Copying code from the web is considered plagiarism.
We have implemented a small sandbox for you to test your code on. It runs on machines in Penn Engineering, so it will be fast to access, and it will not contain any links out of itself. The start URL of the sandbox is https://dbappserv.cis.upenn.edu/crawltest.html. There should be adequate XML and HTML documents there to test your XPath matching.
In order to encourage modularization and test driven development, you will be required to code test cases using the JUnit package (http://www.junit.org/) - a framework that allows you to write and run tests over your modules. A single test case consists of a class of methods, each of which (usually) tests one of your source classes and its methods. A test suite is a class that allows you to run all your test cases as a single program.You can get more information here: http://www.onjava.com/pub/a/onjava/2004/02/04/juie.html.
For Milestone 1, you must include 5 test cases and for Milestone 2, a test suite consisting of these 5 and at least 2 more for each new component. If your test suite uses any files (e.g., test inputs), please put them into your project folder and use a relative path, so your tests will run correctly on the graders' machines.
There are several enhancements you can add to your assignment for extra credit. In all cases, if you implement an improved component, you do not need to implement the simpler version described above; however, your improved component must still pass our test suite for the basic version, and you will lose points if it does not. A safer approach is to implement the basic version first and to extend it later; if you choose to do this, and if you submit both versions, please document in your README file how to enable the extra functionality.
Rather than using Spark Framework, get your Homework 2 working on your Homework 1 Milestone 2 framework. From your Terminal, after 555-hw2 is pulled to your local repository, also run “git clone” (with your HW1 repo) to pull your Homework 1 Milestone 2 code. Then go into the appropriate subdirectory for HW1 and run the following to put it into a local Maven repository:
mvn clean install |
Then modify your Homework 2 pom.xml to add:
<dependency> <groupId>edu.upenn.cis.cis455</groupId> <artifactId>homework-1</artifactId> <version>1.0-SNAPSHOT</version> </dependency> |
and of course, remove Spark Framework and change the imports from spark.Spark.* to the appropriate imports for your HW1.
For 15% extra credit, provide a Web interface for the crawler. An admin user (not all users) should be able to start the crawler at a specific page, set crawler parameters, stop the crawler if it is running, and display statistics about the crawler's execution, such as
This will entail using some sort of communication between processes, potentially through the storage system or via an HTTP request.
For 5% extra credit, allow users to choose which channels to subscribe to from a list of available channels. Add a list of the channels to which they're subscribed, in addition to all channels. Add a “subscription” interface at: localhost:8080/subscribe?channel={name}
Here, when a user logs in, you should only present content from the subscribed channels, as opposed to all channels.