Contextual Advertising with Hadoop/MapReduce

Introduction
Using the Cluster
Part 1: Warmup Exercise - Inverted Index
Part 2: Building a Summary Table
Submission of part 2: two portions
- Part 2a: Some form of analysis on the click/impressions data
- Part 2b: Final submission
Bonus Work
Submission
Some last tips

This is a pair programming assignment. If you are on a team, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Take turns every so often who is driving; you should each spend approximately 50% of the time driving.

(Update on 3/3; I fixed the filenames below)

We will run the click rate projects in the following way:

hadoop com.sun.tools.javac.Main ClickRate.java
jar cf clickrate.jar ClickRate*.class
hadoop jar clickrate.jar ClickRate impressions clicks output

The first two lines should create the file clickrate.jar. The last line runs your program with input folders impressions and clicks and output folder output.

Introduction

The goal of this project is to continue to become familiar with Hadoop/MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache. MapReduce is designed to solve a large class of problems in situations where the amount of data to be processed is extremely large. The idea is that every problem can be boiled down to a map step, which turns each element of the data into a key/value pair, and a reduce step, which combines elements with the same key. For example, we can count the number of times a given word appears in a text by (1) mapping each word to the key/value pair (word, 1) and (2) summing the values for each word.

Using the Cluster

Here is a link to the in-class lab that explains how to run jobs on the lab machines and on the Hadoop cluster: Hadoop Lab.

Part 1: Warmup Exercise - Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce example. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Doct2: Buffalo are mammals.

we could construct the following inverted file index:

Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Your goal is to build an inverted index of words to the documents which contain them. You can try this on the files in the dataset located in /Accounts/courses/cs348w16/gutenberg on the lab machines. You can also download this data set here.

Your end result should be something of the form: (word, docid[]).

Part 2: Building a Summary Table

Suppose we work for an internet advertising company. We want to better target our ads to users, based on prior data. That is, given an advertising context, we would like to predict which of our available advertisements is most likely to result in a click.

The ad serving machines produce two types of log files: impression logs and click logs. Every time we display an advertisement to a customer, we add an entry to the impression log. Every time a customer clicks on an advertisement, we add an entry to the click log.

For this assignment, we will look at one particular feature: the page URL on which we will be showing the ad. This is called the "referrer" in the impressions logs. DO NOT use the hostname! Given a page URL and an ad id, we wish to determine the click through rate, which is just the percentage of impressions with the desired page URL and ad id that were clicked. For consistency's sake, you should output a double between 0 and 1.

Your goal is to build a summary table of click through rates, which could later be queried by ad serving machines to determine the best ad to display. Logically, this table is a sparse matrix with the axes page URL and ad id. The value represents the percentage of times an ad was clicked.

Please do this on the files in the dataset located here. These files were generated by merging a large number of small files. If you want to work with the set of smaller files (e.g. to create a smaller data set for testing), they are here. Do not download either of these tarballs into your lab home directory. The files are available on the lab machines in /Accounts/courses/cs348, in the folders impressions, clicks, impressions_merged, and clicks_merged. I am posting them here so you can download them onto your personal machine if you wish.

If you wish to download these files to an AWS cluster that you have created with StarCluster, first ssh into the master node of the cluster. Once you have done so, you can grab a file you want with the wget command. For example:

wget http://www.cs.carleton.edu/faculty/dmusican/cs348w16/hadoop/click_through_data_merged.tar.gz

If you do choose to download them to your own computer or to the AWS cluster (not your lab account), make sure to put the file in a directory of its own. On any UNIX system, you can then use the tar command to expand the file, for example:

tar xvfz click_through_data_merged.tar.gz

You'll have to do a little more than simply tokenize the text on whitespace. The log files are stored in JSON format, with one JSON object per line. You should use a json library, such as json-simple. You'll have to add the jar file for your JSON library to your HADOOP_CLASSPATH before compiling your program. (Go back and look at the initial Hadoop lab; configuring HADOOP_CLASSPATH was one of the first things we did. For example, after downloading json_simple-1.1.1.jar from the json-simple website, I could update my HADOOP_CLASSPATH as follows:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/Accounts/dmusican/json-simple-1.1.1.jar

… where I would have to use the appropriate location above where I placed the jar file.

You should produce output in the following format:

[page_url, ad_id] click_rate

This can be achieved by making the key the string [page_url, ad_id], and the value click_rate.

Submission of part 2: two portions

Part 2a: Some form of analysis on the click/impressions data

This is an intermediate submission to demonstrate that you are making progress on this project. Submit some form of working Hadoop code that does some analysis on the click/impressions stream data, but precisely what sort of analysis that is can be up to you. Do something. The key things for making this work correctly are that it uses the correct data for the assignment, and that it successfully runs and produces some sort of reasonable output to do something.

Your submission should follow the guidelines in the submission section below. In your README.txt, describe briefly what your current version of the code does.

Part 2b: Final submission

This portion is where you submit the portion of the code that actually works as specified above.

Bonus Work

If you've finished the project, try implementing these extensions for a bonus point each:

Write a query program on top of your feature index, which will accept a user-specified page url and return the ID of the ad to show.
There are other features besides page URL which could help predict a click. Include both user agent and IP address as features in your table. You might consider adding prefixes to the features to distinguish the different types. (If you implement this feature, please include it as a separate file for grading purposes.)

Submission

A zip file containing your code, including Java files and any necessary libraries.
A README.txt file, containing detailed instructions on how to compile and run your code. Also indicate which bonus work, if any, you attempted.
Citations in a text file credits.txt.

Some last tips

You might find this sample code useful, which shows how to apply different mappers to different inputs.