CS 348 Project: Where are the People?

Preliminaries: Partners and Code
Introduction
The Assignment
The Graphical Interface
Write-Up Questions
Bonus
Submission

Preliminaries: Partners and Code

This is a pair programming assignment. If you are on a team, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Take turns every so often who is driving; you should each spend approximately 50% of the time driving.

Please note that the final submission requires you to include a file credits.txt that lists any people or websites you consulted in the course of this project. It is probably best to update this file as you go instead of trying to remember everything at the last minute.

The provided code can be downloaded as a zip file from the course Moodle page.

You do not need to look at any of the files that implement the optional graphical interface: InteractionPane.java, MapPane.java, USMaps.java, USMap.jpg, contUSmap.jpg.
You do need to understand these files: CensusData.java, CensusGroup.java, Pair.java, PopulationQuery.java, Rectangle.java, CenPop2010.txt. Javadoc for these files is posted here (the documentation is simply generated from the provided files — this is just for your convenience).

Introduction

The availability of electronic data is revolutionizing how governments, businesses, and organizations make decisions. But the idea of collecting demographic data is not new. For example, the United States Constitution has required since 1789 that a census be performed every 10 years. In this project, you will process some data from the 2010 census in order to answer efficiently certain queries about population density. These queries will ask for the population in some rectangular area of the country. The input consists of "only" around 220,000 data points, so any desktop computer has plenty of memory. On the other hand, this size makes using parallelism less compelling (but nonetheless required and educational).

You will implement the desired functionality in several ways that vary in their simplicity and efficiency. Some of the ways will require fork-join parallelism and, in particular, Java's ForkJoin Framework. Others are entirely sequential. The last (not necessarily best) approach uses explicit threads, a shared data structure, and lock-based synchronization.

A final portion of this project involves comparing execution times for different approaches and parameter settings. You will want to write scripts to collect timing data for you, and you will want to use a machine that has at least 4 processors.

The Assignment

Overview of what your program will do

The file CenPop2010.txt (distributed with the project files) contains real data published by the U.S. Census Bureau. The data divides the U.S. into 220,333 geographic areas called "census-block-groups" and reports for each such group the population in 2010 and the latitude/longitude of the group. It actually reports the average latitude/longitude of the people in the group, but that will not concern us: just assume everyone in the group lived on top of each other at this single point.

Given this data, we can imagine the entire U.S. as a giant rectangle bounded by the minimum and maximum latitude/longitude of all the census-block-groups. Most of this rectangle will not have any population:

The rectangle includes all of Alaska, Hawaii, and Puerto Rico and therefore, since it is a rectangle, a lot of ocean and Canada that have no U.S. popluation.
The continental U.S. is not a rectangle. For example, Maine is well East of Florida, adding more ocean.

Note that the provided code reads in the input data and changes the latitude for each census group. That is because the Earth is spherical but our grid is rectangular. Our code uses the Mercator Projection to map a portion of a sphere onto a rectangle. It stretches latitudes more as you move North. You do not have to understand this except to know that the latitudes you will compute with are not the latitudes in the input file. If you find it helpful to do so, you can change the code to disable this projection during your testing.

We can next imagine answering queries related to areas inside the U.S.:

For some rectangle inside the U.S. rectangle, what is the 2010 census population total?
For some rectangle inside the U.S. rectangle, what percentage of the total 2010 census U.S. population is in it?

Such questions can reveal that population density varies dramatically in different regions, which explains, for example, how a presidential candidate can win despite losing the states that account for most of the country's geographic area. By supporting only rectangles as queries, we can answer queries more quickly.

Your program will first process the data to find the four corners of the rectangle containing the United States. Some versions of the program will then further preprocess the data to build a data structure that can efficiently answer the queries described above. The program will then prompt the user for such queries and answer them until the user chooses to quit. For testing and timing purposes, you may also wish to provide an alternative where queries are read from a second file. We also provide you a graphical interface that makes asking queries more fun.

Provided code

The provided code will take care of parsing the input file (sequentially), performing the Mercator Projection, and putting the data you need in a large array. The provided code uses float instead of double since the former is precise enough for the purpose of representing latitude/longitude and takes only half the space.

The provided code includes a main function in the PopulationQuery class that takes four command-line arguments:

The file containing the input data.
x, the number of columns in the grid for queries.
y, the number of rows in the grid for queries.
One of -v1, -v2, -v3, -v4, -v5 corresponding to which version of your implementation to use.

Suppose the values for x and y are 100 and 50. That would mean we want to think of the rectangle containing the entire U.S. as being a grid with 100 columns (the x-axis) numbered 1 through 100 from West to East and 50 rows (the y-axis) numbered 1 through 50 from South to North. (Note we choose to be "user friendly" by not using zero-based indexing.) So the grid would have 5000 little rectangles in it. Larger x and y will let us answer queries more precisely but will require more time and/or space.

The program answers queries about the census data. A query describes a rectangle within the U.S. using the grid. It is simply four numbers:

The westernmost column that is part of the rectangle; error if this is less than 1 or greater than x.
The southernmost row that is part of the rectangle; error if this is less than 1 or greater than y.
The easternmost column that is part of the rectangle; error if this is less than the westernmost column (equal is okay) or greater than x.
The northernmost row that is part of the rectangle; error if this is less than the southernmost column (equal is okay) or greater than y.

main then repreatedly prints a single one-line prompt asking for these four numbers, reads them in, and outputs two numbers:

The total population in the queried rectangle.
The percentage of the total U.S. population in the queried rectangle.

Here is an example of running main, querying the population of Alaska, the entire US, and Puerto Rico, in that order.

$ java PopulationQuery CenPop2010.txt 108 149 -v1
Query? (west south east north | quit) 1 81 44 149
Query population:     710231
Percent of total:       0.23%
Query? (west south east north | quit) 1 1 108 149
Query population:  312471327
Percent of total:     100.00%
Query? (west south east north | quit) 1 1 108 2
Query population:    3725789
Percent of total:       1.19%
Query? (west south east north | quit) quit

What you need to implement

The provided main function works by calling two functions that you need to implement:

public void preprocess(int x, int y, int versionNum);
public Pair<Integer, Float> singleInteraction(int w, int s, int e, int n);

The preprocess method assumes that the input file has already been parsed into an array. This method performs any necessary preprocessing for the given version of your implementation. We'll go into much more detail on each version in a moment, but, for example, for versions 1 and 2 you will need to find the minimum and maximum latitude and longitude. The arguments to the preprocess method are x and y, the number of columns and rows in the map grid, respectively, and versionNum, an integer from 1 to 5 representing the version of the implementation you should be using. The preprocess method may be called more than once; each time you should restart processing from the parsed data array.

The singleInteraction method performs a single query using the implementation version most recently passed to preprocess. The arguments to the singleInteraction method are the four numbers described earlier: the west-, south-, east-, and northern most blocks in the query rectangle. This method should determine the population size and the population percentage of the U.S. given the parameters and return those two pieces of information as a Pair.

Five different implementations

You will implement 5 versions of your program. There are significant opportunities to share code among the different versions, so do so if you can!

Version 1: Simple and Sequential

Before processing any queries, process the data to find the four corners of the U.S. rectangle using a sequential O(n) algorithm where n is the number of census-block-groups. Then for each query do another sequential O(n) traversal to answer the query (determining for each census-block-group whether or not it is in the query rectangle). The simplest and most reusable approach for each census-block-group is probably to first compute what grid position it is in and then see if this grid position is in the query rectangle.

You will need to determine within which grid rectangle each census-block-group lies. This requires computing the minimum and maximum latitude and longitude over all the census-block-groups. Note that smaller latitudes are farther South and smaller longitudes are farther West. Also note all longitudes are negative, but this should not cause any problems.

In the case that a census-block-group falls exactly on the border of more than one grid position, tie-break by assigning it to the North and/or East as needed.

Version 2: Simple and Parallel

This version is the same as version 1 except both the initial corner-finding and the traversal for each query should use the ForkJoin Framework effectively. The work will remain O(n), but the span should lower to O(log n). Finding the corners should require only one data traversal, and each query should require only one additional data traversal.

Version 3: Smarter and Sequential

This version will, like version 1, not use any parallelism, but it will perform additional preprocessing so that each query can be answered in O(1) time. This involves two additional steps:

First create a grid of size x*y (use an array of arrays) where each element is an int that will hold the total population for that grid position. Recall x and y are the command-line arguments for the grid size. Compute the grid using a single traversal over the input data.
Now modify the grid so that instead of each grid element holding the total for that position, it instead holds the total for all positions that are neither farther East nor farther South. In other words, grid element g stores the total population in the rectangle whose upper-left is the North-West corner of the country and the lower-right corner is g. This can be done in time O(x*y) but you need to be careful about the order you process the elements. Keep reading…

For example, suppose after step 1 we have this grid:

0   11   1  9
1   7    4  3
2   2    0  0
9   1    1  1

Then step 2 would update the grid to be:

0   11  12  21
1   19  24  36
3   23  28  40
12  33  39  52

There is an arithmetic trick to completing the second step in a single pass over the grid. Suppose our grid positions are labeled starting from (1,1) in the bottom-left corner. (You can implement it differently, but this is how queries are given.) So our grid is:

(1,4)  (2,4)  (3,4)  (4,4)
(1,3)  (2,3)  (3,3)  (4,3)
(1,2)  (2,2)  (3,2)  (4,2)
(1,1)  (2,1)  (3,1)  (4,1)

Now, using standard Java array notation, notice that after step 2, for any element not on the left or top edge: grid[i][j]=orig+grid[i-1][j]+grid[i][j+1]-grid[i-1][j+1] where orig is grid[i][j] after step 1. So you can do all of step 2 in O(x*y) by simply proceeding one row at a time top to bottom — or one column at a time from left to right, or any number of other ways. The key is that you update (i - 1, j), (i, j + 1) and (i - 1, j + 1) before (i, j).

Given this unusual grid, we can use a similar trick to answer queries in O(1) time. Remember a query gives us the corners of the query rectangle. In our example above, suppose the query rectangle has corners (3, 3), (4, 3), (3, 2), and (4, 2). The initial grid would give us the answer 7, but we would have to do work proportional to the size of the query rectangle (small in this case, potentially large in general). After the second step, we can instead get 7 as 40 - 21 - 23 + 11. In general, the trick is to:

Take the value in the bottom-right corner of the query rectangle.
Subtract the value just above the top-right corner of the query rectangle (or 0 if that is outside the grid).
Subtract the value just left of the bottom-left corner of the query rectangle (or 0 if that is outside the grid).
Add the value just above and to the left of the upper-left corner of the query rectangle (or 0 if that is outside the grid).

Notice this is O(1) work. Draw a picture or two to convince yourself this works.

Note: A simpler approach to answering queries in O(1) time would be to pre-compute the answer to every possible query. But that would take O(x²y²) space and preprocessing time.

Version 4: Smarter and Parallel

As in version 2, the initial corner finding should be done in parallel. As in version 3, you should create the grid that allows O(1) queries. The first step of building the grid should be done in parallel using the ForkJoin Framework. The second step should remain sequential; just use the code you wrote in version 3. Parallelizing it is bonus work.

To parallelize the first grid-building step, you will need each parallel subproblem to return a grid. To combine the results from two subproblems, you will need to add the contents of one grid to the other. The grids may be small enough that doing this sequentially is okay, and doing so is sufficient for this project. It is arguable that larger grids might benefit here from yet another ForkJoin computation, and you may optionally add this if you wish. (If you try this, to test that this works correctly, you might need to set a sequential-cutoff lower than your final setting.)

Note that your ForkJoin tasks will need several values that are the same for all tasks: the input array, the grid size, and the overall corners. Rather than passing many unchanging arguments in every constructor call, it is cleaner and probably faster to pass an object that has fields for all these unchanging values.

Version 5: Smarter and Lock-Based

Version 4 may suffer from allocating a lot of extra grids during the first grid-building step. An alternative is to have just one shared grid that different threads add to as they process different census-block-groups. But to avoid losing any of the data, that means grid elements need to be protected by locks. To allow simultaneous updates to distinct grid elements, each element should have a different lock.

In version 5, you will implement this strategy. You should not use the ForkJoin Framework; it is not designed to allow synchronization operations inside of it other than join. Instead you will need to take the "old-fashioned" approach of using explicit threads. It is okay to set the number of threads to be a static constant, such as 4.

How you manage locks is up to you. You could have the grid store objects and lock those, or you could have a separate grid of just locks. Note that after the first grid-building step, you will not need to acquire locks anymore (use join to make sure the grid-building threads are done!).

Note you do not need to re-implement the code for finding corners of the country. Use the ForkJoin Framework code from versions 2 and 4. You also do not need to re-implement the second grid-building step. You are just re-implementing the first grid-building step using Java threads, a shared data structure, and locks.

Experimentation

The write-up requires you to measure the performance (running time) of various implementations with different parameter settings. To report interesting results properly, you should use a machine with at least four processors and report relevant machine characteristics. The lab machines should work, although you should make sure no one else is using the machine for experiments at the same time as you. Open a terminal and enter w; that will show you if anyone is logged in remotely.

You will also need to report interesting numbers more relevant to long-running programs. In particular you need to:

Not measure time parsing the data out of the input file.
Allow the Java Virtual Machine and the ForkJoin Framework to "warm up" before taking measurements. The easiest way to do this is to put the initial processing in a loop and not count the first few loop iterations which are probably slower. While this is wasted work for your program, (a) you should do this only for timing experiements and (b) this may give a better sense of how your program would behave if run many times, run for a long time, or run on more data.
Try to avoid having your experiments be interrupted by Java's garbage collector. You can do this by running System.gc() just before starting your experiment.
Write extra code to perform the experiments, but this code should not interfere with your working program. You do not want to sit entering queries manually in order to collect timing data.

I also got much more worthwhile timing numbers by concatenating the cenus data file to itself a bunch of times. This effectively made it larger, which increased the benefit of parallel computation. You shouldn't do this for the final submission that requires correct answers, but you could do so for timing measurements.

For guidelines on what experiments to run, see the Write-Up Questions. Note you may not have the time or resources to experiment with every combination of every parameter; you will need to choose wisely to reach appropriate conclusions in an effective way.

The Graphical Interface

The provided graphical user interface (GUI) for the program is fun (we hope), easy to use, and useful for checking your program against some geographical intuition (e.g., nobody lives in the ocean and many people live in Southern California).

The GUI presents a map of the U.S. as a background image with a grid overlaid on it. You can select consecutive grid squares to highlight arbitrary rectangles over the map. When you select run, the GUI will invoke your solution code with the selected rectangle and display the result.

To run the GUI, run the main method of the class USMaps.

In the GUI, you can "zoom in" to the continental U.S. When zoomed, keep in mind two things:

Zooming means the entire grid is not shown. For example, if the grid has 12 rows and 23 columns, zooming will show 6 rows (with most of the bottom one not shown) and 13 columns. So even selecting all the visible grid rectangles will not select all the actual grid rectangles.
If you select partly-viewable grid rectangles on the edge, this selects the entire grid rectangle, including the population for the census-block-groups in the not-visible portion of the grid rectangle. For small grid sizes, this can include portions of Puerto Rico and for very small (silly) grid sizes, it can include Alaska or Hawaii.

The GUI expects that you have implemented preprocess and singleInteraction, as discussed above.

Write-Up Questions

Turn in a write-up answering the following questions. Note there is a fair amount of data collection for comparing timing, so do not wait until the last minute.

How did you test your program? What parts did you test in isolation and how? What smaller inputs did you create so that you could check your answers? What boundary cases did you consider?
For finding the corners of the United States and for the first grid-building step, you implemented parallel algorithms using Java's ForkJoin Framework. You should be able to vary the number of tasks created by modifying the sequential cutoff. Perform experiments to determine the optimal number of tasks. Note that if the sequential cut-off is high enough to reduce the number of tasks to one, then you should see performance close to the sequential algorithms, but evaluate this claim empirically.
Compare the performance of version 4 to version 5 as the size of the grid changes. Intuitively, which version is better for small grids and which version for large grids? Does the experimental data validate this hypothesis? Produce and interpret an appropriate graph or graphs to reach your conclusion.
Compare the performance of version 1 to version 3 and version 2 to version 4 as the number of queries changes. That is, how many queries are necessary before the pre-processing is worth it? Produce and interpret an appropriate graph or graphs to reach your conclusion. Note you should time the actual code answering the query, not including the time for a (very slow) human to enter the query.
What bonus projects, if any, did you implement?

Bonus

You may implement one or both of the following extensions, for one bonus point each. Do these in separate files so that we can still grade your regular version.

Parallel prefix: In version 4, the second step of grid-building is still entirely sequential, running in time O(x*y). We can use parallel prefix computations to improve this — the most straightforward approach involves two different parallel-prefix computations where the second uses the result of the first. Implement this so that the span of this grid-building step is O(log x + log y). Run experiments to determine the effect of this change on running time.
Earth curvature: Our program uses the Mercator Projection, which badly distorts locations at high latitude. Either use a better projection (read about map projections on your own) or properly treat the Earth as a sphere. Adjust how queries are handled appropriately and document your approach.

Submission

Submit all your new code files, including any additional Java files you created for testing, and all provided files, to Moodle as a zip file. Your code should be easy to compile and run, so include instructions on how to compile and run if you have done anything unusual. There are separate Moodle assignments for each subproject, so make sure to submit accordingly. As always, make sure your code is properly documented.

The grader for this course is a student taking the course. Please make sure your submission, other than the single file I will describe in a moment, is anonymous. In particular, your name(s) should not appear in your write-up or in any of your Java code.

You should include a single file credits.txt that contains the following information:

Your name and your partner's name (if applicable).
A list of people, books, websites or other sources you used in the course of completing this project. Include anyone or anything except your partner, me, and the course materials. In particular, if you had more than a passing interaction with another student, or if you used code from a website as a starting point, list that information.

To recap, you should submit a single zip file, which contains the following. It should all be anonymized with the exception of credits.txt:

any Java code you created or modified and any other relevant files (e.g. scripts for testing).
Your write-up — PDF, doc, txt, etc.
Citations in a text file credits.txt.