Query Optimization, Team Part

This assignment involves the use of a very large database, and we've only been granted permission to do it on two particular machines in the department. You are welcome to do this on your own computer, if you are inclined to learn how to install PostgreSQL, etc., though it is undoubtedly easier just to use our department machine. If you do choose to do this from home, you'll have to figure out how to adapt the directions (grabbing files, etc.) You're on your own if you do this: I'm only guaranteeing support in the lab. Please work in a team if you are at all inspired; it will help cut down the bottleneck on the machines.

As soon as possible, coordinate with your teammate and have one of you sign up for a time slot.

Setup

Find machine cmc307-2 or cmc307-3 in CMC 307. Login. Change your current directory to the course directory, e.g. at at terminal window type

cd /Accounts/courses/cs334/data
Start up PostgreSQL. The user name is "cs334". Mike Tie has asked me to not put the password in the clear the web to save us from attackers, so I have posted the password on Moodle. So to start up PostgreSQL, type
psql -U cs334
and enter the password you'll find on the Moodle page. Once you've managed to get into PostreSQL, drop all tables and indexes that previous students might have left behind. You can see all tables with the "\dt" command, and all indexes with the "\di" command. To drop a table or an index, here are examples:
DROP TABLE link;
DROP INDEX index1;
Once you have dropped all the tables and indexes, restore the data for this assignment from backup by typing
\i pgcrawler.dump

This should take about 3-5 minutes, so get familiar with the rest of the assignment while doing this. You'll see some "CREATE TABLE", "ALTER TABLE," and "NOTICE" messages dumped to the screen. That's fine. If you haven't correctly dropped all your tables before invoking this script, you might have a problem.

Getting to it

The data which you see was obtained by a Carleton comps group a few years ago that built a search engine (ala Google). This data is what they obtained from a crawl at the time of Carleton's intranet. This data may be incomplete, and it is most certainly out of date, but it represents a neat snapshot of Carleton's web presence at the time.

First, get familiar with the data. We will not use all of the tables that are here; rather, we will focus on the tables word, word_to_url, and url. (Drop any other tables that you see, they are not relevant for this assignment.) Look at the columns that each table has (the PostgreSQL command "\d" is good here. You might also want to print out the first 10 records of each just to see them; you can do so by typing:

select * from whatever limit 10;

The following instructions have specific tasks mixed in. You should turn in on paper whatever the tasks ask for.

Task 1: Draw an E-R diagram showing the entities and relationships that these three tables model. It should be a fairly straightforward picture. Indicate one-to-many, many-to-many, or one-to-one relationships as they exist by using arrows when appropriate.

The key purpose of this data was to be a search engine. In other words, for a particular word that someone was searching for, the goal was to find all URLs (websites) that contained that word.

Write an SQL query to print out all web pages that contain the word "computer". (In reality, this query would really be constructed underneath a web page.) Run the query and observe how long it takes.

Task 2: Include here your SQL query, and your estimate of how long the query took. How does this compare to what you would expect from a good search engine?

Use the EXPLAIN command (see the individual part of this assignment for more details) to obtain the query evaluation plan.

Task 3: Transcribe the above plan into a relational algebra query evaluation tree, similar to that which your textbook does. If the dataset were much bigger (i.e., on the scale of the Internet), how would you expect your query time to change? Answer quantitiatively using your best guesstimate of the size of the Internet relative to this dataset.


Editorial notice: In various comps projects in the past, it is at about this point in time when students have come to their comps adviser (sometimes me, sometimes others) holding pitchforks and torches. They bellow that the hardware they have been given is not powerful enough to accomplish their task.

You should be able to do something that will make this query and all others like it (e.g., for all other words one might look up) produce output almost instantly. You might need to look through the PostgreSQL documentation to see exactly how to issue the syntax, but there should be a very simple and straightforward thing to do to dramatically speed up this query. Do it.

Task 4: What did you do to make your query run so much more quickly? Indicate the precise syntax that you typed into PostgreSQL to make it happen. How fast would you estimate that your query would now run, on the scale of the entire Internet?

When done, drop all of your tables and indexes. Your fellow students will thank you.

What to Turn In

Submit electronically the answers to the above tasks. For the drawing, use a tool such as Dia, or draw it on paper and scan it. Have fun!