CS 257 Assignment

One of the up-coming phases of the web application project will be the design of a database, and insertion of your data into the database. This lab exercise is intended to introduce you to some important database concepts, and to guide you through the first steps of using the PostgreSQL database management system. The exercises here depend on my "books and authors" example.

Part 1: psql, tables, and a little SQL

Login to perlman.mathcs.carleton.edu:
ssh yourusername@perlman.mathcs.carleton.edu
You will need to do some of your work in the special directory that has been set aside for you:
cd /Accounts/courses/cs257/jondich/web-f2018/yourusername
Here is the raw data we started with a couple weeks ago: books-original.csv. This file consists of lines like this:

All Clear,2010,Connie Willis (1945-) And Then There Were None,1939,Agatha Christie (1890-1976) Beloved,1987,Toni Morrison (1931-) Blackout,2010,Connie Willis (1945-) Good Omens,1990,Neil Gaiman (1960-) and Terry Pratchett (1948-2015) ...

There are a couple important things to take note of here. First, these data are organized more for human readability than for computer readability. (Writing software to parse the author fields, for example, would be easier if the author's birth and death dates were given their own fields instead of being embedded in the author field.) And second, there's a lot of duplication of data. There are several lines that include "Connie Willis (1945-)", for example, and similarly for any author with multiple books. We're going to want to convert this data into something more suitable for computer manipulation, and also eliminate duplication in our database while still preserving the relationships between books and authors.
Back before the term, Mike Tie sent you an email with information about your database account on perlman.mathcs.carleton.edu. Your database user name and the name of your database are both your normal Carleton user name. Your password for the database was in the email. Go dig through your email and retrieve that password.
To set up and use our database, we will be using the PostgreSQL database management system (often referred to as just "postgres"). To get direct access to our postgres databases, we will use the interactive command-line client psql. Try it out while logged into perlman:
psql
or if that doesn't work,
psql -U yourusername yourusername
That's a capital U, by the way. The "-U yourusername" tells psql which postgres user you are, and the second "yourusername" specifies the name of the database you want to use. If all goes well, you should see a prompt like this:
yourusername=>
Try typing a few commands. "\?" will print a description of possible commands (use the spacebar to scroll forward through the help text, and q to get out of the help text). "\l" (that's a a lowercase ELL, not a ONE) will list all the postgres databases installed on perlman. "\d" will show you all your database's tables (of which you should have none at the moment). And, of course, it's nice to know "\q" to quit.
Our books/authors database will consist of three tables. The first table will be called authors. Each row of the authors table will represent a single author, much in the same way each instance of a Java class called Author would represent a single author.

Let's create the authors table. Type or paste the following CREATE TABLE statement at the psql prompt, and hit return. Note that you can type this one line at a time, and the statement will only get executed once you have typed the semi-colon followed by a return.
CREATE TABLE authors ( id SERIAL, last_name text, first_name text, birth_year integer, death_year integer );
Notice a couple things about this CREATE TABLE statement.
- CREATE TABLE etc. is a statement in the Structured Query Language, which is virtually always abbreviated SQL, and pronounced either "ESS-CUE-ELL" or "Sequel". SQL statements and queries are distinct from the psql directives like \d and \q, but you can type either SQL or psql directives at the psql prompt.
- The type SERIAL is an integer that postgres will automatically initialize and increment for you. So every time you add a new author to the table, it will get the next integer id available.
- When you executed the CREATE TABLE statement, you should have seen this:
  NOTICE: CREATE TABLE will create implicit sequence "authors_id_seq" for serial column "authors.id" CREATE TABLE
  This is psql's way of telling you that it implemented SERIAL by creating a table-like object known as a "sequence" to keep track of the next available id value.
- A table has "fields" or "columns". These often play the same role as instance variables in classes.
- SQL has the notion of a NULL value, which indicates "this field has no value at all". Note that NULL in this context is not the same as zero. For the books/authors example, we're going to use NULL in the death_year field to indicate that the author is still alive.
Type "\d" at the psql prompt. You should see information about both the authors table and the corresponding id sequence.

Type "\d authors" to see the structure of the authors table.

Type the SQL query:
SELECT * FROM authors;
to see that you don't have any authors in the table yet.
Pick a favorite book written by somebody other than the authors in my data. I'll use "Pattern Recognition" (2003) by William Gibson as my example. Let's add Gibson to our authors table:
INSERT INTO authors (last_name, first_name, birth_year, death_year) VALUES ('Gibson', 'William', 1948, NULL);
Now try the SELECT query from above again. Do you see your author in the table? Here's what I see:
id | last_name | first_name | birth_year | death_year ----+-----------+------------+------------+------------ 1 | Gibson | William | 1948 | (1 row)
Go through the last few steps again to create the books table and and a book or two by your author:
CREATE TABLE books ( id SERIAL, title text, publication_year integer );
followed by some suitable INSERT statements, one for each book.
At this point, we have an authors table, a books table, one author, and one book. But we have not added any information to the database to indicate that our author wrote our book. Here's how we do that:
CREATE TABLE books_authors ( book_id integer, author_id integer );
and
INSERT INTO books_authors (book_id, author_id) VALUES (1, 1);
Note that this is a pretty trivial case. We have one author (William Gibson, id=1) and one book (Pattern Recognition, id=1), so adding (1, 1) to the books_authors table just says that author #1 wrote book #1. As we add more authors and more of their books, of course, the (book_id, author_id) pairs will be more interesting. We might see, for example, rows in the books_authors table like (3, 5) and (17, 5), which would say that author #5 wrote both book #3 and book #17. Or (3, 5) and (3, 12), which would say that book #3 was co-written by author #5 and author #12.
Got it? An authors table giving information about each author, a books table giving information about each book, and a books_authors table showing the authorship relationship between books and authors.

Now let's BURN IT ALL DOWN and start over:

DROP TABLE authors; DROP TABLE books; DROP TABLE books_authors;
Go back and redo the three CREATE TABLE statements from earlier in the lab to create three empty tables. Then quit psql (\q) for now.

Part 2: raw data to table data

You will start with your data in some form or other. Maybe CSV, maybe JSON, maybe a table online that you can copy and paste into a spreadsheet, etc. Regardless, it's unlikely to come separated into exactly the tables you want from the start. (E.g. Downloadable data seldom includes linking-table data like our books_authors table.)

The goal of this section is to convert our raw books-original.csv into three separate CSV files, structured so as to match our postgres tables exactly. Starting with

All Clear,2010,Connie Willis (1945-) And Then There Were None,1939,Agatha Christie (1890-1976) Beloved,1987,Toni Morrison (1931-) Blackout,2010,Connie Willis (1945-) Good Omens,1990,Neil Gaiman (1960-) and Terry Pratchett (1948-2015) ...

we want to create three CSV files. books.csv should look like this:

1,All Clear,2010 2,And Then There Were None,1939 3,Beloved,1987 4,Blackout,2010 5,Good Omens,1990 ...

authors.csv should look like this:

1,Willis,Connie,1945,NULL 2,Christie,Agatha,1890,1976 3,Morrison,Toni,1931,NULL 4,Gaiman,Neil,1960,NULL 5,Pratchett,Terry,1948,2015 ...

and books_authors like this:

1,1 2,2 3,3 4,1 5,4 5,5 ...

Python and its csv, re, and json modules can be a very handy tool for doing this kind of thing.

Grab books-original.csv and books_and_authors_converter.py.
Open up books_and_authors_converter.py and read through it. Here are a few things to pay attention to.
- As you learned during the books assignments, the csv module gives us a really easy way to move between lists of strings (which I typically call "row" in my code on the theory that CSV files are a lot like tables with rows and columns) and CSV files.
- Because the lines in a CSV file are treated as lists of strings, you sometimes need to convert between string and numeric data via typecasting—but not in a conversion like this one, where we're going from CSV to CSV.
- Iterating over a CSV reader object using a for-loop gives you one line/row at a time, which is just a Python list. Easy to work with.
- This code uses regular expressions to parse the birth and death dates of the authors. Learning about regular expressions is great, and I have some resources available for doing so, but you can also extract the birth and death dates using normal Python string methods. If you need help figuring out how to parse raw data like this, I'm happy to help.
- Do you understand how I've created the books_authors.csv file? Automating the creation of linking tables like this is one of the trickiest parts of this data conversion process, so pay attention to how I've done it in this (admittedly pretty simple) case.
Run books_and_authors_converter.py on books-original.csv. What's the output? Does the output look OK?
Did it handle properly the fact that "Good Omens" has two authors? (Hint: no.) If you're planning to run this converter a lot of times or over a very large dataset, then you would want to fix the converter. But in this situation, you could alternatively just fix the output files by hand. Fixing books_authors.csv is a little bit tricky, but if you pay attention, you can get it right.
If you've been doing all this Python work on perlman, great—the files are already on perlman. If not, then you should copy the three new .csv files over to perlman using scp, sftp, or something like that.
cd to whatever directory has the three .csv files, launch psql, and then type:
\copy authors from 'authors.csv' DELIMITER ',' CSV NULL AS 'NULL'
Try this SQL query:
SELECT * FROM authors;
Do you see the list of authors? (I hope so!)
Repeat the \copy and SELECT steps for books.csv and books_authors.csv.
Once you're confident that all three tables have the right data in them, we can do something cool. We can combine these tables to search the data in lots of interesting ways. SQL has enormous power and a ton of features, and we'll barely scratch the surface in this class. But here are a few example queries that you might find interesting. Try to guess what they do before pasting the into psql. Then see if you were right.

SELECT authors.first_name, authors.last_name, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id;

SELECT books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id AND authors.last_name = 'Willis';

SELECT authors.first_name, authors.last_name, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id AND authors.last_name LIKE 'B%' ORDER BY authors.last_name, authors.first_name;

SELECT authors.first_name, authors.last_name, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id ORDER BY authors.last_name;

Part 3: data conversion from JSON to CSV

You can do something very similar to Part 2 with raw data that comes in JSON form.

Here's my books.json raw data, and my json_to_tables.py conversion program. Feel free to play around with this version of the data conversion problem, adapt my code as you see fit, and ask lots of questions.

CS 257: Software Design

PostgreSQL lab

Part 1: psql, tables, and a little SQL

Part 2: raw data to table data

Part 3: data conversion from JSON to CSV