CS 334 MongoDB Lab

Table of Contents

Clicker evaluations

This first part has nothing to do with MongoDB… the friendly folks in Carleton academic technology support who supplied me with clickers would like you to do an evaluation on them. This isn't directly an evaluation on me, or on the course, but more on how the clickers themselves integrated. Please fill out this evaluation form before moving on to the lab. If you're in a pair, each of you should do this; one of you should just turn around and chat with neighbors while your partner fills it out, then swap. I'm assuming that this evaluation doesn't require significant privacy. If I'm wrong about that, and you'd prefer to fill it out on your own later, you can do so; the link will remain active.

MongoDB lab!

This is a lab to help you get started in MongoDB. The idea is essentially to give you a sense of what is like to work with MongoDB. As you go through the steps, ask questions as they come up! There is nothing you need to turn in, but please take seriously the various questions and exercises throughout. This lab is brand new, so may occasionally be buggy (though I hope not); ask for help if something doesn't seem to be working right.

Startup MongoDB

The first think you'll need to do is create a directory to store your MongoDB data, and then startup a MongoDB server.

  1. At a terminal prompt, navigate to the /tmp directory by typing cd /tmp. This is important, as you'll be downloading some sizable data, and Mike Tie has begged us not to clog up our server by storing this in your home directories. By putting it in /tmp, it just lives on the local computer you're logged into, and will be automatically cleared off at some point, presumably at the next reboot.
  2. In the /tmp directory, issue the command mkdir dbusername (where username is your Carleton username) to make a directory to store your mongo data. (You can actually call this directory anything you want so long as you're consistent in your usage of it.)
  3. Start up MongoDB by issuing this command:

    /usr/local/mongodb/bin/mongod --dbpath /tmp/dbusername
    
  4. You'll know MongoDB started up successfully if you see a bunch of text fly by from MongoDB indicating status information from the server, ending with a line that says something like "waiting for connections on port 27017". If you're having trouble getting the server started, ask for help.
  5. At this point, you'll want to just leave this terminal window alone. Shrink the size of the window to make it reasonably short, and move it off to a corner of your screen. You want to be able to see messages if they appear there, but you won't be actively working with this window.

Import sample data

The MongoDB project has some sample data you can import to play with; you'll do so in this step.

  1. Open up a new terminal window or tab (while leaving the one with the server running open), and again navigate to the /tmp directory with cd /tmp.
  2. Make sure you're in the /tmp directory before doing this next step! Issue this command, which will download a sample dataset from MongoDB (copy and paste it), and import it:

    wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
    
  3. From the terminal window, while in the /tmp directory, import the data with the following command

    /usr/local/mongodb/bin/mongoimport --db test --collection restaurants --drop --file /tmp/primer-dataset.json
    

    If this works, you should see a message in your terminal window indicating that 25359 documents have been imported.

Finding and querying data

Visit this tutorial on finding and querying data. Work through the examples one-by-one and see if you can get them to work. Note that we have installed the pymongo library under Python 3, so you'll need to execute your Python programs as python3 programs.

This tutorial doesn't require much thought: it's pretty much copying and pasting code and seeing it work. Therefore, I want you to add the following updates:

  1. The tutorial has Python code to retrieve data from the database and print it, but when you print it, it displays as a raw dictionary. This is a mess to read. Modify the printing code to make the output readable.
  2. Some of the query syntax is pretty bizarre if you're used to looking at SQL. Focus on the "greater than operator" as a specific example. The people who built MongoDB were smart people who thought hard about what they did. Why does the syntax look like that? Talk with your partner or the people next to you as to why they might have made this choice.

Once you have finished the above tutorial, work through this tutorial on aggregation. Again:

  1. Clean up the printing code as you go.
  2. The aggregate query syntax is arguably even stranger than in the earlier exercise. Talk with your partner or people next to. What is the query saying? Why is the syntax designed this way? You might want to try to lookup more detail in the MongoDB manual.

Inserting data

Work through this tutorial on inserting data. Additionally:

  1. See if there's anything you can do clean up the syntax. That inserted record in the tutorial is a mess. Can you break it down into subdictionaries to make your code more readable?
  2. Can you use the query syntax you learned above to find the record you inserted?

Building and utilizing an index

Go back again and find the query in the "Finding and querying data" that found all restaurants in the borough of Manhattan. Before you build an index, let's look at the query plan that MongoDB uses. (This is the same idea as using EXPLAIN in PostgreSQL.) Here is documentation on how to get "explain" output from MongoDB. The short version is that once you have your cursor object, you issue the following command:

print(cursor.explain())

The output is pretty dense, but take a look at it and see what you can get out of it. This documentation on "explain" results might help a little. After you've discussed it, copy and paste the output to a file so you have it somewhere for comparison purposes.

Next, look at this documentation on how to build an index, and build an index on borough. Then run your query again, looking at the "explain" output. Compare it to what you had previously. How is it different? Can you find evidence that it is using the index?

Running a MongoDB cluster

One of the major benefits MongoDB offers is the ability to run a database cluster across multiple computers. In this portion of the lab, you'll create a cluster and watch data replicate. MongoDB supports two different kinds of distribution: replicas and shards. Replicas are used to have copies of the data on multiple servers, so that different users can access different replicas. Shards have different portions of the database itself on different computers, again to achieve performance gains. Replicas are easier to set up than shards, so we'll set up replicas in this lab.

In practice, replicas are typically run on separate computers that communicate with each other. For purposes of this lab, you'll run all three replicas on your lab computer, each running in a different directory. That's because a firewall in place on campus prevents us from having the replicas actually work across lab computers. The good news is that the experience you'll have in setting up the replicas is exactly the same as it would be across multiple computers. The only difference is that all of the replicas are on the same disk drive, which means that there isn't an actual performance gain, but everything else will look the same.

  1. Shut down the MongoDB server that you started up at the beginning of the lab. Find the terminal window where you left it running, and issue a ctrl-c on the keyboard.
  1. Restart the server as a replica server. To do this, issue the following command in your server terminal window:

    /usr/local/mongodb/bin/mongod --dbpath /tmp/dbusername --replSet "rs0" --port 27017
    

    This starts up the server again, but associated with a replica set we're calling rs0, and listens for connections on the default port of 27017.

  2. In a second terminal window, navigate to the /tmp directory. Create a new database directory, and start up a second server on another port. For example:

    cd /tmp
    mkdir dbusername2
    /usr/local/mongodb/bin/mongod --dbpath /tmp/dbusername2 --replSet "rs0" --port 27018
    

    Shrink this terminal window and move it out of the way somewhere.

  3. Do it all again a third time, in yet another window:

    cd /tmp
    mkdir dbusername3
    /usr/local/mongodb/bin/mongod --dbpath /tmp/dbusername3 --replSet "rs0" --port 27019
    
  4. In yet another terminal window, start up the MongoDB shell for the first server by running the following command:

    /usr/local/mongodb/bin/mongo --port 27017
    
  5. Once the MongoDB shell has started, issue the following command to initialize the replica set, and to see it's configuration state. With this and the commands that follow, watch the output that gets dumped out in the terminal windows you have left running with the servers! It's fun to see them respond.

    rs.initiate()
    rs.conf()
    
  6. Then add the remaining replica members to the set. To do so, enter into the MongoDB shell the following two command, but change the number of the computer to match the one you're on:

    rs.add("cmc306-02.mathcs.carleton.edu:27018")
    rs.add("cmc306-02.mathcs.carleton.edu:27019")
    

    This should get the whole replica set up and going.

  7. Write Python code to add a new record to the primary replica. For me, the primary ended up being the first one (port 27017). I suppose it's possible that you might end up with a different one; MongoDB uses a procedure to choose a single primary from the group. If that's the case, you'll need to modify the line in your Python program that looks like this:

    client = MongoClient()
    

    to instead look like this:

    client = MongoClient('localhost',27018)  ### or, 27019 if that's the port where the primary is
    
  8. Write code to query the record you have just inserted to see if it is there.
  9. Now, we're going to kill the primary server, and show that the data did in fact replicate. Go to the terminal window for whichever server ended up being the primary. Again, that's likely the first one you started. In that server terminal window, issue a ctrl-c. Watch what happens in the server windows for the other servers. They'll continue to complain that the first server is down, but they'll also pick a new primary.
  10. Write Python code to again query for the record you added, now going up against one of the other replicas. You should be able to find the record, even though the server you originally added it to is down.

Next steps

I'll admit I haven't a clue how long the above will take. If you haven't finished, that's fine; you've still gotten an introduction to MongoDB. If you have finished and have time to spare, go back and look at our old SQL lab involving course enrollments. Think about what a reasonable structure would be in MongoDB for this data for how it is most commonly used. Insert some sample data, and see if you can write some of the queries that we did.