Wikipedia Demographic Analysis

Wikipedia Data Analysis

Dave Musicant, Carleton College

Wikipedia is an amazing and bizarre community-driven project where anyone can make edits to almost any article. If you haven't ever tried to edit a Wikipedia article before, you should do so -- it's fun and interesting.

Wikipedia is rapidly becoming the default and free encyclopedia of the world. This is amazingly cool, but also scary considering some of the flaws present in Wikipedia. There are dramatic inequities amongst the Wikipedia contributor community, and in the choices that they make. One example in particular is that only 13% of contributors are women. Is this a problem? Some might argue that the gender of an author doesn't matter if the content is fine. What can we say about "female" vs. "male" content in Wikipedia?

For this project, you'll reproduce research that looked at English Wikipedia articles of interest to women and articles of interest to men, and compared the lengths of those articles³. (Disclosure: the creator of this assignment was one of the authors of that paper.) Research has shown that article length correlates with article quality^1,2. While it is not a perfect predictor, using the length of an article is a good proxy for estimating how good an article is.

More specifically, for this project, you will attempt to determine if Wikipedia articles of interest to men and women are of considerably different lengths.

The data

Determining the (approximate) length of a Wikipedia article is easy. That will be the last step of the work that you do. The challenging part is to determine which articles are more interesting to males vs. females in a systematic and reproducible way. The obvious thing to do is to get this data from Wikipedia somehow, except that this is hard. Most Wikipedia editors do not supply their gender (this info in a user profile is optional), so there may be a strong "self-selection" bias amonst those who supply gender info and those who do not.

Instead, you will use gender information from MovieLens, which is a free online movie recommendation site. Over 80% of the users in MovieLens report their gender, (unlike Wikipedia, where only 2.8% of contributors report their gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, it seems as though this would be much less likely than in Wikipedia.

Specifically, you will use MovieLens data to identify which movies should be of strongest interest to women, and which movies should be of strongest interest to men. You'll then compare the average lengths of those Wikipedia articles to look for a difference.

Your task

Your job is to:

Download the most recent MovieLens dataset with demographic information. The data is a little on the old side; alas, more recent releases by the project have not included demographic information on the users. (For the actual research study, one of the co-authors was on the MovieLens team, and had access to more recent data behind the scenes.)
The data that you download has a README file within it. After you unzip the data, read the README file to learn about how the data is stored.
Write a program to read the three files (movies.dat, ratings.dat, and users.dat) into Python. Think very carefully about how to store the information. For example, if you'll want to look up information in movies.dat by a movie id, you'll want to put the data into a dictionary keyed on movie id. Don't just start coding here; read through the rest of the assignment first, and think about what your algorithm will look like. You'll be looping through one set of data, and doing lookups on others. What are you looping over, and what are you looking up? You want to store your data appropriately to make this fast.
Produce a list of the 20 most "male" movies and the 20 most "female" movies. Figuring out how to measure the genderedness of a movie is part of your task, and is not clear cut. Should you use the average rating by people from each gender, which measures how a movie was liked by each gender? Or should you use how often a movie was rated by each gender, regardless of whether the rating was positive or negative? This would measure what movies each gender chose to watch, regardless of the opinion they formed. For the top 20 female movies (and ditto for male), should you choose the movies that score the highest on whichever metric you choose for "femaleness"? Or should you choose the movies that have the highest difference between the female scores and the male scores? You'll need to argue the technique you choose. You might want to try more than one.
Once you have chosen your two lists of 20 movies, measure the length of the English Wikipedia article for each. This is hard to completely automate because the names of the movies in the MovieLens dataset don't precisely match to the names in Wikipedia. You'll have to manually search Wikipedia to find the names of the Wikipedia articles that match to each movie. Once you've done this, you can use or modify this program I wrote to measure the lengths of a series of Wikipedia articles.
Summarize your results in a way that is meaningful. Submit a short paper (perhaps 3 pages or so, including tables of data or graphs) describing how you approached what you did, and what you learned. You can use whatever software you like to create this document, but you should submit it as a PDF. This is good practice for transmitting electronic work: sending word processor documents (such as Microsoft Word, etc) does not guarantee that your reader will see the layout in the same way that you do.
You should ultimately submit both your Python program(s) and your paper.

Parts 1 and 2

In order to get you started on this assignment, there are actually two submissions you'll need to make. Part 2 is the final project, as described above. For Part 1, submit Python code which determines (and prints out) the number of males and the number of females, separately, that rated the movies "Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These numbers in particular will help determine if you are on the right track.

Closing notes

There are undoubtedly other ways of solving this problem by using other tools. The point of this assignment, however, is to learn how to use programming with dictionaries and other structures in the context of a hopefully interesting problem. Don't do this assignment via some other magical tool. Spreadsheets have some really neat tricks that might make this program mostly unnecessary; but they would fail if the dataset had 10 million rows.
A research paper such as this one, in computer science, is typically written as to describe in detail the data used and the approach taken, and an analysis of the results. It does not include low-level details of the program itself, like "I looped over the data, and incremented a count of the number of movies that males watched." If you'd like to see what an actual research paper looks like, here is the the actual gender paper on Wikipedia from which this assignment was based. This is considerably longer than the paper you're being asked to write, and is at a considerably higher level; it was written by computer science faculty and graduate students. Still, it might be interesting to look at, and at least give you a rough sense of what the important parts of a paper such as this one might be.
Addendum to the above point: don't let the above paper squelch your creativity. Don't use it as a source on how to make specific decisions on how to measure things. There are many decisions made that were judgment calls; use your own judgment on those matters, rather than simply mimicking the choices that the paper made.

References

¹J. E. Blumenstock. Size matters: Word count as a measure of quality on Wikipedia. In Proc. WWW 2008. ACM.

²T. Wöhner and R. Peters. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proc. WikiSym 2009, New York, NY. ACM.

³Shyong (Tony) K. Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, David R. Musicant, Loren Terveen, and John Riedl. WP:Clubhouse?: An exploration of Wikipedia's gender imbalance. In Proc. WikiSym 2011, New York, NY, ACM.