Wikipedia Data Analysis
Dave Musicant, Carleton College
Wikipedia is an amazing and bizarre community-driven project where
anyone can make edits to almost any article. If you haven't ever
tried to edit a Wikipedia article before, you should do so -- it's fun
and interesting.
Wikipedia is rapidly becoming the default and free encyclopedia of the
world. This is amazingly cool, but also scary considering some of the flaws
present in Wikipedia. There are dramatic inequities amongst the Wikipedia
contributor community, and in the choices that they make. One example in
particular is that only 13% of contributors are women. Is this a problem? Some
might argue that the gender of an author doesn't matter if the content is
fine. What can we say about "female" vs. "male" content in Wikipedia?
For this project, you'll reproduce research that looked at English Wikipedia
articles of interest to women and articles of interest to men, and compared
the lengths of those articles3. (Disclosure: the creator of this assignment was
one of the authors of that paper.) Research has shown that article length
correlates with article quality1,2. While it is not a perfect
predictor, using the length of an article is a good proxy for estimating how
good an article is.
More specifically, for this project, you will attempt to determine if
Wikipedia articles of interest to men and women are of considerably different
lengths.
The data
Determining the (approximate) length of a Wikipedia article is easy. That
will be the last step of the work that you do. The challenging part is to
determine which articles are more interesting to males vs. females in a
systematic and reproducible way. The obvious thing to do is to get this data
from Wikipedia somehow, except that this is hard. Most Wikipedia editors do not
supply their gender (this info in a user profile is optional), so there may be a
strong "self-selection" bias amonst those who supply gender info and those who
do not.
Instead, you will use gender information from MovieLens, which is a free
online movie recommendation site. Over 80% of the users in MovieLens report
their gender, (unlike Wikipedia, where only 2.8% of contributors report their
gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, it seems as though this would be much less likely than in Wikipedia.
Specifically, you will use MovieLens data to identify which movies should be
of strongest interest to women, and which movies should be of strongest interest
to men. You'll then compare the average lengths of those Wikipedia articles to look
for a difference.
Your task
Your job is to:
- Download the
most recent MovieLens dataset with demographic information. The data is a
little on the old side; alas, more recent releases by the project have not
included demographic information on the users. (For the actual research study,
one of the co-authors was on the MovieLens team, and
had access to more recent data behind the scenes.)
- The data that you download has a README file within it. After you unzip
the data, read the README file to learn about how the data is stored.
- Write a program to read the three files (movies.dat, ratings.dat,
and users.dat) into Python. Think very carefully about how to store the
information. For example, if you'll want to look up information in movies.dat
by a movie id, you'll want to put the data into a dictionary keyed on movie
id. Don't just start coding here; read through the rest of the assignment
first, and think about what your algorithm will look like. You'll be looping
through one set of data, and doing lookups on others. What are you looping
over, and what are you looking up? You want to store your data appropriately
to make this fast.
- Produce a list of the 20 most "male" movies and the 20 most "female"
movies. Figuring out how to measure the genderedness of a movie is part of
your task, and is not clear cut. Should you use the average rating by people
from each gender, which measures how a movie was liked by each gender? Or
should you use how often a movie was rated by each gender, regardless of
whether the rating was positive or negative? This would measure what movies
each gender chose to watch, regardless of the opinion they formed. For the top
20 female movies (and ditto for male), should you choose the movies that score
the highest on whichever metric you choose for "femaleness"? Or should you
choose the movies that have the highest difference between the female scores
and the male scores? You'll need to argue the technique you choose. You might
want to try more than one.
- Once you have chosen your two lists of 20 movies, measure the length of
the English Wikipedia article for each. This is hard to completely automate
because the names of the movies in the MovieLens dataset don't precisely match
to the names in Wikipedia. You'll have to manually search Wikipedia to find
the names of the Wikipedia articles that match to each movie. Once you've done
this, you can use or modify this program I
wrote to measure the lengths of a series of Wikipedia articles.
- Summarize your results in a way that is meaningful. Submit a short paper
(perhaps 3 pages or so, including tables of data or graphs) describing how you
approached what you did, and what you learned. You can use whatever software
you like to create this document, but you should submit it as a PDF. This is
good practice for transmitting electronic work: sending word processor
documents (such as Microsoft Word, etc) does not guarantee that your reader
will see the layout in the same way that you do.
- You should ultimately submit both your Python program(s) and your paper.
Parts 1 and 2
In order to get you started on this assignment, there are actually two
submissions you'll need to make. Part 2 is the final project, as described
above. For Part 1, submit Python code which determines (and prints out) the
number of males and the number of females, separately, that rated the movies
"Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These
numbers in particular will help determine if you are on the right
track.
Closing notes
- There are undoubtedly other ways of solving this problem by using other
tools. The point of this assignment, however, is to learn how to use
programming with dictionaries and other structures in the context of a hopefully
interesting problem. Don't do this assignment via some other magical tool. Spreadsheets
have some really neat tricks that might make this program mostly
unnecessary; but they would fail if the dataset had 10 million rows.
- A research paper such as this one, in computer science, is typically written
as to describe in detail the data used and the approach taken, and an analysis
of the results. It does not include low-level details of the program itself,
like "I looped over the data, and incremented a count of the number of movies
that males watched." If you'd like to see what an actual research paper looks
like, here is the the actual
gender paper on Wikipedia from which this assignment was based. This is
considerably longer than the paper you're being asked to write, and is at a
considerably higher level; it was written by computer science faculty and
graduate students. Still, it might be interesting to look at, and at least
give you a rough sense of what the important parts of a paper such as this one
might be.
- Addendum to the above point: don't let the above paper squelch your
creativity. Don't use it as a source on how to make specific decisions on how
to measure things. There are many decisions made that were judgment calls;
use your own judgment on those matters, rather than simply mimicking the
choices that the paper made.
References
1J. E. Blumenstock. Size matters: Word count as a measure of
quality on Wikipedia. In Proc. WWW 2008. ACM.
2T. Wöhner and R. Peters. Assessing the quality of Wikipedia
articles with lifecycle based metrics. In Proc. WikiSym 2009,
New York, NY. ACM.
3Shyong (Tony) K. Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, David R. Musicant, Loren Terveen, and John Riedl. WP:Clubhouse?: An exploration of Wikipedia's gender imbalance. In Proc. WikiSym 2011, New York, NY, ACM.