Anonymization

Our initial project was to develop a course recommender using historical transcript data provided by the registrar. When we received these data, they did not contain obvious personal identifiers such as name or student ID, but they did contain the complete schedules and grade history of several thousand Carleton students. The sensitivity of this information made privacy a critical issue. Even without a name or student ID, for instance, it would still be possible to identify a student using their course history, and it might be possible to make such an identification using only the student's major in some cases (i.e. special majors or triple majors). To address this issue we examined several different mechanisms to anonymize this dataset without rendering it useless for collaborative filtering.

k-Anonymity

In order to address the problem of anonymization, Dr. Latanya Sweeney proposed a framework called k-anonymity [4]. Suppose that data is stored in a table where each row is a user and each column is a class. Then the table is said to be k-anonymous if every combination of sensitive attributes is shared by at least k rows in the table. In other words, our table is k-anonymous if every combination of courses and grades is shared by at least k students.

There are two primary techniques for making a table k-anonymous [3]. Suppression involves deleting entries in the table until k-anonymity is achieved, while generalization consists of modifying entries so that they are less specific. For example, a zip code might be generalized by removing one of its digits. Meyerson and Williams showed that finding the fewest number of attributes to suppress in order to make a table k-anonymous is NP-hard [1], and most recent work has focused on developing efficient but sub-optimal algorithms for ensuring k-anonymity.

Gender Dorm Zip Code Major
M Musser 43983 CS
F Burton 54293 CS
F Myers 43923 Math
M Nourse 40231 CS

Table 1: A table of data. Note that this table is not k-anonymous because each row is unique.

Gender Dorm Zip Code Major
M * 4**** CS
F * * Science
F * * Science
M * 4**** CS
Gender Dorm Zip Code Major
* * 439** Science
* * ***** CS
* * 439** Science
* * ***** CS

Table 2: Two distinct k-anonymizations of the above table.

Our experiments with k-anonymization revealed a limitation of current algorithms. All of the current research on k-anonymity has focused on anonymizing a single table in which each row represents an individual and each column is a distinct attribute. For transcript data, however, each user is identified by a set of courses rather a tuple of fixed length. Students can take a different number of courses, and this adds an extra dimension to the problem. The concept of a column is not well defined, and it is not clear which attributes to generalize or suppress.

One possible solution is to create a table with a column for every possible course, so that each column represents a distinct course and each cell, c(i,j), contains a binary value indicating whether user i has taken course j. After this transformation, every row will be the same length, and a traditional k-anonymization algorithm can be used to find an approximate solution. This approach works, but it has several major limitations. The first is that by greatly increasing the number of columns in the table, the performance decreases. The approximation algorithm proposed by Meyerson and Williams is already cubic in the number of rows [1], and by increasing the time it takes to process each row the problem can become infeasible for large datasets. Even more problematic than speed, however, is that it is impossible to generalize attributes when every value is binary. This is especially troubling for collaborative filtering algorithms, which require measuring the similarity between users. If one student has taken Databases (CS 347 at Carleton) and another has taken Data Mining (CS 377), we would like to be able to say that the two students are somewhat similar since they have both taken upper-level computer science courses, but to anonymize these two rows one would have to delete both attributes.

Solving this problem would require several changes to existing k-anonymity algorithms. Rather than putting each course in a separate column, a better approach would be to compare the sets of courses taken by users and determine which pairs of courses can be generalized to minimize the number of necessary changes. In this case, the courses described above might be generalized CS 300 to avoid deleting them. For large datasets it is not possible to do this exhaustively, but it might be possible to come up with a heuristic to determine pairs of students that are roughly the same and then compare those students exhaustively. This remains an interesting open research problem.

Perturbation

While developing a k-anonymous collaborative filtering system is still an open problem, there are other techniques we can use to help protect the privacy of sensitive user data. One of the simplest methods is to randomly perturb the data. In their article "SVD-Based Collaborative Filtering with Privacy," Huseyin Polat and Wenliang Du perturbed their dataset by adding a random number chosen according to a Gaussian distribution to each rating [2]. By using a Gaussian distribution with mean 0, each individual rating is obscured but the average ratings for each user and movie should remain roughly constant. In fact, Polat and Du find only a small difference in error between the perturbed and non-perturbed data. Our own experiments have corroborated this. (show results).

Perturbation prevents a malicious party from determining the exact value of a specific rating, but it does not necessarily provide anonymity for the user. In the transcript dataset, for instance, we can easily perturb the grades, but it is not clear how to perturb what courses a user took, and thus they could be identified using their schedule. There is also a question of magnitude. Unless we chose a Gaussian distribution with a very high standard deviation, it would still be possible to distinguish between a student who got mostly As and a student who got mostly Fs, and with sensitive data this may not be acceptable. Perturbation provides some privacy, especially from non-malicious users, but it does not guarantee the strict anonymity of schemes like k-anonymity.

References

[1] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proceedings of the Symposium on Principles of Database Systems, pages 223-228, 2004.

[2] H. Polat and W. Du. SVD-based collaborative filtering with privacy. In Proceedings of the 2005 ACM symposium on Applied computing, pages 791-795, New York, NY, USA, 2005. ACM Press.

[3] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuziness, and Knowledge-based Systems, 10(5):571-588, 2002.

[4] L. Sweeney. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuziness, and Knowledge-based Systems, 10(5):557-570, 2002.