General Collaborative Filtering Algorithm Ideas

Recommender systems can be present in all sorts of systems and situations, and thus can be implemented in many different ways. Here is an overview of the methods of implementation, which will help with understanding what we did for our comps project.

Grand Underlying Assumption of Collaborative Filtering

There is one important assumption underlying all of collaborative filtering, which is this: users who have similar preferences in the past are likely to have similar preferences in the future. It is this assumption which allows us to take a user's history and extrapolate into their future and predict items which they might enjoy.

It's pretty obvious why this can be assumed - we can see that in human nature there are people who are like each other and people who are not. And the people who are like each other tend to, of course, be similar. For recommender systems, we just look at the similarity in their preferences on a particular item.

Like all assumptions (especially those relating to extrapolation), it is not necessarily true all the time. However, for the purposes of a recommender system, it is true often enough. The partial correctness of the assumption can be seen in our results - if the assumption was without merit, we could do no better than wild guessing, but we obviously get results better than random prediction.

Explicit vs. Implicit Data Collection

In order to make any recommendations, the system has to collect data. The ultimate goal of collection the data is to get an idea of user preferences, which can later be used to make predictions on future user preferences.

There are two ways to gather the data. The first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars). The second is to gather data implicitly as the user is in the domain of the system - that is, to log the actions of a user on the site.

Explicit data gathering is easy to work with. Assumedly, the ratings that a user provides can be directly interpreted as the user's preferences, making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings.

On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with. The goal is to convert user behavior into user preferences, but it requires getting over one hurdle: how exactly does one infer preference based on actions in a system? This can be a difficult question to answer.

Of course, these two methods of gathering data are not mutually exclusive. A combination of the two have the possibility for the best overall results - one could gain the advantages of explicit voting when the user chooses to rate items, and could still make recommendations when the user does not rate items by implicitly collecting data.

There are additional small difficulties related data gathering:

Data collection can only record the actions of a user, and knows nothing of the actual person behind the computer - it is near impossible to properly a user who actually happens to be two people using the same computer.
Explicit votes may still not be an accurate representation of the true preferences of a user if the person does not know him or herself well enough. Certain studies have shown that people can rate items differently for indiscernible reasons.
Implicit data collection can involve some privacy issues; any system that would make recommendations must avoid overstepping its bounds.

Passive vs. Active Filtering

Once you've gathered your data, there are two basic ways of filtering through it to make predictions. The most basic method is passive filtering, which simply uses data aggregates to make predictions (such as the average rating for an item). The more advanced method is to use active filtering, which uses patterns in user history to make predictions. An example of this would be finding similar users to the current user and using their history to predict a rating.

The distinction between these two methods is subtle, but basically comes down to whether or not the recommendations are user-specific or not. In passive filtering, every user will be given the same predictions for a particular item. In active filtering, the system takes into account your specific history in order to make a recommendation. To put this distinction in a solid example, the news site Digg.com uses passive filtering, showing all users the articles which have received the most votes, whereas the online sales site Amazon.com uses active filtering, trying to recommend products based upon a user's specific actions.

Ultimately, active filtering is what most people mean when they talk about collaborative filtering. Though passive filtering has very useful and practical applications, a personal recommendation system can only be implemented using active filtering.

User-centric vs. Item-centric Filtering

All recommender systems must decide whether or not it will attempt to see patterns between users or between items. A user-centric system will find similarities between users then use the similar users' preferences to predict ratings. An alternative to this is an item-centric system which will attempt to find the relationships between items, and make predictions then only based on a user's preferences and these relationships.

It is not necessary that a recommender system focus only on users or items, but most typically only find similarities between users or similarities between items and not both.

Memory-Based vs. Model-Based Algorithms

One big distinction between algorithms is that of memory-based algorithms and model-based algorithms. The basic difference is that memory-based algorithms uses all the data all the time to make predictions, whereas model-based algorithms use the data to learn/train a model which can later be used to make predictions. This means that the memory-based algorithms generally should have all data in memory, whereas model-based can make fast predictions using less data than the original (once you build the model).