Many of us see recommender systems as mysterious entities that seem to know our thoughts. Just think of Netflix’s recommendation engine, which suggests movies, or Amazon, which suggests what products we should buy. Since their inception, these tools have been improved and refined to continuously improve user experience. Although many of them are very complex systems, the fundamental idea behind them remains very simple.
What Is a Recommender System?
Recommender systems are a subclass of information filtering systems that present users with items he or she might be interested in based on preferences and behavior. They seek to predict your appreciation of an item and suggest the ones you are more likely to appreciate.
How to Create a Recommender System
Although there are many techniques to set up a recommender system, I chose to present you three of the simplest and most often used — first collaborative filtering, then content-based systems, and ultimately knowledge-based systems. For each system, I will explain the related weaknesses, potential pitfalls, and how to circumvent them. Finally, a complete implementation of a recommender system awaits you at the very end.
The first technique used, which is still among the simplest and most efficient, is collaborative filtering. This three-step process begins by collecting user information, then forming a matrix to calculate associations, and finally making a recommendation with a fairly high level of confidence. The technique is divided into two main categories: one based on users and one based on items that make up the environment.
User-Based Collaborative Filtering
The idea behind user-based collaborative filtering is to find users with similar tastes to our target user. If Jean-Pierre and Jason have rated several films in a similar way in the past, then we consider those two as similar users and we can use the ratings of Jean-Pierre to predict the unknown ratings of Jason. For example, if Jean-Pierre enjoyed The Return of the Jedi and The Empire Strikes Back, and Jason also enjoyed The Return of the Jedi, then The Empire Strikes Back would be a great suggestion for Jason. Generally, you only need a small number of users similar to Jason to predict his evaluations.
In a table where each row is a user and each column represents a movie, simply find the similarities between the rows in the matrix to find similar users.
There are, however, some issues associated with this type of implementation
User preferences change over time. This can generate many antiquated suggestions.
The higher the number of users, the longer it will take to generate recommendations.
User filtering is sensitive to the Shilling Attack, which is a way for malicious people to bypass the system and make specific products rank higher than others.
Item-Based Collaborative Filtering
The process is simple. The resemblance of two items is calculated based on ratings given by the user. Let’s meet Jean-Pierre and Jason once more, both of whom have enjoyed The Return of the Jedi and The Empire Strikes Back. We can, therefore, deduce that the majority of users who enjoyed the first movie may also appreciate the second. So, it would be relevant to suggest The Empire Strikes Back to Larry, who loved The Return of the Jedi.
Therefore, the resemblance is calculated according to the columns rather than the lines (As seen in the user/movie matrix presented above). Item-based collaborative filtering is often favored since it does not have any of the disadvantages of user filtering. First, the items in the system (movies in this case) do not change over time, so suggestions will be more relevant. In addition, there are typically fewer items than there are users, which reduces processing time. Ultimately, these systems are far more complex to cheat on considering that no user can change.
Content-Based Recommender System
In content-based recommendation systems, the descriptive attributes of the elements are used to formulate recommendations. The term “content” refers to these descriptions. For example, looking at Sophie’s listening history, the system notices that she seems to enjoy the country genre. Consequently, the system can recommend titles of the same or of a similar genre. More complex systems are able to detect relationships amongst multiple attributes and thus produce suggestions of higher quality. For instance, the Music Genome Project categorizes each song in its database according to 450 different attributes. This project is what powers music recommendations on Pandora.
Knowledge-Based Recommender System
Knowledge-based recommendation systems are particularly useful in a context where items are rarely purchased. Examples include items such as houses, cars, financial services, and even expensive luxury goods. In such cases, the recommendation process often suffers from a lack of ratings for the products. Knowledge-based systems do not use ratings to make recommendations. Rather, the recommendation process is performed on the basis of similarities between customer requirements and item descriptions, or on the use of constraints specifying user requirements. This makes this type of system unique since it allows users to explicitly specify what they want. Regarding the constraints, when applicable, they are mostly implemented by experts in the field and they are known from the beginning. For example, when the user clearly specifies he’s looking for a home within a range of prices, the system must take this specification into account.
Cold-Start Problem in Recommender Systems
One of the major problems in recommender systems is that the number of initially available ratings is relatively small. What can we do when a new user has not yet recorded movies, or when a new movie is added to the system? In such cases, it is more difficult to apply traditional models of collaborative filtering. While content-based and knowledge-based methods are more robust than collaborative models in the presence of cold starts, the content or knowledge may not always be available either. As a result, a number of processes, such as hybrid systems, have been designed to address this problem.
Hybrid Recommender Systems
Note that the different types of systems presented so far all have strengths and weaknesses and base their suggestions on various data points. Some recommendation systems, such as those based on knowledge, are most effective in cold start environments where the amount of data is limited. Other systems, such as collaborative methods, are more effective when a lot of data is available. In many cases, where the data is diversified, we have the flexibility to use multiple methods for the same task. We can, therefore, combine the suggestions of several techniques to improve the quality of the system as a whole. Many combination techniques have been explored, including:
- Weighted: A different weight is given to the recommendations of each technique used to favor some of them.
- Mixed: A single set of recommendations, without favorites.
- Augmented: Suggestions from one system are used as input for the next, and so on until the last one.
- Switching: Choosing a random method.
One of the most famous examples of a hybrid system came to be known during the Netflix Prize coding competition that lasted from 2006 to 2009. The goal was to improve Netflix’s movie recommendation system known as Cinematch by increasing its algorithm’s accuracy by at least 10%. The Bellkor’s Pragmatix Chaos team won the one million dollar prize with a solution that combined 107 different algorithms and managed to improve Cinematch’s suggestions by 10.06%. In case you were wondering, accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.
What About AI?
Recommender systems are commonly used in an artificial intelligence context. Their ability to provide insight, predict events, and highlight correlations are typically responsible for their use in AI. On the other hand, machine learning techniques are commonly used to implement recommender systems. For example, at Arcbees, we’ve successfully managed to build a movie rating prediction system using a neural network and data from IMDb. Neural networks can quickly perform complex tasks and easily manipulate big data. By using a list of movies as an input and comparing the output with the user’s rating, the network can learn on his own the rule to predict future ratings for a specific user.
Throughout my readings, I noticed two great tips that always came back among the experts in the field. The first is to base recommendations on items that users pay for. When a user is willing to pay, you can be assured that the rating he or she will give will be much more relevant and accurate. Secondly, it is always better to have a greater number of algorithms than to refine a single algorithm. The Netflix Prize is a good example.
Implementing an Item-Based Recommender System
The following code demonstrates how easy and quick it is to implement a collaborative filtering item recommendation system. The language used is Python and I use the Pandas and Numpy libraries which are among the most popular in the field. The data used are film ratings and the set is available on MovieLens.
Step 1: Finding Similar Movies
- Read the data:
Build user X’s movies matrix:
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
Choose a movie and generate similarity score (correlation) between this movie and all others:
starWarsRatings = movieRatings['Star Wars (1977)'] similarMovies = movieRatings.corrwith(starWarsRatings) similarMovies = similarMovies.dropna() df = pd.DataFrame(similarMovies)
Remove unpopular movies to avoid having inappropriate suggestions:
ratingsCount = 100 popularMovies = movieStats['rating']['size'] >= ratingsCount movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]
Extract popular movies that are similar to our target one:
df = movieStats[popularMovies].join(pd.DataFrame(similarMovies, columns=['similarity'])) df.sort_values(['similarity'], ascending=False)[:15]
import pandas as pd ratings_cols = ['user_id', 'movie_id', 'rating'] ratings = pd.read_csv('u.data', sep='\t', names=ratings_cols, usecols=range(3)) movies_cols = ['movie_id', 'title'] movies = pd.read_csv('u.item', sep='|', names=movies_cols, usecols=range(2)) ratings = pd.merge(ratings, movies)
Step 2: Make Recommendations to a User Based on All His Ratings
- Generate similarity score between each pair of movies and keep only popular ones:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating') corrMatrix = userRatings.corr(method='pearson', min_periods=100)
Generate recommendations for each movie seen and rated by our user (here we chose user zero):
myRatings = userRatings.loc.dropna() simCandidates = pd.Series() for i in range(0, len(myRatings.index)): # Retrieve similar movies to this one that I rated sims = corrMatrix[myRatings.index[i]].dropna() # Now scale its similarity by how well I rated this movie sims = sims.map(lambda x: x * myRatings[i]) # Add the score to the list of similarity candidates simCandidates = simCandidates.append(sims) simCandidates.sort_values(inplace = True, ascending = False)
Sum up scores of identical movies:
simCandidates = simCandidates.groupby(simCandidates.index).sum() simCandidates.sort_values(inplace = True, ascending = False)
Keep only movies that have not yet been viewed by the user:
filteredSims = simCandidates.drop(myRatings.index)
How to Go Further?
In the above case, Pandas and our in-house CPU were sufficient to work on the MovieLens dataset. However, larger sets might prove more lengthy to process. Therefore, you will probably want to turn to solutions such as Spark or MapReduce, which have more processing power.
I hope I have succeeded in helping you see that there is nothing complicated in implementing a simple and effective recommender system. Do not hesitate to comment and ask your questions!