The Evolution of Recommendation Systems
Recommendation systems seek to predict the 'rating' or 'preference' that a user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered. Some techniques work well with explicit ratings e.g. movies, music while others do well with implicit ratings e.g. page views, clicks, etc.
Collaborative filtering (CF) methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities, or preferences -- and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself.
First, there was 'Taste', a simple collaborative filtering engine that could predict what a user would like next -- be it a movie, book, or a product. Later, Taste was incorporated into Apache Mahout. As of v0.7, Mahout provides the standalone version of the original item-item, user-user and slope-one recommender as well as the distributed versions of item-item CF.
The following images demonstrate how Mahout's recommendation engine work:
Another approach is to use association rule mining (or market basket analysis) to compute interesting recommendations. But this technique does not generate personalized recommendations. Some researchers have combined ARM and CF to provide personalized recommendations.
A very successful approach to this problem are the so called latent factor models using Matrix Factorization.
In contrast to CF algorithms, this technique builds a model during the learning phase, by recognizing patterns in the training data. After the learning phase, they use the model to predict ratings of given queries. Many recent achievement of precise prediction used this approach. Specifically, these methods factorize the rating matrix into two low-rank matrices: user profile and item profile i.e. model the users and items as points in a k-dimensional feature space. An unknown rating can than simply be estimated by taking the dot product between the corresponding user and item feature vectors. They tend to take longer time compared to CF, but claimed to achieve better accuracy.
Mathematically spoken, decompose A into two other matrices U and M whose combination is a good approximation of A.
Few published techniques in matrix factorization include:
- Regularized SVD - Regularized SVD (Singular Value Decomposition) minimizes squared error between actual ratings and predicted estimations for all available votes. In order to control overfitting issue, it adds regularization terms both for user and item profiles. For minimization process, it uses gradient descent. With a proper choice of parameters, this algorithm is known to achieve good accuracy.
- Non-negative Matrix Factorization (NMF) - NMF also factorizes the rating matrix into user and item profiles, but it has one more restriction: both low-rank profile matrices should have only positive values in them. This method uses multiplicative update rules for minimizing Euclidean distance or Kullback-Leibler divergence between the actual ratings and estimation.
- Probabilistic Matrix Factorization (PMF) - PMF adopts a probabilistic linear model with Gaussian observation noise for representing latent features both for users and items.
- Bayesian Probabilistic Matrix Factorization (BPMF) - This algorithm applies a fully Bayesian treatment of PMF model in which model capacity is controlled automatically by integrating over all model parameters and hyperparameters.
- Non-linear Probabilistic Matrix Factorization (NLPMF) - This algorithm develops a non-linear probabilistic matrix factorization using Gaussian process latent variable models. The model is optimized using stochastic gradient descent method, allowing to apply Gaussian processes to data sets with millions of observations without approximate methods.
Mahout uses Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition, which is an iterative algorithm and currently show poor performance on Hadoop caused by the enormous overhead of scheduling and check-pointing each single iteration.
The next wave of innovative techniques were brought out by KDD 2011 competition, similar in nature to the Netflix competition but offered a fraction of the prize money! The winners used a combination of techniques (ensemble) including ALS, KNN, SGD, SVD++. More on it in next blog article.
Finally, whatever algorithm you choose, error analysis can provide lot of insights into an algorithm's behavior!