Background of Collaborative Filtering with Mahout
Background of Collaborative Filtering with Mahout
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In order to set up Apache Mahout, a library written in Java to perform scalable machine learning algorithms based on Hadoop, in the architecture of Mario’s fabulous online shop for pizza, pasta and co (see blog post Building an Online-Recommendation Engine with MongoDB and Mahout) we’d like to know which recommendation strategy is the best for our so far fictional use case (which is computing recommendations for 32 products and 101 users in real time). With this small amount of data we could also use other tools, e.g. Weka, but in an actual online shop the occurring data would be a lot more than what we simulate here, which is why we choose Apache Mahout. Before we dive into coding details let’s have a look at what Mahout’s collaborative filtering actually does.
In order to be able to transfer the recommendation logic to use cases of different businesses we opt for collaborative filtering. A technique for producing recommendations solely based on the user’s preferences for products (instead of including product features and/or user properties). Well, collaborative filtering can be user- or item-based. User-based recommendation promotes products to the user that are bought by users who are similar to her.
Item-based recommendation proposes products that are similar to the ones the user already buys.
Item-based Recommendation: recommend products to a user that are similar to the ones he already bought
User-Item Preferences and Similarity
Alright, but what does similar mean in this context? In collaborative filtering similaritybetween users (for user-based recommendations) or items (for item-based recommendations) is computed based on the user-item preference only. We use the number of how often a user bought a product as a proxy for the user’s preference. It’s not a perfect proxy but is does the trick and it’s easy to gather. One could also use the number of clicks or views or a combination of those.
Based on these user-item preferences we can use the Euclidean distance or the Pearson correlation to determine the similarity between users respectively items (products). Based on the Euclidean distance, two users are similar if the distance between their preference vectors projected into a Cartesian coordinate system is small. In fact, the Pearson correlation (based on demeaned user-item preferences) coincides with the cosine of the angle between the preference vectors. That is, two users are similar if the angle between their preference vectors is small, or formulated in terms of correlation, two users are similar if they rate the same products high and other products low, intuitively spoken.
However, user-item preferences can be (intentionally) limited to pure association, i.e. the user buys or doesn’t buy the product (respectively views or doesn’t view the product etc.). In this case, similarities between users or items can be computed based on the Tanimoto coefficientor the log-likelihood ratio. Both similarities are concepts of how likely respectively unlikely it is that two users have both an association to some items but not to other items.
The Tanimoto similarity between 2 users is computed as the number of products the 2 users have in common divided by the total number of products they bought (respectively clicked or viewed) overall.
This isn’t really a detailed description of similarity measures and it doesn’t need to be one: Even if one fully understands the concept and computational details of these similarities, in the end one would probably still prefer a data driven decision in order to choose between them for the particular use case at hand.
So Mario decided to implement all of the above mentioned recommenders, that is user- and item-based each combined with one of the for similarity measures, plus the Slope One recommender which doesn’t need any similarity measure as input at all. Once all 9 Mahout recommendation strategies are implemented he wants to evaluate and compare them.
Stay tuned for the coding details of how to integrate the open source recommendation framework Mahout into Mario’s online shop.
Please feel free to attend our talk “Building a Online-Recommendation Engine with MongoDB” at the Free GOTO NoSQL Munich – part II in Munich, April 9, 2013 to get a live and comprehensive presentation of our online-recommencation engine. Furthermore, we would love to meet you at the NoSQL Roadshow Munich 2013. A great place to learn more about NoSQL and Big Data technologies. To get a 30% discount please use the comSysto CodeCOMSYSTO30.
Published at DZone with permission of Comsysto Gmbh , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.