High-Quality Recommendation Systems with Elasticsearch: Part I

We argue that recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance.” The only difference is whether a keyword query is provided. Read this article to find out more.

Doug Turnbull

Updated Sep. 18, 16 · Opinion

Like (7)

Save

11.2K Views

Did you know that technical books can have spoilers? Spoiler alert: we close our book, Relevant Search, making an argument that search engine technology is a great platform for delivering recommendations. We argue that recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance.” The only difference is whether a keyword query is provided.

We argue that search engines provide the ideal framework for implementing easy-to-tune recommendations. In fact, I’ll be talking about this idea at Lucene Revolution in not too long (see also my coauthor’s talk about this subject at Elastic{On}).

(Note: While it’s not required reading, definitely check out my previous article on basket analysis for more information.)

Aggregations: Find What People Who Like This Movie Also Like

Let’s dig in! You’re running a movie website. You’ve got a set of users and you want to know what to recommend to them. One idea you might have is to index each user as a document, like so (movies_liked here is setup as a keyword analyzed field, which I’m omitting):

PUT recs/user/1
{    "movies_liked": ["Forrest Gump", "Terminator", "Rambo", "Rocky", "Good Will Hunting"]}

PUT recs/user/2
{    "movies_liked": ["Forrest Gump", "Terminator", "Rocky IV", "Rocky", "Rocky II", "Predator"]}

PUT recs/user/3
{    "movies_liked": ["Forrest Gump", "The Thin Red Line", "Good Will Hunting", "Rocky II", "Predator", "Batman"]}

PUT recs/user/4
{    "movies_liked": ["Forrest Gump", "Something about Mary", "Sixteen Candles"]}

We want to perform recommendations for users that like Terminator. In other words, we want to know what users who like Terminator also like. This is very basic “basket analysis” – it’s the fundamental building block of recommendations. The name comes from the idea of looking at user’s shopping baskets and finding statistically interesting relationships. Famously, grocery chains learned that people who bought diapers also frequently bought beer. Insights like this can help extremely valuable to users and businesses. Here, we’ll focus in on Terminator to find what’s statistically interesting to Terminator viewers.

With documents that store user history, how do we figure out what viewers of Terminator would like? The first idea that might come to you is to run a terms aggregation. A terms aggregation is a traditional facet. It gives a raw count of the terms for a field in the current search results. In a typical application, terms would probably be movie IDs, but we’ll use titles here. We’ll search for “Terminator” and aggregate on the “movies_liked” field to get a raw count of the movies that occur frequently alongside Terminator, like so:

The end result, just focusing on the aggregations, is the following breakdown:

"movies_like_terminator": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [{
        "key": "Forrest Gump",
        "doc_count": 2
    }, {
        "key": "Rocky",
        "doc_count": 2
    }, {
        "key": "Terminator",
        "doc_count": 2
    }, {
        "key": "Good Will Hunting",
        "doc_count": 1
    }
}

If you scan the user records above, you’ll see users that like Terminator also coincide with two users that \like Forrest Gump, 2 users that like Rocky, and so on. As an aside, if we repeat this for all of your favorite movies – searching also for “Terminator” OR “Predator” OR “Say Anything” and all your favorite movies, then you’ll get collaborative filtering: issue a search to find the users that like the same movies you do. Then you’ll examine the raw counts to see movies you haven’t watched that co-occur frequently with movies you like.

But is this raw count approach a good approach? We discussed both in Relevant Search and in the basket analysis blog article that raw co-occurrence counts are really poor ways of doing recommendations. Raw counts prioritize what’s globally popular over what’s interesting here. In this example, for instance, everyone likes Forrest Gump. So if you used this as a method for recommendations, every user will be recommended Forrest Gump. We refer to this problem as the “Oprah Book Club” problem – a bias towards cross cutting popularity that isn’t particularly useful.

What’s more interesting here is the relative count of movies in the Terminator results. For example, look at Rocky. Every user that likes Rocky also likes Terminator – they have exactly 100% overlap! Another way of saying this is Rocky occurs 100% here in the foreground, and only 50% globally: in the background. This seems like a great recommendation for users that like Terminator!

Measuring Meaningful User-Item Connections with Significant Terms

To transcend the problems with raw counts, Significant Terms Aggregation comes in handy. This aggregation measures the kind of statistically significant relationships we need to deliver meaningful recommendations. Its job is to calculate not raw counts, but specifically the statistical significance of terms in the current results when compared to the background corups. In my basket analysis article, we discussed the pros and cons of different ways of scoring significance. Let’s explore what significant terms does.

But let’s take it out for a spin before thinking about it too critically:

POST recs / user / _search {
    "query": {
        "match": {
            "movies_liked": "Terminator"
        }
    },
    "aggregations": {
        "movies_like_terminator": {
            "significant_terms": {
                "field": "movies_liked",
                "min_doc_count": 1
            }
        }
    }
}

The only thing that’s different above is we use a significant_terms calculation. We set the min_doc_count to 1 to work well in our really puny data set.

And indeed, this solves the immediate problem, we see that our recommendations appear more appropriate, with Forrest Gump nowhere to be seen:

 "buckets": [{
    "key": "Rocky",
    "doc_count": 2,
    "score": 1,
    "bg_count": 2
}, {
    "key": "Terminator",
    "doc_count": 2,
    "score": 1,
    "bg_count": 2
}, {
    "key": "Rambo",
    "doc_count": 1,
    "score": 0.5,
    "bg_count": 1
 }, {
    "key": "Rocky IV",
    "doc_count": 1,
    "score": 0.5,
    "bg_count": 1
 }]

Thinking Critically About JLH Significance Scoring

That’s it, right? No need to dive further? Not quite: you’ll need to understand how this scoring works in the context of your data. To build really good recommendation systems you need to be able to think more deeply about the scoring used. How did significant terms arrive at this ranking? As we saw in the basket analysis article, different forms of significance scoring have their own pros and cons. Choosing the wrong method can have disastrous consequences for the quality of recommendations.

I’m not going to take apart every form of significant terms scoring in this article (there are many options), but let’s dive into the default method to teach ourselves how to think through these problems. The default is called JLH. It multiplies two values

(foregroundPercentage / backgroundPercentage) * 
   (foregroundPercentage - backgroundPercentage)

“Foreground” here means the percentage this item occurs in the current search results (the users retrieved that like our item). For example, above Rambo’s foreground percentage in Terminator movies was 100%. The background percentage is the global percentage in the whole collection. For Rambo above, it’s 50%.

JLH Scoring on Extremely Common Items

How would you go about evaluating the appropriateness of this scoring to your recommendations use case? Let’s think of a couple of hypothetical scenarios. These let us take apart the scoring mechanics around some imaginary scenarios. Then you can think critically about how realistically they might play out in your real-world data. While we continue to use movies, I’m only playing with hypothetical scenarios here: don’t mistake this for an analysis of JLH scoring’s appropriateness against a real data set like Netflix or Movielens.

The first scenario, which we’ve already discussed is the movie everyone likes. In our data set, this is Forrest Gump. 99.999% of users like Forrest Gump.

Indeed such an extremely popular movie doesn’t fare particularly well with JLH scoring. With the (foreground/Background) term, the best Forrest Gump could achieve would be 100 / 99.999, barely > 1. Similarly, (foreground - background) or (100 - 99.999) would get just below one. As you’ll see below, this is a pretty low number for JLH.

But is this a fair scenario? I would argue it’s rare in most domains that an item would be liked by close to 100% of the users. It’s more likely that a much smaller number has a measurable preference for a film. For example, sure perhaps everyone likes Forrest Gump, but most methods of detecting a “like”, such as an explicit rating or an implicit watch/click, will not show 100% of the users interacting with Forrest Gump. Sure, I like Forrest Gump, but last time I watched it was probably 10 years ago. It’s a bit of a passive preference. I’ve seen it, I’m unlikely to be excited when I see it show up on Netflix.

A more likely scenario is the most popular movie or show is something that, say, 20% of the users like. How does JLH scoring fare here? If this movie, let’s say Good Will Hunting, is liked 100% by this set of users, then the highest score would be (100 / 20) * (100 - 20) or 5 * 80 or 400. Much higher than the perhaps unrealistic Forrest Gump scenario.

Items of Average Popularity

Most movies show single digit percentage of users that like the movie. Let’s move on to an average movie: Rocky IV, liked by 4% of users.

How well does Rocky IV fare with JLH? Well, repeating the JLH movie, if 100% of the foreground users like Rocky, we get (100 / 4) * (100 - 4) or 25 * 96 = 2400. Wow, that’s significantly more than the likelihood of being recommended Good Will Hunting!

More “Normal” Deviations

With a 400 best possible score for popular Good Will Hunting and a 2,400 for average popularity Rocky IV, it seems like JLH is just terrible for recommendations, right? Not quite. Well let’s think about the likelihood of these “best case” scenarios. How reasonable is it for 100% of users that like one item to like another item? How likely is it that the foreground probability of liking an item deviates so dramatically from the background preferences?

Well, that depends heavily on the relative distribution of item preferences in your data. One could argue, for example, that if you like Rocky III there’s a very high probability you’ll also like Rocky IV. Perhaps close to 100% overlap.

However, if your data tends to not change as much between the background and foreground data sets, perhaps this isn’t as big of a deal. For example, let’s consider cases where the expected value of the foreground is based on a normal distribution, centered around the background percentage. Let’s consider a hypothetical constant change between background and foreground is a change of about 75% (Rocky IV could be 7+/-3%; Good Will Hunting could be 20 +/- 15%). In these perhaps more realistic scenarios, we would expect scores:

Rocky IV best score: (7/4) * (7-4) = 5.25
Good Will Hunting best score: (35 / 20) * (35 - 20) = 26.25

In other words, an equal proportionate shift rewards the popular movie much more than the one of middling popularity. The first term (the division) will be equivalent between the two movies. The other term (the subtraction) however has the potential for providing a much higher multiple for the popular movie given the shift.

Why might this be? As the creator of JLH, Mark Harwood, points out this is precisely what JLH is designed to do. JLH reflects patterns seen in many recommendation data sets: popular items don’t change dramatically in popularity. Even when scoped to other seemingly related items they deviate far less from the background popularity. We mentioned that everyone purchases eggs in the basket analysis article. If almost everyone buys eggs, scoping egg purchases to other omelet ingredients probably shows only a small increase in the total proportion of egg purchases. But we still want reasonable recommendations for these omelet chefs: so even mild increases in egg popularity should be seen as significant.

Given this, perhaps it’s more typical that Good Will Hunting only deviates +/- 25% from the background. It just so happens, this makes Good Will Hunting’s highest score to be the same as less popular Rocky IV:

(25 / 20) * (25 - 20) = 6.25

In other words, if we examine, say, Sense and Sensibility, Good Will Hunting increasing 25% in popularity scores as highly as our hypothetical Rocky IV increasing 75%.

Final Thoughts on JLH Significance Scoring for Recommendations

Given what we’ve just seen, JLH assumes less popular items are more likely to increase their foreground probability than background percentage.

Typical foreground percentages tend to stay fairly close to background percentage, as in you won’t suddenly see a 100% overlap between two items.
The foreground percentage for popular items deviates much less than the foreground percentage of less popular items.

The most important takeaway is that this is dictated by the statistical patterns in YOUR data. You need to evaluate typical patterns in your data. When you focus on typical preferences what’s the relationship between foreground and background percentages? This is the hard, domain specific work you need to pursue to build a great recommendation system.

But there’s one nefarious case we haven’t yet considered. What about extremely rare movies that happen to overlap? We saw in the basket analysis article that this was the Achilles heel for the Jaccard similarity. So let’s consider one such scenario, involving two movies that just so happened to be liked very rarely and by exactly one user:

Two films Dumb Movie and Weird Art Film each occur twice. And it just so happens that Dumb Movie and Weird Art Film were both liked by one user that happens to like obscure and/or bad movies:

PUT recs / user / 2124 {
    "movies_liked": ["Weird Art Film", "Forrest Gump"]
}
PUT recs / user / 2125 {
    "movies_liked": ["Weird Art Film", "Dumb Movie"]
}
PUT recs / user / 2126 {
    "movies_liked": ["Dumb Movie", "Rambo"]
}

The background percentage of these movies is very low. Perhaps 0.001%. When recommending movies to people that like Weird Art Film, the foreground percentage of “Dumb Movie” becomes 50%. Suddenly very common! How does this score?

(50 / 0.001) * (50 * 0.001) = 2499950.0

Eeegads! That’s a high score.

Suffice it to say, just as discussed in Ted Dunning’s paper on Surprise and Coincidence, this result is unreasonably high. We only have a very small handful of users that like these movies - probably not enough to make statistically robust conclusions about significance. You can see this if you consider how far the foreground and background percentages are from each other. If the data shows +/- 25%, we’d expect this rare movie’s foreground percentage to be some value under 0.001 to just above 0.001. Definitely not 50. This value probably defies our beliefs about the typical patterns we see in our data.

So in addition to the recommendations above, there’s another important criterium to evaluate when applying any form of significance scoring:

Find a reasonable lower bound of number of likes that begins to reflect typical statistical significance between foreground and background relationships. If it's more typical for a fg percentage to be +/- 3% centered around the background, but one rare movie shows wild deviations, consider it an outlier and set a reasonable minimum.

(It’s worth noting a high lower bound in the minimum document frequency is a good idea anyway for performance, lest you run all these expensive calculations on every spurious term that occurs).

The big takeaway is it’s your data that matters with any method of measuring significance. What’s a typical relationship? And what on its face appears to wildly defy expectations?

Be sure to check out Part II, coming soon!

Data (computing) IT Elasticsearch

Published at DZone with permission of Doug Turnbull, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending