Here is Part II of Doug Turnbull's article about high-quality recommendation systems with Elasticsearch. Did you miss Part I? Read it here.
Towards a Bayesian Approach
I want to keep exploring ways of scoring co-occurrences in search engines in future articles. In all this talk about foreground and background percentages, I can’t help but think about Bayes formula:
P(A | B) = P(B | A) * P(A) / P(B)
We’re trying to understand the probability of liking one item given the probability of liking another item. That’s exactly what P(A | B) expresses. Scoped to all the users that like B (perhaps “Terminator”) what is the probability of liking some A (perhaps “Rambo”).
P(Rambo | Terminator) = P(Terminator | Rambo) * P(Rambo) / P(Terminator)
Well, P(Rambo | Terminator) corresponds to the foreground percentage. While P(Rambo) corresponds to the background percentage. In other words:
foregroundRambo = ( P(Terminator | Rambo) / P(Terminator) ) * backgroundRambo
I’ve alluded to performing your own data analysis work. Don’t just try to find this relationship between foreground and background probabilities. But try to figure out a typical relationship between foreground and background percentages. You can then use the typical general relationship to evaluate scoring methods, and find a good lower bound where you feel confident in the statistical patterns in the scoring. If we think about Bayes formula a bit more generally, we see there’s a common transformation happening. Getting away from one example, we can think of functions that return probabilities (ie probability distributions) to get at these relationships:
ForegroundDistribution(item A, item B) = transformation(item A, item B) * background( item B)
I haven’t finished mapping this out in my head yet, but this idea is at the root of Bayes Formula. The “background” distribution seems to correspond to a Bayesian Prior. It says, with no evidence, what sort of typical background probability distribution can we expect. In the basket analysis article, we argued for a power-law or zipfian distribution for buying patterns. The “transformation” in classic Bayesian analysis corresponds to the likelihood function. Here, it’s the relatively likelihood of A, given some other B. You can imagine this as outputting a constant that’s going to shrink or grow the background probability based on the relative co-occurence of A and B. Perhaps it spits out a 1.5 for some A and B pair – growing the foreground percentage. Or spits out a 0.5, shrinking it.
The “transformation” function you can imagine in three dimensions. The x and y-axis corresponding to items a and b, with the z-axis corresponding to how much you’re going to multiply background to get foreground. You can sample P(B | A) (Terminator scoped to Rambo) by looking up Rambo users and counting the raw count of Terminator users. You can similarly look up P(Rambo) by snagging the total count of Rambo users. From these values, you can compute a point-wise value for transformation.
You probably won’t be able to drive this transformation into a formula, but you can use your data to draw a graph. At any given item pairing, for example, Terminator and Rambo, you can compute this value. You need not compute this for every pairing. Remember: your goal is to figure out what’s typical to guide your evaluation, not find every possible value.
If you sample this transformation, and values tend to stay between 0.9 and 1.1, you know that your foreground and background percentages don’t deviate a great deal. If popular things grow or shrink less than unpopular things, then JLH might work well for you. If there are wild swings that go against this pattern, you might have a sense that something like JLH might not be appropriate for your data.
I’m actually really excited to pursue these ideas further (and if you play with them, let me know what happens). For example, this isn’t particularly classical Bayesian analysis, where we’re trying to find our confidence in some model parameters (such as the bias of a coin, or weight of die, or distribution of topics in a corpus). Instead, it’s more like a regression where we’re trying to determine the relationship between two values (items A and B) and a transformation. Perhaps something like a loess regression that’s not parameterized (i.e., doesn’t assume the underlying data fits exactly one formula). Perhaps what I really ought to be doing is trying to do Bayesian analysis on the transformation process to find latent parameters in the relationship between arbitrary A and B items. I’d love to hear your ideas! or seek me out for a free lunch and learn.
Practical Elasticsearch and Data Modeling Considerations
With an aggregation approach, we’re left with a couple of practical considerations for building great recommendations. It’s worth noting them here as areas for further investigation.
We can’t use a significance score from the aggregations alongside typical search scoring, so we can’t directly bias by factors like release date with just one query. We can’t simply do “recommendations” alongside “search” to do personalized search. Unfortunately, aggregations and search live in two different universes.
One way to get around this is by also using the sampler aggregation. The sampler aggregation performs aggregations over the top N most relevant documents. So we could, for example, prioritize recent movies in our search results, then treat the top 100 search results as the “foreground” set. This provides an indirect relevance influence over the significance calculation.
Also, perhaps there’s a way to integrate significance scoring with the regular query process. I’ve been playing with using More Like This or locality-sensitive hashing via the new Simhash capabilities which can occur in the normal query process. But I have significant reservations about these methods, which I won’t get into here (I apologize, this article is probably the root for 10 other articles :-).)
Graph-Based Recommendations with Elastic Graph
Elastic has released an interesting graph product that does the sampling and significance aggregation alongside an ability to navigate to secondary and tertiary significance. For example, in addition to the exercise above, what if we could generate recommendations, notice which genres or taste profiles are significant to me, then find what’s significant across those taste profiles? This is a kind of graph navigation that can provide a lot of value to users and dramatically simplify the creation of recommendation systems.
We’re actually building some interesting demos using Elastic’s graph product for recommendations that go beyond what’s in this article. If you’re interested in seeing a proof of concept of what we’re doing with Elastic’s graph product, feel free to . I’d be happy to run you through the ideas and proof-of-concept, even as we’re actively developing it. We’d love to hear your feedback!
Search Engines Are the Future of Recommendations
Open source search engines like Solr and Elasticsearch made search extremely simple to implement. Recommendation systems still require integrating multiple distributed systems, learning R, and hiring a huge team of data scientists. It sounds extremely hard. You can stand up reasonable search in a few days, but recommendation systems require a great deal more effort.
We’re pushing to change that. In the same way that Solr and Elasticsearch enable pretty good search out of the box, we want to help build recommendation systems with a far lower total cost of ownership than the current regime. Search engines are the ideal place to implement a recommendation system explicitly because they’re easier to maintain, simpler to tune, and more straightforward to deploy than juggling dozens of machine learning libraries and distributed systems.
If you’re interested in exploring how Solr or Elasticsearch can simplify your development of a recommendation system, and we’d love to explore this with you. And everyone gets a free hour of consulting through our Lunch and Learn program!
Special thanks to Mark Harwood at Elastic for his help reviewing this article.