# Optimizing Recommendation Systems: Exploration and Exploitation

"Cold Start" is a common problem that happens quite frequently in
recommendation systems. When a new item enters, there is no prior
history that the recommendation system can use. Since recommendation is
an optimization engine that recommends items that match best with
the user's interests, without prior statistics, the new item will
hardly be picked up as recommendations, and hence continuously lacking the
necessary statistics that the recommendation system can use.

One example is movie recommendations where a movie site recommends movies
to users based on their past viewing history. When a new movie arrives
the market, there aren't enough viewing statistics about the movie and
therefore the new movie will not have a strong match score and won't be
picked as a recommendation. Because we never learn from those that we
haven't recommended, the new movies will continuously not have any
statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving, when a new Ad enters the Ad repository.

### Multilevel Granularity Prediction

One solution to the cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres). Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.In other words, we can make recommendation based on two models of different granularity. A fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy. For cold-start problem where the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model. For new items, the fine-grain model will have low confidence, and therefore gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction, we have to use the fine-grain model to predict. But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user?

### Optimization under Uncertainty

The core of our problem is that we need to optimize under uncertainties. We have two approaches- Exploitation: Make the most optimal choice based on current data. Because of uncertainty (high variation), the current data may deviate from its true expected value, and we may end up picking a non-optimal choice.
- Exploration: Make a random choice or choices that we haven't made before. The goal is to gather more data points and reduce the uncertainty. This may waste our cycles of picking the optimal choice.

Unfortunately, you don't know the underlying probability, and each has only a limited number of rounds to play. How would you choose which bandit to play to maximize the total number of rounds you win?

Our strategy should strike a good balance between exploiting and exploring. To measure how good the strategy is, there is a concept of "regret," which is the ratio of the two elements.

- Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
- Value you obtain by following the strategy

### Epsilon-Greedy Strategy

In the**"epsilon-greedy"**strategy, at every play we throw a dice between explore and exploit.

With probability p(t) = k/t (where k is a constant and t is the number of tries so far). We pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past). And with probability 1 - p(t), we pick the bandit that has the highest probability of winning based on past statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes. However, it doesn't have a smooth transition between explore and exploit. Also, while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits. While it exploits, it doesn't consider the confidence of probability estimation.

### Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval. In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.Initially each bandit has a zero mean and a large confidence interval. As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i. We also adjust the confidence interval (reducing deviation) as we play the bandit.

e.g., standard deviation is (p.(1-p)/n)^0.5

Notice that the UCB model can be used in a more general online machine learning setting. We require that the machine learning model be able to output its estimation based on a confidence parameter. As a concrete example, let's say a user is visiting our movie site and we want to recommend a movie to the user based on a bunch of input features (e.g., user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify the movie candidate based on relevancy (i.e., user's viewing history or user search query). For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and needs to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution). We then combine them by adding the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether the user clicks on it, views it ... etc. and the response will be fed back to our ML model as new training data. Our ML model is an online learning setting and will update the model with this new training data. Over time, the 68% confidence range will be reduced as more and more data is gathered.

### Relationship with A/B Testing

For most websites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things. In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly. In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment. The goal is to test if the treatment causes any significant improvement from the control. A certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense. Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience. This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination. In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from. Throughout the process, A/B test framework will explore/exploit among different user experiences to optimize for the objective function. At any time, we can query the A/B testing framework to find out the latest statistics of each user experience. This provides a much better way to look at large number of experiment results at the same time.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}