Structured Learning and Imitation Learning
Let's take a look at structured learning as well as imitation learning and also discover the need for structured learning.
Join the DZone community and get the full member experience.Join For Free
In a classical prediction use case, the predicted output is either a number (for regression) or category (for classification). A set of training data (x, y) where x is the input and y is the labeled output is provided to train a parameterized predictive model.
- The model is characterized by a set of parameters w
- Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model predicts the probability of each possible class for classification
- Define a Lost function L(y, y_hat) for regression, or L(y, P(y=a | x), P(y=b | x) ...), find the parameters w to minimize L
This problem is typically viewed as an optimization problem and uses a gradient descent approach to solve it.
Need for Structured Learning
In some cases, y is not as simple as a number or a class. For example:
- For machine translation, x is a sentence in English, and y is a translated sentence in French
- For self-driving-vehicle, x is a camera image, and y is the control action on the steering wheel, brake, gas pedal
In these cases, the output y can be viewed as an object. But wait, can we break down the object as multiple numbers/categories and use the classical regression/classification approach to solve it? Not quite, because the lost function cannot be just formulated as the summation of loss of individual components. For example, two French sentences using different words may still be a very good translation of the same English sentence. So we need to generalize a bit more and introduce the concept of an object and compatibility here.
The prediction problem can be generalized as: Given input x, finding an object y such that y is the "most compatible" with x. The compatibility is a parameterized function that we are going to learn from the training data.
- The compatibility function is defined as F(x, y; w)
- During the training phase, we tune parameter w such that for every sample in training data, F(x, y; w) is the maximum. In other words, F(x, Y=y; w) > F(x, Y=other_val; w)
Notice that the training process is different from classical ML in the following way.
- There are two optimization loop here. a) Given parameter w, find y_opt that maximize F(x,y,w). b) Given Lost = gap between F(x,y,w) and F(x,y_opt,w), find w that minimize the gap.
- It turns out the first optimization is solved in a problem specific way while the second optimization can be solved by classical gradient descent approach.
Rather than trying to exactly match y_hat to y in the training data, structured learning enables us to learn a more abstract relationship (ie: compatibility) between x and y so that we can output other equally good y even it is not the same as the y in the training data. This more-generalized form is very powerful when we don't have a lot of training data. The downside of structured learning is it is compute intensive because at the inference phase, it needs to solve an optimization problem, which is typically expensive.
In a typical setting of Reinforcement Learning, a digital agent observe the state from the environment, formulate its policy to determine an action that it believes will maximize its accumulative future reward, take the action, get the reward from the environment, and transition to the next state. Reinforcement learning is about how the agent optimizes its policy along its experience when interacting with the environment.
Basically, reinforcement learning is based on a trial-and-error approach to learning the lesson. This approach can be very costly because, during the trial process, a serious mistake can be made (imagine if we use reinforcement learning to learn a self-driving car, we may crash many cars before we learn something meaningful). In practice, people rely on a simulator to mimic the environment. However, coming up with a good simulator is not easy because it requires a very deep understanding of how the actual environment behaves. This is one of the limitations that restricts reinforcement learning to be broadly applied.
Another important design consideration is how the reward is assigned. Of course, we can use the actual reward from the environment to train the policy, but this is usually very inefficient. Imagine when playing the chess game; we only get the reward at the end when we win/lose the game. Propagating this reward all the way to each move will be very inefficient. To make the learning faster, we usually use a technique called "reward shaping" to basically assign some artificial reward along the trajectory to bias the agent towards certain desirable actions (based on domain knowledge).
One special form of reward shaping is "imitation learning" where we assign intermediate reward based on how the action "similar" to when an expert does in real-life circumstances. Let's say we collect a set of observations that the expert is taking action y in state x, and try to learn a model that the agent will bias to take action y when seeing state x. But wait, does it sound like a supervised learning problem? Can we just train a prediction model from x to y and then we are done?
Unfortunately, it is not as simple. Expert data is typically very sparse and expensive to get, meaning we usually don't have too many data from the expert. Imagine in a self-driving program, if we want to learn how to react when the car is almost crash, we may not find any situation in the expert's observation because the expert may not run into such dangerous situation at all. On the other hand, we don't need to copy exactly when the expert did in every situation, we just need to copy those that are relevant to the situation.
"Inverse Reinforcement Learning" comes to the rescue. Basically, it cuts off the reward from the environment and replaces it with a "reward estimator function," which is trained from a set of expert behavior, assuming that expert behavior will achieve the highest reward.
The underlying algorithm of inverse reinforcement learning is based on the "structured learning" algorithm. In this case, the x is the start state, y is the output of the trajectory of the expert, which is basically the training data. And y_opt is the output of the trajectory based on the agent policy, which is learned from the reward function using Reinforcement Learning algorithm. The compatibility function is basically our reward function because we assume expert behavior achieves the highest reward.
Then we bring it to the structured learning algorithm as shown below:
The agent still needs to interact with the environment (or simulator) to get its trajectory, but the environment only needs to determine the next state, but not the reward. Again, there are two nested optimization loops in the algorithm
- Given a reward function (characterized by w), use classical RL to learn the optimal policy
- Use the optimal policy to interact with the environment to collect the total reward of each episode, then adjust the reward function parameter w such that the expert behavior always gets the highest total reward.
Published at DZone with permission of Ricky Ho, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.