For a classification problem (let's say output is R, G, or B), how do we predict?

There are two formats in which we can report our prediction:

**Output a single value that is the most probable outcome**— for example, output "B" if P(B) > P(R) and P(B) > P(G).**Output the probability estimation of each label**— for example, R=0.2, G=0.3, B=0.4.

But if we look at regression problem (let's say we output a numeric value `v`

), most regression models only output a single value (that minimizes the RMSE). In this article, we will look at some use cases in which outputting a probability density function is much preferred.

## Predict the Event Occurrence Time

As an illustrative example, we want to predict when a student will finish her work given she has already spent some time `s`

. In other words, we want to estimate `E[t | t > s]`

where `t`

is a random variable representing the total duration and `s`

is the elapsed time so far.

Estimating time `t`

is generally hard if the model only outputs an expectation. Notice that the model has the same set of features, expect that the elapsed time has changed in a continuous manner as time passes.

Let's look at how we can train a prediction model that can output a density distribution. Let's say our raw data schema:

**[feature, duration]**- f1, 13.30
- f2, 14.15
- f3, 15.35
- f4, 15.42

Take a look at the range (i.e. min and max) of the output value. We transform into the training data of the following schema:

**[feature, dur<13, dur<14, dur<15, dur<16]**- f1, 0, 1, 1, 1
- f2, 0, 0, 1, 1
- f3, 0, 0, 0, 1
- f4, 0, 0, 0, 1

After that, we train four classification models:

- feature, dur<13
- feature, dur<14
- feature, dur<15
- feature, dur<16

Given a new observation with the corresponding feature, we can invoke these four models to output the probability of binary classification (cumulative probability). If we want the probability density, simply take the difference (i.e. differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region. As time passes, we just need to slide the red line continuously and recalculate the expected time. We don't need to execute the prediction model unless the input features have changed.

## Predict Cancellation Before Commitment

As an illustrative example, let's say a customer of restaurant has reserved a table at 8:00 p.m. The time now is 7:55 p.m. and the customer still hasn't arrived. What is the chance of a no-show?

Now, given a person (with feature `x`

), and given that the current time is `S - t`

(still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema:

**[feature, arrival]**- f1, -15.42
- f2, -15.35
- f3, -14.15
- f4, -13.30
- f5, infinity
- f6, infinity

We transform into the training data of the following schema:

**[feature, arr<-16, arr<-15, arr<-14, arr<-13]**- f1, 0, 1, 1, 1
- f2, 0, 1, 1, 1
- f3, 0, 0, 1, 1
- f4, 0, 0, 0, 1
- f5, 0, 0, 0, 0
- f6, 0, 0, 0, 0

After that, we train four classification models:

- feature, arr<-16
- feature, arr<-15
- feature, arr<-14
- feature, arr<-13

Notice that `P(arr<0)`

can be smaller than 1 because the customer can be a no-show.

In this post, we've discussed some use cases where we need the regression model to output not just its value prediction but also the probability density distribution. We also illustrate how we can build such a prediction model.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}