# Foundations of Machine Learning: Part 4

# Foundations of Machine Learning: Part 4

### Explore penalization and variables selection.

Join the DZone community and get the full member experience.

Join For Free*This post is the eighth one of our series onthe history and foundations of econometric and machine learning models. The first four were on econometrics techniques. Part 7 is online here.*

## Penalization and Variables Selection

One important concept in econometrics is Ockham's razor — also known as the law of parsimony (*lex parsimoniae*) — which can be related to abductive reasoning.

Akaike's criterion was based on a penalty of likelihood taking into account the complexity of the model (the number of explanatory variables retained). If in econometrics, it is customary to maximize the likelihood (to build an asymptotically unbiased estimator), and to judge the quality of the ex-post model by penalizing the likelihood, the strategy here will be to penalize ex-ante in the objective function, even if it means building a biased estimator. Typically, we will build:

where the penalty function will often be a norm ∥⋅∥ chosen *a priori*, and a penalty parameter λ (we find in a way the distinction between AIC and BIC if the penalty function is the complexity of the model — the number of explanatory variables retained). In the case of the ℓ2 norm, we find the ridge estimator, and for the ℓ1 norm, we find the lasso estimator ("Least Absolute Shrinkage and Selection Operator"). The penalty previously used involved the number of degrees of freedom of the model, so it may seem surprising to use ∥β∥ℓ2 as in the ridge regression. However, we can envisage a Bayesian vision of this penalty. It should be recalled that in a Bayesian model:

or

In a Gaussian linear model, if we assume that the a priori law of θ follows a centred Gaussian distribution, we find a penalty based on a quadratic form of the components of θ.

Before going back in detail to these two estimators, obtained using the ℓ1 or ℓ2 norm, let us return for a moment to a very similar problem: the best choice of explanatory variables. Classically (and this will be even more true in large dimension), we can have a large number of explanatory variables, ** p**, but many are just noise, in the sense that βj=0 for a large number of

**. Let**

*j***be the number of (really) relevant covariates, s=#S, with**

*s*If we note XS the matrix composed of the relevant variables (in columns), then we assume that the real model is of the form y=xSTβS+ε. Intuitively, an interesting estimator would then be

but this estimator is only theoretical because the set S is unknown, here. This estimator can actually be seen as the oracle estimator mentioned above. One may then be tempted to solve

This problem was introduced by Foster & George (1994) using the ℓ0 notation. More precisely, let us define here the following three norms, where a∈Rd

**Table 1**: Constrained optimization and regularization.

Let us consider the optimization problems in Table 1. If we consider the classical problem where the quadratic norm is used for ℓ, the two problems of the equation (ℓ1) of Table 1 are equivalent, in the sense that, for any solution (β⋆,s) to the left problem, there is λ⋆ such thatβ⋆,λ⋆) is the solution of the right problem; and vice versa. The result is also true for problems ((ℓ2). These are indeed convex problems. On the other hand, the two problems (ℓ0) are not equivalent: if for (β⋆,λ⋆) solution of the right problem, there is s⋆ such that β⋆ is solution of the left problem, the reverse is not true. More generally, if you want to use an ℓp norm, sparsity is obtained if p≤1 whereas you need p≥1 to have the convexity of the optimization program.

One may be tempted to resolve the penalized program (ℓ0) directly, as suggested by Foster & George (1994). Numerically, it is a complex combinatorial problem in large dimension (Natarajan (1995) notes that it is a NP-difficult problem), but it is possible to show that if λ∼σ2log(p), then

Observe that in this case

where Sλ(β) refers to all non-zero coordinates when solving (ℓ0).

The problem (ℓ2) is strictly convex if ℓ is the quadratic norm, in other words, the Ridge estimator is always well defined, with in addition an explicit form for the estimator,

Therefore, it can be deduced that

and

With a matrix of orthonormal explanatory variables (i.e. XTX=I), the expressions can be simplified

Observe that

And because

we obtain an optimal value for

On the other hand, if is no longer the quadratic norm but the ℓℓ1 norm, the problem (ℓ1) is not always strictly convex, and in particular, the optimum is not always unique (for example if XTX is singular). But if it is strictly convex, then predictions Xβ will be unique. It should also be noted that two solutions are necessarily consistent in terms of sign of coefficients: it is not possible to have βj<0 for one solution and βj>0 for another. From a heuristic point of view, the program (ℓ1) is interesting because it allows us to obtain, in many cases, a corner solution, which corresponds to a problem resolution of type (ℓ0) — as shown visually in Figure 2.

**Figure 2** : Penalization based on norms ℓ0, ℓ1, and ℓ2 (from Hastie *et al.* (2016)).

Let us consider a very simple model: yi=xiβ+ε, with a penalty ℓ1 and a loss function ℓ2. The problem (ℓ2) then becomes

The first order condition is then

And the sign of the last term depends on the sign of β. Suppose that the least square estimator (obtained by setting λ=0) is (strictly) positive, i. e. yTx>0. If λ is not too big, we can imagine that β is of the same sign as

and therefore, the condition becomes

and the solution is

By increasing λ, there will be a time such that

If we increase λ a bit little more,

does not become negative because, in this case, the last term of the first order condition changes, and we try to solve

whose solution is then

But this solution is positive (we assumed yTx>0), and so it is possible to have

at the same time. Also, after a while,

which is then a corner solution. Things are of course more complicated in larger dimensions (Tibshirani & Wasserman (2016) goes back at length on the geometry of the solutions) but as Candès & Plan (2009) notes, under minimal assumptions guaranteeing that the predictors are not strongly correlated, the Lasso obtains a quadratic error almost as good as if we had an oracle providing perfect information on the set of βj's that are not zero. With some additional technical hypotheses, it can be shown that this estimator is "sparsistant" in the sense that the support of

is that of β, in other words, Lasso has made it possible to select variables (more discussions on this point can be obtained in Hastie et al. (2016)).

More generally, it can be shown that

is a biased estimator, but may be of sufficiently low variance that the mean square error is lower than using least squares. To compare the three techniques, relative to the least square estimator (obtained when λ=0), if we assume that the explanatory variables are orthonormal, then

and

**Figure 3**: Penalization based on norms (from Hastie *et al.* (2016)).

*To be continued with probably a final post this week (references are online here)...*

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

{{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}