Foundations of Machine Learning: Part 4
Foundations of Machine Learning: Part 4
Explore penalization and variables selection.
Join the DZone community and get the full member experience.Join For Free
This post is the eighth one of our series onthe history and foundations of econometric and machine learning models. The first four were on econometrics techniques. Part 7 is online here.
Penalization and Variables Selection
One important concept in econometrics is Ockham's razor — also known as the law of parsimony (lex parsimoniae) — which can be related to abductive reasoning.
Akaike's criterion was based on a penalty of likelihood taking into account the complexity of the model (the number of explanatory variables retained). If in econometrics, it is customary to maximize the likelihood (to build an asymptotically unbiased estimator), and to judge the quality of the ex-post model by penalizing the likelihood, the strategy here will be to penalize ex-ante in the objective function, even if it means building a biased estimator. Typically, we will build:
where the penalty function will often be a norm ∥⋅∥ chosen a priori, and a penalty parameter λ (we find in a way the distinction between AIC and BIC if the penalty function is the complexity of the model — the number of explanatory variables retained). In the case of the ℓ2 norm, we find the ridge estimator, and for thenorm, we find the lasso estimator ("Least Absolute Shrinkage and Selection Operator"). The penalty previously used involved the number of degrees of freedom of the model, so it may seem surprising to use as in the ridge regression. However, we can envisage a Bayesian vision of this penalty. It should be recalled that in a Bayesian model:
In a Gaussian linear model, if we assume that the a priori law offollows a centred Gaussian distribution, we find a penalty based on a quadratic form of the components of .
Before going back in detail to these two estimators, obtained using theor norm, let us return for a moment to a very similar problem: the best choice of explanatory variables. Classically (and this will be even more true in large dimension), we can have a large number of explanatory variables, p, but many are just noise, in the sense that for a large number of j. Let s be the number of (really) relevant covariates, , with
If we notethe matrix composed of the relevant variables (in columns), then we assume that the real model is of the form . Intuitively, an interesting estimator would then be
but this estimator is only theoretical because the set S is unknown, here. This estimator can actually be seen as the oracle estimator mentioned above. One may then be tempted to solve
This problem was introduced by Foster & George (1994) using thenotation. More precisely, let us define here the following three norms, where
Table 1: Constrained optimization and regularization.
Let us consider the optimization problems in Table 1. If we consider the classical problem where the quadratic norm is used for ℓ, the two problems of the equation of Table 1 are equivalent, in the sense that, for any solution ( to the left problem, there is such that is the solution of the right problem; and vice versa. The result is also true for problems ( . These are indeed convex problems. On the other hand, the two problems are not equivalent: if for solution of the right problem, there is such that is solution of the left problem, the reverse is not true. More generally, if you want to use an norm, sparsity is obtained if whereas you need to have the convexity of the optimization program.
One may be tempted to resolve the penalized programdirectly, as suggested by Foster & George (1994). Numerically, it is a complex combinatorial problem in large dimension (Natarajan (1995) notes that it is a NP-difficult problem), but it is possible to show that if , then
Observe that in this case
whererefers to all non-zero coordinates when solving .
The problem ℓ is the quadratic norm, in other words, the Ridge estimator is always well defined, with in addition an explicit form for the estimator,is strictly convex if
Therefore, it can be deduced that
With a matrix of orthonormal explanatory variables (i.e.), the expressions can be simplified
we obtain an optimal value for
On the other hand, ifis no longer the quadratic norm but the ℓ norm, the problem ( ) is not always strictly convex, and in particular, the optimum is not always unique (for example if is singular). But if it is strictly convex, then predictions will be unique. It should also be noted that two solutions are necessarily consistent in terms of sign of coefficients: it is not possible to have for one solution and for another. From a heuristic point of view, the program ( ) is interesting because it allows us to obtain, in many cases, a corner solution, which corresponds to a problem resolution of type ( ) — as shown visually in Figure 2.
Figure 2 : Penalization based on norms, and (from Hastie et al. (2016)).
Let us consider a very simple model:, with a penalty and a loss function . The problem ( ) then becomes
The first order condition is then
And the sign of the last term depends on the sign of β. Suppose that the least square estimator (obtained by setting ) is (strictly) positive, i. e. . If λ is not too big, we can imagine that β is of the same sign as
and therefore, the condition becomes
and the solution is
By increasing λ, there will be a time such that
If we increase λ a bit little more,
does not become negative because, in this case, the last term of the first order condition changes, and we try to solve
whose solution is then
But this solution is positive (we assumed), and so it is possible to have
at the same time. Also, after a while,
which is then a corner solution. Things are of course more complicated in larger dimensions (Tibshirani & Wasserman (2016) goes back at length on the geometry of the solutions) but as Candès & Plan (2009) notes, under minimal assumptions guaranteeing that the predictors are not strongly correlated, the Lasso obtains a quadratic error almost as good as if we had an oracle providing perfect information on the set of's that are not zero. With some additional technical hypotheses, it can be shown that this estimator is "sparsistant" in the sense that the support of
is that of β, in other words, Lasso has made it possible to select variables (more discussions on this point can be obtained in Hastie et al. (2016)).
More generally, it can be shown that
is a biased estimator, but may be of sufficiently low variance that the mean square error is lower than using least squares. To compare the three techniques, relative to the least square estimator (obtained when), if we assume that the explanatory variables are orthonormal, then
Figure 3: Penalization based on norms (from Hastie et al. (2016)).To be continued with probably a final post this week (references are online here)...
Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.