Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Key Terms to Know: Regression Analysis

DZone's Guide to

Key Terms to Know: Regression Analysis

When trying to decipher the results of a regression analysis, you must understand the lingo, as well. Learn some of the most common terms used in regression analysis.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

When trying to decipher the results of a regression analysis, it is mandatory to understand the lingo, as well. This article will introduce you to some of the most common terminologies that are used in regression analysis.

It might be a "duh" scenario for a veteran analyst when it comes to these terms and terminologies. However, I strongly feel that there will be many newbies digging for this information on the world wide web.

So, here is a comprehensive list of regression analysis terms:

Estimator: A formula or algorithm for generating estimates of parameters, given relevant data.

Bias: An estimate is unbiased if its expectation equals the value of the parameter being estimated; otherwise it is biased.

Efficiency: An estimator A is more efficient than an estimator B if A has a smaller sampling variance — that is, if the particular values generated by A are more tightly clustered around their expectation.

Consistency: An estimator is consistent if the estimates it produces converge on the true parameter value as the sample size increases without limit. Consider an estimator that produces estimates θ^ of some parameter θ, and let ^ denote a small number. If the estimator is consistent, we can make the probability as close to 1.0 as we like or as small as we like by drawing a sufficiently large sample. Note that a biased estimator may nonetheless be consistent if the bias tends to zero in the limit. Conversely, an unbiased estimator may be inconsistent if its sampling variance fails to shrink appropriately as the sample size increases.

Standard error of the Regression (SER): An estimate of the standard deviation of the error term in a regression model.

R-squared: A standardized measure of the goodness of fit for a regression model.

Standard error of regression coefficient: An estimate of the standard deviation of the sampling distribution for the coefficient in question.

P-value: The probability, supposing the null hypothesis to be true, of drawing sample data that are as adverse to the null as the data actually drawn, or more so. When a small p-value is found, the two possibilities are that we happened to draw a low-probability unrepresentative sample or that the null hypothesis is in fact false.

Significance level: For a hypothesis test, this is the smallest p-value for which we will not reject the null hypothesis. If we choose a significance level of 1%, we're saying that we'll reject the null if and only if the p-value for the test is less than 0.01. The significance level is also the probability of making a type 1 error (that is, rejecting a true null hypothesis).

T-test: The t-test (or z-test, which is the same thing asymptotically) is a common test for the null hypothesis that a particular regression parameter, βi, has some specific value (commonly zero, but generically βH0).

F-test: A common procedure for jointly testing a set of linear restrictions on a regression model.

Multicollinearity: A situation where there is a high degree of correlation among the independent variables in a regression model — or, more generally, where some of the Xs are close to being linear combinations of other Xs. Symptoms include large standard errors and the inability to produce precise parameter estimates. This is not a serious problem if one is primarily interested in forecasting; it is a problem is one is trying to estimate causal influences.

Omitted variable bias: Bias in the estimation of regression parameters that arises when a relevant independent variable is omitted from a model and the omitted variable is correlated with one or more of the included variables.

Log variables: A common transformation that permits the estimation of a nonlinear model using OLS to substitute the natural log of a variable for the level of that variable. This can be done for the dependent variable and/or one or more independent variables. A key point to remember about logs is that for small changes, the change in the log of a variable is a good approximation to the proportional change in the variable itself. For example, if log(y) changes by 0.04, y changes by about 4%.

Quadratic terms: Another common transformation. When both xi and x 2 i are included as regressors, it is important to remember that the estimated effect of xi on y is given by the derivative of the regression equation with respect to xi. If the coefficient on xi is β and the coefficient on x 2 i is γ, the derivative is β + 2γ xi.

Interaction terms: Pairwise products of the "original" independent variables. The inclusion of interaction terms in a regression allows for the possibility that the degree to which xi affects y depends on the value of some other variable x j. In other words, x j modulates the effect of xi on y. For example, the effect of experience on wages (xi) might depend on the gender (x j) of the worker.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
data science ,regression analysis ,data analytics ,ai

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}