Over a million developers have joined DZone.

Clarifying Statistics Terminology

DZone's Guide to

Clarifying Statistics Terminology

In this article, learn about some of the interesting statistical terms that tend to be confusing in a more lucid manner.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Statisticians say the weirdest things. This is how one can feel if they are not well-versed with statistical terminology. However, statistics also has an unhelpful tendency to use words that change their meanings based on the context that they are used in. This, most of the time, causes a lot of heartburn and confusion for not-so-technically-sound analysts.

Hence, I decided to write an article around some interesting statistical terms that I have heard and used in the past two decades and have also explained to various audiences ranging from college graduates at various national and international universities to corporate head honchos.

Heterogeneity: In statistics, this means that your population samples have widely varying results.

Heteroscedasticity: This refers to the circumstance in which the variability of a variable is unequal across the range of value of a second variable that predicts it. This is one of the most common and frequently used terms of assumption of parametric analysis (i.e. linear regression)

Homogeneity: This term is opposite of heterogeneity; it means that your populations and samples have similar traits. Homogeneous samples are usually small and are made of similar cases.

Microdata: Individual response data obtained from surveys and censuses. These are data points directly observed or collected from a specific unit of observation. Also known as raw data. ICPSR is an excellent resource for obtaining microdata files.

Data point or datum: Singular of data. Refers to a single point of data. Example: The amount of aviation gasoline consumed by the transportation sector in the U.S. in 2012.

Quantitative data/variables: Information that can be handled numerically. Example: Spending by US consumers on personal care products and services.

Qualitative data/variables: Information that refers to the quality of something. Ethnographic research, participant observation, open-ended interviews, etc., may collect qualitative data. However, often, there is some element of the results obtained via qualitative research that can be handled numerically, i.e. how many observations, the number of interviews conducted, etc. Example: Periods when the US was in vs. was not in a recession. The quality of being in a recession is assigned a value of .01 and not in a recession .0, which makes it possible to display as a chart.

Indicator: Typically used as a synonym of statistics that describes variables that describe something about the socioeconomic environment of a society, i.e. per capita income, unemployment rate, or median years of education.

Statistic: A number that describes some characteristic or status of a variable, i.e. a count or a percentage. Example: Total non-farm job starts in August 2014.

Statistics: Numerical summaries of data that has been analyzed in some way. Example: Ranking of airlines by percentage of flights arriving on time at Huntsville International Airport in Alabama in 2013.

Time series data: Any data arranged in chronological order. Example: Gross Domestic Product of Greece, 2000-2013.

Variable: Any finding that can change or vary. Examples include anything that can be measured, such as the number of logging operations in Alabama.

Numerical variable: Usually refers to a variable whose possible values are numbers. Example: Bank prime loan rate.

Categorical variable: A variable that distinguishes among subjects by putting them in categories (i.e. gender). Also called discrete or nominal variables. Example: Female infant mortality rate of Belarus (the mortality rate is numerical; the age/gender characteristic is categorical).

Time series: A set of measures of a single variable recorded over a period of time. Example: Hourly mean earnings of civilian workers — mining management, professional, and related workers.

Alpha-beta conundrum: There are so many meanings for these two statistical terms that one can get confused in no time. So, let's understand the meaning of these two statistics in various contexts.

  • Hypothesis testing:

    • Alpha error: It is the probability of a Type I error in any hypothesis test — incorrectly claiming statistical significance.

    • Beta error: It is the probability of a Type II error in any hypothesis test — incorrectly concluding no statistical significance. (1 — Beta is power.)

  • Regression coefficients:

    • In almost all textbooks and software packages, the population regression coefficients are denoted by beta. Like all population parameters, they are theoretical — we don't know what they are. The regression coefficients we estimate from our sample are statistical estimates of those parameter values. Most parameters are denoted with Greek letters and statistics with the corresponding Latin letters.

  • Cronbach's alpha:

    • This is another totally different use of alpha, AKA the coefficient alpha, which measures the reliability and correctness of a scale.

I hope that you all have enjoyed reading this article and would like to share some interesting terminologies and statistical terms with me, as well!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

statistics ,data mining ,big data analytics ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}