The Journey of Probability and Statistics (With an Excel Example)
Need a refresher on stats and probability? A data analyst goes over the basics of population and sample sizes, before showing how to use Excel to analyze data.
Join the DZone community and get the full member experience.
Join For FreeImgaine simply flipping a coin in the air. Can our lives be defined by such simple actions? According statistics, the answer is “Yes.” In my school years, I always thought, “why do we have to flip a coin to learn statistics? What is the relation with real life?” No one said that this was a simple real-life simulation, basically. How come could this happen? Think about your life. When you apply for a job, there are only two outcomes: you will get the job or you won’t. Another example, pulling a ball from an bin. A company has many employees. In a specific moment, one employee could leave the company and a new employee could start a new job. This can be a simulation of replacing a ball in a bin. Up to this point, the probability is in the focus of all the simple events. Back to student question in this writing, no one flips a coin or pulls out a ball from a bin. These are just basic simulations of our basic lives.
Now, the main question: is it possible to define all such events? The answer is very simple, "No." Because there is no way to know all the events which are in your focus. You cannot reach all the people searching for jobs and estimate if they would be accepted for the job or not. You have to be satisfied with the people you can actually reach. That is called the “Sample.” What is all the mass if we can define? It is called the “Population”, too. We conclude that, in the statistics world, the realities are “Sample” and the dreams are the “Population.”
This can be illustrated in the below figure.
By using statistical methods, we gain insights into the “Population” by means of the “Sample,” because we know the average value of the “Sample” and we can calculate the proportion of the expected events in the “Sample” set. The first step to statistics journey starts here.
Then, let’s continue with the “Sample” set. So far, I have tried to say that we can predict the population parameter with sample statistics. What does that mean? If we have some sample data from a population, we can predict some other sample values by using that set of sample data. This is called “simple linear regression.”
A more formal definition can be found here: “Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.”
Simple Linear Regression
As I said above, simple linear regression gives the statistical relationship between response and predictor. It is not a deterministic relationship like Celsius and Fahrenheit conversion. A deterministic relationship is given simply by the following equation which has no conversion error between the two dependent and independent variables:
For a statistical relationship, an example can be given as the relationship between “avg. temperature” and “avg. rainy days,” which will be used in this article. The small dataset is given in the below table. There are 12 samples which represent the average temperature for each month, respectively, and average rainy days in a similar way.
From this data, we can try to predict the average rainy days (y) for other possible values of the average temperature (x). However, prediction always means we'll make some errors. If there is a statistical relationship between two variables, there has to be a “prediction error.” A prediction error in simple linear regression is the difference between the predicted value and the actual value, as shown below:
We have to minimize the predicted error value, to get the “Best Fitted Line” between the variables. Here comes the main question for the simple linear regression: “What is the best fitting line between the response and predictor variable?” Why do we need to minimize the error? The answer can be seen from the following equation which defines the simple linear regression line.
To find the best fitting line by using the given formula, we have to calculate the values for b0 and b1 that make the sum of squared prediction errors the smallest that they can be. For simplicity, I will not go through the formula calculations here. However, as can be seen, the main idea is that, since we do not everything there is to know about the population, the accuracy of estimating new values depends on getting the smallest error from the sample data. Of course, you do not have to all the calculations manually. There are a lot of statistical programs, including Excel, that you can use to calculate the errors. But it is important to know what the calculated values are, how to use them to control the model, and how one should interpret them.
Stastical Jargon
SST | Total Sum of Squares |
SSE | Error Sum Of Squares |
SSR | Regression Sum of Squares |
R-Squared | Coefficient of Determination |
S | Standard Error of Estimate |
R- Squared: The percentage of the variation in the dependent variable which is explained by the line with the least squares is known as the R-squared value. With one independent variable, a high R-squared value indicates a better fit. For better estimation accuracy, a standard error (S) should be taken into consideration.
Standard Error (S) of Regression (Estimates): It measures the spread of the points around the least-squares line. For getting a normal distribution, 68% of sample values should fall into one margin of standard error based on the regression of the least-square line and approximately 95% of the values should be within two margins of standard error of the regression value. This can be illustrated with the following graph (from wikipedia standard deviation page);
A Simple Example With Excel
Now let's look at a simple exercise with a very simple data set. We'll use the data set given above which includes average temperature and number of rainy days.
Step 1: Put the data in an Excel sheet. The first column should be "Avg. Temperature (°C)" for the x-axis values, and the second column should be "Avg. Rainy Days." Then select both of them with the cursor, as shown in the figure below. From the "Insert" menu, choose "Scatter" plot from among the graph types. You will get the graph which can be seen in figure.
Step 2: To add a trend line, simply choose a data point on the graph. Right-click on the data point and choose "Add Trendline" from the pop-up menu:
Step 3: The format options for the trendline can be seen in the below figure. Choose the "Linear" option, "Display equation on chartm" and "Display R-squared value on chart" from the options.
Step 4: You will get the below graph:
Step 5: The equation gives the "Simple Linear Regression" best fit line for this sample set. By using this formula, you can get the errors (residuals) for the sample data set:
Step 6: To calculate the standard errors of regression use the STEXY(known Ys, known Xs) function in Excel. For the "known Ys" parameter, select the "Avg. Rainy Days" column and for the "known Xs" parameter, select the "Avg. Temperature (°C)." When you run the formula you will get the value for SER (standard errors for regression): 2.341960944.
This value shows that 83% of the points are within one SER of the least-squares line. For the R-squared value, "0.6014," the linear relationship explains 60 percent of the variation in "Avg. Rainy Days."
This is just the start of a journey...
Opinions expressed by DZone contributors are their own.
Comments