DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. The Journey of Probability and Statistics (With an Excel Example)

The Journey of Probability and Statistics (With an Excel Example)

Need a refresher on stats and probability? A data analyst goes over the basics of population and sample sizes, before showing how to use Excel to analyze data.

Zehra Can user avatar by
Zehra Can
·
Nov. 15, 18 · Tutorial
Like (11)
Save
Tweet
Share
8.84K Views

Join the DZone community and get the full member experience.

Join For Free

Imgaine simply flipping a coin in the air. Can our lives be defined by such simple actions? According statistics, the answer is “Yes.” In my school years, I always thought, “why do we have to flip a coin to learn statistics? What is the relation with real life?” No one said that this was a simple real-life simulation, basically. How come could this happen? Think about your life. When you apply for a job, there are only two outcomes: you will get the job or you won’t. Another example, pulling a ball from an bin. A company has many employees. In a specific moment, one employee could leave the company and a new employee could start a new job. This can be a simulation of replacing a ball in a bin. Up to this point, the probability is in the focus of all the simple events. Back to student question in this writing, no one flips a coin or pulls out a ball from a bin. These are just basic simulations of our basic lives.

Now, the main question: is it possible to define all such events? The answer is very simple, "No." Because there is no way to know all the events which are in your focus. You cannot reach all the people searching for jobs and estimate if they would be accepted for the job or not. You have to be satisfied with the people  you can actually reach. That is called the “Sample.” What is all the mass if we can define? It is called the “Population”, too. We conclude that, in the statistics world, the realities are “Sample” and the dreams are the “Population.”

This can be illustrated in the below figure. 

Image title

By using statistical methods, we gain insights into the “Population” by means of the “Sample,” because we know the average value of the “Sample” and we can calculate the proportion of the expected events in the “Sample” set. The first step to statistics journey starts here. 

Then, let’s continue with the “Sample” set. So far, I have tried to say that we can predict the population parameter with sample statistics. What does that mean? If we have some sample data from a population, we can predict some other sample values by using that set of sample data. This is called “simple linear regression.”

A more formal definition can be found here: “Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.”

Simple Linear Regression

As I said above, simple linear regression gives the statistical relationship between response and predictor. It is not a deterministic relationship like Celsius and Fahrenheit conversion. A deterministic relationship is given simply by the following equation which has no conversion error between the two dependent and independent variables:

Image title

For a statistical relationship, an example can be given as the relationship between “avg. temperature” and “avg. rainy days,” which will be used in this article. The small dataset is given in the below table. There are 12 samples which represent the average temperature for each month, respectively, and average rainy days in a similar way.

Image title

From this data, we can try to predict the average rainy days (y) for other possible values of the average temperature (x). However, prediction always means we'll make some errors. If there is a statistical relationship between two variables, there has to be a “prediction error.” A prediction error in simple linear regression is the difference between the predicted value and the actual value, as shown below:

Image title

We have to minimize the predicted error value, to get the “Best Fitted Line” between the variables. Here comes the main question for the simple linear regression: “What is the best fitting line between the response and predictor variable?” Why do we need to minimize the error? The answer can be seen from the following equation which defines the simple linear regression line.

Image title

To find the best fitting line by using the given formula, we have to calculate the values for b0 and b1 that make the sum of squared prediction errors the smallest that they can be. For simplicity, I will not go through the formula calculations here. However, as can be seen, the main idea is that, since we do not everything there is to know about the population, the accuracy of estimating new values depends on getting the smallest error from the sample data. Of course, you do not have to all the calculations manually. There are a lot of statistical programs, including Excel, that you can use to calculate the errors. But it is important to know what the calculated values are, how to use them to control the model, and how one should interpret them.  

Stastical Jargon

SST Total Sum of Squares
SSE Error Sum Of Squares
SSR Regression Sum of Squares
R-Squared Coefficient of Determination
S Standard Error of Estimate

R- Squared: The percentage of the variation in the dependent variable which is explained by the line with the least squares is known as the R-squared value. With one independent variable, a high R-squared value indicates a better fit. For better estimation accuracy, a standard error (S) should be taken into consideration.

Standard Error (S) of Regression (Estimates): It measures the spread of the points around the least-squares line. For getting a normal distribution, 68% of sample values should fall into one margin of standard error based on the regression of the least-square line and approximately 95% of the values should be within two margins of standard error of the regression value. This can be illustrated with the following graph (from wikipedia standard deviation page);

Image title

A Simple Example With Excel

Now let's look at a simple exercise with a very simple data set. We'll use the data set given above which includes average temperature and number of rainy days. 

Step 1: Put the data in an Excel sheet. The first column should be "Avg. Temperature (°C)" for the x-axis values, and the second column should be "Avg. Rainy Days." Then select both of them with the cursor, as shown in the figure below. From the "Insert" menu, choose "Scatter" plot from among the graph types. You will get the graph which can be seen in figure.

Image title

Step 2: To add a trend line, simply choose a data point on the graph. Right-click on the data point and choose "Add Trendline" from the pop-up menu:

Image title

Step 3: The format options for the trendline can be seen in the below figure. Choose the "Linear" option, "Display equation on chartm" and "Display R-squared value on chart" from the options.

Image title


Step 4: You will get the below graph:

Image title

Step 5: The equation gives the "Simple Linear Regression" best fit line for this sample set. By using this formula, you can get the errors (residuals) for the sample data set:

Image title

Step 6: To calculate the standard errors of regression use the STEXY(known Ys, known Xs) function in Excel. For the "known Ys" parameter, select the "Avg. Rainy Days" column and for the "known Xs" parameter, select the "Avg. Temperature (°C)." When you run the formula you will get the value for SER (standard errors for regression): 2.341960944.

This value shows that 83% of the points are within one SER of the least-squares line. For the R-squared value, "0.6014," the linear relationship explains 60 percent of the variation in "Avg. Rainy Days."

This is just the start of a journey... 

Statistics Linear regression Data (computing) Least squares

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Unlocking the Power of Polymorphism in JavaScript: A Deep Dive
  • How To Generate Code Coverage Report Using JaCoCo-Maven Plugin
  • Promises, Thenables, and Lazy-Evaluation: What, Why, How
  • Core Machine Learning Metrics

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: