DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Smart Routing Using AI for Efficient Logistics and Green Solutions
  • Predicting Traffic Volume With Artificial Intelligence and Machine Learning
  • A Guide to Regression Analysis Forecasting in Python
  • Inventory Predictions With Databricks

Trending

  • The End of “Good Enough Agile”
  • MCP Servers: The Technical Debt That Is Coming
  • Event Driven Architecture (EDA) - Optimizer or Complicator
  • GitHub Copilot's New AI Coding Agent Saves Developers Time – And Requires Their Oversight

Customer Churn Analysis: Using Logistic Regression to Predict At-Risk Customers

For predicting a discrete variable, logistic regression is your friend. Let's learn why linear regression won't work as we build a simple customer churn model.

By 
Sunil Kappal user avatar
Sunil Kappal
·
Dec. 14, 18 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
17.8K Views

Join the DZone community and get the full member experience.

Join For Free

We all know that linear regression routines are pretty straightforward and easy to understand. Where it clearly states that the value of an independent variable increases by 1 point, the dependent variable increases by b units.

However, when it comes to predicting a discrete variable — for example, whether a customer will stay with a service provider or not, or whether it will rain or not — logistic regression will come into play. There are not many distinct values, and the outcome can only be a 1 or a 0.

In this article, we will learn how to build a simple customer churn model within Excel, and we will use Solver to optimize this model by reducing the cross entropy error.

Before we get into the details of logistic regression, let's understand why linear regression will not work when we have to predict a discrete outcome based on the following limitations of linear regression:

  • Linear regression assumes a linear relationship between each variable. However, if the customer is paying less for a particular service compared to a competitor, the chances of the customer leaving the service provider will vary exponentially.
  • Linear regression assumes that the probability increases proportionately as the independent variable increases.

Solution: Sigmoid Curve to the Rescue

As stated above, the major problem with linear regression is its assumption of a linear relationship among variables, which is very rare in practice.

In order to solve this (and a few other) limitation problems, we will explore the sigmoid curve. In layman's language, we can say that the sigmoid function (curve) helps to return the probability value, which can then be mapped to two or more discrete classes. Whereas, in general, linear regression does not tell us the probability of an event happening after certain range.

Based on the above explanation, we can say that the sigmoid curve can better explain discrete phenomena than linear regression.

Just to keep things simple, I will move on to the core topic of this article, i.e., how to use logistic regression to retain customers. For this, we will work to understand logistic regression and see how it can be implemented.

As we know, linear regression assumes a linear relation between dependent and independent variables. It is expressed as Y = x+b*X. Logistic regression moves away from the notion of linear relation by applying the sigmoid curve.

The above notation clearly show how logistic regression uses independent variables, which is the same as linear regression. At the same time, however, it also passes those variables through a sigmoid activation to bound the outputs in between 0 or a 1.

Now, to see how the output changes in a logistic regression, let's look under the hood of a logistic regression equation with the help of an example:


  • If X = 0, the value of Y = 1/(1 + exp(-(2))) = 0.88
  • If X is increased by 2 units (that is, X = 2), the value of Y is Y = 1/(1 + exp(-(2+3*2))) = 1/(1+exp(-(5))) = 0.993

It is quite clear when the value of Y changed from 0.88 to 0.99, as X changed from 0 to 2. Similarly, if X were -2, Y would have been at 0.017. Therefore, the impact on Y of a unit change in X depends on the equation. The value 0.88 when X = 0 can be interpreted as the probability. For example, on average in 88% cases, the value of Y is 1 when X = 0.

Logistic Regression in Action

To see the above equation and routine in action, we will go through the exercise of building a logistic regression equation in Excel. The challenge is to be able to predict whether a customer will stay with a service provider or not, based on a few variables listed in the below table.

Steps to Follow

  1. Initialize the weights of independent variables to random values (let's assume 1 each).
  2. Once the weights and the bias are initialized, we'll estimate the output value (the probability of a customer leaving = 1 or staying =- 0) by applying sigmoid activation on the multivariate linear regression of independent variables.

The table below contains information about the (a+b*X) part of the sigmoid curve and the final sigmoid activation value:

The formula for obtaining the values in the table above is given in the table below:

The if condition in the preceding sigmoid activation col. is used only because Excel has limitations in calculating any value >exp 500 — hence the clipping.

Error Estimation

In linear regression, we consider least squares (the squared difference) between the actual and the predicted value to estimate the overall error. In logistic regression, we will use a different error metric called cross entropy.

Cross entropy is a measure of difference between two different distributions — actual and predicted distribution.

Let's look at the two cost functions (least squares method and entropy cost) in cases where the dependent variable (Customer Churn) is 1:

The above table clearly shows that the cross entropy method penalizes heavily for high prediction error compared to the squared error: lower error values have similar loss in both the squared error and the cross entropy error, but where there is a high difference, cross entropy penalizes heavily. So it is a good idea to stick with the cross entropy error as the error metric when predicting discrete variables.

Now that we have set up our problem, let us vary the parameters in such a way that the overall error is reduced. This step can be performed by gradient descent, which can be done by using the Solver functionality in excel.

How do you perform logistic regression using Solver?

Note: To improve the model performance and reduce the error, it will be a good idea to bring in only those variables which are statistically significant. For this activity, one can follow different approaches that are out of the scope of this article.

For training and consulting, visit my website.
Linear regression

Published at DZone with permission of Sunil Kappal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Smart Routing Using AI for Efficient Logistics and Green Solutions
  • Predicting Traffic Volume With Artificial Intelligence and Machine Learning
  • A Guide to Regression Analysis Forecasting in Python
  • Inventory Predictions With Databricks

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!