Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Guide for Customer Retention Analysis With SQL

DZone's Guide to

A Guide for Customer Retention Analysis With SQL

See how to make basic customer retention analysis, build customer retention over time, deal with new vs. existing customers retention curves, and calculate retention analysis in cohorts.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Whether you’re selling groceries, financial services, or gym memberships, successful recruitment of new customers is only truly successful if they return to buy from you again. The metric that reflects this is called retention and the approach we use is customer retention analysis. It’s one of the key metrics that influence revenue. When your customers’ retention is law, you’ll spend all of the income from your business on marketing.

At the same time, retention is easy to improve if you can calculate it the right way using SQL and your database. In this post, we’ll guide you step by step on how to make basic customer retention analysis, how to build customer retention over time, how to deal with new vs. existing customers retention curves, and how to calculate retention analysis in cohorts.

Basic Customer Retention Curves

Customer retention curves are essential to any business looking to understand its clients and will go a long way towards explaining other things like sales figures or the impact of marketing initiatives. They are an easy way to visualize a key interaction between customers and the business, which is to say, whether or not customers return — and at what rate — after the first visit.

The first step to building a customer retention curve is to identify those who visited your business during the reference period, what I will call p1. It is important that the length of the period chosen is a reasonable one, and reflects the expected frequency of visits.

Different types of businesses are going to expect their customers to return at different rates:

  • A coffee shop may choose to use an expected frequency of visits of once a week.
  • A supermarket may choose a longer period, perhaps two weeks or one month.

In the following example, I will use a month and assume that we are looking at customer retention of customers who visited in January 2016 over the following year.

As previously stated, the first step is to identify the original pool of customers:

January_pool AS 
( 
                SELECT DISTINCT cust_id 
                FROM            dataset 
                WHERE           month(transaction_date)=1 
                AND             year(transaction_date)=2016)

Then, we look at how those customers behaved over time. For example, how many of them returned per month over the rest of the year?

SELECT   Year(transaction_date), 
         Month(transaction_date), 
         count (distinct cust_id) AS number 
FROM     dataset 
WHERE    year(transaction_date)=2016 
AND      cust_id IN january_pool 
GROUP BY 1, 
         2

As you can see, the original SELECT function is included in this second step.

If we had 1,000 unique customers in January, we could expect our results to look something like this:

The resulting graph would then look like this:

Data visualized with Statsbot


Evolution of customer retention over time

What is described above is obviously only the first step, as we would also like to see whether there are any trends in customer retention, i.e. are we getting any better at it?

So, one idea we might have is to say: of those who came in January, how many returned in February? Of those who came in February, how many returned in March? And other one-month intervals.

So, then we need to set up an iterative model, which can be built in a few simple steps. First, we need to create a table where each user’s visits are logged by month, allowing for the possibility that these will have occurred over multiple years since whenever our business started operations. I have assumed here that the start date is the year 2000, but you can adjust this as necessary.

Visit_log AS
SELECT   cust_id, 
         datediff(month, ‘2000-01-01’, transaction_date) AS visit_month 
FROM     dataset 
GROUP BY 1, 
         2 
ORDER BY 1, 
         2

This will give us a view that looks like this:

We then need to reorganize this information in order to identify the time lapse between each visit. So, for each person and for each month, see when the next visit is.

Time_lapse AS
    SELECT cust_id,
           visit_month lead(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month)
    FROM visit_log

We then need to calculate the time gaps between visits:

Time_diff_calculated AS
    SELECT cust_id,
           visit_month,
           lead,
           lead — visit_month AS time_diff
    FROM time_lapse

Now, a small reminder of what customer retention analysis measures: it is the proportion of customers who return after x lag of time. So, what we want to do is compare the number of customers visiting in a given month to how many of those return the next month. We also want to define those who return after a certain absence, and those who don’t return at all. In order to do that, we need to categorize the customers depending on their visit pattern.

Custs_categorized AS
SELECT cust_id,
       visit_month,
       CASE
             WHEN time_diff=1 THEN ‘retained’,
             WHEN time_diff>1 THEN ‘lagger’,
             WHEN time_diff IS NULL THEN ‘lost’
       END AS cust_type
FROM time_diff_calculated

This will allow us, in a final step, to establish a count of the number of customers who visited in a given month, and how many of those return the next month.

SELECT visit_month,
       count(cust_id where cust_type=’retained’)/count(cust_id) AS retention
FROM custs_categorized
GROUP BY 1

This gives us, month by month, the proportion of customers who returned.

Data visualized with Statsbot


Other Techniques for Customer Retention

There are, of course, other ways to think about customer retention analysis. In particular, we might want to look at how to deal with returning customers: if in p1 there are 100 customers and in p2 80 of them return, in p3 do we want to keep the original 100 or look only at the 80?

In the above example, I talked only about period-to-period retention, but I also classified those who were “laggers,” or who took more than one month to return. A business may want, through targeted marketing efforts, to reward customers who keep a monthly visit pattern, or perhaps, to encourage those who visit less than once a month to come back more often.

So, another way to look at it would be to look at what proportion of our visitors in any given month are retained, how many are returning, and how many are new. In this case, we would want to change our perspective slightly, and look not at the next visit, but at the previous one.

Time_lapse_2 AS 
  SELECT   cust_id, 
           Visit_month,
lag(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month)
  FROM     visit_log

Time_diff_calculated_2 AS 
  SELECT cust_id, 
         visit_month, 
         lag, 
         visit_month - lag AS time_diff 
  FROM   time_lapse_2


Custs_categorized AS
SELECT cust_id, 
       visit_month, 
       CASE 
              WHEN time_diff=1 THEN ‘retained’, 
              WHEN time_diff>1 THEN ‘returning’, 
              WHEN time_diff IS NULL THEN ‘new’ 
       END AS cust_type 
FROM   time_diff_calculated_2

So then, it’s simple to count the number of each type of customer in any given month.

SELECT visit_month, 
       cust_type, 
       Count(cust_id) 
FROM   custs_categorized 
GROUP  BY 1, 
          2 

With a simple export and graph, you will get something that looks like this:

Data visualized with Statsbot

Cohort Tables

A popular way to analyze customer retention is through the use of cohorts, i.e. defining each user by their first visit and then seeing how they return over time.

Ideally, our end product will be a table like this:

Well, actually, ideally, our table will show that the number of new users is increasing every month and that retention is improving over time. But you get the idea!

In order to build this, we need first to establish the number of new users for each month, and the amount of time that they stick around. Then, getting the percentages will be easy.

Our first subquery will classify each customer according to their initial visit and the length of time that they are retained.

first_visit AS 
  SELECT   cust_id, 
           min(visit_month)   AS first_month 
   FROM     visit_log 
  GROUP BY 1

By month, we can then calculate the number of new users.

New_users AS 
  SELECT   first_month, 
           count(DISTINCT cust_id) AS new_users 
  FROM     first_visit 
  GROUP BY 1

We also want to get a picture of whether a new user in, say, Month 1, returned in both Month 2 and 3 or only in Month 3. So, we’ll create a subquery that allows us to track the visit pattern of each customer.

SELECT * 
FROM   visit_log 
       LEFT JOIN visit_log AS visit_tracker 
              ON visit_log.cust_id = visit_tracker.cust_id 
                 AND visit_log.visit_month < visit_tracker.visit_month

Then, we’ll combine these to get a lovely final product. For each month a new user is recruited, we’ll have the size of their group, as well as, for each subsequent month, the percentage of those users who returned. Just like the table above!

In order to do that, we run the following script:

SELECT    first_month, 
          new_users, 
          retention_month, 
          retained, 
          retention_percent 
FROM      ( 
                 SELECT first_month, 
                        new_users, 
                        Count(DISTINCT visit_tracker.cust_id) AS retained count(DISTINCT visit)tracker.cust_id)/new_users AS retention_percent
FROM      first_visit 
LEFT JOIN visit_log AS visit_tracker 
ON        visit_log.visit_month < visit_tracker.visit_month 
LEFT JOIN new_users 
ON        new_users.first_month=first_visit.first_month 
GROUP BY  1, 
          2)

This would give us a table that looks like this:

We can then do a pivot table in Excel (or similar) in order to get the table above or use a business intelligence tool such as Statsbot to get data right from our database without coding SQL.

Interpretation

Producing such a table is, of course, only the first step — and a pretty useless one at that if we don’t know how to interpret the numbers.

Taking our original table again, I’ll run through the basics of cohort analysis.

First of all, we’re going to look at the number of new users per month. Are we managing to recruit a healthy number of clients for our business? Here, we can see that we are recruiting between 500 and 600 new clients per month. Is this what we had aimed to do? This is a discussion that should be had with the marketing team.

Then is the number of new clients per month rising? If not, why? Are there peaks and troughs? So, for example, we can see that there were only 504 new customers in Period 6, but 583 in Period 5 and 598 in Period 7. These are huge swings in the numbers. If there is something to explain it, like we’re a business that caters to school groups and there was a week of school vacation during that period, fine. But we need to have an explanation.

After that, the actual retention percentages. One business objective is going to be to improve our retention over time, i.e. to have an upward trend as we go down the lines. So, if we retain a higher percentage of new users who joined in Period 2 than those who joined in Period 1, we are on track.

As we see in the table above, this is what happened: after 1 month, 63% of new users from Period 1 returned, but 68% of new users from Period 2 returned. So we’re doing well! Basically, our customer experience should be getting better over time, so that people are more likely to come back.

After that, another metric we will look at is the rate at which we lose people. It is expected that over time, retention will fall. This is the reason that we always need to keep recruiting! But if we can slow customer attrition so that the fall between each period gets smaller, then we can spend more resources on serving client needs, and not just on trying to find new clients. This, in turn, will help retention, as clients will like that we are paying more attention to them.

Frequency of Visits

One element that I have neglected until now is the frequency of visits, which is the other side of customer retention analysis. It is generally accepted that a customer who remains loyal to your business is worth much more than a new customer, or even one whose engagement is patchy (i.e. returning customers).

Adding a count of visits per customer, per month, or average lag time will complete the picture of how customers are interacting with your business. You can also build cohorts with the number of customers, revenue from returning customers, etc.

Final Thoughts

Customer retention analysis will add depth to any business analysis, and allow decision makers to track not only the success of their recruitment strategy but also how well they are performing in terms of customer experience. If your clients don’t come back, then there’s something to be worked on, either in terms of the quality of your product or in terms of the relationship that you have with your customers. Retention analysis allows for easy flagging of this type of issue.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data analytics ,sql ,retention ,tutorial ,big data ,metrics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}