{{announcement.body}}
{{announcement.title}}

How to Analyze Time Series COVID-19 Data With SQL Window Functions

DZone 's Guide to

How to Analyze Time Series COVID-19 Data With SQL Window Functions

In this article, see a tutorial that explains how to analyze time series COVID-19 data with SQL window functions.

· Database Zone ·
Free Resource

The COVID-19 pandemic really changed the way we live. No matter where you are, you're being impacted by this unpredictable virus. We do not know what the future holds, but we do believe in science and in better days to come.

One thing that makes me as a data analyst particularly happy is that we have publicly available data. We can analyze that data and learn something from it. If you want to explore COVID-19 with the help of modern SQL constructions, keep reading — this article is intended for you.

Publicly Available COVID-19 Data

Today's news is filled with COVID-19 information. No matter which site you visit, the coronavirus topic is inevitable. This is not surprising; our situation is totally extraordinary and we all hope to return to normal someday. Until that "normal" appears, the coronavirus will stay a central topic of interest.

For data analysts, the good news is that coronavirus-related time series data is publicly available and updated daily. We can download the data sets and draw our own conclusions from analyzing them. There are a lot of different tools and technologies available to make such an analysis. In this article, I will show how you can analyze the data with the help of advanced SQL techniques. We are going to use data published by Johns Hopkins University on their GitHub page.

Before we start a short analysis, let's look at the data in the database.

Finding the Daily Number of Confirmed Cases

We'll use the confirmed_covid table in this article. It contains the total number of confirmed coronavirus cases per day for each country and province/state. The first date in the table is 22 January 2020. The data comes from Johns Hopkins University's official GitHub page, which is refreshed daily. The data has been transformed with R code and loaded into this MySQL table. The R code is available on my GitHub page.

Here is how the data looks for Croatia (one of the affected countries) during March 2020:

country province_state date confirmed_day
Croatia
2020-03-01 1
Croatia
2020-03-02 0
Croatia
2020-03-03 2
Croatia
2020-03-04 1
Croatia
2020-03-05 0
Croatia
2020-03-06 1
Croatia
2020-03-07 1
Croatia
2020-03-08 0
Croatia
2020-03-09 0
Croatia
2020-03-10 2
Croatia
2020-03-11 5
Croatia
2020-03-12 0
Croatia
2020-03-13 13
Croatia
2020-03-14 6
Croatia
2020-03-15 11
Croatia
2020-03-16 8
Croatia
2020-03-17 8
Croatia
2020-03-18 16
Croatia
2020-03-19 24
Croatia
2020-03-20 23
Croatia
2020-03-21 78
Croatia
2020-03-22 48
Croatia
2020-03-23 61
Croatia
2020-03-24 67
Croatia
2020-03-25 60
Croatia
2020-03-26 53
Croatia
2020-03-27 91
Croatia
2020-03-28 71
Croatia
2020-03-29 56
Croatia
2020-03-30 77

There are four columns — country, province_state, date, and confirmed_day. The confirmed_day column stores the number of patients that tested positive for COVID-19 on a specific day (the column date). For example, on 29 March, Croatia had 56 new confirmed cases. On 30 March, it had 77 new cases. Note that the province_state column is blank for Croatia, as Croatia is reporting on a country level. This is not the case for China, Canada, or similar countries; for these, the province_state value will be defined.

SQL Constructions for Analyzing COVID-19 Time Series Data


In this section, we are going to use several advanced modern SQL techniques to get some additional insights from the confirmed_covid table.

So, we have a number of newly confirmed cases on a daily basis. Now we'll use SQL to get insights like the ones you see on popular news portals, such as:

  • The total number of confirmed cases to date for each country and province.
  • A higher-level summary of confirmed cases.
  • The cumulative sum of confirmed cases.
  • The percent change (compared to the previous day) of confirmed cases.
  • Which countries had the highest number of confirmed cases in one particular day.

All answers will be given with SQL; We'll use window functions and common table expressions  – AKA modern SQL techniques. We'll show that SQL can be used in complex time series analysis. With the right techniques, some facts can be extracted easily, using only a few lines of code.

Total Number of Confirmed Cases on a Country and Province Level

If you are an advanced beginner in SQL (as most of us are), you are probably familiar with traditional SQL statements like SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. If we want to calculate the total number of confirmed cases on a country and province level, we can use an ordinary GROUP BY clause:

SQL
 







So, when using GROUP BY clause data are divided into groups. Rows with the same GROUP BY column (country and province_state in our example) are put into one group. For each group, the total sum is calculated and returned as a result. Rows are collapsed, that is, merged; for each group, one row is displayed as the result.

From our original table, one row will be returned for each country and province/state. Here is how the result is going to be displayed for China and Colombia:

After GROUP BY is executed, Colombia is presented as one row (provinces are not defined for Colombia). On the other hand, for China we have several rows displayed after the GROUP BY is executed; this is because each province_state value's total sum is displayed.

This simple report is produced with a simple GROUP BY statement. If you want to learn more about GROUP BY aggregations, I suggest that you try the interactive online course SQL basics on learnsql.com.

If we want to go one step further, we could also display a higher-level summary – the total confirmed cases on a country level and the total number of confirmed cases worldwide. Let's see how we can do that by using the ROLLUP clause.

Creating a Country-Level Summary of Confirmed Cases With ROLLUP

We can add a higher-level summary to this output . If you want to see all three totals (by province, country, and total worldwide) you could write these three statements:

SQL
 




xxxxxxxxxx
1


 
1
SELECT country, province_state, SUM(confirmed_day) AS total_confirmed
2
FROM confirmed_covid
3
GROUP BY country,province_state;



SQL
 




xxxxxxxxxx
1


 
1
SELECT country, province_state, SUM(confirmed_day) AS total_confirmed
2
FROM confirmed_covid
3
GROUP BY country;


SQL
 




xxxxxxxxxx
1


1
SELECT country, province_state, SUM(confirmed_day) AS total_confirmed
2
FROM confirmed_covid;



The first result would give you the total sum of confirmed cases on a country and province/state level, a second one on the country level, and a third one worldwide.

However, you can write simpler code by using the ROLLUP() clause:

SQL
 




xxxxxxxxxx
1


 
1
SELECT country, province_state, sum(confirmed_day) AS total_confirmed
2
FROM confirmed_covid
3
GROUP BY ROLLUP(country, province_state);



Using group by ROLLUP(country, province_state), SQL will generate all three GROUP BY clauses and the output will look like this:

As you can see in this output, there are some rows where province_state is marked NULL. There is also a row where both country and province_state are NULL. Those are higher-level summaries. All rows where province_state is marked NULL are rows that represent the total sum of confirmed cases per country. The row where both country and province_state are NULL is the total sum of confirmed cases worldwide.

ROLLUP() is neat because you are getting everything that you need in one output, using one simple line of code.

Calculating a Running Total With OVER and PARTITION BY

Currently in confirmed_covid, we have each day's total number of confirmed COVID19 cases for each country and province. In the last two examples, we saw how to get the total number of confirmed cases on the country and province/state levels. But what about running totals? In our next example, we'll write a SQL statement that will tell us how many total confirmed COVID-19 cases we had on a specific day on the country and province/state level.

Here's the code:

SQL
 




xxxxxxxxxx
1
13
9


 
1
SELECT
2
  *,
3
  SUM(confirmed_day) OVER(
4
    PARTITION BY country, province_state
5
    ORDER BY date
6
  ) AS running_total
7
FROM confirmed_covid;



A running total (or cumulative sum) is calculated in SQL using aggregate window functions. Like the aggregate function used with the GROUP BY clause that we saw earlier, a window function also performs calculations across a set of rows. However, this result is not presented as a single output row per group, i.e., rows are not collapsed in the result table. Instead, each row from the input table is returned and the resulting value is assigned to each row in the table.

A window function is denoted by the OVER clause in combination with the PARTITION BY and ORDER BY keywords. Here is the short description of what these keywords do:

  • PARTITION BY tells us how rows are grouped into logical chunks/groups. In our example, rows are grouped on the country and province_state level.
  • ORDER BY sets up the order of the rows. In our example, all rows are sorted by the date column.
  • ROW or RANGE is not explicitly defined in this example, but these keywords signify that each window frame inside one partition comprises all rows from the start of the partition to the current row (for each row, the total sum is calculated as the sum of the confirmed_day values from the first row in the partition until the current row). With this SQL statement (OVER in combination with PARTITION BY and ORDER BY), we set up a running total of confirmed cases on the country and province/state levels.

Let's use an example to explain this. Here is what we got after running the SQL statement for Croatia:


On 25 February, Croatia had its first COVID-19 patient. The next day, there were two more patients, which brought the total to three confirmed cases on that day (26 February). On 5 March, Croatia had 10 confirmed cases in total; that's from the beginning of the pandemic (even though Croatia didn't detect a new case of COVID-19 on that day). A new case arrived on 6 March, and on that day Croatia had 10 + 1 = 11 detected cases.

For each day, we now can detect how many confirmed cases we had in total and how that number changed through time. Keep in mind that we made this calculation for each country and province/state separately, which means that running totals are available for each country in the confirmed_covid table.

Calculating the Daily Percent Change in Confirmed Cases

Let's now calculate the percent change in new confirmed cases on a daily basis. For example, if today we have 10 new confirmed cases and yesterday we had five, our percent change is calculated like this:

(10–5)/5 * 100 = 100%

This means that we had a 100% increase in the number of daily confirmed cases.

To calculate this number, we need to assign to each day in our table a value for the cases detected the previous day. We can do this using the LAG() window function:

SQL
 




xxxxxxxxxx
1
13
9


 
1
SELECT
2
  *,
3
  LAG(confirmed_day) OVER(
4
    PARTITION BY country, province_state
5
    ORDER BY date)
6
  AS confirmed_previous_day
7
FROM confirmed_covid;


Again, rows are grouped on the country and province level and ordered by the date column. After running this, SQL displays all rows from the confirmed_covid table plus the additional info contained in confirmed_previous_day (i.e., the confirmed_day value from the previous row). Here are several rows for Croatia:

Each row now contains the number of confirmed new cases from the previous day. When we have this information, we can easily calculate the percent change using the formula:

(confirmed_day - confirmed_previous_day)/ confirmed_previous_day * 100

Below is the complete SQL statement. Note that we are using common table expressions because we want to divide this relatively long query into several smaller pieces:

SQL
 







Here is a short explanation of what's going on:

  • The WITH statement tells SQL that we are creating a temporary named result set. This is also called a common table expression or CTE. A temporary result set is the result of a query that is written inside brackets (). In our example, we are using nested CTEs, i.e., we are creating two temporary result sets named confirmed_lag (assigns the previous day's value to the current row) and confirmed_percent_change (uses the confirmed_lag result set and calculates the percent change for each day).
  • In the end, a second temporary result set is used. Then, the final SELECT statement displays the output:
  • SQL
     




    xxxxxxxxxx
    1
    17
    9


     
    1
    SELECT
    2
      *,
    3
      CASE
    4
        WHEN percent_change > 0 THEN 'increase'
    5
        WHEN percent_change = 0 THEN 'no change'
    6
        ELSE 'decrease'
    7
      END AS trend
    8
    FROM confirmed_percent_change
    9
    WHERE country = 'Croatia';
    4
        WHEN percent_change > 0 THEN 'increase'


The result after running this nested CTE looks like this:

country province_state date confirmed_day confirmed_previous_day percent_change trend
Croatia
2020-01-22 0 NULL 0 no change
...
... ... ... ... ...
Croatia
2020-03-22 48 78 -38 decrease
Croatia
2020-03-23 61 48 27 increase
Croatia
2020-03-24 67 61 10 increase
Croatia
2020-03-25 60 67 -10 decrease
Croatia
2020-03-26 53 60 -12 decrease
Croatia
2020-03-27 91 53 72 increase
Croatia
2020-03-28 71 91 -22 decrease
Croatia
2020-03-29 56 71 -21 decrease
Croatia
2020-03-30 77 56 38 increase

Common table expressions are really useful when you want to organize your code in a more readable way, as they allow long queries to be broken into smaller logical units.

Using RANK to Find the Highest Number of Confirmed Cases

The last thing that we are going to do is display the countries with the highest number of confirmed cases on a specific day from 20 March to 30 March.

The idea is to rank countries by the column confirmed_day. We want the country with the highest number of confirmed cases on a specific day to have rank 1, the second-highest to have rank 2, and so on)

Here is the query, followed by a short explanation:

SQL
 







  • We are using a WITH statement. This creates a temporary result set named highest_no_of_confirmed.
  • A SELECT statement creates a temporary result set named highest_no_of_confirmed using a RANK() window function. For each day, countries and provinces are ranked based on the confirmed_day column. This time, our rows are partitioned by the date column. Rows are ordered by the total number of new confirmed cases on a specific day. The country with the highest number of confirmed cases in a day is marked with 1, the second-highest with 2, and so on.
  • We are taking only the countries ranking 1 in the highest_no_of_confirmed result set. And we are looking only at ten days in March.

Here is the result:

country province_state date confirmed_day highest_confirmed
Italy
2020-03-20 5986 1
Italy
2020-03-21 6557 1
US
2020-03-22 7787 1
US
2020-03-23 10571 1
US
2020-03-24 9893 1
US
2020-03-25 12038 1
US
2020-03-26 18058 1
US
2020-03-27 17821 1
US
2020-03-28 19821 1
US
2020-03-29 19408 1
US
2020-03-30 20921 1

We can see that Italy and the USA were hit hard in March.

Summary

In this blog post, I wanted to show how easy it can be to analyze publicly available COVID-19 time series data with the help of SQL window functions, common table expressions, and GROUP BY extensions. My idea was to show you that SQL is more than just SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. There are modern SQL techniques that help us perform complex analyses.

You can continue to play with the data provided by Johns Hopkins. Other files are also available on my GitHub page

Topics:
analyze data, covid-19, database, sql, time series, tutorial, window functions

Published at DZone with permission of Marija Ilić . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}