How to Analyze Time Series COVID-19 Data With SQL Window Functions
How to Analyze Time Series COVID-19 Data With SQL Window Functions
In this article, see a tutorial that explains how to analyze time series COVID-19 data with SQL window functions.
Join the DZone community and get the full member experience.Join For Free
The COVID-19 pandemic really changed the way we live. No matter where you are, you're being impacted by this unpredictable virus. We do not know what the future holds, but we do believe in science and in better days to come.
One thing that makes me as a data analyst particularly happy is that we have publicly available data. We can analyze that data and learn something from it. If you want to explore COVID-19 with the help of modern SQL constructions, keep reading — this article is intended for you.
Publicly Available COVID-19 Data
Today's news is filled with COVID-19 information. No matter which site you visit, the coronavirus topic is inevitable. This is not surprising; our situation is totally extraordinary and we all hope to return to normal someday. Until that "normal" appears, the coronavirus will stay a central topic of interest.
For data analysts, the good news is that coronavirus-related time series data is publicly available and updated daily. We can download the data sets and draw our own conclusions from analyzing them. There are a lot of different tools and technologies available to make such an analysis. In this article, I will show how you can analyze the data with the help of advanced SQL techniques. We are going to use data published by Johns Hopkins University on their GitHub page.
Before we start a short analysis, let's look at the data in the database.
Finding the Daily Number of Confirmed Cases
We'll use the confirmed_covid table in this article. It contains the total number of confirmed coronavirus cases per day for each country and province/state. The first date in the table is 22 January 2020. The data comes from Johns Hopkins University's official GitHub page, which is refreshed daily. The data has been transformed with R code and loaded into this MySQL table. The R code is available on my GitHub page.
Here is how the data looks for Croatia (one of the affected countries) during March 2020:
There are four columns —
confirmed_day column stores the number of patients that tested positive for COVID-19 on a specific day (the column date). For example, on 29 March, Croatia had 56 new confirmed cases. On 30 March, it had 77 new cases. Note that the
province_state column is blank for Croatia, as Croatia is reporting on a country level. This is not the case for China, Canada, or similar countries; for these, the
province_state value will be defined.
SQL Constructions for Analyzing COVID-19 Time Series Data
In this section, we are going to use several advanced modern SQL techniques to get some additional insights from the
So, we have a number of newly confirmed cases on a daily basis. Now we'll use SQL to get insights like the ones you see on popular news portals, such as:
- The total number of confirmed cases to date for each country and province.
- A higher-level summary of confirmed cases.
- The cumulative sum of confirmed cases.
- The percent change (compared to the previous day) of confirmed cases.
- Which countries had the highest number of confirmed cases in one particular day.
All answers will be given with SQL; We'll use window functions and common table expressions – AKA modern SQL techniques. We'll show that SQL can be used in complex time series analysis. With the right techniques, some facts can be extracted easily, using only a few lines of code.
Total Number of Confirmed Cases on a Country and Province Level
If you are an advanced beginner in SQL (as most of us are), you are probably familiar with traditional SQL statements like
ORDER BY. If we want to calculate the total number of confirmed cases on a country and province level, we can use an ordinary
GROUP BY clause:
So, when using
GROUP BY clause data are divided into groups. Rows with the same
GROUP BY column (
province_state in our example) are put into one group. For each group, the total sum is calculated and returned as a result. Rows are collapsed, that is, merged; for each group, one row is displayed as the result.
From our original table, one row will be returned for each country and province/state. Here is how the result is going to be displayed for China and Colombia:
GROUP BY is executed, Colombia is presented as one row (provinces are not defined for Colombia). On the other hand, for China we have several rows displayed after the
GROUP BY is executed; this is because each
province_state value's total sum is displayed.
This simple report is produced with a simple
GROUP BY statement. If you want to learn more about
GROUP BY aggregations, I suggest that you try the interactive online course SQL basics on learnsql.com.
If we want to go one step further, we could also display a higher-level summary – the total confirmed cases on a country level and the total number of confirmed cases worldwide. Let's see how we can do that by using the
Creating a Country-Level Summary of Confirmed Cases With
We can add a higher-level summary to this output . If you want to see all three totals (by province, country, and total worldwide) you could write these three statements:
The first result would give you the total sum of confirmed cases on a country and province/state level, a second one on the country level, and a third one worldwide.
However, you can write simpler code by using the
Using group by
ROLLUP(country, province_state), SQL will generate all three
GROUP BY clauses and the output will look like this:
As you can see in this output, there are some rows where
province_state is marked
NULL. There is also a row where both country and
NULL. Those are higher-level summaries. All rows where
province_state is marked
NULL are rows that represent the total sum of confirmed cases per country. The row where both country and
NULL is the total sum of confirmed cases worldwide.
ROLLUP() is neat because you are getting everything that you need in one output, using one simple line of code.
Calculating a Running Total With
confirmed_covid, we have each day's total number of confirmed COVID19 cases for each country and province. In the last two examples, we saw how to get the total number of confirmed cases on the country and province/state levels. But what about running totals? In our next example, we'll write a SQL statement that will tell us how many total confirmed COVID-19 cases we had on a specific day on the country and province/state level.
Here's the code:
A running total (or cumulative sum) is calculated in SQL using aggregate window functions. Like the aggregate function used with the GROUP BY clause that we saw earlier, a window function also performs calculations across a set of rows. However, this result is not presented as a single output row per group, i.e., rows are not collapsed in the result table. Instead, each row from the input table is returned and the resulting value is assigned to each row in the table.
A window function is denoted by the
OVER clause in combination with the
PARTITION BY and
ORDER BY keywords. Here is the short description of what these keywords do:
PARTITION BYtells us how rows are grouped into logical chunks/groups. In our example, rows are grouped on the country and
ORDER BYsets up the order of the rows. In our example, all rows are sorted by the date column.
RANGEis not explicitly defined in this example, but these keywords signify that each window frame inside one partition comprises all rows from the start of the partition to the current row (for each row, the total sum is calculated as the sum of the
confirmed_dayvalues from the first row in the partition until the current row). With this SQL statement (
OVERin combination with
ORDER BY), we set up a running total of confirmed cases on the country and province/state levels.
Let's use an example to explain this. Here is what we got after running the SQL statement for Croatia:
On 25 February, Croatia had its first COVID-19 patient. The next day, there were two more patients, which brought the total to three confirmed cases on that day (26 February). On 5 March, Croatia had 10 confirmed cases in total; that's from the beginning of the pandemic (even though Croatia didn't detect a new case of COVID-19 on that day). A new case arrived on 6 March, and on that day Croatia had 10 + 1 = 11 detected cases.
For each day, we now can detect how many confirmed cases we had in total and how that number changed through time. Keep in mind that we made this calculation for each country and province/state separately, which means that running totals are available for each country in the confirmed_covid table.
Calculating the Daily Percent Change in Confirmed Cases
Let's now calculate the percent change in new confirmed cases on a daily basis. For example, if today we have 10 new confirmed cases and yesterday we had five, our percent change is calculated like this:
(10–5)/5 * 100 = 100%
This means that we had a 100% increase in the number of daily confirmed cases.
To calculate this number, we need to assign to each day in our table a value for the cases detected the previous day. We can do this using the
LAG() window function:
Again, rows are grouped on the country and province level and ordered by the date column. After running this, SQL displays all rows from the
confirmed_covid table plus the additional info contained in
confirmed_previous_day (i.e., the
confirmed_day value from the previous row). Here are several rows for Croatia:
Each row now contains the number of confirmed new cases from the previous day. When we have this information, we can easily calculate the percent change using the formula:
(confirmed_day - confirmed_previous_day)/ confirmed_previous_day * 100
Below is the complete SQL statement. Note that we are using common table expressions because we want to divide this relatively long query into several smaller pieces:
Here is a short explanation of what's going on:
WITHstatement tells SQL that we are creating a temporary named result set. This is also called a common table expression or CTE. A temporary result set is the result of a query that is written inside brackets
(). In our example, we are using nested CTEs, i.e., we are creating two temporary result sets named
confirmed_lag(assigns the previous day's value to the current row) and
confirmed_lagresult set and calculates the percent change for each day).
- In the end, a second temporary result set is used. Then, the final SELECT statement displays the output:
The result after running this nested CTE looks like this:
Common table expressions are really useful when you want to organize your code in a more readable way, as they allow long queries to be broken into smaller logical units.
Using RANK to Find the Highest Number of Confirmed Cases
The last thing that we are going to do is display the countries with the highest number of confirmed cases on a specific day from 20 March to 30 March.
The idea is to rank countries by the column confirmed_day. We want the country with the highest number of confirmed cases on a specific day to have rank 1, the second-highest to have rank 2, and so on)
Here is the query, followed by a short explanation:
- We are using a
WITHstatement. This creates a temporary result set named
SELECTstatement creates a temporary result set named
RANK()window function. For each day, countries and provinces are ranked based on the
confirmed_daycolumn. This time, our rows are partitioned by the date column. Rows are ordered by the total number of new confirmed cases on a specific day. The country with the highest number of confirmed cases in a day is marked with 1, the second-highest with 2, and so on.
- We are taking only the countries ranking 1 in the
highest_no_of_confirmedresult set. And we are looking only at ten days in March.
Here is the result:
We can see that Italy and the USA were hit hard in March.
In this blog post, I wanted to show how easy it can be to analyze publicly available COVID-19 time series data with the help of SQL window functions, common table expressions, and
GROUP BY extensions. My idea was to show you that SQL is more than just
ORDER BY. There are modern SQL techniques that help us perform complex analyses.
You can continue to play with the data provided by Johns Hopkins. Other files are also available on my GitHub page.
Published at DZone with permission of Marija Ilić . See the original article here.
Opinions expressed by DZone contributors are their own.