Time Series Data Analysis Tutorial With Pandas
Time Series Data Analysis Tutorial With Pandas
Check out Google trends data of keywords "diet" and "gym" and looked cursorily at "finance" to see how they vary over time.
Join the DZone community and get the full member experience.
Join For FreeBias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
In this tutorial, we will analyze Google trends data for keywords diet, gym, and finance.
Importing Data
Download it from https://trends.google.com/trends/explore?date=all&q=diet,gym,finance
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('data/multiTimeline.csv',skiprows=1)
# top 5 rows
df.head()
Month 
diet: (Worldwide) 
gym: (Worldwide) 
finance: (Worldwide) 

0  200401 
100 
31  48 
1  200402 
75  26  49 
2  200403 
67  24  47 
3  200404 
70  22  48 
4  200405 
72  22  43 
#change column name
df.columns = ['month', 'diet', 'gym', 'finance']
df.head()
month 
diet 
gym 
finance 

0  200401 
100  31  48 
1  200402 
75  26  49 
2  200403 
67  24  47 
3  200404 
70  22  48 
4  200405 
72  22  43 
Next, we will change the "month" column into a DateTime data type and make it the index of the DataFrame.
Use pandas .to_datetime() to convert the "month" column in to a DateTime.
df.month = pd.to_datetime(df.month)
# change to index
df.set_index('month',inplace=True)
# results
df.head()
diet 
gym 
finance 

month 

20040101 
100  31  48 
20040201 
75  26  49 
20040301 
67  24  47 
20040401 
70  22  48 
20040501 
72  22  43 
Now we will plot data as 3 line plots on a single figure (one for each column, namely, "diet," "gym," and "finance").
df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
Identifying Trends in Time Series Data
There are many ways of identifying trends in time series. One popular way is by taking a rolling average, which means for each time point, we take the average of the points on either side of it. Note that the number of points is specified by a window size, which you need to choose.
Here we check out this rolling average of "diet" using the builtin pandas methods. When it comes to determining the window size, it makes sense to first try out one of twelve months, as you're talking about yearly seasonality.
diet = df[['diet']]
diet.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=30);
Same way we can also plot the rolling average of "gym" with the same window size as the "diet" data.
gym = df[['gym']]
gym.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
Compare Gym and Diet Trends
Plotting the trends of "gym" and "diet" on a single figure.
df_dg = pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1)
df_dg.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
df_dg that has two columns with the rolling average of "diet" and "gym." We used the pd.concat() function, which takes a list of the columns as a first argument, and since we want to concatenate them as columns, we also added the axis argument, which is set to 1.
The above plot shows an increase in gym trend compared to diet as years go on, and we can see that diet potentially has some form of seasonality, where gym is increasing.
Seasonal Patterns in Time Series Data
One way to think about the seasonal components to the time series of data is to remove the trend from a time series so that you can more easily investigate seasonality. To remove the trend, you can subtract the trend you computed above (rolling mean) from the original signal. This, however, will be dependent on how many data points you averaged over.
Another way to remove the trend is called "differencing," where we look at the difference between successive data points (called "firstorder differencing" because we only looking at the difference between one data point and the one before it).
FirstOrder Differencing
You can use pandas and the diff() and plot() methods to compute and plot the first order difference of the "diet" series.
diet.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
In the above plot, we have removed much of the trend and can really see the peaks in January every year. Each January, there is a huge spike of 20 or more percent of the highest search item.
Periodicity and Autocorrelation
A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Autocorrelation is correlation within a dataset and can indicate a trend. For example, if we have a lag of one period, we can check if the previous value influences the current value. For that to be true, the autocorrelation value has to be pretty high.
We will again plot all your time series to remind yourself of what they look like.
df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
After that, compute the correlation coefficients of all of these time series with the help of .corr()
df.corr()
diet 
gym 
finance 

diet 
1.000000 
0.100764 
0.034639 
gym 
0.100764 
1.000000 
0.284279 
finance 
0.034639 
0.284279 
1.000000 
What this tells us:
If we focus on "diet" and "gym," they are negatively correlated. That's very interesting! Remember that we have a seasonal and a trend component. From the correlation coefficient, "diet" and "gym" are negatively correlated. However, from looking at the times series, it looks as though their seasonal components would be positively correlated and their trends negatively correlated.
Now, we plot the firstorder differences of these time series and then compute the correlation of those because that will be the correlation of the seasonal components, approximately. Remember that removing the trend may reveal correlation in seasonality.
Start off by plotting the firstorder differences with the help of .diff() and .plot()
df.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);
We see that "diet" and "gym" are incredibly correlated once you remove the trend. Now, you'll compute the correlation coefficients of the firstorder differences of these time series.
df.diff().corr()
diet 
gym 
finance 

diet 
1.000000 
0.758707 
0.373828 
gym 
0.758707 
1.000000 
0.301111 
finance 
0.373828 
0.301111 
1.000000 
Now, we can see that with the seasonal component, "diet" and "gym" are highly correlated, with a coefficient of 0.76.
Conclusion
In this part we have covered a lot of ground! You checked out Google trends data of keywords "diet" and "gym" and looked cursorily at "finance" to see how they vary over time. We covered concepts such as seasonality, trends, and correlation.
Your machine learning project needs enormous amounts of training data to get to a productionready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.
Opinions expressed by DZone contributors are their own.
{{ parent.title  parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}