Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Time Series Data Analysis Tutorial With Pandas

DZone's Guide to

Time Series Data Analysis Tutorial With Pandas

Check out Google trends data of keywords "diet" and "gym" and looked cursorily at "finance" to see how they vary over time.

· AI Zone ·
Free Resource

Did you know that 50- 80% of your enterprise business processes can be automated with AssistEdge?  Identify processes, deploy bots and scale effortlessly with AssistEdge.

In this tutorial, we will analyze Google trends data for keywords diet, gym, and finance.

Importing Data

Download it from https://trends.google.com/trends/explore?date=all&q=diet,gym,finance

 #import packages 
 import numpy as np 
 import pandas as pd 
 import matplotlib.pyplot as plt 
 %matplotlib inline 

 df = pd.read_csv('data/multiTimeline.csv',skiprows=1) 

 # top 5 rows 
 df.head() 


Month

diet: (Worldwide)

gym: (Worldwide)

finance: (Worldwide)

0

2004-01

100

31 48
1

2004-02

75 26 49
2

2004-03

67 24 47
3

2004-04

70 22 48
4

2004-05

72 22 43


 #change column name

 df.columns = ['month', 'diet', 'gym', 'finance']

df.head()

month

diet

gym

finance

0

2004-01

100 31 48
1

2004-02

75 26 49
2

2004-03

67 24 47
3

2004-04

70 22 48
4

2004-05

72 22 43


Next, we will change the "month" column into a DateTime data type and make it the index of the DataFrame.

Use pandas .to_datetime() to convert the "month" column in to a DateTime.

 df.month = pd.to_datetime(df.month) 

 # change to index 
 df.set_index('month',inplace=True) 

 # results 
 df.head() 

diet

gym

finance

month

2004-01-01

100 31 48

2004-02-01

75 26 49

2004-03-01

67 24 47

2004-04-01

70 22 48

2004-05-01

72 22 43

Now we will plot data as 3 line plots on a single figure (one for each column, namely, "diet," "gym," and "finance").

 df.plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

plot for diet gym and finance

There are many ways of identifying trends in time series. One popular way is by taking a rolling average, which means for each time point, we take the average of the points on either side of it. Note that the number of points is specified by a window size, which you need to choose.

Here we check out this rolling average of "diet" using the built-in pandas methods. When it comes to determining the window size, it makes sense to first try out one of twelve months, as you're talking about yearly seasonality.

 diet = df[['diet']] 
 diet.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=30); 

diet rolling plot

Same way we can also plot the rolling average of "gym" with the same window size as the "diet" data.

 gym = df[['gym']] 
 gym.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

data science plot


Plotting the trends of "gym" and "diet" on a single figure.

 df_dg = pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1) 
 df_dg.plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

data science time series with pandas

df_dg that has two columns with the rolling average of "diet" and "gym." We used the pd.concat() function, which takes a list of the columns as a first argument, and since we want to concatenate them as columns, we also added the axis argument, which is set to 1.

The above plot shows an increase in gym trend compared to diet as years go on, and we can see that diet potentially has some form of seasonality, where gym is increasing.

Seasonal Patterns in Time Series Data

One way to think about the seasonal components to the time series of data is to remove the trend from a time series so that you can more easily investigate seasonality. To remove the trend, you can subtract the trend you computed above (rolling mean) from the original signal. This, however, will be dependent on how many data points you averaged over.

Another way to remove the trend is called "differencing," where we look at the difference between successive data points (called "first-order differencing" because we only looking at the difference between one data point and the one before it).

First-Order Differencing

You can use pandas and the diff() and plot() methods to compute and plot the first order difference of the "diet" series.

 diet.diff().plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

data science with pandas

In the above plot, we have removed much of the trend and can really see the peaks in January every year. Each January, there is a huge spike of 20 or more percent of the highest search item.

Periodicity and Autocorrelation

A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.

Autocorrelation is correlation within a dataset and can indicate a trend. For example, if we have a lag of one period, we can check if the previous value influences the current value. For that to be true, the autocorrelation value has to be pretty high.

We will again plot all your time series to remind yourself of what they look like.

 df.plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

Image title

After that, compute the correlation coefficients of all of these time series with the help of .corr()

df.corr()

 

diet

gym

finance

diet

1.000000

-0.100764

-0.034639

gym

-0.100764

1.000000

-0.284279

finance

-0.034639

-0.284279

1.000000

What this tells us:

If we focus on "diet" and "gym," they are negatively correlated. That's very interesting! Remember that we have a seasonal and a trend component. From the correlation coefficient, "diet" and "gym" are negatively correlated. However, from looking at the times series, it looks as though their seasonal components would be positively correlated and their trends negatively correlated.

Now, we plot the first-order differences of these time series and then compute the correlation of those because that will be the correlation of the seasonal components, approximately. Remember that removing the trend may reveal correlation in seasonality.

Start off by plotting the first-order differences with the help of .diff() and .plot()

 df.diff().plot(figsize=(20,10), linewidth=5, fontsize=20) 
 plt.xlabel('Year', fontsize=20); 

Image title

We see that "diet" and "gym" are incredibly correlated once you remove the trend. Now, you'll compute the correlation coefficients of the first-order differences of these time series.

df.diff().corr()

 



diet

gym

finance

diet

1.000000

0.758707

0.373828

gym

0.758707

1.000000

0.301111

finance

0.373828

0.301111

1.000000

Now, we can see that with the seasonal component, "diet" and "gym" are highly correlated, with a coefficient of 0.76.

Conclusion

In this part we have covered a lot of ground! You checked out Google trends data of keywords "diet" and "gym" and looked cursorily at "finance" to see how they vary over time. We covered concepts such as seasonality, trends, and correlation.

Consuming AI in byte sized applications is the best way to transform digitally. #BuiltOnAI, EdgeVerve’s business application, provides you with everything you need to plug & play AI into your enterprise.  Learn more.

Topics:
data science ,pandas ,python ,machine larning ,artificial intelligence ,ai ,time series ,time series data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}