Data Chef ETL Battles: Energy Consumption Time Series
Data Chef ETL Battles: Energy Consumption Time Series
Both of our data chefs wrangled incredibly interesting features from the raw data ingredients — but only one reigned champion!
Join the DZone community and get the full member experience.Join For Free
Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.
Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!
Ingredient Theme: Energy Consumption Time Series
Let’s talk today about electricity and its consumption. One of the hardest problems in the energy industry is matching supply and demand. On one hand, the over-production of energy can be a waste of resources; on the other hand, underproduction can leave people without the basic commodities of modern life. The prediction of the electrical energy demand at each point in time is, therefore, a very important chapter in data analytics.
For this reason, a couple of years ago, energy companies started to monitor the electricity consumption of each household, store, or other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation (CER).
The Smart Metering Electricity Customer Behaviour Trials (CBTs) took place during 2009 and 2010 with over 5,000 Irish homes and businesses participating. The purpose of the trials was to assess the impact on consumers’ electricity consumption in order to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy business customers who participated in the trials, had an electricity smart meter installed in their homes or on their premises and agreed to take part in research to help establish how smart metering can help shape energy usage behaviors across a variety of demographics, lifestyles, and home sizes. The trials produced positive results. The reports are available from CER (Commission for Energy Regulation) along with further information on the Smart Metering Project. In order to get a copy of the data set, fill out this request form and email it to .
The data set is just a very long time series. One column covers the smart meter ID, one column the time, and one column the amount of electricity used in the previous 30 minutes. The time is expressed in the number of minutes from 01.01.2009 : 00.00 and has to be transformed back to one of the classic date/time formats, like for example dd.MM.yyyy : HH.mm. The original sampling rate, at which the used energy is measured, is every 30 minutes.
The first data transformations, common to all data chefs, involve the date/time conversion and the extraction of the year, month, the day of the month, the day of the week, hour, and minute from the raw data.
Topic: Energy consumption time series.
Challenge: From time series to behavioral measures and seasonality.
Methods: Aggregations at multiple levels, correlation.
Data manipulation nodes: GroupBy, Pivoting, Linear Correlation, Lag Column.
What can we do, in general, with a time series? Often, the final goal is the prediction of future values based on current and past values. But just how much past? Also, a time series can follow very different shapes. Has the shape any meaning? Can we summarize the time series evolution, by describing the electricity related habits of the household? Is there any seasonality that we can take into account? Is it possible to predict future values for groups of similar time series? In this case, how do we measure similarity across time series? Well, let’s start this challenge and let’s see what our data chefs have prepared for today’s data!
Data Chef Haruto: Behavior Measures Over Time
Haruto has decided to remain in the time space and to analyze the electrical behavior of the energy consumers, as measured by their smart meters. In particular, he explored the energy consumption on weekends and business days, on each day of the week, on each hour of the day, and for different time frames during the day.
In order to do that, he calculated first the average energy consumption by week day, by the hour, by day times, and by weekends vs. business days. Average values already show who is using the largest amount of energy. Then he transformed such average values into percentages, to understand when each entity uses how much energy.
In Figure 1, the workflow upper branch — embedded in the “Usage Measures” square — is from Data Chef Haruto.
Figure 1: Final Workflow 03_ETL_Energy_autocorr_stats. The upper part named "Usage Measures" describes the entity’s energy consumption behavior. The lower part labeled "Auto-correlation Matrix" calculates the autocorrelation matrix of the energy consumption time series for a selected meter ID.
The first two metanodes, named “Daily Values” and “Hourly Values,” calculate (Figure 2):
- The average daily/hourly energy usage by meter ID (GroupBy node).
- The average energy usage by meter ID vs. day of week/hour of day (Pivoting node).
- The average energy usage during weekends vs. business days/day time frames (Rule Engine + Pivoting node).
After that, a series of Math Formula nodes in the metanodes named “Intra-day segments (%)” and “Week Day (%)” put the average values into context, by reporting them as the percentage of energy used during intra-day segments and during week days.
Figure 2: Content of the “Daily Values” metanode to calculate the daily energy consumption by meter ID in average, in average per day of the week, in average over weekends and business days.
Data Chef Momoka: Auto-Correlation Matrix
Momoka decided to look for seasonality patterns and for that to check each time series autocorrelation matrix.
In Figure 1, the workflow lower branch — embedded in the “Auto-correlation Matrix” square — is the result of Data Chef Momoka’s work.
First, the data is shaped as a pivoting table with average energy consumption of meter ID vs. date and hour. The metanode “Pivoting”, indeed, produces the energy consumption time series for all meter IDs, sampled every hour, and sorted by time. The subsequent metanode, named “Select Meter ID,” allows selecting one-time series only through its meter ID value.
In order to calculate the autocorrelation matrix, we need:
- normalized values for a meaningful comparison of the correlation indexes.
- past values to calculate the correlation of the current sample with its past N samples.
In metanode “Normalize & Lag,” time series values are then being normalized into [0,1] and N past samples are introduced. Normalization is achieved with a Normalizer node, while the N past samples are produced by a Lag Column node. The Lag Column node makes N copies of the selected column and shifts its values of 1, 2, …, N steps forward. If the column values were sorted by time, this means that we would attach the N past samples of the time series to the current one.
The autocorrelation matrix of the current samples with their past N samples is then calculated using a Linear Correlation node. The correlation matrix will show a few highly correlated columns, like for example x(t) and x(t-2). In particular, if the autocorrelation function shows local maxima at recurrent steps in the past, like at x(t) and x(t- i*24) with i= 1,2, .., this might be a sign of a seasonality pattern.
The “Find Seasonality” metanode searches for such local maxima in the correlation functions. It detects the smallest seasonality period as the position of the first local maximum of the correlation function’s first derivative (Figure 3).
Figure 3. Content of “Find Seasonality” metanode, which finds the local maxima in the auto-correlation function through its first derivative values.
Let’s now see the final results on a specific smart meter. The jury randomly chose meter ID 1038.
According to the behavioral features provided by data chef Haruto, the entity connected to meter ID 1038 uses 232 kW/day on average, more or less the same amount every day of the week, with not much difference between weekends and business days. Moving to the hour scale, meter ID 1038 uses ~10 kW/hour on average, most of it during the day and roughly equally distributed over morning and afternoon.
Indeed, the line plot provided by data chef Momoka for the energy usage time series of meter ID 1038 (Figure 4) shows a cyclical trend day vs. night, where the kWs used during the day are definitely dominant. The plot also shows no difference of electricity usage across week days.
Figure 4. Line Plot of energy consumption time-series for meter ID 1038. Notice the day/night rhythm.
This cyclic trend justifies the auto-correlation based findings of data chef Momoka. The signal autocorrelation map (Fig. 5) shows a 24-hour cycle, with local maxima in the auto-correlation function at x(t) and x(t-24), x(t) and x(t-48), x(t), and x(t-72) and so on. The smallest seasonality period was calculated to be 24 hours.
Note: The stronger the cyclic behavior of the autocorrelation matrix, the more meaningful the seasonality pattern. In Figure 5, the time-series seasonality is clearly visible through the cyclic trend of its auto-correlation matrix.
Figure 5: Autocorrelation matrix of energy consumption time-series for meter ID 1038. You can see the cyclic trend of the auto-correlation matrix and the auto-correlation local maxima at -24, -48, and so on.
Note: The example workflow on the EXAMPLES server works only on a subset of the original dataset. This is because the original dataset must be obtained by filling the request form and emailing it to . Therefore, the auto-correlation map and in general all other numbers shown in this post will be different when produced by the example workflow on the reduced dataset!
We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!
Published at DZone with permission of Rosaria Silipo , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.