Over a million developers have joined DZone.

Data Wrangling and Visualization on a Future-Proof Platform

Data Wrangling is difficult for a single person and a single tool. Advanced Visualization also is difficult without the right tools. Read this article to see why Hortonworks Hadoop + Trifecta + Tableau make an excellent solution.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

It sounds like the Wild West, and when it comes to data, sometimes it looks like it too. The concept of “wrangling” brings to mind a lone cowboy on horseback, rounding up a herd of cattle. Back in the day, a really good wrangler could take charge of livestock, guiding them in the right direction and organizing along the way.

We can make the same analogy for data wrangling. Rather than a lone cowboy, visualize a frantic business analyst, faced with disparate data sources from multiple organizations, and a need to organize, index, and query against a variety of elements. While cows may be fickle and needing encouragement, it seems at times the work we do in Microsoft Excel resembles the same effort as cajoling or cracking a whip!

You don’t have to look far before you find a compelling need for three things:

  • Future-proof platform to collect and store all data elements from multiple sources;
  • Data Wrangling tool to wrestle and control all these data sources; and
  • An Iron-clad visualization tool to intelligently display the resulting data

These requirements exist in all industries, but they are especially highlighted in the Consumer Product Goods (CPG) industry.

It’s a Data Problem

Pepsico is a leader in this industry, and like other CPG companies have distinct challenges associated with managing supply chain and product demand. CPG organizations rely on special relationships with retailers to predict and manage this concept. This collaboration provides unique insight into the forecast and replenishment of standard goods. It can make a difference when planning “buy one get one” promotions to minimizing the risk of retailers having empty shelves when consumers arrive at the store to purchase the promoted item.

This process is called Collaborative Planning, Forecast, and Replenishment (CPFR), and requires data from all participants. CPG data outlining UPC details, shelf-life, and size provide details necessary to support shipping algorithms and what space is required on the trucks delivering product. Data from retailers contains store codes, quantity on hand, and Point of Sales (POS) data. Additional data like Weather, Events, and driver scores can be added as well to optimize delivery routes and manage issues in the supply chain.

The CPFR process is data-intensive.  To be successful it requires a future-proof data platform to truly support all data. Whether it is structured data created by an application or data warehouse, or unstructured data collected from social media, syndicated sources, or other services, all data is combined to provide a unique view into this complex business process

Can My Job Title Be "Data Wrangler?"

In addition to the volume of this data being overwhelming, the process to manage it is as well. The sequence to combine multiple sources with different ID systems can be very manual and resource-intensive. At Strata in 2015, Matt Derda of Pepsico shared how they leveraged a series of Macros in Microsoft Access to convert customer data into Excel, which then fed a series of queries on their internal servers. Hours and Days were spent simply preparing the data. According to our partner Trifacta, over 80% of overall effort is spent preparing data vs. the true objective of analysis.

A Rich Blend of Partnerships Providing a Powerful Solution

This complex business problem for CPFR was solved at Pepsico, who used Hortonworks Hadoop to store all collected data, Trifacta for a data wrangling solution, and Tableau delivering rich visualizations.


With this blend of technology, Pepsico gets faster access to reports, and truly supports the concept of “Collaborative” in their CPFR process.

They can easily and quickly import customer-provided data, combine it with internal product data, and enrich it with social media, sentiment analysis, and other unstructured data points. This combined data set is assembled very quickly, using intuitive and approachable scripting logic that provides visualization of data components, as well as potential data errors based on bad or missing characters.

CPG and Their Customer’s Customer

A Consumer Goods company manages customers at two levels. Through their primary customer or retailers, as well as the end-consumer. The end result is that a CPG organization requires data at multiple levels to make business decisions.

At times, based on this CPFR process, a business decision may be to actually recommend reducing order volumes. If a retailer is ordering 500 cases per week and only consumes 100 cases per week, the right recommendation is a reduced order count in order to prevent spoilage at the end of the year. This is a byproduct of a truly collaborative relationship within the customer demand chain.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

visualization,big data,hadoop

Published at DZone with permission of Eric Thorsen. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}