Ad-Hoc Data Wrangling vs. Production Data Engineering and Integration
The tools that are designed for ad-hoc self-service data analytics aren't the same tools that you want to think about for operationally running that environment at scale and vice versa.
Join the DZone community and get the full member experience.Join For Free
Self-service data analytics and data wrangling have been all the rage for the past few years. The idea that citizen data scientists and citizen data analysts, if just given the right tools, can be freed from the shackles of IT to do their own data analysis projects is very attractive. They can access data in a data lake, and they can use ad-hoc data integration and analytics tools to manipulate that data in a spreadsheet-like interface that does smart things like normalize data formats, propose join conditions, and make all kinds of other smart recommendations to help the user figure out how to integrate data in new and interesting ways. And then, once they have the “recipe” that they want to apply to integrate the data, putting that recipe into a repeatable data pipeline is an exercise that is left to the user.
This is where production data engineering and integration come into play. For ad-hoc analytics, you can use a sample dataset. For production, you have to be able to ingest the entire dataset, which means you have to parallelize the loading of the data so it can fit within an SLA window. For ad-hoc analytics, you can just load the data once; for production analytics, you have to be able to handle incremental loading of data on an ongoing basis as new data comes into the system. This means you have to deal with change data capture and slowly changing dimensions. These are non-trivial technical problems in the Hadoop world because merging and syncing new data, as well as tracking history, is not something that Hadoop currently handle.
For ad-hoc analysis, you don’t have to worry about how your data pipeline may interact with other data pipelines. Who cares if you happen to load a dataset from a relational database into the data-lake and you happen to do it several times? But in a production environment, if you happen to have five data pipelines, all of which load the same table into the data lake, your operational environment better be smart enough to realize that it is loading the same table five times and either just load it once or know to only load the most recent changes.
For ad-hoc analytics, you don’t worry about restarting the data pipeline if it fails. You just manually restart it. For production, your production system better be smart enough to know that something failed, try to restart it, and inform the right people that it failed in the first place.
And lastly, when you are doing ad-hoc data analysis, you don’t need to worry about tracking who made changes to the analytic recipe. In general, one data scientist is working on a problem and once they find the recipe, they publish that recipe to be put into production. However, once you switch to a production environment, you want to know how often the pipeline was run, how long each run took, and who made changes to the pipeline logic (if any changes were made). Governance of the process becomes a big issue in an operational environment.
The bottom line is that the tools that are designed for ad-hoc self-service data analytics are not the same tools that you want to think about for operationally running that environment at scale and vice versa. They are different problems that have very different feature set priorities. While there is some overlap in that they both allow you to ingest and transform data, the similarities tend to stop there. So make sure that you think about your needs and requirements for both environments — because those requirements are significantly different.
Opinions expressed by DZone contributors are their own.
Revolutionizing Algorithmic Trading: The Power of Reinforcement Learning
Tech Hiring: Trends, Predictions, and Strategies for Success
Avoiding Pitfalls With Java Optional: Common Mistakes and How To Fix Them [Video]
AI and Cybersecurity Protecting Against Emerging Threats