Self-Service Data Prep: Current State and Toolsets
Self-Service Data Prep: Current State and Toolsets
Business users have always thought they could work do data prep by themselves. And as new tools in this area started to emerge, this belief truly came to fruition.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Self-service is a new buzzword in the area of analytics. As technology became smarter, it created newer, exciting, easy-to-use, visual, guided approaches to achieve self-service. This fueled customer expectation in all areas of analytics — which is now starting to be fulfilled using these enhanced capabilities. Currently, the most common self-service tool in analytics is Tableau, which creates data visualizations. As analysts became more comfortable with self-service in data visualization space, they started looking for tools for self-service in other areas of analytics, including data transformation capabilities. With the merging of requirements and capabilities, new areas in analytics emerged — focusing exclusively on self-service for data preparation (also called data wrangling or data munging).
Why Users Need Self-Service for Data Prep
IT has been seen as an impediment to changes requested by a business. But IT cannot be faulted, as due processes need to be followed for stable delivery. A simple change to report, such as adding a new attribute, will likely require a timeline that may seem like an eternity to a business user.
Business users have always thought they could work on such changes by themselves. As new tools in this area started to emerge, this belief really came to fruition. Data prep tools such as Trifacta, Paxata, etc. put the power in the hands of business users to make any transformation to data.
Defining Self-Service Data Prep Capabilities
Most of these tools are built on top of Hadoop and big data ecosystem technologies. As they serve on top of a data lake, they either pull data to their cluster or execute jobs directly on the existing data on the lake. Some of the major capabilities required for the tools are the following.
Ease of Use
This is the basic capability for any tool to be part of self-service. Hence, these tools are designed from the ground up for ease of use. Each complex transformation can be achieved easily, can be visually guided, and can be achieved only in a couple of clicks.
This was a major jump from previous technologies and tools that can also do data prep, like Python scripts, which require a high learning curve and are time-consuming. Still, Python is very popular with data scientists.
Data profiling is represented using visual charts to understand the data better. Transformations are guided visually so as to help users complete tasks easily. This makes the learning curve minimal for these tools.
Automated profiling can help users understand the data before starting the transformation. Before doing a transformation, profiling is done on the data automatically; the visual data profile provides us with data quality and data distribution information. Profiling can be done on full data, but if the dataset is really large, then the automated sampling of data is also provided.
Leveraging Hadoop Ecosystem
Most of these tools use a compute engine from Hadoop MR, Spark, and other big data ecosystem tools. Each data transformation job runs on the data “in place” instead of pulling into the production server.
How do these tools integrate with secure services on a data lake? What kind of authentication and authorization services they provide for a big data environment? Tools should support active directory integration as well as many different types of users.
Data prep tools can provide support for advanced analytics, supporting R, SAS, etc., as well as the ability to support predictive and spatial analytics. The level of integration of these data prep tools is supported with advanced analytical tools.
There are multiple connectivity requirements for the source, target, and execution on RDBMS, FTP, Hive, HDFS, HBase, etc.
ETL vs. Data Prep: Will ETL Survive?
Both transform data and ETL require huge IT resources. The self-service data prep environment is only required to be set up once by IT and afterward, business teams can work on transforming and loading data independently. Developers were required to use ETL, which in turn required more technical knowledge. Data prep tools are user-friendly and hence can be used even with little technical knowledge.
Data prep is currently being used more for ad-hoc exploration and ETL for more standard processes but as we move towards greater use of Hadoop Big data environment (Data Lake) for analytics, both for ad-hoc and standard processes, ETL will start to become irrelevant and data prep will become more pervasive.
Current Set of Tools
The current set of data prep tools includes Trifacta, Paxata, Alteryx, Platfora, Kinesis, and more.
There is another set of tools that are more comprehensive in their approach: data lake management tools. They start with data ingestion and manage the data at the very end of the data chain (publishing). They create an automated profile but do not support complex transformation and ease of use as well as data prep tools. The most common of them are Alation, Waterline, Podium Data, and Datameer.
Tips for Selecting a Tool
It depends on the requirements, including the infrastructure available, and the overall technology landscape. Below are some tips for self-service data prep tool selection:
- Check each of the capabilities mentioned above to confirm which tool fulfills most of the requirements.
- Trifacta uses different engines while running the transformation based on data size, while Paxata runs everything on a Spark engine regardless of data size.
- Alteryx, by default, pulls the data on its own cluster to run the transformation while also supporting something like in-database to execute on Hadoop cluster.
- Trifacta and Paxata are better than others in terms of ease of use and guided visual transformations.
- Software support and licensing are other important considerations for choosing a tool.
Cloud vs. On-Premises: Which Is Better?
Most tools can be used either on-premises or on the cloud. There are few that are cloud-native like AWS Kinesis Analytics. The decision of cloud or on-premises depends on the overall landscape of the current stack being used.
In terms of capability and support, not many changes have been based on this decision, as most tools provide the same functionality both for cloud and on-premises.
Can It Work on Real-Time Data?
AWS Kinesis does support working on streams for real-time data. AWS Kinesis may have limited capability as a data prep tool. However, while working on real-time data, it may still be your best bet.
This new area of analytics is here to stay — and grow! As more and more organizations start using Hadoop/big data ecosystem technologies, self-service data prep will become ever more pervasive. It supports both types of enterprise transformation requirements: an ad hoc exploration requirement for a case-to-case basis, which can also feed to advance analytics for machine learning. Any standard transformation processes will also be performed using these data prep tools. Also, as expected, data prep tools — new and existing — will continue enhancing their capabilities both horizontally and vertically.
Opinions expressed by DZone contributors are their own.