Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Transforming ETL for Data Driven Age

DZone's Guide to

Transforming ETL for Data Driven Age

In this post, we look at the growth of data lake technology tools and how they're changing to meet the challenges of data-centric organizations.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Are ETL Tools Still Relevant?

This question is facing user-centric organizations and even ETL vendors themselves. Will they be able to survive the ever-changing data landscape? Let’s first understand the genesis of ETL, which originated in the data warehousing world. It had a high learning curve for developers but it provided many benefits like distributed processing, maintainability, being somewhat UI-based instead of scripting, etc.

The changing data transformation process and terminology for the data-driven age can be summed up in the below table:

DW 

Reporting focused 

Data-Driven 

Analytics focused

ETL (Extract, Transform, and Load)

The flow was tightly coupled to how data was handled.

Real-time data was not considered.

Data Pipeline

It is loosely coupled in terms of how data is handled

It includes real-time data. It can be ELT or ETL

Extract and Load

Selecting data from a particular source, and loading into a different environment like RDBMS.

Ingestion

No selection of data but the full dataset is dumped into the data lake.

Transformation

Transforming the data using ETL tools.


Standardized Transformation

Transforming the data using any big data tools/technologies.

Ad-Hoc Transformation        

Self-service data prep tools are used for ad-hoc transformation.     

Standard ETL Processes

Data Quality, Security, Metadata management, Governance, etc.

Standard Data Processes

Data Quality, Security, Metadata management, Governance, etc. (still relevant).


Coupling may be an old concept in programming but is still a relatively new one when it comes to how data is handled - as mentioned above, ETL flows are tightly coupled but, now, data pipelines are loosely coupled. This approach also had drawbacks, like the creation of data swamps with dark data.

Standardized transformations are still relevant, for which ETL processes can still be followed. But for totally new concepts like data self-service, old processes and practices cannot be used. Standard ETL processes like data quality, security, metadata management, and governance also remain relevant for data-driven organizations.

Data Lake Impact

Big data shook ETL, as it impacted its core value proposition. ETL should start supporting big data eco-system technologies while reinventing itself.

Below are certain ways in which ETL was impacted by big data:

  1. ETL is still relevant in the environment which is using DW - currently, both DW and data lakes are complimenting each other by extending and improving architecture. This may not be impossible in the future, as all new use cases are built using data lakes.
  2. Standard transformations were implemented using ETL tools/engine for processing and RDBMS as storage. But data lakes are used for both processing and storage, hence, in comparison, they provided a single platform, for ease of use and was cheaper to use.
  3. Data lakes extend analyses from just standardized ETL, as data lakes enable first ingestion and then data prep which is oriented towards self-service and ad-hoc, which is not possible in ETL.
  4. Data lakes were used as data landing/staging/archives, which even RDBMS was also not able to handle as a storage solution. Thus, a rethink of how ETL tools were implemented was required.
  5. ETL was not meant to be used in an unstructured environment, but big data processes enable the storage of semi-structured and unstructured data, which makes ETL irrelevant to such types of data. ELT is the way forward for such data.

Legacy ETL approaches have started to lose relevance in the new data-driven world. As new architectures and technologies emerge due to big data, new approaches need to be supported by ETL tools to be relevant. The shift towards Hadoop and other open architectures meant that legacy ETL vendors were on losing ground. 

Reinventing ETL - Options

What are the options for vendors to stay relevant by reinventing themselves, let’s check below:

1. Open Source-Based Execution

Proprietary technologies for data processing and storage are losing relevance. ETL vendors should be able to support all the open source executions - Spark, MR, etc,., and Hadoop storage.

2. Cloud-Centric

Cloud capable is not good, ETL tools should support cloud-native architectures with on-premise versions. There are new cloud-native ETL tools like Snaplogic, Informatica Cloud, and Talend Integration Cloud which provide an integration Platform-as-a-Service (iPaaS) that resolve lots of challenges in terms of infrastructure, though are still some ETL limitations which are not that self-service enabled as compared to emerging tools. Hence, more focus on self-service and ML can allow these tools to enable ad-hoc and self-learning, thus being more relevant in the new age.

3. Data Prep in the Mix

ETL is a developer-focused data transformation tool, while data prep is self-service-focused data transformation tool. As we move towards greater use of data lakes for analytics, both for ad-hoc and standard processes, ETL will start to become irrelevant as self-service will become more pervasive. Both should merge towards creating a single data transformation category of tools which can work on standard and ad-hoc transformations. 

4. AI/ML Focused

AI/ML is an enabler - it enhances data engineers'/developers' ability to complete their jobs easily and quickly by automating many processes. This may include automatic suggestions for datasets, their transforms, and rules which were not previously possible. AI creates a collaboration between AI algorithms and data workers. AI learns once a suggestion is accepted and tunes the classification and transformation according to the suggestions accepted.

Thus, AI will keep on impacting many parts of the data architecture including self-learning algorithms in data classification, data modeling, data storage, etc. ETL tools need to support AI solutions - some vendors have started to provide some AI functionality but still far away from being used as the standard solution.

5. Self-Service Design Capability

ETL tools should start supporting the creation of new self-service-based design/flows by enhancing existing tools and providing new tools for such designs. This will help in creating new self-service-based use cases for organizations.

6. Real-Time Support

Real-time support should be provided via open source technologies and there should be appropriate changes to the architecture of existing tool or new tools created for this purpose. Real-time will enable the tool to provide support for all the use cases of big data.

7. Big Data Quality 

There are still no ETL tools which can enhance the quality of large amounts of data. There are few which can profile big data processes, but there is no rule-based engine to support such execution. ETL vendors should focus on this critical area to be able to compete with new platform-based tools on Hadoop. Data prep can provide support to some degree, but it cannot be industrialized for executing such use cases. 

8. Matching and Merging Support on Big Data

Somewhere in the grey area of MDM and ETL - matching and merging support for the ingested data in data lakes needs to be provided. This is, again, a critical area and, by using ML technologies, this can be easily provisioned by the vendors.

9. Unified Metadata Catalog Support

A data-driven world will require organizations to have access to the catalog of all of their data. As ETL tools are already a repository of metadata, they should be able to support such a requirement which should require a catalog to be automatically populated, its data automatically categorized/tagged, and have search capabilities and Crowd/Expert ratings enabled.

10. Reusability-Centric Data Lake Design 

ETL tools should, by design, provide support to reusable components so that few jobs should be able to support such design. This has been in work for a long time but more emphasis should now support data lake technologies.

Conclusion

As this data-driven age requires relentless support for more data to provide better insights with lower costs, ETL tools need to reinvent themselves and native technologies will form the future for such tools. ETL may be fading out in use by vendors, but the knowledge that created ETL as a category in data management still provides a base for any such data transformation activity. ETL vendors like Talend, Informatica, etc. have recognized these challenges and created new products and enhanced products some specifically for big data and cloud.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,etl ,data preparation ,data lakes

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}