Two years ago, Doug Henschen, Executive Editor of InformationWeek, wrote an interesting article called “Big Data Debate: End Near for ETL?” which covers a debate about Hadoop replacing ETL. In this article, he quotes Phil Shelley, the former Chief Technology Officer of Sears Holdings and CEO of Metascale, who debates that ETL will become obsolete. Shelley states,
“The growth of ETL has been alarming, as data volumes escalate year after year. Companies have significant investment in people, skills, software and hardware to do nothing but ETL. Some consider ETL to be a bottleneck in IT operations: ETL takes time as, by definition, data has to be moved. Reading from one system, copying over a network and writing all take time — ever growing blocks of time, causing latency in the data before it can be used. ETL is expensive in terms of people, software licensing and hardware. ETL is a non- value-added activity too, as the data is unusable until it lands in the destination system.”
Shelley made some very good points on why “ETL’s Days Are Numbered.” Shelley also talked about how Hadoop will replace eventually replace ETL because it is a data hub that can store, transform, and use data without ETL tools. Shelley described how it works in the bullets below,
- Systems generate data, just as they always have.
- As near to real-time as possible, data is loaded into Hadoop — yes, this is still “E” from traditional ETL, but that is where the similarity ends.
- Now we can aggregate, sort, transform and analyze the data inside Hadoop. This is the “T”and the “L” from traditional ETL.
- Data latency is reduced to minutes instead of hours because the data never leaves Hadoop. There is no network copying time, no licenses for ETL software and no additional ETL hardware.
- Now the data can be consumed in place without moving it. There are a number of graphic analytic and reporting options to consume data without moving large amounts of data out of Hadoop.
- Some subsets of data do have to be moved out of Hadoop into other systems, for specific purposes. However, with a strong and coherent enterprise data architecture, this can be managed to be the exception.
Now, two years later. Let’s take a look at what has happened to ETL.
In a VentureBeat article called, “The state of big data in 2014 (chart)“, Matt Turck, presents and describes his chart on the landscape of Big Data and how it has changed over the last 2 years. He remarks on what has happened with infrastructure, “Hadoop seems to have solidified its position as the cornerstone of the entire ecosystem, but there are still a number of competing distributions — this will probably need to evolve.”
In a different recent article titled, “More data means more data scientists, & they need more data tools“, by Jordan Novet, he refers to book written by Tom Davenport, called “Big Data @ Work: Dispelling the Myths, Uncovering the Opportunities.” Novet says that Davenport believes that the industry needs more data integration tools to transfer and move data to essentially remove a great deal of the “dirty work” that is involved in migrating data.
So is ETL on life support? No, far from it actually. Shelley’s prediction was correct in his belief that Hadoop would prosper. However, he was completely wrong about the “death of ETL” as more and more ETL tools have emerged and others have evolved. ETL tools that work without the weaknesses that Shelley presents in the article such as slow speed and high costs are easy to find. Common in the market are ETL tools that have automation features that decrease the amount of time spent coding for ETL and are user-friendly which decreases the time required for programming. Change data capture (CDC) tools are available now which migrates only changed data, shaving off process time. And then there are data replication tools that can do the work in “real time” speeds flooding the market.
Why are ETL tools evolving and flooding the market? My take is that these tools are in demand by corporations that don’t want to replace all of their current systems for something like Hadoop because of the costs and efforts of such an endeavor. Instead, they are buying other less costly solutions such as add-ons for their enterprise systems that can do what Hadoop can do. What’s your take?