Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

2018 ETL Tools Comparison

DZone's Guide to

2018 ETL Tools Comparison

Today's need for advanced data analytics requires a modern approach to data integration. A modern platform, built-in and for the cloud, can give you an edge.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Extract, transform, and load (ETL) tools enable organizations to make their data accessible, meaningful, and usable across disparate data systems. Typically, companies first realize a need for ETL tools when they learn the cost and complexity of trying to code and build an in-house solution.

When it comes to choosing the right ETL tool, you have several options. You can try to assemble open-source ETL tools to deliver a solution. This approach can work for some situations, but companies often find themselves needing more — more functionality/features, more flexibility, and more support.

The next option is to go with an incumbent provider: a solution that deals well with today's popular data sources and streams. Incumbent providers offer the stability and comfort of a big or well-known brand.

The third category of ETL tool is the modern ETL platform. These are often cloud-based solutions and offer end-to-end support for ETL of data from an existing data source to a cloud data warehouse. They're also built to support the ever-growing list of web-based data streams.

For this post, we'll dive into the world of incumbent ETL tools — the usual suspects, the advantages, and the drawbacks — and then finish up with a quick look at the modern ETL platforms.

Incumbent ETL Tools Overview

Incumbent ETL tools make up the majority of the ETL tool market — and that stands to reason. They've been around the longest and many were designed by very large companies (Microsoft, IBM, etc.) so the pre-installed customer base was substantial.

Some of these tools consist of a suite of tools used together, customized to solve particular problems. And because many companies have their data stored in legacy, monolithic databases and systems, the manufacturers are well-positioned to provide tools to migrate that data and to support the existing batch-processing approach.

Popular Incumbent ETL Tools

This is not a complete list, but it does cover the major offerings.

IBM InfoSphere Information Server

IBM InfoSphere Information Server is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions (Server Edition, Enterprise Edition, and MVS Edition).

Informatica PowerCenter

Informatica PowerCenter is the general name for an ETL product suite including the PowerCenter Client Tools, Server, and Repository.

Data is stored in the repository where it is accessed by the client tools and the server. Actions are executed on the server, which connects to sources and targets to fetch the data, apply all transformations, and load the data into target systems.

iWay Software

Information Builders' iWay Integration Suite provides both application and data integration capabilities. Customers use them to manage both structured and unstructured information. The suite includes iWay DataMigrator, iWay Service Manager, and iWay Universal Adapter Framework.

Microsoft SQL Server Integration Services

Microsoft SQL Server Integration Services (SSIS) is a platform for building high-performance data integration solutions, including ETL packages for data warehousing.

OpenText

The OpenText Integration Center is an integration platform that gives organizations the ability to extract, enhance, transform, integrate, and migrate data and content from one or many repositories to any new destination.

Oracle GoldenGate

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments.

Pervasive Software

Pervasive's Data Integrator platform is an enterprise data integration software solution that enables companies to build connections between any kind of data source and application. Data Integrator supports real-time integration scenarios.

Pitney Bowes Software

Pitney Bowes offers a large suite of tools and solutions targeted around data integration. Sagent Data Flow is a flexible integration engine that collates data from disparate sources and provides a comprehensive set of data transformation tools to enhance its business value.

SAP Business Objects Data Services

Formerly Business Objects Data Integrator, SAP BusinessObjects Data Services (BODS) is an ETL tool used for data integration, data quality, data profiling, and data processing. It allows you to integrate and transform trusted data-to-data warehouse systems for analytical reporting.

SAS Data Management

Built on the SAS platform, SAS Data Management is SAS's entry into ETL the tool market. The platform consists of a large suite (20+) of SAS tools and services.

Sun Java Composite Application Platform Suite

Sun's ETL and Data integration tools are a part of a large Java Composite Application Platform Suite (CAPS). CAPS, or Java CAPS, is a standards-based Enterprise Service Bus software suite from Oracle Corporation. Java CAPS has several components which help to integrate existing applications and deliver new business services in a service-oriented architecture environment.

Sybase

Sybase ETL includes Sybase ETL Development and Sybase ETL Server.

Sybase ETL Development is a GUI tool for creating and designing data transformation projects and jobs. This tool provides a complete simulation and debugging environment designed to speed the development of ETL transformation flows. Sybase ETL Development includes an ETL Development Server that controls the actual processing, such as connecting to databases and executing procedures.

Sybase ETL Server is a scalable and distributed grid engine that connects to data sources and extracts and loads data to data targets using transformation flows (designed using Sybase ETL Development).

SyncSort

SyncSort Cloud Solutions access and integrates data from various sources and facilitates moving that data to cloud repositories.

Limitations of Incumbent ETL Tools

The biggest limitation of incumbent tools is that they were designed to work in batch: gather some data, upload it, gather more data, upload it, etc. Batch loading of data works in some situations; however, there are issues with a batch-only approach.

Batch data transformation tools can be hard to implement for cross-platform data sources, especially where Change Data Capture (CDC) is involved. When something goes wrong with your batch data upload, you need to track down the problem, troubleshoot, and re-submit the job quickly. This kind of error handling is crucial, as lost data can be a huge issue in cases where you have, for example, surpassed your 24-hour allotment of API calls in the data warehouse, or where the incoming data gets backed up and CDC information is lost or overwritten.

And what about the ever-growing number of streaming and other types of data sources? They are not a good fit for toolsets designed and built around batch processing, especially with today's demands that the freshest data be available as quickly as possible.

Modern ETL Tools Overview

The modern suite of ETL tools were built with real-time, streaming data processing and the cloud in mind. These latest entries were born to integrate well with advanced cloud data warehouses and to support the ever-growing number of data sources and streams.

Today's trends continue to point to the cloud, and moving IT and ETL to the cloud only makes sense. Cloud-based ETL services are the natural next step. They support the same batch model as their predecessors, but they are taking ETL to the next stage, often offering support for real-time data, intelligent schema detection, and more.

Modern demands on ETL processes render the batch approach nearly obsolete. Gone are the days of nightly financial or inventory updates, as companies and their customers demand the freshest data. Companies keeping up with the ever-growing list of data streams need real-time ETL processing.

And with the need for real-time data access comes a fundamental change in architecture. Today's model is based on stream processing and distributed message queues such as Kafka. Modern approaches from companies like Alooma and others incorporate these new technologies to offer SaaS platforms and on-prem solutions. As part of the stream, modern ETL platforms offer differing levels of transformation, from almost none (instead, transformation happens in the data warehouse, after loading, AKA ELT) to full control via code (Python, Java, etc.).

The last piece of the puzzle is data integrity. What happens if part of the process lags behind or fails? What happens to the data traveling through the pipeline? Any truly modern ETL platform needs to have a robust safety net built in for error handling and reporting.

Popular Modern ETL Platforms and Tools

Here's a list of the most common modern ETL platforms and tools.

Alooma

Alooma is an enterprise data pipeline platform built for the cloud. Alooma provides data teams a modern, scalable cloud-based ETL solution, bringing together data from any data source into any data warehouse, all in real-time.

  • Error handling: Handling, monitoring/reporting, restreaming

  • Transformation: ETL with support for Python transforms

Confluent

Confluent is a full-scale data streaming platform based on Apache Kafka and capable of publish-and-subscribe and storage and processing of data within the stream. Confluent offers an open-source version of its platform.

  • Error handling: Monitoring only

  • Transformation: ETL, Kafka Streams API

Fivetran

Fivetran is a SaaS data integration tool that extracts data from different cloud services, databases, and business intelligence (BI) tools and loads it into a data warehouse.

  • Error handling: Monitoring only

  • Transformation: ELT, limited

FlyData

FlyData is a SaaS data migration tool that enables management of the data load process from MySQL, PostgreSQL, MariaDB, Percona, and logs in CSV/TSV/JSON to an Amazon Redshift data warehouse.

  • Error handling: Yes, buffered

  • Transformation: ELT, limited

Matillion

Matillion offers cloud data integration ETL tools built specifically for Amazon Redshift, Google BigQuery, and Snowflake.

  • Error handling: Supported via code, not built-in

  • Transformation: ETL, graphical builder

SnapLogic

SnapLogic provides data integration platform-as-a-service tools for connecting cloud data sources, SaaS applications, and on-prem business software applications.

  • Error handling: Supported, but not built-in

  • Transformation: ETL, graphical builder

Stitch Data

Stitch is a cloud-first, developer-focused tool for rapidly moving data.

  • Error handling: Manual, records are logged in rejection table

  • Transformation: Quasi-ETL, limited

StreamSets

StreamSets is a cloud-native collection of products to control data drift; the problem of changes in data, data sources, data infrastructure, and data processing.

  • Error handling: Yes (error record handling)

  • Transformation: ETL, code, and GUI

Striim

Striim (pronounced "stream") is a real-time, streaming analytics, and data integration platform.

  • Error handling: Monitoring only

  • Transformation: ETL, built-in and Java functions

Wrapping Up

Today's need for advanced data analytics requires a modern approach to data integration. Whether you're looking to incorporate data from databases, streaming services, files, or other sources, choosing the right toolset is critical. A modern platform, built-in and for the cloud, can give your business the edge you need.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
etl ,big data ,opentext ,oracle ,streamsets ,alooma ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}