Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Real-Time Streaming ETL With SnappyData

DZone's Guide to

Real-Time Streaming ETL With SnappyData

This article introduces the rationale for real-time streaming ETL using this open source solution, and the advantages of this approach to real-time streaming ETL

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

ETL has always been a big part of prepping data for analytics. The origins of this go back to the development of relational databases and how they were “normal-form” optimized for changes and reads for transactional systems. Once that data rolled off the transactional system, it had to be decomposed into a form that was more suitable for read-only analytics. The ETL process became a popular concept in the 1970s. Data extraction is where data is extracted from the source data store; data transformation is all about converting the data to the format or structure for the purposes of querying and analysis; data loading is the phase where the transformed data is loaded into its final analytics destination store, e.g. operational data storedata mart, or data warehouse. These processes were batch-oriented and complex. Chewing gum and bailing wire was very much the order of the day. Latency was never a consideration because these processes were monolithic and offline in 100% of cases.

As database products proliferated, the list of products that serviced that market grew considerably. Depending on whether you are an Oracle, IBM, or Microsoft shop, you had a choice of ETL products. But the product stack remained roughly the same and the cost, complexity, and latency associated with doing ETL remained largely constant. Vendors like Talend and Informatica strived to provide connectors to a wide variety of sources. All of these products essentially run batch jobs and are not suitable for real-time ETL. These products were born in an era where distributed computing was not the norm; they were not built from the ground up to be horizontally scalable. The expectation was that these were batch processes and the focus was on getting the job done and specialized knowledge was considered a prerequisite for using these tools.

The diagram below shows how the ETL process, as executed today, is inherently a batch process that is complex, is high-latency, and involves multiple user kernel crossings.

There is clearly a need to reduce the complexity of operating ETL processes. Additionally, monitoring them and ultimately getting the data to its analytic destination in a timely manner without errors or data loss is paramount.

With the advent of big data streaming products like Spark, Samza, and Flink, there has been renewed interest in doing ETL, this time without the complexity or latency and also making it real time. This is where customers asked us to craft a solution that was less messy, used standard interfaces, cut down latency, and allowed real-time transformation of data, thereby enabling quicker insights on the business closer to where the events occurred.

Introducing SnappyData Agile ETL:

Doing real-time streaming ETL from operational stores requires a system that has 3 important capabilities.

  • The ability to read data from relational and non-relational data sources without impacting the throughput of those systems. This can typically be done using change data capture mechanisms. Given the myriad number of data sources from which data can be read, the ability to capture changes at the source is critical.
  • The ability to stream the data into a streaming engine and apply rich transformations on it so as to make it consumable for analytics in the target system.
  • The ability to write transformed data out to target systems using standard connectors which can do bulk writes. Messaging providers can also be used to route the data to their final destination.

In addition to this, the system needs to have a metadata repository that is updateable using standard interfaces like SQL (or Spark).

SnappyData offers a robust in-memory database that is accessible via ODBC, JDBC, and REST interfaces. Data stored in SnappyData can be persisted in the cluster or persisted on S3, Azure Storage, etc. SnappyData offers in-database streaming which allows streaming applications to read data from sources like SQLServer, Oracle, MySQL, etc. in a highly parallel manner. Data sources can be HDFS or anything that writes to Kafka since SnappyData treats Kafka as a streaming source. The streaming application performs rich transformations on the stream to convert the data to the format in which it can be consumed by the target system. The application uses the SnappyData repository to store metadata needed to complete these transforms. Once completed, it uses Spark connectors to write data out to any target that has a Spark connector defined. These writes can also be partitioned out in the SnappyData cluster leading to high throughput and low end-to-end latency for the ETL. Most importantly, the system is scalable end-to-end since you can easily add new sources by configuring new streaming applications to run in the cluster ensuring source scalability. The writes to the target system are horizontally scalable ensuring sink scalability and the cluster metadata itself can grow without bounds  without increasing the cost of doing stream transformations.

With SnappyData, real-time streaming ETL becomes a robust, easy to use system which leverages all the work that has been done in Spark.

Lower Complexity:

In most enterprises, DBAs would not want to install any process on the database machines beyond turning on CDC or turning on supplemental logging which allows transaction log reader approaches to CDC. Alternative approaches that require CDC updates to be written out to a message bus are more intrusive and require adding more infrastructure to mission-critical database machines. As an example, a leading ETL product that works with SAP Oracle databases requires log reader and listener processes on the database machine to read the database logs, which then puts additional requirements on monitoring those components, and consumes CPU, memory, and disk (all precious resources) on the database machine. Our approach primarily reads CDC data off the machine and directly into the stream, leading to fewer moving parts overall in the ETL process

Security:

For most data sources, SnappyData integrates into the existing security mechanisms in the database. This uses existing security measures and does not open up a new access point in the database.

No Sink Connector Processes:

The stream application transforms the data and microbatches the changes out to the target system directly using Spark connectors. The sink system does not need to pull in the changes and have additional machinery installed on it, which leads to lower operational complexity and better performance. The transformed payload can be emitted to messaging providers as well for consumption by multiple endpoints.

Scalable Metadata Storage:

The SnappyData ETL engine is built into an in-memory datastore which offers a scalable metadata repository which is persisted and loaded upon startup. SnappyData offers robust monitoring capabilities that allow for streaming applications to be managed easily.

Performance:

Eliminating additional processes eliminates unnecessary user kernel crossings, which cuts down CPU costs, improves throughput, and improves system availability. The use of partitions for processing streams allows for a massive degree of parallelism which improves performance significantly.

Studio Driven Configuration (Coming Soon):

SnappyData offers the Snappy Agile Studio which allows end users to configure sources and targets and provides them with the ability to store metadata needed to complete the transformations.

This article introduced the rationale for real-time streaming ETL and the advantages of the SnappyData approach to real-time streaming ETL. In future articles, we will explore the nuts of bolts of how we autoscale the solution, how we make the “Transform” layer pluggable and how checkpointing and resumability is built into the solution. 

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
apache spark ,big data ,etl ,real-time big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}