Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Converting Apache Sqoop Commands Into StreamSets Data Collector Pipelines

DZone's Guide to

Converting Apache Sqoop Commands Into StreamSets Data Collector Pipelines

Learn about the ways StreamSets Data Collector can be used as a way to modernize Sqoop and how Sqoop functionality varies from SDC.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

When it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw widespread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.

Image title

In StreamSets Data Collector (SDC) 2.7, we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now, customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.

In addition to adding Sqoop-like capabilities, we've also added an importer tool that can automatically convert existing Sqoop commands to an equivalent pipeline within SDC. Check out this short video for a demonstration. The tool can be accessed by simply executing the Python command pip3 install StreamSets.

There are many ways in which SDC can be used as a way to modernize Sqoop, namely:

  • Converting Sqoop commands into SDC is a great way to start. More frequently than not, customers have to perform entire data dumps each day or have to create custom scripting on top of Sqoop due to poor support for incremental imports and lack of data drift handling. By converting Sqoop commands to SDC pipelines, incremental imports are automated — reducing the need for daily dumps and removing most, if not all, of the custom scripting that was previously needed to work around this problem.
  • Sqoop's ability to do incremental import is limited, as it doesn't have a proper method for persisting the "last offset." While there is a "Sqoop metastore," it is not supported in production by any Hadoop vendor due to various limitations (i.e. no backup, no security story, doesn't work from Oozie [which is the most common scheduler for Sqoop]).
  • Sqoop is not data drift-aware, so StreamSets can help. Even though Sqoop can create an Apache Hive or Apace Impala (incubating) table for a user, it can't "synchronize" the schema in the case where there was a data drift, whereas drift synchronization is a key feature of SDC. A typically deployed workaround is to drop the Hive table at the beginning of Sqoop import and recreate it from scratch, which is clearly not ideal and adds to the overall processing time.
  • Since Sqoop has limited extensibility, most customers have to build various scripts around it for performing certain actions to be executed before or after the Sqoop command. For the most common use cases, such as running JDBC or Hive queries, SDC includes standalone executors that can be plugged into start/stop event handling. For more advanced use cases, a Shell executor can move custom scripting to a single place (i.e. an SDC pipeline).

Since Sqoop has been around for longer than StreamSets, in some cases, Sqoop functionality varies enough from SDC that we recommend you reach out and ask us for best practices — specifically:

  • Most Sqoop connectors are JDBC-based, as are the connectors in SDC. There are, however, few exceptions where Sqoop may have so-called “direct” connectors. The most common ones include Oracle and Teradata, and they are highly optimized. At this time, SDC does not offer the same connectivity performance as Sqoop for this type of connectivity. In a situation where direct connectivity is important, using Sqoop is the preferred method.

In summary, for anyone loading data to Hadoop using StreamSets, you no longer need to use a separate tool for bulk data transfer if you don't want to. Not only does StreamSets give you the same great functionality, it also enables drift detection and integration with StreamSets Dataflow Performance Manager — so it's possible to run complex pipelines at scale with operational oversight and confidence. Check out this video and tell us what you think.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,tutorial ,apache sqoop ,streamsets data collector ,pipelines

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}