Adaptive Data Integration and Operations on Oracle Cloud Using StreamSets
A data expert discusses how this partnership will be useful for continuous data flows built with StreamSets's DataOps platform.
Join the DZone community and get the full member experience.Join For Free
StreamSets is pleased to announce a new partnership with Oracle Cloud Infrastructure (OCI). As enterprises move their big data workloads to the cloud, it becomes imperative that their Data Operations are more resilient and adaptive to continue to serve the business's needs. This is why StreamSets Data Collector™ is now easily deployable on OCI.
What led us to this point? There are fundamental questions such as 'What good is an Enterprise Data Hub (EDH) without the most current data?' 'What good is the EDH without lots of data sources feeding it?' which leads to the follow up questions of 'How do you manage data engineering as quickly as software development in a fast-paced DevOps world?' 'How do you manage change-data-capture (CDC) from Oracle, streaming log files, and batch SFTP dumps without using large and confusing toolsets?'
To answer all of these questions, StreamSets has created the first complete DataOps (DevOps for data integration) platform to compliment the fail-fast world of DevOps toolsets that are commonly found in places like a cloud-based EDH deployment. Running StreamSets in the Oracle Cloud to support a Cloudera Enterprise Data Hub (EDH) provides an excellent example of DevOps being applied to data to harness the value of a big data project.
Before we get to what this example looks like and how it operates so well together, it might be helpful to explain the why this unlikely trio would be assembled in the first place and how to answer; 'Why the Oracle Cloud to run Cloudera?'
As OCI becomes more popular, a wider range of use cases presents itself and we see Hadoop deployments becoming great fit for OCI. This is because the Oracle Cloud does have a few pretty significant tricks up its sleeve that are unique to a second gen cloud provider. The old saying about, "Pioneers get the arrows and the settlers get the land..." it turns out, also applies to cloud computing. First, there are some serious performance incentives, like OCI's combination of bare metal compute and 50TB of local NVMe storage per node (or one petabyte of block storage per node) offers about 40% faster performance when compared to traditional cloud VMs, or that OCI is the only cloud provider that offers a guaranteed 25Gbps connection between any two nodes (SLA here). Second, OCI incorporates Oracle's Identity and Access Management (IAM) suite and the unique use of 'compartments' (which are essentially sub-clouds for greater security and billing that scale across regions). Finally, the unique partnership between Oracle and Cloudera is an added bonus. Specifically, the cloud portion of this partnership, enshrined in their ongoing support for a repository of Terraform scripts that enable a rapid and supported start-up and management of a large amount of nodes for development or production.
The importance of Terraform and Oracle's/Cloudera's ongoing support for its openly available scripts to rapidly provision environments cannot be understated. Terraform gives users the ability to declaratively create immutable infrastructure and is fundamentally different than what you might find in a procedural, agent-based configuration management tool like Chef or Puppet. For those unfamiliar with Terraform, it is an open-source, high-level configuration language which can create and execute plans to build a potentially unlimited amount of infrastructure via APIs in any popular cloud or on-premise environment. Using the Terraform scripts supported by Oracle and Cloudera, deploying a high-performance N-node EDH becomes as simple as making any changes to the scripts as deemed necessary, and writing into the CLI "terraform init && terraform plan && terraform apply."
Now that we understand how easy it is to provision and deploy to those environments the next issue is how to move the data into/out of the EDH? Or how to ingest change data capture or streaming web logs to keep the EDH current?"
The answer to that question lies in the value of the partnership between Cloudera and StreamSets. StreamSets makes data ingestion and data movement easy via its DataOps platform. Tools like StreamSets Data Collector (data execution plane) and StreamSets Control Hub™ (control plane) work in tandem so an organization can centrally develop data pipelines and automate a distributed implementation of the same pipelines inside or outside the Hadoop cluster. Additionally, tools like StreamSets Data Protector™ and StreamSets Dataflow Performance Manager™ will discover and protect sensitive data in-stream, or provide service level agreements around streaming data availability and/or quality. All of these tools brought together allow for the rapid iteration of data movement that is a secure, predictable, and scalable way to ensure the ongoing value of the EDH to business users.
Experience the speed and power of DevOps plus DataOps using this repository as a packaged offering of StreamSets, Oracle, and Cloudera. Once you have provisioned your EDH cluster with StreamSets via Terraform on the Oracle Cloud, the next step can be an adventure of your choosing! You can create data pipelines as a microservice, stream CDC logs to your EDH, or even stream data from Salesforce APIs for visualization in Minecraft. The world of DataOps awaits your exploration!
This is the first of a 3 part series where next we will do a deep dive of how we got StreamSets up and running on OCI and value of Terraform and finally how StreamSets and Cloudera perform on bare metal OCI vs other cloud vendors. In the meantime, you can read Oracle's view on the integration here.
Published at DZone with permission of Clarke Patterson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.