The Tools We Used to Build Our ETL Pipeline Platform
Learn about the toolchain for one ETL pipeline using Stitch that can help you configure data sources and monitor each part of the pipeline.
Join the DZone community and get the full member experience.Join For Free
Stitch is a cloud-based platform for ETL – extract, transform, and load. More than a thousand companies use Stitch to move billions of records every day from SaaS applications and databases into data warehouses and data lakes, where it can be used for analysis, reporting, or training machine learning models. We thought people would be interested in seeing how we architected a cloud-based service that can handle billions of rows of data daily.
The Stitch data pipeline first saw the light of day as a proof-of-concept component extracted from a monolithic PHP platform built by RJMetrics, the company Stitch was spun out of in 2016. When we broke out Stitch as a separate platform, we didn't want to use PHP – it doesn't have great support for datatypes and functions for data processing, and its performance isn't optimized for that use case.
We wrote the first Stitch POC code in Clojure. Few members of our team had Clojure experience at the time, but we thought it would help us build a scalable, maintainable platform. The productivity of REPL-based programming resonated enough that it became our default development language. Today we have 17 Clojure microservices (and a number of seasoned Clojure engineers), and PHP is not a part of our stack.
Just what's in the Stitch stack? And what tools do we use to build and maintain it? Let's go backstage to see what powers our cloud data pipeline.
The Front End
Stitch’s front end lets users configure data sources and destinations and monitor the status of each part of the pipeline. Since we support more than 70 data sources, and we expect that number to grow, we built reusable interface components for configuring each source. Adding a new source to the front end is as simple as declaring some information about its parameters. We've invested a lot of effort in making sure our front end helps users understand how Stitch is working, how much data they're moving, and how fresh their data is.
The Data Pipeline
That front-end code communicates with the back-end services through REST APIs. The majority of our Clojure microservices are simple web services that wrap a transactional database with CRUD operations and a little bit of business logic. We use both MySQL and Postgres for transactional data persistence, having transitioned from the former to the latter for newer services and to take advantage of new features coming out of the Postgres community. We expect our scale requirements for these transactional services to remain small, and Amazon Relational Data Service (RDS) makes it easy for us to operate either.
Most of our Clojure best practices can be summed up by the phrase "keep it simple." We avoid complex web frameworks in favor of using the Ring library to build web service routes, and we prefer sending SQL directly to the JDBC library rather than using a complicated object-relational mapper (ORM) or a domain-specific language (DSL). We also steer clear of Clojure libraries that are wrappers on top of Java libraries, since they can cause friction if they aren’t comprehensive, and Clojure’s interoperability features make it easy to interface with Java code directly. For testing, we find that the core.test library that comes with Clojure is all we need.
If you're interested in how we've architected the data flow through Stitch, read A trip through Stitch's data pipeline, which talks about how Stitch integrations take rows and events from data sources and pipe them to destinations. Behind the scenes, we take advantage of Apache Kafka queuing and encrypted Amazon S3 buckets.
All of the tools we've talked about run in a development environment in a VirtualBox VM that's configured with the same Chef code we use in production, to ensure consistency between the two environments. The majority of the team uses Emacs as an IDE, along with the CIDER package to connect over the network to the Clojure REPL of our applications. Most of our engineers' screens are tiled with tmux windows for editing and monitoring different services in their development environment. We use GitHub to host our code repositories, and CircleCI runs our test suites automatically before any code is merged.
Stitch's production environment runs entirely on AWS. All of our transactional databases are run with RDS, and we rely on S3 for data persistence in various stages of our pipeline. We use Redshift as our data warehouse.
The majority of our services run on stateless EC2 instances managed by AWS OpsWorks. We configure OpsWorks stacks and layers in Terraform with all the Chef recipes, network, load balancer, and hardware configuration that a service requires, then use Jenkins along with a custom script for continuous integration and continuous delivery (CI/CD) to provision instances in those layers with a specific code release. Jenkins also handles our dozens of daily code deploys by provisioning new instances and swapping them into load balancers or otherwise activating them.
We recently introduced Kubernetes into our infrastructure to run the scheduled jobs that execute Singer code to extract data from various sources. Kubernetes has proven to be a good fit for this task, and its stability, strong community, and helpful tooling have made it easy for us to incorporate it into our operations. While we don’t have any plans to move our entire infrastructure to Kubernetes, we expect to expand our use of it as a scheduled job runner.
We use Datadog to monitor our application and infrastructure. It connects to Slack and PagerDuty to alert us in the event of a problem. We use an ELK stack for aggregating logs, with the Amazon Elasticsearch Service managing the search cluster.
So there you have it – a quick tour of the tools and processes we use to build and manage Stitch. If you have any tips for us, we'd love to hear from you.
Opinions expressed by DZone contributors are their own.
Exploring the Capabilities of eBPF
Java Concurrency: Condition
Grow Your Skills With Low-Code Automation Tools
4 Expert Tips for High Availability and Disaster Recovery of Your Cloud Deployment