Solving Data Integration at Stitch Fix
A robust data integration solution is important. See how Stitch Fix structured their Data Highway solution and logging infrastructure.
Join the DZone community and get the full member experience.Join For Free
About a year ago, my colleagues and I began work on vastly improving the way data is collected and transported at Stitch Fix. We are members of the infrastructure cohort of the Data Platform team — the engineers that support Stitch Fix’s army of data scientists — and the solution we built was a self-service data integration platform tailored to the needs of the data scientists. It was a total redesign of much of our data infrastructure, and it is revolutionizing the way data scientists interact with our data. This article is about:
Why data integration is an important capability that data-driven companies should be thinking about.
How our platform solved our data integration needs.
An overview of how we built the platform itself.
Before delving in, I’ll define data integration (sometimes known as data flow) as the mechanisms that transport data from A to B, such as batch load jobs, REST APIs, message buses, or even rsync. This definition of data integration encompasses many things that are not often considered data integration mechanisms, but by defining this broad category of fundamentally similar things, it’s possible to devise singular solutions to solve many disparate problems. More on this later.
From the outset of the project, one requirement was very clear: fix the logging, as in, make it easier to collect and use our log data. How did we start with the requirement of improving our logging and end up with a data integration platform? To answer this, I defer to Jay Kreps. In his masterpiece, I 3 Logs, he talks in depth about logs and their myriad of uses, specifically:
Data integration: Making all of an organization’s data easily available in all its storage and processing systems.
Real-time data processing: Computing derived data streams.
Distributed system design: How practical systems can be simplified with a log-centric design.
Centralized logging infrastructure is crucial for achieving all of this. However, logging infrastructure alone is not useful to data scientists —deploying a Kafka cluster and calling it a day would do them little good. After many discussions, we decided to build new logging infrastructure and build a data integration platform on top of it to deliver the most value to the data scientists.
By this point, we knew what we wanted to build. We still had a pile of technologies to research and months of work ahead of us, but we had decided on a name: The Data Highway.
Designing the Data Highway
First, we decided on three goals for the Data Highway:
Simplify, scale up, and centralize the logging infrastructure.
Provide a robust, flexible, and extensible data integration platform.
Allow for self-service and ease of use for data scientists
I’ll delve into each goal to clarify what we wanted to accomplish.
Standardize the Logging Infrastructure
My team had already experimented with a few different approaches to logging. The logging infrastructure we were currently running was built using AWS Kinesis and Lambda functions, and it was developing some issues: it was difficult to integrate with non-AWS products, the lambda functions were proliferating and becoming too burdensome to maintain, we were running into throttling issues with Kinesis, and the system was too complex to allow for self-service for the data scientists. Furthermore, various ad hoc solutions to logging had sprouted up across the org, and we had no best practices around how to log. Replacing the legacy logging infrastructure and cleaning up logging tech debt were priority number one.
Extensible Data Integration
We wanted to make it trivial to get data to and from any application or system within Stitch Fix. How? By providing a centralized, standardized, robust, and extensible mechanism for transporting data from A to B.
To illustrate how this would work, first imagine you are a data scientist and you need data from the customer-facing landing page. Without a standard way to get that data, you’d probably build a one-off solution to transport the data. As companies grow, these one-off solutions proliferate. Eventually, this becomes a tangled mess that is exponentially more burdensome to maintain as more and more systems and applications are developed within the company.
The way out of this mire is to implement a centralized pipeline for transporting data. In this world, anything that produces or consumes data simply integrates with the main pipeline. This solution involves fewer connectors, and since each connector resembles one another, it’s possible to develop an extensible abstraction layer to handle the integrations with the central pipeline.
This abstraction layer is what we strove to build with our data integration platform. Each connector would run together in the same infrastructure. They’d share common code and would be deployed, monitored, and maintained together. When new kinds of systems are rolled out, only one kind of connector would need to be developed to integrate the new system with the rest of the company. This cohesion of data makes it possible to have a high-level view of all the data within a company and makes mechanisms like data validation, stripping PII, streaming analytics, and real-time transforms much easier to build. This is tremendously valuable, particularly for data-driven companies.
Self-Service and Easy to Use
This is one of the central tenets of the data org at Stitch Fix. Our data scientists are wholly responsible for doing everything from data collection to cleaning, ETL, analysis, visualization, and reporting. We data platform engineers are never directly involved in that work. Instead, we focus on developing tooling to streamline and abstract the engineering elements of the data scientists’ work. We are like a miniature AWS. Our products are finely crafted data science tools, and our customers areStitch Fix’s data scientists. This affords the data scientists complete autonomy to deliver impactful projects, and the data science teams can scale and grow without being limited by the cycles of the engineers. This relationship is harmonious and enables the modestly sized 20+ person data platform team to support all 80+ data scientists with relative ease.
So, to summarize, the Data Highway needed to be a centralized data integration platform backed by resilient logging infrastructure. It also needed to be simple and 100% self-service. Lastly, it needed to be robust, so the data scientists couldn’t break it. This sounds daunting, but it was surprisingly achievable thanks to recent advancements in logging and data integration infrastructure.
The Data Highway would have three major components:
The underlying logging infrastructure.
The abstraction layer to create the data integration platform.
The interface that data scientists would interact with.
I’ll describe, on a high level, how each component is implemented and briefly discuss why we chose the technologies that we did.
The Logging Infrastructure
Currently, the most frequently adopted solutions for logging infrastructure are Kafka, Kinesis, and Google PubSub. We ruled out GooglePubSub because integrating our AWS-based stack with a GCP product would be too costly. That left Kinesis vs. Kafka.
In the end, we chose Kafka despite the up-front cost of deploying and maintaining our own Kafka infrastructure. Although the ease of simply clicking a few buttons in Kinesis was tempting, the scales tipped in favor of Kafka because 1) we wanted something that would integrate well with AWS products and non-AWS products alike, 2) we wanted to improve our involvement in the open-source community, 3) we wanted fine-grained control over our deployment, and 4) as far as data infrastructure goes,Kafka is quite mild-mannered, so we were optimistic that the maintenance burden would be manageable.
The Data Integration Layer
For the data integration layer on top of Kafka, we considered a few options, including Apache Beam, Storm, Logstash, and Kafka Connect.
Apache Beam is a Swiss army knife for building data pipelines, and it has a feature called Pipeline I/O, which comes with a suite of built-in “transforms.” This was close to what we wanted, but Beam, at the time, was still a very young project, and it is more oriented towards transformations and stream processing. The data integration features seemed like an add-ons feature.
Apache Storm was a viable option, but it would be tricky to build the self-service interface on top of it, since the data flow pipelines have to be written as code, compiled, and deployed. A system that relies on configuration as code would be better.
Logstash is handy because it has so many integrations — more than any other option we looked at. It’s lightweight and easy to deploy. For many years, Logstash lacked important resiliency features, such as support for distributed clusters of workers and disk-based buffering of events. As of version 6.0, both of these features are now implemented, so Logstash could work well for a data integration layer. Its main drawback for our use case is that the data flows are configured using static files written inLogstash’s (somewhat finicky) DSL. This would make it challenging to build the self-service interface.
Finally, we investigated Kafka Connect. Kafka Connect’s goal is to make the integration of systems as easy, resilient, and flexible as possible.It is built using the standard Kafka consumer and producer, so it has auto load balancing, it’s simple to adjust processing capacity, and it has strong delivery guarantees. It can run in a distributed cluster — so failovers are seamless. It is stateless because it stores all its state in Kafka itself. There is a large array of open-source connectors that can be downloaded and plugged into your deployment. If you have special requirements for a connector, the API for writing your own connectors is simple and expressive, making it straightforward to write custom integrations.The most compelling feature is Kafka Connect’s REST interface for configuring and managing connectors. This provided an easy way for us to build our self-service interface. So, we chose Kafka Connect to power the data integration layer, and it has worked well for us.
There were still a few gaps left to fill for this layer of the Data Highway. Kafka Connect works great when it is pulling data from other systems. Some of our data is push-based, however, like data sent using HTTP and syslog. So, we wrote lightweight services dedicated to ingesting that data. Also, some of the open-source Kafka Connect connectors didn’t quite fit our use cases, so we wrote a couple of our own custom connectors.
One of the more challenging aspects of this piece of the Data Highway was monitoring and operational visibility. To solve this, we built a mechanism that sends special tracer bullet messages into every source that the Data Highway ingests data from. The tracer bullets then flow to every system the Data Highway can write data to. The tracer bullets flow continuously, and alerts fire when they are delayed. The tracer bullets provide a good high-level overview of the health of the system and are particularly helpful in our staging environment.
The Interface for Data Scientists
The last thing to build was the Data Manager and Visualizer (a.k.a. the DMV, pun intended), which would be the self-service interface for data scientists to interact with the platform. We have a skilled frontend engineer on the data platform who is dedicated to building GUIs for our tools. Working together, we built the DMV, which shows every topic in the Data Highway, displays sample events per topic, and provides web forms for setting up new sources of data or configuring new sinks for the data to flow to.
In theory, the DMV could directly call the Kafka Connect APIs, but we chose to build a small abstraction layer, called CalTrans, in between KafkaConnect and the DMV. CalTrans converts simple and expressive RESTrequests into the bulky requests that Kafka Connect needs. CalTranshandles environment-specific default configs and hardcodes things that make sense to be configurable within Kafka Connect but that are static for our use cases. Lastly, CalTrans can talk to other systems besides Kafka Connect, so if some validation needs to be performed or some kind of miscellaneous administrative task needs to be completed upon configuring connectors, CalTrans performs that logic.
Hopefully, this has piqued your interest in data integration and given you some ideas for how to build your own data integration platform. Since building this system, I’ve developed an intense curiosity for these kinds of standardized integration systems and I see them everywhere now: USB ports, power outlets, airlines, DNS, train tracks, even systems of measurement. They are all simple, often regulated standardizations that enable many disparate things to interoperate seamlessly. Imagine how much potential could be unlocked, particularly by the hands of data scientists, if all the data a company had was as convenient to access as a power outlet? I believe proper data integration is the secret to unlocking this potential, and at Stitch Fix, this revolution is coming to fruition.
Published at DZone with permission of Liz Bennett. See the original article here.
Opinions expressed by DZone contributors are their own.