Why You Should NOT Build Your Data Pipeline on Top of Singer
Can we leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations?
Join the DZone community and get the full member experience.Join For Free
Singer.io is an open-source CLI tool that makes it easy to pipe data from one tool to another. At Airbyte, we spent time determining if we could leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations (targets).
For the sake of this article, let’s say we are trying to build a tool that can do the following:
Run any Singer tap or target
Provide a UI for configuring and running those taps and targets
Count the number of records synced in each run
In the context of these goals, being able to use Singer programmatically means writing a program that can, for any integration:
provide a UI with instructions on what information a user needs to input to configure that integration (e.g., host, password, etc.).
Take those user-provided values and execute each integration.
We know that the described requirements are not the use case that Singer sets out to solve, but we wanted to see if we could leverage Singer to bootstrap building this case. Sure enough, we ran into some “gotchas” along the way. These gotchas illustrate some of the core primitives that a programmatic data integration tool requires.
Integrations Do Not Declare Their Configurations
The Singer protocol does not specify how integration should define what inputs it requires. This means that to use most Singer taps, you need to scour the entire implementation to figure out what properties it uses; depending on the integration's complexity, this can be pretty painful.
Some integrations help out by specifying what the configuration should look like in a readme or a sample config. Even these lead to headaches. They often list the fields that need to be passed in but do not explain what they mean, their format, or how to find them (good luck trying to find all the information you need to configure your Google Ads integration!). In other cases, they only list a subset, and then you have to discover the rest by reading the integration (e.g., tap-salesforce doesn’t mention is_sandbox in the docs UPDATE: someone has now added this field in the readme with this PR).
These taps are great; we have happily used all of them, but because they do not specify what is required to configure them, they can’t be used programmatically. Specifically, our program needs to know that it requires the field’s hostname and port for the Postgres tap. Without this specification, the program cannot determine how to build a valid configuration for an integration. This configuration is expensive to shim because it requires engineering work for every single integration!
No Way To Tell Which Singer Feature Is Compatible With Which Integration
The singer has excellent documentation around its core protocol. It also does a nice job defining the suite of special metadata that it supports. When you start actually using Singer, however, mapping these primitives onto your integrations is difficult. For example, “replication-method” sets whether all the data from the source should be replicated (“full_table”) or just the new or updated data (“incremental”). What is unclear is which taps actually support “incremental” or “full_table” or both.
Taps do not advertise, in a programmatically consumable way, which of these replication methods they support. Some of them mention it in their documentation, but ultimately that’s insufficient for the type of tool we want to build. So what happens when you request “incremental” from a source that only supports “full_table”? The behavior is undefined. Some taps will throw an error; some will do a full refresh. Either way, from the point of view of the UI-based tool that we are trying to build, this isn’t really usable.
The problem only gets hairier for some of the more niche metadata (e.g., “view-key-properties”). You either need to read the source or try it out and see if the configuration works. This problem is adjacent to the previous section's configuration problem and, similarly, requires a shim for every integration.
Singer’s Own Secret Menu
If you’re from the West coast, you might be familiar with how In-N-Out Burger popularized the “secret” menu in fast food chains. While charming at a drive-thru, secret menus can ruin your data integration.
The Singer protocol has some of its own secret menu items. For example, we were parsing each message that a tap output into JSON using the Singer docs' declared schema. We were trying to understand really well what messages were being sent between taps and targets, so we would fail loudly if anything was sent that did not match the documented message types. Then we started getting errors on “ActivateVersionMessage.” After spelunking in the source code for a bit, we found that this message type has existed in Singer as an experimental feature since 2017. A handful of the official Singer taps use it, but there’s no guidance on what you’re supposed to do with it (I suspect it is a feature used internally at Stitch--the paid, managed solution from the creators of Singer). If you’re building something programmatic on top of Singer, your choice is to filter it out or let it pass and hope that stuff…just works, I guess?
Handling this one case is not the end of the world, but it leaves you feeling uncertain what else is lurking in the protocol that might not play well with your system.
So to answer our original question, can we reasonably stretch the Singer to meet our product requirements? The answer is no. Doing so would require writing custom shims for every single Singer tap and target. Since data integrations are always to scale to more integrations, having to do any work on them per integration is very expensive.
The Singer protocol is underspecified for this use case. This realization makes sense because, ultimately, this is not the use case for which the protocol is trying to solve. Achieving these requirements depends on integrations declaring much more information about how they are configured and which features they support. We are tackling this problem at Airbyte, so if you are looking for an OSS solution that makes it easy to move your data into a warehouse, instead of trying to roll your own on top of Singer, come check us out!
This article is meant to be the first in a pair of articles. The second will explore the engineering journey we took to figure out where Singer should fit into our system.
Published at DZone with permission of John Lafleur. See the original article here.
Opinions expressed by DZone contributors are their own.