Lessons Learned From Building Production-Scale Data Conversion Pipelines

Converting large-scale enterprise data between systems is less about perfection than about making the right tradeoffs and engineering for scale and flexibility.

Charles Wong

Sep. 25, 25 · Analysis

Likes (0)

Comment

Save

1.4K Views

Building production-scale data pipelines usually involves wrangling outputs from multiple legacy systems. Whether you’re trying to build out business intelligence use cases, handle a system migration, or lay the foundations for a new data warehouse, chances are high that you’ll have to normalize and integrate the outputs of multiple systems that were never designed to talk to one another.

Recently, we built a production-scale data pipeline converting one data set from one enterprise system (Health Information Exchanges) to be used as an input into another (a claims-powered risk stratification algorithm). Although these two formats fundamentally represented the same underlying event (clinical encounters), the two systems spoke completely different “languages” — different coding standards, field definitions, and expectations about what was required. The goal was not a one-off ETL script, but a reusable, production-ready pipeline that downstream applications could rely on.

Even though this project was healthcare-specific, what we learned applies to anyone building large-scale data conversion or migration projects, regardless of industry or vertical.

1. Embrace Intentional Imperfection

The first of many harsh truths we encountered: trying to map 100% of fields with complete losslessness is a moot cause.

Legacy enterprise systems evolve in silos, generally in incremental updates over decades. One might use free-text labels, while another requires enumerated codes to represent the same concept. Trying to build a lossless, universal mapping between different systems is not only impractical, it’s counterproductive.

Instead, in this project, we needed to decide where precision mattered — and where it didn’t. In our pipeline, one of the most important fields in our input system was diagnosis codes (e.g., indicating that a patient was diagnosed with lung cancer, or type II diabetes, for example). There was one problem — our input data source used a format to represent diagnoses that was less granular than what our output system demanded, which meant that if we wanted to produce a valid output, we’d need to make assumptions on which output diagnosis we needed to map the input to.

We ended up determining that usability was more important than precision - and mapped each higher-level diagnosis code to the most commonly observed granular counterpart. This was a necessarily imprecise assumption, yet a common problem and pattern that arises in many data conversion projects, even outside of healthcare. Think of mapping bills of materials codes in warehousing or inventory management, transportation codes in logistics systems, etc.

This approach won’t satisfy perfectionists, but it’s what made our output usable in production. One caveat — these approaches require that we acknowledge the imperfection of what we’re building, and document exactly which parts of our output had imprecision intentionally introduced — this would later help us contextualize the value of the output when put into downstream business logic.

As the famous statistician George Box once said: “All models are wrong, but some are useful.” — it wouldn’t be too far-fetched to apply this to the tradeoffs we have to make in data engineering, either.

2. Apply the 80/20 Rule Relentlessly

When performing migrations or converting data between systems, it’s tempting to aim for completeness: every field mapped, every edge case handled. But even in the rare cases where this is possible, this may not be a desirable approach, especially in a world of competing priorities and limited engineering resources.

We learned quickly that the most expedient and high-value way to run this project was to work backwards from what our business logic and the downstream system demanded. In hindsight, healthcare claims data was the perfect domain for us to learn this lesson. Claims data is sprawling — there are hundreds of fields representing different nuances of healthcare claims — procedure codes, billing codes, revenue codes, etc. This forced us to the negotiating table to prioritize which fields were actually going to be used, let alone important, in our downstream business logic. Because the downstream system was a machine learning model in itself, it also forced us to interrogate which fields actually carried any (literal) weight in making a prediction.

In our case, we learned that our downstream model only cared about a handful of critical fields. Rather than boil the ocean, we then doubled down on those and de-prioritized the rest. However, we didn’t just stop at the column-level. Once we identified the columns that mattered, we also realized quickly that a small minority of rows didn’t obey the purported schema:

CPT (procedure) codes that were supposed to be 5 digits long, but were sometimes much longer
Location codes that were zero-padded
Timestamp fields that weren’t encoded the same way (some were timezone-naive, some timezone-aware)

Within these inconsistencies, we had to reapply the same prioritization logic. In this sense, it meant figuring out which problems were worth solving and how much time and effort to put into solving each problem. For example, zero-padding was easy to solve with a substring call, but understanding when and how our procedure codes were arbitrarily longer than what they were supposed to be was not, especially when there was no discernible pattern, and this problem only affected <1% of the rows in question. We continued to make ruthless prioritization decisions, allowing us to deliver the project on time.

This follows the classic Pareto principle (also known as the 80/20 rule): 20% of the fields drive 80% of the value.

3. Engineer for Scale From Day One With Vectorization

Even with a narrowed scope, large-scale data pipelines often quickly reach production scale. In our case, our claims dataset very quickly reached millions of rows once we joined multiple tables. At this scale, logic that appears to run seamlessly on local machines with test datasets often ends up stuttering and staggering along in production.

An important practice that helped us scale from day one was always opting for vectorization over iteration where possible.

Row-wise operations (for loops, and row-wise functions like .apply()) are easy to understand and write, but end up being performance killers at scale. Instead, prioritize using vectorized dataframe operations or database joins where possible. We had the perfect example of this challenge when, as part of downstream business logic, we wanted to create a derived column on our claims dataset indicating how much time (in days) had passed between this claim and the previous claim for the same patient. While we could have written a row-wise function to search the dataframe for the previous claim for the same patient, this would have taken an enormous amount of time (and therefore compute) to run.

Instead, we opted for a smarter, more streamlined approach - we used a window function to number each patient’s claims, “lagged” the timestamp column for each patient by one position, and then just did a column-wise datediff between the two timestamp columns.

    Python
   
 

   import pandas as pd

# --- Example input, generalized from our claims dataset ---
df = pd.DataFrame({
    "patient_id": [101, 101, 102, 101, 102],
    "claim_id":   [  1,   2,   3,   4,   5],
    "claim_ts":   pd.to_datetime([
        "2024-01-10 09:00",
        "2024-01-12 15:00",
        "2024-01-11 08:00",
        "2024-02-01 10:30",
        "2024-01-20 12:00"
    ])
})

# --- Example vectorized window approach ---
# 1) Sort by patient + time (basically a window function for those more comfortable with SQL)
df = df.sort_values(["patient_id", "claim_ts"], kind="stable")

# 2) Number the claims per patient and "lag" to get previous timestamp via shift
df["claim_seq"] = df.groupby("patient_id").cumcount() + 1
df["prev_claim_ts"] = df.groupby("patient_id")["claim_ts"].shift(1)

# 3) Column-wise time difference in DAYS
df["delta_since_prev_days"] = (df["claim_ts"] - df["prev_claim_ts"]).dt.days

  

This may look like an implementation detail, but it represents an important design decision-making philosophy when it comes to building enterprise pipelines.

Conclusion

Building production-scale data conversion pipelines may not be glamorous work, but the messy tradeoff decisions and little data handling tricks represent foundational steps that are important for downstream business use cases. In sharing some of our lessons in the journey we took to convert HIE data into claims, I hope you can make use of some of these learnings in building your own enterprise data pipelines.

Data (computing) Pipeline (software) Production (computer science)

Opinions expressed by DZone contributors are their own.

Related

Trending