DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Testcontainers: From Zero To Hero [Video]
  • Why I Built the Ultimate Text Comparison Tool (And Why You Should Try It)
  • The Role of DQ Checks in Data Pipelines
  • Integrating Lighthouse Test Automation Into Your CI/CD Pipeline

Trending

  • Understanding IEEE 802.11(Wi-Fi) Encryption and Authentication: Write Your Own Custom Packet Sniffer
  • Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
  • Build Your First AI Model in Python: A Beginner's Guide (1 of 3)
  • A Deep Dive Into Firmware Over the Air for IoT Devices
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. Best Practices for Building the Data Pipelines

Best Practices for Building the Data Pipelines

Using best practices for building the data pipelines will improve the data quality and reduce the risk of pipeline breakage significantly.

By 
Priyanka Kadiyala user avatar
Priyanka Kadiyala
·
Dec. 17, 23 · Analysis
Likes (2)
Comment
Save
Tweet
Share
5.4K Views

Join the DZone community and get the full member experience.

Join For Free

In my previous article ‘Data Validation to Improve Data Quality’, I shared the importance of data quality and a checklist of validation rules to achieve it. Those validation rules alone may not guarantee the best data quality. In this article, we focus on the best practices to employ while building the data pipelines to ensure data quality. 

1. Idempotency

A data pipeline should be built in such a way that, when it is run multiple times, the data should not be duplicated. Also, when a failure happens and it is resolved and run again, there should not be a data loss or improper alterations. Most pipelines are automated and run on a fixed schedule. By capturing the logs of previous successful runs such as the parameters passed (date range), record inserted/modified/deleted count, timespan of the run, etc., the next run parameters can be set relative to the previous successful run. For example, if a pipeline runs every hour and a failover happens at 2 pm, the next run should capture the data from 1 pm automatically and the timeframe should not be incremented until the current run is successful.

2. Consistency

In some cases where the data flows from upstream to downstream databases, if the pipeline ran successfully and did not add/modify/delete any records, the next run should include a bigger time frame accounting for the previous run to avoid any data loss. This will help to maintain the consistency between source and target databases if the data is landed in the source with a bit of delay. For the example we considered in the above scenario if a pipeline ran at 2 pm successfully and did not add/modify/delete any records, the next run which happens at 3 pm should fetch the data from 1 pm-3 pm instead of 2 pm-3 pm.

3. Concurrency

If the data pipeline is scheduled to run more frequently within a shorter timeframe and the previous run is taking longer than usual to finish, the next scheduled run might get triggered. This will cause performance bottlenecks and inconsistent data. To prevent concurrent runs, the pipeline should have a logic to check if the previous run is in progress and raise an exception or gracefully exit if there is a parallel run. If there is a dependency between the pipelines, using Directed Acyclic Graphs (DAGs) the dependencies can be managed.

4. Schema Evolution

As the source systems continue to evolve with changing requirements or software/hardware updates, the schema is subjected to change, which might cause the pipeline to write data with inconsistent data types and add or modify the fields. To overcome pipeline breaks or data loss, it is a good strategy to check the source and target schemas, and if there is a mismatch, add logic to handle it. Another option is to adopt the schema-on-read approach instead of the schema-on-write approach. Modern-day tools like Upsolver SQLake allow the pipeline to dynamically adapt to schema evolution.

5. Logging and Performance Monitoring

If there are hundreds and thousands of data pipelines, it’s not feasible to monitor every single pipeline every day. Using tools to log and monitor the performance metrics in real-time and setting up alerts and notifications helps to foresee the issues and resolve them on time. This also helps in addressing issues related to abnormally high or low data volumes, latency, throughput, resource consumption, performance degrading, and error rates which will impact the data quality eventually.

6. Timeout and Retry Mechanism

If the pipeline is making API calls and sending or receiving requests over the network, there can be issues such as slow or dropped connections, loss of packets, etc. Adding a timeout period for each request and a retry mechanism with certain time constraints will help restrict the pipeline from going into a never-ending state.

7. Validation

Validation plays a key role in measuring the data quality. It will verify if the data meets the predefined rules and standards. Incorporating the validation rules in the data pipeline at each stage of ingestion such as extraction, performing transformations and loading will ensure integrity, reliability, and consistency and enhance the data quality.

8. Error Handling and Testing

Error Handling can be done by making the best guess of exceptions, potential failure scenarios, and edge cases that will cause the pipeline to break and handling them in the pipeline to avoid the breakage. Another important phase of building a data pipeline is testing. A series of tests such as unit tests, integration tests load tests, etc., can be performed to ensure all blocks of the pipeline are working as expected and give an idea of the data volume limits.

Data pipelines, either batch or streaming can be built using different coding languages and tools. There is a vast set of tools that offer different capabilities. It is a good idea to perform an analysis to understand the complete requirements of your use case, functionalities, and limitations that each tool offers and to choose the right platform based on your needs. Regardless, the above-mentioned best practices can come in handy in building, monitoring, and maintaining the data pipelines.

Tags: Data, Data Pipelines, Data Quality, Data Validation, Testing Data Pipelines, Batch Pipelines, Streaming Pipelines, Data Consistency, Data Reliability, Data Integrity, Data Scalability, ETL, ELT, Data Schema, Idempotency, Logging, Performance Monitoring.

Data quality Database Pipeline (software) Log analysis Testing

Opinions expressed by DZone contributors are their own.

Related

  • Testcontainers: From Zero To Hero [Video]
  • Why I Built the Ultimate Text Comparison Tool (And Why You Should Try It)
  • The Role of DQ Checks in Data Pipelines
  • Integrating Lighthouse Test Automation Into Your CI/CD Pipeline

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!