DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Blueprint for Migrating On-Premise Data Pipelines To Azure Cloud
  • Testing Serverless Functions
  • Bridging Cloud and On-Premises Log Processing
  • Enhanced Monitoring Pipeline With Advanced RAG Optimizations

Trending

  • The Role of Functional Programming in Modern Software Development
  • AI-Based Threat Detection in Cloud Security
  • Teradata Performance and Skew Prevention Tips
  • How to Build Scalable Mobile Apps With React Native: A Step-by-Step Guide
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Essential Guidelines for Building Optimized ETL Data Pipelines in the Cloud With Azure Data Factory

Essential Guidelines for Building Optimized ETL Data Pipelines in the Cloud With Azure Data Factory

Discover how to optimize Azure Data Factory pipelines by selecting the right activities and managing resources for reliable data processing.

By 
Aravind Nuthalapati user avatar
Aravind Nuthalapati
DZone Core CORE ·
Aug. 14, 24 · Analysis
Likes (5)
Comment
Save
Tweet
Share
36.6K Views

Join the DZone community and get the full member experience.

Join For Free

When building ETL data pipelines using Azure Data Factory (ADF) to process huge amounts of data from different sources, you may often run into performance and design-related challenges. This article will serve as a guide in building high-performance ETL pipelines that are both efficient and scalable.

Below are the major guidelines to consider when building optimized ETL data pipelines in ADF:

Designing Pipelines

Activity Type Selection

ADF provides many types of activities to orchestrate the data, and based on the need and business requirement, choose the one that best fits your business. For example, when you copy data from an on-prem Oracle data source to an on-prem SQL Server (sink) that has many tables, consider using a combination of Lookup, For Each, and Copy Activity to optimally design the pipeline.

Linked Services and Datasets

A linked service is a connection to a data source that can be created once and reused across multiple pipelines within the same ADF. It is efficient to create one linked service per source for easy maintenance.

Similarly, datasets are derived from the linked services to fetch the data from the source. These should ideally be a single dataset for each linked service and reused across all pipelines in the same ADF. Separate datasets are created when the data format from the source varies.

Resiliency and Reliability

When you build pipelines with numerous activities, it can complicate troubleshooting when the failure occurs. To achieve reliability, split the complex logic by dividing it into multiple pipelines and creating dependency with each other. Additionally, intermittent network issues can cause pipeline failures. To avoid manual intervention, enable the retry policy at the activity level to automatically rerun when it fails.

Scheduling Using Triggers

In a Big Data environment, processing large amounts of data is often required to develop and schedule multiple pipelines to run in batches. To optimize resource utilization, stagger pipeline schedules where possible. In case business requirements demand processing within a strict SLA, it is best to scale up resources on Integration Runtime (IR) and source/sink to effectively utilize and manage resources.

Expressions and Variables

You can leverage expressions and variables in pipelines to implement dynamic content that can adapt to environmental changes without manual intervention and diversify workflows. This can optimize the pipeline and resource usage across the activities. For example, expressions can be used in If and Switch activities to direct the activity execution based on the conditions, while variables can be used in Lookup and For Each activities to capture and define the values to pass those to subsequent activities.

Performance Optimization Techniques

Using Azure Integration Runtime 

Azure Integration Runtime (IR) provides a compute environment for your activities, which are managed by ADF. To transfer large amounts of data, higher computing power is needed, and this compute is adjustable via Data Integration Units (DIU) in Copy activity settings. The more DIUs can significantly reduce the activity runtime. 

Using Self-Hosted Integration Runtime (SHIR)

SHIR can let you use your own compute resources for activity execution, mostly used to connect to on-premise sources. It does support a maximum of four nodes. For large amounts of workload, consider using high-configuration machines across all nodes (ideally the same size and location) to avoid performance bottlenecks such as slow throughput, and high queue time.

Network Throughput

When transferring huge amounts of data using pipelines, it is necessary to have high network throughput to handle such amounts of data. For instance, when connecting to on-prem data sources and cloud destinations, you need to consider using a dedicated network connection to have efficient data transfer with high throughput.

Parallel Reads and Writes

When you are copying large datasets without enabling parallel reads and writes, it can cause delays in execution. Choose the right partitioning column option on the source or sink side settings to enable parallel reads and writes.

Advanced Techniques

Data Transformations Using External Compute

When running transformations on data that is stored externally, it is efficient to use the source compute for those transformations. For instance, triggering a Databricks notebook activity can leverage the compute on the Databricks cluster to perform data processing; this can reduce data movement and optimize performance.

Cost Optimization

Pipeline costs highly correlate with the number of activities, transformations, and runtime duration of each activity. The majority of cost occurs due to Azure IR utilization. It is essential to apply the optimization techniques discussed in this guide to lower costs. To reduce costs, enable Azure IR with a time-to-live setting when running data flow activities in parallel.

Security Considerations

During the execution of a pipeline, data movement typically occurs over public networks. To enhance security, there is an option to enable ADF native-managed private endpoints to ensure that the connection to your sources is private and secure.

Operational Best Practices

Data Lineage

It is one of the best practices and important to maintain an audit for each record when moving data across using pipelines. Integrating ADF with Microsoft Purview gives the option to track the lineage of data and document the origin of data in a data catalog entirely for future reference.

Fault Tolerance

During pipeline execution, data compatibility issues can arise and lead to failed activities. To be able to handle compatibility issues, implement fault tolerance by logging incompatible records, allowing for their correction, reprocessing them later, and continuing to load compatible data. To enable this, configure at the activity level by directing incompatible records to the storage account.

Backup and Restore Pipelines

To be able to safeguard your ADF pipelines, ADF needs to be integrated with GitHub or Azure DevOps. This will ensure that pipelines can be restored from their source control if the ADF resource is accidentally deleted.

Note that you need to restore other resources separately that are configured as sources or sinks in the pipelines. ADF can only restore pipeline configuration, not the actual resources that are deleted. For instance, if you are using Azure Blob Storage as a source, then this needs to be restored separately if it was deleted.

Monitoring and Alerts

Setting up monitoring for pipeline resources is critical for proactive reaction and prevention, ensuring no issues occur that may lead to escalation. Monitor the pipeline runs, integration runtime utilization, and set up alerts to notify you via email or phone.

Conclusion

There are many components of data pipelines that need to be taken into consideration and planned before their development. These best practices guide you to do the right design of each component in a data pipeline so that it will be high-performing, reliable, cost-effective, and scalable for handling your growing business needs.

Extract, transform, load azure Cloud optimization Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • Blueprint for Migrating On-Premise Data Pipelines To Azure Cloud
  • Testing Serverless Functions
  • Bridging Cloud and On-Premises Log Processing
  • Enhanced Monitoring Pipeline With Advanced RAG Optimizations

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!