DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Modern ETL Architecture: dbt on Snowflake With Airflow
  • Creating a Web Project: Key Steps to Identify Issues
  • ETL Generation Using GenAI
  • The Right ETL Architecture for Multi-Source Data Integration

Trending

  • Proactive Security in Distributed Systems: A Developer’s Approach
  • Is Big Data Dying?
  • How to Introduce a New API Quickly Using Micronaut
  • How to Convert XLS to XLSX in Java
  1. DZone
  2. Data Engineering
  3. Data
  4. Top Methods to Improve ETL Performance Using SSIS

Top Methods to Improve ETL Performance Using SSIS

Improve ETL performance in SSIS with parallel extraction, optimized transformations, and proper configuration of concurrency, batch sizes, and data types.

By 
DZone Editorial user avatar
DZone Editorial
·
Feb. 27, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

Extract, transform, and load (ETL) is the backbone of many data warehouses. In the data warehouse world, data is managed through the ETL process, which consists of three steps: extract—pulling or acquiring data from sources, transform—converting data into the required format, and load—pushing data to the destination, typically a data warehouse or data mart.

SQL Server Integration Services (SSIS) is an ETL tool widely used for developing and managing enterprise data warehouses. Given that data warehouses handle large volumes of data, performance optimization is a key challenge for architects and DBAs.

ETL Improvement Considerations

Today, we will discuss how you can easily improve ETL performance or design a high-performing ETL system using SSIS. To better understand this, we will divide ten methods into two categories: first, SSIS package design-time considerations, and second, configuring property values of components within the SSIS package.

SSIS Package Design-Time Considerations

Extract Data in Parallel

SSIS allows data extraction in parallel using Sequence Containers in control flow. By designing a package to pull data from non-dependent tables or files simultaneously, you can significantly reduce overall ETL execution time.

Extract Required Data

Pull only the required set of data from any table or file. Avoid the tendency to retrieve all available data from the source just because you might need it in the future—it consumes network bandwidth, system resources (I/O and CPU), extra storage, and degrades overall ETL performance. If your ETL system is highly dynamic and requirements frequently change, consider other design approaches, such as metadata-driven ETL, rather than pulling everything at once.

Avoid the Use of Asynchronous Transformation Components

SSIS is a powerful tool with a variety of transformation components for handling complex tasks during ETL execution. However, improper use of these components can significantly impact performance. SSIS offers two types of transformation components: synchronous and asynchronous.

  • Synchronous transformations process each row and pass it directly to the next component or destination. They use allocated buffer memory efficiently and don’t require additional memory since each input/output data row fits entirely within the allocated space. Components like Lookup, Derived Columns, and Data Conversion fall into this category.
  • Asynchronous transformations first store data in buffer memory before processing operations like Sort and Aggregate. These transformations require additional buffer memory, and until it becomes available, the entire dataset remains in memory, blocking the transaction—this is known as a blocking transformation. To complete the task, the SSIS engine (data flow pipeline engine) allocates extra buffer memory, adding overhead to the ETL system. Components like Sort, Aggregate, Merge, and Join fall into this category.

Overall, you should avoid asynchronous transformations. However, if you have no other choice, you must be aware of how to manage the available property values of these components. We’ll discuss them later in this article.

Make Optimum Use of Event in Event Handlers

To track package execution progress or take other appropriate actions on specific events, SSIS provides a set of event handlers. While events are useful, excessive use can add unnecessary overhead to ETL execution. Therefore, it’s important to carefully evaluate their necessity before enabling them in an SSIS package.

Consider the Destination Table Schema When Working With Large Data Volumes

You should think twice when pulling large volumes of data from the source and loading it into a data warehouse or data mart. You may see performance issues when executing a high volume of insert, update, and delete (DML) operations, especially if the destination table has clustered or non-clustered indexes. These indexes can lead to significant data shuffling in memory, further impacting ETL performance.

If ETL performance issues arise due to a high volume of DML operations on an indexed table, consider modifying the ETL design. One approach is to drop existing clustered indexes before execution and re-create them after the process completes. Depending on the scenario, alternative solutions may be more effective in optimizing performance.

Configure Components Properties

Control parallel task execution by configuring the MaxConcurrentExecutables and EngineThreads properties. MaxConcurrentExecutables is a package-level property with a default value of -1, meaning the maximum number of concurrent tasks is equal to the total number of processors on the machine plus two.

Package

Package


EngineThreads is a data flow task level property and has a default value of 10, which specifies the total number of threads that can be created for executing the data flow task.

Data flow task

Data flow task


You can adjust the default values of these properties based on ETL requirements and available system resources.

Configure the Data Access Mode option in the OLE DB Destination. In the SSIS data flow task, the OLE DB destination offers multiple options for inserting data into the destination table. The "Table or view" option inserts one row at a time, whereas the "Table or view - fast load" option utilizes bulk insert, significantly improving performance compared to other methods.

Once you choose the "fast load" option, it provides greater control over the destination table's behavior during a data push operation. You can configure options such as Keep Identity, Keep Nulls, Table Lock, and Check Constraints to optimize performance and maintain data integrity.

OLE DB Destination Editor

OLE DB Destination Editor


It’s highly recommended to use the fast load option when pushing data into the destination table to improve ETL performance.

Configure Rows per Batch and Maximum Insert Commit Size in OLEDB Destination. These two settings are crucial for managing tempdb and transaction log performance. With the default values, all data is pushed in a single batch and transaction, leading to excessive tempdb and transaction log usage. This can degrade ETL performance by consuming excessive memory and disk storage.

OLE DB Destination Editor

OLE DB Destination Editor


To improve ETL performance, you can set a positive integer value for both properties based on the anticipated data volume. This will divide the data into multiple batches, allowing each batch to be committed separately to the destination table. This approach helps reduce excessive tempdb and transaction log usage, ultimately improving ETL performance.

Use SQL Server Destination in a data flow task. When pushing data into a local SQL Server database, it is highly recommended to use SQL Server Destination to improve ETL performance. This option leverages SQL Server's built-in bulk insert feature, offering better performance compared to other methods. Additionally, it allows data transformation before loading and provides control over triggers, enabling or disabling them as needed to reduce ETL overhead.

SQL Server Destination Data Flow Component

SQL Server Destination Data Flow Component


Avoid implicit typecasting. When data comes from a flat file, the Flat File Connection Manager treats all columns as string (DS_STR) data types, including numeric columns. Since SSIS uses buffer memory to store data and apply transformations before loading it into the destination table, storing numeric values as strings increases buffer memory usage, reducing ETL performance.

To improve ETL performance, you should convert all the numeric columns into the appropriate data type and avoid implicit conversion, which will help the SSIS engine to accommodate more rows in a single buffer.

Summary of ETL Performance Improvements

In this article, we explored how easily ETL performance can be controlled at any point in time. These are common ways to improve ETL performance, though there may be other methods depending on specific scenarios. By categorizing these strategies, you can better determine how to tackle performance challenges. If you're in the design phase of a data warehouse, you may need to focus on both categories, but if you're supporting a legacy system, it's best to start by working closely on the second category.

Extract, transform, load Data (computing) Performance

Opinions expressed by DZone contributors are their own.

Related

  • Modern ETL Architecture: dbt on Snowflake With Airflow
  • Creating a Web Project: Key Steps to Identify Issues
  • ETL Generation Using GenAI
  • The Right ETL Architecture for Multi-Source Data Integration

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!