Incorporating ETL Tools Into Your Data Warehousing Strategy
Incorporating ETL Tools Into Your Data Warehousing Strategy
In this post, a big data expert discusses the use of ETL tools to help data teams better work with and manage their data warehouses.
Join the DZone community and get the full member experience.Join For Free
The Architect’s Guide to Big Data Application Performance. Get the Guide.
Managing a data warehouse isn't just about managing a data warehouse, if we may sound so trite. There's actually a lot more to consider. For example, how data gets into your data warehouse is a whole process unto itself — specifically, what happens to your data while it's in motion and the forms it must take to become usable.
And that's where ETL tools come in.
ETL - extract, transform, load - is the common paradigm by which data from multiple systems - typically developed and supported by different vendors, departments or stakeholders - is combined to a single database, data store, or warehouse for legacy storage or analytics.
Extraction is the process by which data is extracted from various data sources. Transformation involves transforming the data for storage in proper format for query and analysis. Finally, loading occurs when the transformed data is loaded into the target database, data store, data mart, or warehouse. In essence, ETL is what preps your data to make it meaningful and accessible for analysis.
Many companies blessed with in-house IT teams often think about foregoing an automated ETL tool in favor of coming up with their own solution instead. They may think they can rely on their IT staff to manually code a "pipeline" solution, only to find out later that the pipeline can't handle increasingly large or complex data loads. Or if it can, it can only do so via more expense, time, and resources.
The sacrifice? Time, accuracy, and performance. So organizations must ask: what's the point of striving for business intelligence if we're wasting hours and days to get it, can't be 100% sure of our results, and feel like the whole machine could crumble in a strong breeze?
On the other hand, an automated ETL tool is built to be fast, accurate, high-performing, scalable, and secure. More importantly, it performs a necessary function that shouldn't be the sole responsibility of overworked or under-trained IT teams to oversee, especially when so much is riding on your data warehouse and the critical answers your company is seeking from it.
The fact is, no matter how skilled your IT team may be, expanding data needs will continually pose challenges for any organization, stressing employees, equipment, and budgets, and wasting valuable time on custom, manual configurations just to keep up. So let's dive into the different automated ETL tools that are available and how you can decide which one is right for you.
Types of ETL Tools
Here's a look at the specific types of ETL tools and what they can do for your organization:
Batch processing tools: Incumbent batch processing tools consolidate your data during off-hours when there's less competition for compute power. For types of data that are less dependent on speed (think quarterly or monthly calculations, like tax or salary reporting), these tools prep data without affecting performance elsewhere.
Open source tools: Like nearly all open source solutions, open source ETL comes ready-made, can be easily integrated with other systems, and is especially attractive to companies with restrictive infrastructure budgets. And because of the collaborative nature of open source development, you can bank on a level of accountability, flexibility, and "the latest everything" that may be missing in part with other solutions.
Cloud-based tools: Though batch processing is generally the domain of on-premise data warehouses, new batch processing tools are now available in the cloud. They offer the same benefits of those old legacy systems but with today's cloud advantages, such as real-time support, built-in security and compliance, and intelligent schema detection.
Real-time tools: These days, most companies use a multitude of modern applications that demand real-time information. Real-time ETL tools use an entirely different model from the other options, one based on distributed message queues - communication between decoupled, or independent, applications - and stream processing, or a continuous data stream. The net result is that companies can query and get answers fast, and not just when it's convenient for the system.
These categories are not strictly defined, however. For example, there are open source or cloud-based tools that can handle real-time information, etc.
Which ETL Tool Is Right for You?
While most, if not all, of the above tools can serve your organization well in some capacity, each is geared to best fit certain requirements:
- Incumbent batch: Best for organizations that want to use on-premise technology and/or existing vendors, and have less of a concern about real-time data processing.
- Open source: Best for organizations comfortable with maintaining and operating open source technology, or who want to build an ETL solution themselves using open source technology.
- Cloud-based: Best for organizations that prefer tools built and delivered in the cloud and are interested in keeping costs low by not having to purchase or maintain equipment.
- Real-time: Best for organizations seeking a modern solution to handle large volumes of data or streaming data, scale operations up or down as needed, and process events in real time.
Data Warehousing and ETL: An Unbeatable Match
Today's business intelligence requirements make big demands on every organization. Your unique data sources, structure, and collection shouldn't be subject to gaps in processing that inhibit flow and end-user analysis.
Published at DZone with permission of Garrett Alley , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.