Emerging Trends Of ETL: Big Data and Beyond
Emerging Trends Of ETL: Big Data and Beyond
As big data continues to grow, many have concerns that the conventional process of extract, transform, and load (ETL) is applicable. Are these concerns valid?
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Amidst the analysis of driving voluminous data, along with analytics challenges, there are concerns about whether the conventional process of extract, transform, and load (ETL) is applicable.
ETL tools quickly "intrude" across mobile apps and web applications, as they can access data very efficiently. Eventually, ETL applications will accumulate industry standards and gain power.
Let's discuss practically something rather new — something that offers an approach to easily build sensible, adaptable data models that dynamize your data warehouse: the data vault!
Enterprise data warehouse (EDW) systems intent to sustain an authentic business intelligence (BI) for the data-driven enterprise. Companies must acknowledge critical metrics which are deep-rooted in this significant and dynamic data.
Challenges of ETL
Following is a list of the top five challenges that ETL faces due to traditional data modeling:
- When upstream business flows or business rules of EDW/BI systems adapt change.
- Analysis of business data in terms of volume, velocity, variety, visualization, veracity, and value (6 Vs) — big data with technical realities to affirm competitive edge.
- Designing, creating, delivering, and sustaining vigorous, accessible, and large powerful storage EDW/BI systems for intelligent adoption has become complex.
- To tailor the data to meet the needs of the business with its core business domain value without failing in giving simple solutions for the definite needs of the business.
- Lack of flexibility to adopt new unpredicted and unplanned sources with or without the impact of the upstream process.
Now. it's time to emerge with much focused, all time solution for all the potential challenges depicted above.
A data vault is a methodology with hybrid data modeling.
As per Dan Linstedt, it is detailed as follows:
"A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise."
It's elegant, easy, and simple to execute. It is established on a set of too many structures, and with auditable rules. By exploiting the data vault principles, your project will undoubtedly gratify auditability, scalability, and flexibility.
The following stipulated standards will help you to build a data vault:
- Determine the business keys and hubs
- Confirm the relationships between the business keys, links
- Specify the description around the business keys, satellites
- Connect standalone factors like time dimension attributes and code/descriptions for decoding in data marts
- Integrate for query optimization, append performance tables like bridge tables and point-in-time (PD) structure model tables.
Building a data vault is simple as you go; ultimately, it will mutate the conventional methods generally used in enterprise integration architectures. The model is built in a way that it can be efficiently extended whenever required.
Data vault modeling + architecture + methodology provides solutions for the challenges depicted above.
"Business agility is the ability to improve from continuous change."
Let's see how the data vault can adapt the change.
With the partying of business keys (as they are static) in the data vault and the associations between the business keys from their descriptive attributes, the data vault can address the problem of change in the environment.
Crafting these keys as the structural backbone of a data warehouse, all the associated data can be organized around them. These hubs (business keys), links (associations), and SAT (descriptive attributes) yield an extremely adaptable data structure while sustaining an immense degree of data integrity. Specific links are like synapses (vectors in the opposite direction). They can be created or dropped whenever the business relationships are bound to change automatically by transforming the data model as needed without any impact to the existing data structures.
Let's see how the data vault engulfs the ETL challenge of big data.
How Data Vaults Solve the ETL Challenge of Big Data
The data vault blends consistent integration of big data technologies along with modeling, methodology, architecture, and outstanding practices. With the adoption of very large voluminous data, data can easily be blended into a data vault data model to incorporate adopting products like Hadoop, MongoDB, and various other NoSQL varieties. Eradicating the cleansing specifications of a star schema design, the data vault triumphs over huge data sets by reducing exhaustion and sustaining correlated insertions which impact the potential of big data systems.
Data vaults also decode the challenge of complexity through simplification. Let's see how.
Designing a competent and dynamic data vault model can be done instantaneously once you know the core of the three table types: hub, satellite, and link. Determining the business keys and specifying the hubs is invariably the perfect thing to kick-off. "Hub-satellites" simulate source table columns that can change, and links connect them. It is also feasible to have link-satellite tables.
Once the data vault data model is done, the next uphill task is to build the data integration process through ETL (i.e. to populate data into target systems from source systems). So, with the data vault design, you can connect the data-driven enterprise and enable data integration.
ETL, with its simplified development process, decreases the total cost of open-platform. ETL can certainly be used to populate and maintain a robust EDW system built upon a data vault model.
Overcoming Challenges Using Data Vaults
This can be achieved through any prominent ETL tools available in the market.
The data vault typically specifies the outlook/values of an enterprise that analyze and details the business domain and relationships bounded within the vicinity. Accomplishing business rules must ensue before populating a star schema. Through data vaults, you can drive the business rules downstream after EDW incorporation/ingestion. Another data vault philosophy is that any data is significant, even if it is irrelevant. The theory of the data vault is to ingest source data of any type (good or bad).
This data model is designed eminently to resolve and meet the absolute needs of the latest EDW/BI systems.
A data vault is flexible enough to adopt new unpredicted and unplanned sources without impacting the existing data model.
The data vault methodology is based on SEI/CMMI Level-5 processes and practices and comprises of various components constituting with outstanding features of Six Sigma, Total Quality Management (TQM), and SDLC (Agile). Data vault projects have short and considerable release cycles usually adopting the repeatable, defined, manageable, consistent and optimized projects expected at CMMI Level 5. While adding new data sources, similar business keys and new hubs, satellites, and links can be added and then can be linked to the existing data vault structures without any impact with underlying model.
Testing a Data Vault: ETL/Data Warehouse Pursuit
Unlike in non-data vault ETL programs, general testing strategies are best suited for data vault adopted programs. However, by using raw Data Vault loads we can moderate transformations to a minimum level in the entire ETL process through permissible load errors.
ETL/data warehouse testing should emphasize on:
- Data quality of source systems
- Data integration
- Performance/scalability/upgrade issues of BI reports
Following are five prominent proposals to execute tests for a data vault/ETL/DWH project to adhere the above baseline pointers:
- Design a small static test database derived from the actual data to run the tests quickly so that expected results can be identified in the earlier stage.
- Do early execution of system testing to ensure the connecting boxes in the ETL interface.
- Use test automation tools to:
- Set up the test environment
- Restore a virtual environment
- Update static data to validate complex data validations
- Execute the ETLs, capture logs and validate bad/rejected/valid data flows
- Engage business users while creating and deriving the real data to a small test database to ensure the data profiling and data quality.
- Simulate the test environment as if it's the production environment to cut down on cost issues.
Exploiting various innovative methodologies to visualize business trends coupled with having substantial evidence will do wonders in ETL/big data engagements.
Although it is important to discuss the ETL trends that solve challenges, it is not enough. We need to reflect upon and extend how we can develop automation solutions to create test data for any ETL requirement using component libraries, RowGen, etc.
Published at DZone with permission of Sree Lakshmi , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.