Data Harmonization 101: Making Sweet Music for BI Users
Data Harmonization 101: Making Sweet Music for BI Users
Read this article if you want a crash course on how to make data harmonization music to your business intelligence ears.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
As an end-user of business intelligence or analytics, how much should you have to know about preparing the data to get your actionable business insights? Ideally, you should be able to use different datasets as and when you need them. This dream world is possible, but only with suitable data harmonization. Disparate data sources and their structures must be made compatible with one another, to allow them to be correctly joined and analyzed together. Proper harmonization is also vital to get meaningful, consistent results.
Compatibility between data sources rarely exists at the start. Different people or organizations collect data in the way that suits them and usually don’t know about any specific requirements you have. Consequently, the harmonization of data from different sources requires work from you, from your IT department, or from some other resource.
Let’s use this opportunity as a crash course on how to make data harmonization music to your ears, shall we?
The Basics of Data Harmonization
Essentially, there are three operations in the data harmonization process.
The data must be extracted (for instance, from IT systems, machines, or websites). It must be transformed to make it compatible with other data. It must also be loaded into a repository, such as a separate database, to be available to authorized users.
Traditionally, these three operations have been performed in the order “extract-transform-load” (ETL) to make a data warehouse that stores the different datasets. There are however two disadvantages with this approach.
First, the manual ETL process is resource-intensive in terms of processing power and computer memory and storage space. Second, the transformation imposes a specific structure of the data that may limit options for analyzing the data afterward. Data warehousing also needs specialist IT skills, meaning that end-users are obliged to go back to the IT department to add new data sources or change the way they are transformed.
Another approach is the data lake, which changes the order of the three steps above to make them “extract-load-transform” (ELT).
Data is extracted from different sources and loaded “as is” into the data lake. Users, including IT data warehouse specialists, are then free to transform and harmonize copies of the datasets as they wish, while the datasets in the data lake keep their original format. This makes it easier to store both structured and unstructured data and does not limit the options for analysis (offers better data “granularity”).
Dimensional Modeling to Help Business Users
As part of data harmonization, data models show the relationships between different pieces of data or variables. When systems are built for enterprise operations or business transactions, data modeling is an important part of the design process to ensure system reliability and performance. However, there is often a gap between these operational data models and the kinds of analysis and reporting that business users find the most useful.
Dimensional data modeling helps bridge the gap by organizing extracted data in terms of “facts” and “dimensions.” The facts represent events or entities, for instance, sales data. The dimensions group the facts together, for example, by location or product. This makes it easier for business users to “slice and dice” data to answer questions like, “how many units of product ‘A ‘did our sales location ‘B’ sell over the last 12 months?”
Some IT departments offer their end-users data warehouses or “data marts” (subsets of data warehouses for specific departments), with harmonization and dimensional models included. Yet end-users must still go back to the IT department to get changes made. On the other hand, a modern, single stack BI platform removes this dependence on the IT department, while offering usability and flexibility. End-users can then get their business insights rapidly and easily, changing and adding data sources at will, while Sisense takes care of the data harmonization.
Data Harmonization in Big Data Analytics
Businesses today are increasingly looking to big data to help them perform better and gain competitive advantage. Big data goes beyond the traditional, structured data generated by IT systems for, say, production or order handling. It can include structured and unstructured data (text, audio, video, social media, and so on) in huge volumes, with many varieties, and arriving at high velocity.
For data harmonization in big data analytics, the variety of data sources can be tackled by offering suitable data connectors. Each data connector contains the information needed to connect to and query a specific data source. However, the other factors of the volume and velocity of big data feeds require other tactics. They challenge traditional approaches that were not designed with big data in mind, forcing them to increase computing resources (more expensive) or limit analytics to smaller subsets of data (sacrificing data granularity).
You can avoid both these issues by changing the way in which data is processed. Sisense, for example, uses the caching possibilities of today’s microprocessors, including the ones used in commodity computing hardware. This in-chip technology offers the advantage of being able to work with basic or “quick and dirty” data models, significantly reducing the need for upfront data preparation and modeling. At the same time, it produces data analytics results at lightning-fast speeds and lets users ask all kinds of ad hoc questions of their data, not just the predefined questions that typically limit traditional analytics systems.
Multi-Source Data Analytics
Business users frequently want to combine multiple sources of data to get a complete view of their customers, operations, or enterprise. The data must then be brought together from disparate systems, where and when it is needed.
In traditional approaches, this harmonization of multiple data sources requires a specialist, such as a database administrator (DBA), to “pre-join” data tables from the different sources to make sure processing will be fast enough. Once again, this limits the analytics possibilities for end-users, preventing them from asking ad hoc questions or being creative with their data. By comparison, built-in connectors and in-chip technology allow end-users to access a wide variety of data sources simultaneously and explore existing and new data as they want.
In-Tune and In-Step With Business User’s BI Needs
With the right end-to-end BI and analytics stack, your end-users get to conduct the entire orchestra from data source to analytics insight, as well as improvising on different themes by asking new or random BI questions.
Self-service BI also gives the IT department breathing space, no longer weighed down by incessant user demands to revamp data sources, queries, and reports. With even more advanced platforms, your tool can automatically learn via harmonization to continually improve performance, making your life and the life of your colleagues even better.
Sounds pretty great to us.
Published at DZone with permission of Gur Tirosh , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.