Add Schema as Needed, Not in Advance
Big Data is changing how data warehouses are created and updated. Find out why and how this is changing the way data engineers peform their analyses.
Join the DZone community and get the full member experience.Join For Free
The first three steps of building a traditional data warehouse are to 1) gather reporting requirements, 2) identify source data, and 3) design a data model, also known as a schema, to hold the data in a predictable structure for analysis.
The big data and data lake revolutions have radically changed that approach. Now people are gathering data first and using a “come as you are” approach for the data model. Basically sourcing and dumping potentially interesting data, as-is, into a big data repository or cloud file store. The analytical and reporting requirements then generally come next, as people (or machines) try to find something useful to do with the data that they have assembled, or try to use it to answer actual business questions. In this new world, the “data modeling” step is largely ignored, or deferred until later, using a “schema on read” approach.
Sometimes this works. Often the data tables in modern data lakes are wide, meaning they have lots of columns in each table — potentially every column you’d need to do a certain type of analysis. With such a setup, users can explore a lot of data all from a single table, which eliminates the need to do much or any data modeling, as no joins are required between tables.
But often analysis needs to go beyond a single table, which means we need to understand the relationships between data columns across various tables in a single database or data lake, or across datasets that are in completely different data stores. In this case, the analytical tool needs to have some understanding of the schema of the data.
In traditional BI, the schema is a relatively rigid structure that is usually defined before any reports or dashboards are created. Creating and changing the schema often requires an architect-level skill set, and possibly the editing of complex XML-like files. These tasks are way beyond the abilities of most typical business users.
Our philosophy at Zoomdata is to get out of the user’s way — just let them see and interact with the data. Once data has been through data quality and ETL steps, to ensure initial cleanliness and consistency, we believe that it is ready for users to start to explore. Also a lot of data these days is machine generated (as opposed to being entered into systems by people manually), and it’s not a sure thing that all machine-generated data is necessarily of great value to analyze. So by just showing people their data, initially in a simple grid or a few graphs, the user can get a sense of what they have to work with. As they identify the parts that are potentially of value, they can then start to sculpt the simple raw data into a more refined product that presents and answers new questions and to gain insight into their business.
That sculpting process usually involves three steps, which may happen in any order and can happen multiple times at different levels. The first sculpting step is to surf between data attributes, to look at the data from different perspectives or dimensions, such as geography, type of customer, or whatever attributes or collection of attributes are in the data that are of interest. The second is to filter the data, to remove parts of the data that are not of interest and zero-in on the important aspects. The third is to explore the data over time. To look at the filtered data across time, to see how things vary, and determine the level of time granularity and period-to-period comparisons that best get at the issue or insight at hand. By using these three approaches a user can narrow down to a set of data that can then be visualized to start to tell a story.
No predefined “schema” is required to do any of that. The first leap to something that resembles a traditional data warehouse schema is when we start to look at data that needs to be joined or coalesced across different sources. Zoomdata considers a source to be a table of data in a database, an index in a search system like Elasticsearch or Solr, or a collection in a document store like MongoDB. Once we start to analyze data from more than one source, we need to understand the relationships between the sources.
These are not just database joins or foreign key relationships. We don’t assume the sources are in the same underlying datastore. They might be, and when they are, we can push some more of the work down to that store, but we’ve learned that these days what’s exposed via one database one day might be exposed by other means tomorrow. Some of the most powerful data analyses involve including data from completely disparate data systems and engines that will never be co-located due to technical, organizational or regulatory constraints.
We want to learn how these sources interrelate from the end user. The motivated end user who wants to analyze data across sources is the perfect person to teach us how the data aligns. We don’t want to have to have an architect define it in advance like most data virtualization systems and BI tools require. We want the end user to tell us because they are motivated. They have a problem to solve and can tell us how they want it solved. We can learn from them, on the fly, as they teach the system how cross-source data aligns.
And if you have enough users doing this, you can learn a lot, really fast. And you can apply those learnings to other users automatically. What we learn from one user about how two or more sources align can immediately be used to help out hundreds more. We can intuit and derive the schema simply by learning from the actions of the end users. It’s a self-learning, continuously-evolving, bottoms-up as opposed to top-down BI metadata layer.
Given how fast data is changing, new sources are appearing, data schemas are changing, fields are being added and removed, the approach we’re discussing is the only possible way to stay on top of it. The sources of data and the structure of the data is changing almost as quickly as the data itself. Think of it as the first derivative of the data.
So we can learn from the users. We can learn the schema as discussed above. But we can learn about the users and their interests. We can learn what certain types of users find interesting. And from the data itself, we can learn what is normal and what is unusual. The holy grail of the next generation of BI is to use deep learning AI techniques to match up interesting insights from the data itself with the users who are most likely to care about and take action based on those insights. We need to tell the right person at the right time things that they will find interesting and actionable. We need to tell them why they are happening. And, eventually, once enough historical insights and action pairs are available for advanced machine learning algorithms to learn from, we may be able to even suggest what they should do about these anomalies or insights.
In future posts we will think more about how this learning process works. We will talk more about proactively generating insights and getting them to the right users. And we will talk about making recommendations and actionability — to eventually be able to tell users what they should probably do, and give them the right tools to immediately take action. While the first thing we are automatically learning from our users is about the schema of their data, that is only the beginning.
Published at DZone with permission of Justin Langseth, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.