In this big data world, a major goal for businesses is to maximize the value of all their customer data. In this article, I will argue why businesses need to integrate their data silos to build better models and how machine learning can help them uncover those insights.
The Value of Data Is Insight
The goal of analytics is to “find patterns” in data. These patterns take the form of statistical relationships among the variables in your data. For example, marketing executives want to know which marketing pieces improve customer buying behavior. The marketing executives then use these patterns — statistical relationships — to build predictive models that help them identify which marketing piece has the greatest lift on customer loyalty.
Our ability to find patterns in data is limited by the number of variables to which we have access. So, when you analyze data from a single dataset, the breadth of your insights is restricted by the variables housed in that data set. If your data are restricted to, say, attitudinal metrics from customer surveys, you have no way of getting insights about how customer attitude impacts customer loyalty behavior. Your inability to link customers’ attitudes with their behaviors simply prevents any conclusions you can make about how satisfaction with the customer experience drives customer loyalty behaviors.
Two Dimensions of Your Data
You can describe the size of data sets along two dimensions: 1)the sample size (number of entities in the data set) and 2) the number of variables (number of facts about each entity). Figure 1 includes a good illustration of different data sets and how they fall along these two size-related dimensions (you can see an interactive graphic version here).
For datasets in the upper left quadrant of Figure 1, we know a lot of facts about a few people. Datasets about the human genome are good examples of these types of data sets. For datasets in the lower right quadrant, we know a few facts about a lot of people (i.e. the U.S. Census). Data silos in business are good examples of these types of data sets.
Mapping and understanding all the genes of humans allows for deep personalization in healthcare through focused drug treatments (i.e. pharmacogenomics) and risk assessment of genetic disorders (i.e. genetic counseling, genetic testing). The human genome project allows healthcare professionals to look beyond the “one size fits all” approach to a more tailored approach to addressing the healthcare needs of a particular patient.
The Need for Integrating Data Silos
In business, most customer data are housed in separate data silos. While each data silo contains important pieces of information about your customers, if you don’t connect those pieces across those different data silos, you’re only seeing parts of the entire customer puzzle.
Check out this TED talk by Tim Berners-Lee on open data that illustrates the value of merging/mashing disparate data sources together. Only by merging different data sources together can new discoveries be made — discoveries that are simply not possible if you analyze individual data silos alone.
Siloed data sets prevent business leaders from gaining a complete understanding of their customers. In this scenario, analytics can only be conducted within one data silo at a time, restricting the set of information (i.e. variables) that can be used to describe a given phenomenon; your analytic models are likely underspecified (not using the complete set of useful predictors), thereby decreasing your model’s predictive power/increasing your model’s error. The bottom line is that you are not able to make the best prediction about your customers because you don’t have all the necessary information about them.
The integration of these disparate customer data silos helps your analytics team to identify the interrelationships among the different pieces of customer information, including their purchasing behavior, values, interests, attitudes about your brand, interactions with your brand, and more. Integrating information/facts about your customers allows you to gain an understanding of how all the variables work together (i.e. are related to each other), driving deeper customer insight about why customers churn, recommend you, and buy more from you.
The bottom line: the total, integrated, unified data set is greater than the sum of its data silo parts. The key to discovering new insights is to connect the dots across your data silos.
After the data have been integrated, the next step involves analyzing the entire set of variables. However, with the integration of many data silos, including CRM systems, public data (i.e. weather), and inventory data, there is an explosion of possible analyses that you can run on the combined data set. For example, with 100 variables in your database, you would need to test around 5,000 unique pairs of relationships to determine which variables are related to each other. The number of tests grows exponentially when you examine unique combinations of three or more variables, resulting in millions of tests that have to be conducted.
Because these integrated data sets are so large, both with respect to the number of records (i.e. customers) and variables in them, data scientists are simply unable to efficiently sift through the sheer volume of data. Instead, to identify key variables and create predictive models, data scientists rely on the power of machine learning to quickly and accurately uncover the patterns — the relationships among variables — in their data.
Rather than relying on the human efforts of a single data scientist, companies can now apply machine learning. Machine learning uses statistics and math to allow computers to find hidden patterns (i.e. make predictions) among variables without being explicitly programmed where to look. Iterative in nature, machine learning algorithms continually learn from data. The more data they ingest, the better they get at finding connections among the variables to generate algorithms that efficiently define how the underlying business process works.
In our case, we are interested in understanding the drivers behind customer loyalty behaviors. Based on math, statistics, and probability, algorithms find connections among variables that help optimize important organizational outcomes — in this case, customer loyalty. These algorithms can then be used to make predictions about a specific customer or customer group, providing insights to improve marketing, sales, and service functions that will increase business growth.
The bottom line: the application of machine learning to uncover insights is an automated, efficient way to find the important connections among your variables.
The value of your data is only as good as the insights you are able to extract from it. These insights are represented by relationships among variables in your data set. Sticking to a single data set (silo) as the sole data source limits the ability to uncover important insights about any phenomenon you study. In business, the practice of data science to find useful patterns in data relies on integrating data silos, allowing access to all the variables you have about your customers. In turn, businesses can leverage machine learning to quickly surface the insights from the integrated data sets, allowing them to create more accurate models about their customers. With machine learning advancements, the relationships people pursue (and uncover) are limited only by their imagination.