7 Data Modeling Mistakes That Will Sink Your Analysis
7 Data Modeling Mistakes That Will Sink Your Analysis
In this era of big data it's not unusual to find ourselves awash in a sea of data points. But remember, it is still possible for your analysis to sink without good models. Here are some common mistakes to avoid.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
You have a goal. You want to gain actionable insights from all of this data that you have been collecting. So, how do you make sure to model your data so that you can actually gain these insights and answer the business questions that you have? You plan. When the planning stage is skimped on, or skipped altogether, the result is horrible. The performance and analysis suffer, data integrity and security issues ensue, and the overhead for maintenance and development balloon to unnecessary levels.
In this article, we are going to take a look at some common mistakes that are made when modeling your data in preparation for analysis. If overlooked, these mistakes can undermine your analysis and prevent you from gaining the insights that you are after.
Let’s start with 3 common data modeling mistakes that should be avoided regardless of which tool or technology you use, and move on to 4 that are mostly problematic when working with OLAP cubes and legacy BI platforms (i.e. less of an issue for Sisense users).
Common Modeling Mistakes to Avoid
Starting Without a Clear Plan for Action
When it comes to modeling data resources for analytics, such as a data warehouse or Elasticube, it is vital to plan out what its goal will be. There are several reasons for this but the main theme is that you cannot efficiently use your analytics resources if you do not have a goal for them. Designing a data model that will allow business users to investigate areas such as web traffic and opt-in rates is far different than a model that is meant for analyzing product sales.
The best practice is to plan, design, and allocate resources for each area you intend to run analysis on. This should be done in the BI project planning stage and during comprehensive requirements elicitation. You will notice a marked improvement in performance, security, and feasibility when it comes to implementing changes to your analysis goals.
It is possible to include too much data into one resource. With legacy tools this could cause a slow-down in analysis and query time; but even with a platform such as Sisense, which is optimized for performance when working with big and disparate datasets, you would still want to be careful to avoid storage issues, data duplication, and unnecessary overhead. On the flip side, it is almost worse to not include all of the data necessary to answer the questions listed in your business goals.
This step in planning will allow for the identification of the overall goal of your analytics model and ensure that the correct data is included for each resource.
Inadequate Use of Surrogate Keys
When bringing data together from multiple sources for analysis, a popular strategy for ensuring that the data has unique identifiers is providing surrogate keys. However, it is not always necessary, or good practice, to opt for the use of surrogate keys. Often times, there are natural keys in the data (data that is a unique value) that work well instead. These are values such as customer ID’s, social security numbers, or composite keys already in use by the transactional data as primary keys that are stable enough to retain all of the characteristics that primary keys need.
We’ve written before about when to use surrogate keys, so for the purposes of this article we will suffice with presenting just a few points to keep in mind:
- Surrogate keys should not have a relationship to the data. That is, it should not be subject to business rules. These rules can change over time and render previously unique values non-unique.
- Primary keys should be rather compact. Large, complicated, composite keys of 3 or more fields can be cumbersome. If the natural key candidate is compact and stable as a unique value (see the previous point), there is probably no reason to add in a surrogate key.
- When using surrogate keys, plan for the system to always generate unique values using techniques like UUIDs, GUIDs, or MAX() + 1. This will ensure that any surrogate key is indeed unique.
- Surrogate keys exist to mark a row as unique without providing business context. That is the value they provide. They should not be used for queries that will show them to the end-user. If they are, you have now introduced a business context and a relationship to the data that should not exist. Rethink your model and queries.
Poor Naming Standards
Having poor naming standards can really throw a wrench into any data-related activity. This is a vital step in planning the data model for your analytics resource. Skipping this step can lead to a lot of unnecessary headaches and frustrations down the road. This is especially true when bringing data together from multiple sources.
One of the main underpinnings to data is consistency. This should extend to the names that we give our tables, columns, constraints, measures, etc. The benefits to following a standard naming convention become clear very rapidly. If you are trying to create queries for analysis but your tables and measures do not have any logic behind the way they are named, it will be difficult to follow. For instance, if you have these tables:
Production_Materials Costs productionMachines MaterialVendors
It can difficult, if not impossible, to know how these are labeled, or what they are, without looking them up every time. It is much easier to have consistency across the data model. This could look like:
Production_Materials Material_Vendors Production_Machines Material_Costs
This is a much better way to keep your data analysis on track by giving the data model some consistency.
There are many standard methodologies that are used in naming standards. It is relatively easy to pick one that works for your organization and implement it. There is no need to come up with your own unique naming convention. If you are the data architect creating the analytics framework for the first time, it is your duty to implement a standard for future analysts to follow. Failing to do so is a severe oversight.
Common Mistakes when Working with Legacy Tools
The following mistakes could be extremely costly when working with legacy BI tools or RAM-intensive in-memory systems, but should be less of an issue for Sisense users due to the guaranteed fast performance of its In-Chip data engine. For more about data engineering in Sisense, check out this TDWI report.
Wrong Levels of Granularity
In analytics, granularity refers to the level of detail that our data allows us to see. For instance, take our time and sales dimensions as an example during a product launch for an online business. The granularity would be how small of a time frame we can look at. Can we look at the number of transactions per hour, minute, or even per second? If we had the data to see how many transactions took place per second, transactions per second that would be the granularity. If the data that we had was hourly, then our granularity would be transactions per hour. This does not only apply to time dimensions, though. It can be things like cost per visit to a doctor’s office or customers per zip code as well.
This is an issue that needs a nice balance. Too much, and you have a ton of extraneous data that could be difficult to navigate for the end-user, which might be a business executive that is more interested in the big picture. Too little, and you cannot see the details that you need – and additionally, the performance of your business intelligence platform could suffer, particularly when working with legacy BI tools (it’s worth noting that this should not be an issue when using modern data analytics tools with strong data processing capabilities).
This is important to keep in mind, not only as you plan your analysis and model your data, but much earlier as you think of which data to capture and store as well. After all, you can’t analyze the data if you do not actually have it.
Not all measures are found in the hard data. There are many instances where derived, or calculated, values will be necessary for analysis. However, not planning for these derived values in your data model can be a costly mistake. When running analytical queries on legacy systems, making many calculations can be cumbersome for the infrastructure and slow down the entire process. Furthermore, it can cause some major inconsistencies.
As mentioned, this is less of an issue with Sisense since its In-Chip™ technology can perform these calculations ad-hoc, without eating up your computer’s RAM resources. However, if you are using legacy or in-memory systems, there are a few ways to deal with many derived values depending on the situation. When there are derived values that will constantly be referenced either for reporting or analysis purposes, it can be prudent to make these calculations during the ETL process and store them in advance. Also, if the calculated value is computationally intensive (i.e. complex formula), it may be a good idea to store the derived value.
Aside from when to store them versus not storing them, not planning for derived values will impair your analytics. There are metrics, especially in financial sectors, that the calculation to obtain that metric will be subjective. It depends on the firm and their methods as to which formula gets used for a particular metric. If a consistent formula is not used, this will create a data integrity and consistency issue that will undermine any analysis using it.
It is best to check with the proper business departments to make sure consistency is achieved. Planning derived values will help your analysis and improve results in the future.
Dimensional hierarchies can be intimidating for any analysis. Sometimes there is no clear way to identify whether or not a particular dimension is a part of a hierarchy. Other times business goals for the analysis muddles up the relationship. This is when further investigation is required. If the hierarchy is unclear, take the time to reach out to the business personnel that can clear up the data and how it is used. This can be problematic in large firms that are highly segregated. It can become difficult to correctly map your data model without consulting many departments.
If care isn’t taken to iron out the correct hierarchies, analysis can be severely hampered. One instance would be in creating a dashboard for upper management of a manufacturing firm. If the plant location hierarchy wasn’t mapped correctly, users would not be able to drill down into the data and identify problematic plants according to region or municipality. Of course, this is just one simple example. Incorrect dimensional hierarchies in your data model will also require overly complex queries to get the information needed for analysis, create attribute relationship inconsistencies, and even prevent some analysis from being performed altogether.
It is best to plan out your data model and correct hierarchies where needed before moving on with your analytics process.
Ignoring Small Data Sources
Inevitably, critical business data is stored in all kinds of places. It is stored in spreadsheets, databases, notebooks, mobile devices, memos, receipts, etc. Part of the challenge, and perhaps a goodly portion of it, for analytics, is gathering the data from all of these ‘information silos’. One big mistake that is made is drawing conclusions based on an incomplete dataset that ignores these smaller data sources.
This is doubly true for smaller companies that have yet to implement a robust infrastructure that connects its various departments and centralizes its data, or because joining multiple data sources is still an IT process and cannot be handled by the business users. Small business can grow at exponential rates, and keeping up with that for data collection is usually the last thing on anyone’s mind. So, by the time the firm wants to run analytics, it has many information silos that need to be included for accurate results.
For instance, that spreadsheet that the Lead Administrative Assistant uses for tracking clients and communications can be the crux of customer analysis. Leave that out and you may fail to realize important connections. It may be the only record that two customer names from two different sources are actually the same customer. Imagine the headache that could cause.
Bottom line, make sure to take the time to identify and collect the data from these data sources. It could make or break your analysis.
You have probably noticed that there is a common theme when it comes to modeling your data for analytics. Planning. Taking the time to plan the goals of your analytics and the merging of data from many sources. When done properly, modeling the data becomes much easier and the analytics generate more reliable, actionable insights from your data.
Your planning phase should include the points listed above. It should also be much longer than the actual modeling time. Taking the time to set some standards, collect all of the data, define the goals of various resources, and establish what values your analysis will need will save much headache, frustration, time and energy.
Want to learn about why Sisense makes data modeling a breeze? Check out the in-depth overview of Sisense technology in this report by TDWI, or watch our on-demand webinar on Demystifying Data Modeling. Ready to move to the next stage of analysis? Read about the 4 Principles of Effective Dashboard Design.
Published at DZone with permission of Eran Levy , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.