DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Fixing Common Oracle Database Problems
  • SAP HANA Triggers: Enhancing Database Logic and Automation
  • The Future of Data Lakehouses: Apache Iceberg Explained
  • NoSQL for Relational Minds

Trending

  • How to Practice TDD With Kotlin
  • Scalability 101: How to Build, Measure, and Improve It
  • Setting Up Data Pipelines With Snowflake Dynamic Tables
  • Scaling in Practice: Caching and Rate-Limiting With Redis and Next.js
  1. DZone
  2. Data Engineering
  3. Data
  4. Why All Your Data Should Be Raw

Why All Your Data Should Be Raw

Data is necessary to grow any business — so stop wasting it. By keeping your data raw, you can ask any query you want without having to plan for it in advance.

By 
Archana Madhavan user avatar
Archana Madhavan
·
Sep. 15, 17 · Opinion
Likes (7)
Comment
Save
Tweet
Share
10.7K Views

Join the DZone community and get the full member experience.

Join For Free

Your company generates a ton of data — so much that it’s essential to pare it down and only store the most relevant stats, right?

Well, it was 1975 when data warehouses were developed. At that time, a gigabyte of storage cost $200,000. But today? About 2 cents.

With this low storage cost, companies can stop worrying about compression and start worrying about making sure they can fully understand their data. Heavily “cooking” (processing) data may have been necessary a few decades ago, but now, its few benefits are far outweighed by the advantages of keeping that data raw.

What Is Cooked Data?

“Cooked” data is data that has been taken from its raw format and processed, reorganized, or compressed. Traditionally, companies heavily cook their data in order to optimize storage space and query times. Three major ways to cook data are:

  1. Fitting data warehouses with compression schemas. One common schema is the star schema, which compresses data by taking information from an event and storing it in different dimension tables.

    Star Schema

    When an event, such as a click, occurs, information like the timestamp and user ID is collected. In a star schema, this information is split up into pieces and stored in dimension tables.

  2. Fitting tables with indices. Schemas are usually paired with indices, like bitmaps and B-trees, so information can be found again quickly.
  3. Only storing aggregates or subsets of the data. Companies may choose to store precomputed aggregates, like averages, or just pick a few dimensions of the data to store in an OLAP cube, instead of keeping the raw data.

But these methods were created because they allowed the data to fit on a machine and allowed people to answer queries quickly — not because they actually made sense. Subtle bugs, like an email automator pulling information from the wrong table, are exceedingly difficult to find when data is processed like this.

And again, the motivation behind cooking data no longer exists as storage prices have dropped.

Better Understand Your Data by Keeping It Raw

The Sushi Principle says that raw data is better than cooked data because it keeps your data analysis fast, secure, and easily comprehendible. There are three steps you need to take to keep your data raw.

Sushi Principle Diagram

1. Use a Simple, Well-Tested Pipeline

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

  • Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.
  • Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.
  • Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

Of course, there are a few circumstances where you will need business logic in your pipeline. Regulations may require you to purge old user accounts and drop IP addresses. But every time you think about pushing a piece of business logic into your pipeline, you need to consider the risks. We’re all still relatively bad at writing software; every complicated bit you add increases your chances of an error. And since storage is so much cheaper now, you have every incentive to just perform those calculations later.

2. Keep All of Your Original Data

Once you’ve gone through the trouble of collecting all your data, you shouldn’t toss out portions of it. With data storage costs so low, there’s no reason not to keep all of your data — but a bunch of reasons to do so:

  • You can easily trace the lineage of any statistic. Imagine trying to figure out exactly how your DAU was calculated. If your stored data is in the same format that it was generated in, you can just ask the developer of whatever service you’re using to generate data what they meant. If you have heavily processed data, it’s harder to backtrack through all the transformations that were done to find the original data.
  • You can perform any query you want. The beauty of data is in how it can lead you to further questions. If the number of users subscribing through email is shockingly low, you’re going to want to look into the attributes of users who do actually signup through that channel. You don’t lose any detail when you have all your data on hand, which means you can iterate on your questions at any time. If you’ve pared down your data to an OLAP cube, you can only measure already defined dimensions — everything else is lost.

Interana Query Builder

  • You don’t have to waste time deciding what stats you want. If you decide to precompute stats, you’re going to need to spend a whole lot of time planning out what those will be — and even that’s no guarantee that you’ll have everything you need.

Keeping your original data reduces your unnecessary work so that you can get to parts that actually add value. It takes away the need for extensive prior planning and spending time figuring out where your stats came from so that more time can be spent on fully exploring your data.

3. Summarize and Sample at Query Time

You may be tempted to summarize and sample your data early in the pipeline. The thinking goes, I’m going to have to do these things no matter what, why not shrink my data and make it easier to process? But sampling and summarizing early on can harm the accuracy of your data. It’s much less risky to do those at query time:

  • You can ensure that your summary statistics aren’t skewed. If you calculate the average number of edits a Wikipedia user makes per week, that figure is going to be outrageously high unless you exclude bots. (Check that out yourself with our demo.) While this may seem like a mistake you’d never make, little things slip through the cracks all the time.
  • You can sample once you know who’s interesting. You can’t simply keep every 100th event that’s logged — that doesn’t give you a picture of how users, accounts, and devices are behaving. You need to sample by actor, not event. But you won’t know which actors will be interesting to look at before you’ve started coming up with queries. And the types of users you want to look at will change between queries.

Sampling By Actor

  • You’ll get statistically significant results. Much of the time, you’re going to want to look into the behavior of small segments of your user population. But if you sample before query time, you may not have enough data on that small population to get statistically significant answers to your queries.

Yes, you will likely need to sample your data at some point to get answers to your queries quickly. But making that point at query time will ensure that you have the representative sample you need for every query.

Do Less to Your Data and Do More With it

Data is necessary to grow any business — so stop wasting it.

We believe data works best when you can iterate queries continuously instead of having to craft the perfect idea first; if you throw business logic into your pipeline, you lose this ability. By keeping your data raw, you can ask any query you want without having to plan for it in advance.

Data (computing) Database

Published at DZone with permission of Archana Madhavan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Fixing Common Oracle Database Problems
  • SAP HANA Triggers: Enhancing Database Logic and Automation
  • The Future of Data Lakehouses: Apache Iceberg Explained
  • NoSQL for Relational Minds

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!