DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. A Data Warehouse Alone Won’t Cut It for Modern Analytics

A Data Warehouse Alone Won’t Cut It for Modern Analytics

The data warehouse market has evolved since the creation of OLAP; now we need to evolve as well in order to meet the needs of modern analytics.

David Wang user avatar by
David Wang
·
Jan. 06, 23 · Opinion
Like (1)
Save
Tweet
Share
3.07K Views

Join the DZone community and get the full member experience.

Join For Free

The world of data warehouses as we know it can be traced back to 1993 when Dr E.F. Codd conceived the idea of OLAP. A marvel for its time, the then-new processing paradigm for decision support systems paved the way for modern day data warehouses. Because of Codd's influence the idea of a purpose-built data system to  process analytics queries and aggregations on large data sets was conceived. It was clear that having a separate relational database for transactions and analytics made sense.

Since OLAP came about, the data warehouse market has seen a bit of an evolution. In the last 20 years we saw the market adopt columnar storage with Vertica, then it became cloudy with Snowflake,and now we see the market behind to morph into lakehouses. However, despite all this change in technology, they all address the same use case: the classic business intelligence and reporting workflow.

The workflow always starts the same with a business user wanting to get a better grasp on metrics, and in order to do that they contact a data analyst who then works with the business user to work through all their SQL queries. Once those are all ironed out the analyst will then use their BI tool to run the SQL queries against the data warehouse and then notify the business user once the report is done.

It is important to recognize that this is the standard workflow for every data warehouse, but there are some thing to take notice of:

  1. The amount of latency e introduced when the data set being analyzed is large (greater than 1TB or a few million rows) is meaningful. But this is typically acceptable for a report as query time is less of a concern as the report is usually sitting in an inbox or cached in a dashboard.

  2. There is a users-performance-cost relationship similar to the ideal gas law ie. temperature increases as pressure increases. Just the same, an increase in the number of queries increases the response time (unless a high expense is paid to scale the warehouse). But this is not really an issue for the reporting workflow. The number of users initiating data requests is few so the use of the BI tool typically doesn't put a strain on the warehouse performance. 

  3. Data warehouses are made for batch-ingested historical data that is typically refreshed overnight so as to not impact query performance during the working day. This is acceptable when the insights are based on long historical windows, but not when insights on operational, real-time data is also needed.

It’s understandable to think that these would be limitations in data warehouses, while they definitely can be, they’re also kinda not. Data warehouses are purpose built for the standard reporting workflow. With that in mind, these three points are extremely important for an emerging new analytics use case: analytics applications. 

You can see an example of this at Salesforce. They built an observability application that allowed engineers, customer service representatives, product owners and others to interactively query real-time application performance metrics ingested at a rate of billions to trillions of Kafka log lines a day.

You may be asking “Do they need performance at that scale?,” “Do their team have unrestricted abilities to query raw data without latency?,” “Are they able to analyze streaming events as they are created at an enormous scale?” The answer to all of those questions is yes!

The reality is that thousands of companies have come to the conclusion that there is a whole new generation of analytics beyond traditional BI. Their developers are creating analytic applications that have the ability to provide operational visibility at scale, enable slice/dice at high volume, support real-time decisioning and extend insights to customers. 

What Are the Database Requirements for Analytics Applications?

When you’re building an analytics application but aren’t sure what database to use, you typically start with what you already know and go from there.

So you think about what you need. You need analytics capabilities to process terabytes of data for large aggregation and GROUP BY queries on high dimensional and high cardinality data. That would lead you to the idea of data warehouses.

Hold on, you also need application traits of sub-second response with potentially thousands of queries per second, as well as, high throughput up to millions of real-time data ingested per second. That would lead you to the idea of a transactional database. But wait, that won’t work for the analytic requirements you need. 

Enter Apache Druid. 

This issue is why Druid was born. Druid was made for the intersection of analytics and applications. Not a play at HTAP, but rather a real-time analytics database that provides the speed and scale needed for both analytics and applications. 

Druid exists to handle complex queries on large data sets at the fast responsiveness expected for user-facing applications and automated pipelines. That is why Reddit chose Druid. With Druid, Reddit can process user data to serve real-time ads in less than 30 milliseconds. 

Uniquely, Druid has also achieved high queries per second without taking a heavy toll on your wallet. This is made possible due to its CPU’s efficient architecture. This can be seen with Twitch, their engineers gladly run 70,000 queries per day with the help of Druid. 

Apache Kafka and Amazon Kinesis pipelines also have a database built for data in motion with Druid, and this is supported by the fact that Druid supports true stream ingestion. That is the reason why the engineers at Confluent, the original creators of Apache Kafka®, use Druid to power their observability application that has an ingestion rate of more than 3 million events per second.

New Analytics Use Case, Purpose-Built Open Source Database

The classic BI and reporting workflow isn’t going anywhere. Business users will still get insights from data analysts who will still write SQL queries for BI tools that query a data warehouse — and the whole data warehouse and lakehouse gang will still continue to fight over that workload. 

It is just important to remember that for those who are looking to build analytics applications there is a more interesting database for you to check out.

Apache Druid isn’t a traditional OLAP database as the developers at 1000s of companies aren’t using it for traditional OLAP queries. Follow the lead of Netflix, Confluent, and Target. They’re building and investing in the emerging use case that sits at the intersection of analytics and applications, analytics applications.

Analytics Data warehouse Database Data (computing) workflow

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Why Open Source Is Much More Than Just a Free Tier
  • Why You Should Automate Code Reviews
  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • Event Driven 2.0

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: