DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Building a Slackbot to Answer Analytics Questions From Your Database
  • Nginx Log Analytics With AWS Athena and Cube.js

Trending

  • Jakarta WebSocket Essentials: A Guide to Full-Duplex Communication in Java
  • Reactive Kafka With Spring Boot
  • How Jenkins Can Stay Relevant in the Next Decade
  • Optimizing Vector Search Performance With Elasticsearch
  1. DZone
  2. Data Engineering
  3. Data
  4. Things to Consider When Scaling Analytics for High QPS

Things to Consider When Scaling Analytics for High QPS

For data engineers architecting applications, this post explores key technical considerations to support high QPS for analytics and the strengths of Apache Druid.

By 
David Wang user avatar
David Wang
·
Aug. 18, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

For some, the thought of analytics and high QPS (queries per second) together may seem unlikely. After all, we typically think of analytics as the occasional report and dashboard into business metrics.

But, analytics use cases are evolving with high volume, low latency queries on the rise. Companies like Confluent, Target, and Pinterest use analytics for much more than weekly executive summaries. They’re making analytics available across their organizations; their teams are exploring high-dimensional raw data in a free-flowing, ad-hoc nature; and they’re powering analytics applications and data products for 1000s to millions of external users and customers. 

In the era of analytics where query volume is the sine qua non “V” of data, how should we think about system architecture — what matters and why?  

High QPS or High Concurrency: Is There a Difference?

First, let’s clear up a common misconception between high QPS and high concurrency. While they are both related to the number of simultaneous queries, they are different metrics when it comes to database performance and system architecture. 

QPS is the key technical objective: process a specific number of queries in one second, start to finish. On the other hand, high concurrency is not so much a technical objective but an attribute of a system that can handle multiple simultaneous queries.

In other words, high concurrency is an aspect of system design that enables high QPS. But a database designed for high concurrency (i.e., most modern databases today) does not ensure high QPS.  There are many more factors at play here.

Factors that Make High QPS for Analytics Hard

The ability of an analytics database – actually, any database for that matter — to handle high QPS is a function of the type of queries and the amount of data that needs to be processed for each query and of course the architecture of the database itself.

Type of Queries:  Different types of queries have different levels of complexity, and some queries may require more processing power, memory, or disk I/O than others. For example, a simple query that selects a single row from a table can be processed quickly, while a complex query that performs aggregations will take longer to complete.  

Amount of Data: In addition to query complexity, the amount of data that needs to be processed for each query also plays a role in determining the QPS that a database can handle. Queries that involve large amounts of data require more time to scan and process, and require more computing resources than small data sets.

High QPS is less of an issue for transactional databases like MongoDB, Oracle, and MySQL as they’re dealing with simple CRUD operations on specific rows of data. But, high QPS for analytics queries on larger data sets is a much different animal. 

So then, let’s consider data warehouses — the classic analytics workhorse for complex queries — like Snowflake, BigQuery, and Redshift. The problem here is that these systems are built for infrequent queries like reports and static dashboards for executive status checks. They’re built to minimize cost per query and do things like separate compute and storage, but that comes at a cost to query latency, making them not ideal for high QPS requirements.  

And of course, pre-computing all your aggregations into cache is an option albeit a highly restrictive and labor intensive one.

This is why architecting high QPS for analytics is hard.

High QPS for Analytics Needs More than Hardware 

It’s easy to think that throwing more hardware at a problem will make it go away. After all, more resources — CPU, memory, and disk I/O — will increase the number of concurrent connections possible for a database but that alone doesn’t give the performance guarantee for sub-second queries…or the baggage that comes with scaling. More on that in a second. 

So solving for high QPS for analytics needs to take in more considerations. Let’s look at them.

Processing Efficiency

The key to designing for fast read-intensive queries lies in minimizing the amount of work and scanning required. And it requires a database with the smarts to know how to take a complex query and determine what to scan or not: don’t read the full data set if it can read the smallest index; don’t send data unnecessarily from server to server; don’t perform large complex joins at query-time if you can do it earlier. 

The ideal architecture accepts an analytics query and minimizes the amount of computation required by storing the data and querying it in a highly optimized way, even if querying petabytes of data. So while parallel processing is an obvious way to have a system designed for high concurrency, the system itself needs to have the smarts in its storage engine to read less data to begin with. 

Operational Burden

Virtually any database today can scale-out for concurrency. But that doesn’t mean it’ll be fun to do so!

Take for example, PostgreSQL: a classic, general purpose database that developers like to use for everything from transactional processing to analytics too. It’s natively designed as a scale-up architecture and it’s pretty fast for analytics queries when the data set and the concurrency is on the low end.  But as your application grows, what can be done?

This is generally what’s involved in configuring PostgreSQL and other OLTP databases for high concurrency: 

  1. Database design: ensure that the database schema and table structures are optimized — e.g., manual indexing, distribution, and normalization techniques. 
  2. Partitioning: divide larger tables into smaller tables based on a user-defined partition key, with manual sharding to horizontally partition across multiple instances
  3. Replicate the data: create data copies from the primary server to replicas to reduce the load on the primary server and increase concurrency. 
  4. Load balance: now with replicas, set up a load balancer to distribute incoming queries across multiple PostgreSQL instances with a connection pooling mechanism to handle the concurrent load.
  5. Query rewriting: analyze query plans and performance metrics to figure out how to optimize the queries
  6. Monitor…optimize…grow…rinse…and repeat

So while “anything is possible” might be true, there’s definitely pain when stretching a database beyond its core use.

Amount of Infrastructure

Why use 1000 vCPU when you can use 100? This might sound obvious, but it requires the right system architecture to minimize the processing horsepower required. So again, storage design, query efficiency, and resource management at scale all play a key role in meeting the performance SLA for high QPS and keeping the costs down. 

Apache Druid 

As we’ve seen here, achieving high QPS for analytics isn’t so easy.  There are shortcuts like pre-computing and caching aggregations, but that’s highly restrictive. 

That’s where the open-source database Apache Druid comes in. It’s a high performance, real-time analytics database specifically designed for applications that need ultra-low latency on high volume queries. For example, Target uses Druid for analytics applications that execute over 4 million queries per day with an average response of 300ms; Confluent powers a customer-facing application requiring 350 QPS on data streaming in at 5 million events per second. 

And it’s an economical solution too: for example, a performance benchmark comparing Google BigQuery to Apache Druid shows a 12X price for performance difference as Druid fundamentally uses less infrastructure under load. 

Analytics Database Queries per second Data (computing)

Published at DZone with permission of . See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Building a Slackbot to Answer Analytics Questions From Your Database
  • Nginx Log Analytics With AWS Athena and Cube.js

Partner Resources


Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: