DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Migrate, Modernize and Build Java Web Apps on Azure: This live workshop will cover methods to enhance Java application development workflow.

Modern Digital Website Security: Prepare to face any form of malicious web activity and enable your sites to optimally serve your customers.

Kubernetes in the Enterprise: The latest expert insights on scaling, serverless, Kubernetes-powered AI, cluster security, FinOps, and more.

A Guide to Continuous Integration and Deployment: Learn the fundamentals and understand the use of CI/CD in your apps.

Related

  • Migrating From ClickHouse to Apache Doris: What Happened?
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses
  • Resilient Kafka Consumers With Reactor Kafka
  • Rails Asynchronous Processing

Trending

  • Scaling SRE Teams
  • CI/CD Metrics You Should Be Monitoring
  • How To Use KubeDB and Postgres Sidecar for Database Integrations in Kubernetes
  • Ways To Reduce JVM Docker Image Size
  1. DZone
  2. Data Engineering
  3. Databases
  4. Massive Parallel Query Log Processing With ClickHouse

Massive Parallel Query Log Processing With ClickHouse

Learn how ClickHouse uses multi-core parallel query processing to efficiently perform heavy analytical workloads and crunch log files.

Vadim Tkachenko user avatar by
Vadim Tkachenko
·
Sep. 15, 17 · Tutorial
Like (2)
Save
Tweet
Share
4.8K Views

Join the DZone community and get the full member experience.

Join For Free

In this blog, I'll look at how to use ClickHouse for parallel log processing.

Percona is seen primarily for our expertise in MySQL and MongoDB (at this time), but neither is quite suitable to perform heavy analytical workloads. There is a need to analyze data sets, and a very popular task is crunching log files. Below, I'll show how ClickHouse can be used to efficiently perform this task. ClickHouse is attractive because it has multi-core parallel query processing, and it can even execute a single query using multiple CPUs in the background.

I am going to check how ClickHouse utilizes multiple CPU cores and threads. I will use a server with two sockets, equipped with "Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz" in each. That gives a total of 28 CPU cores / 56 CPU threads.

To analyze workload, I'll use an Apache log file from one of Percona's servers. The log has 1.56 billion rows and uncompressed it takes 274G. When inserted into ClickHouse, the table on disk takes 9G.

How do we insert the data into ClickHouse? There is a lot of scripts to transform Apache log format to CSV, which ClickHouse can accept. As for the base, I used this one, and you can find my modification here.

The ClickHouse table definition:

CREATE TABLE default.apachelog ( remote_host String, user String, access_date Date, access_time DateTime, timezone String, request_method String, request_uri String, status UInt32, bytes UInt32, referer String, user_agent String) ENGINE = MergeTree(access_date, remote_host, 8192)

To test how ClickHouse scales on multiple CPU cores/threads, I will execute the same query by allocating from 1 to 56 CPU threads for ClickHouse processes. This can be done as:

ps -eLo cmd,tid | grep clickhouse-server | perl -pe 's/.* (d+)$/1/' | xargs -n 1 taskset -cp 0-$i

...where $i is (N CPUs-1).

We must also take into account that not all queries are equal. Some are easier to execute in parallel than others. So I will test three different queries. In the end, we can't get around Amdahl's Law!

The first query should be easy to execute in parallel:

select extract(request_uri,'(w+)$') p,sum(bytes) sm,count(*) c from apachelog group by p order by c desc limit 100

Speedup:

Image title

It's much more interesting to chart these results:

From the chart, we can see that the query scales linearly up to 28 cores. After that, it continues to scale up to 56 threads (but with a lesser slope). I think this is related to the CPU architecture (remember we have 28 physical cores and 56 CPU "threads"). Let's look at the results again. With one available CPU, the query took 823.6 seconds to execute. With all available CPUs, it took 23.6 second. So the total speedup is 34.9 times.

But let's consider a query that allows a lesser degree of parallelism. For example, this one:

select access_date c2, count(distinct request_uri) cnt from apachelog group by c2 order by c2 limit 300

This query uses aggregation that counts unique URIs, which I am sure limits the counting process to a single shared structure. So some part of the execution is limited to a single process. I won't show the full results for all 1 to 56 CPUs, but for one CPU, the execution time is 177.715 seconds, and for 56 CPUs the execution time is 11.564 seconds. The total speedup is 15.4 times.

The speedup chart looks like this:

As we suspected, this query allows less parallelism. What about even heavier queries? Let's consider this one:

SELECT y, request_uri, cnt FROM (SELECT access_date y, request_uri, count(*) AS cnt FROM apachelog GROUP BY y, request_uri ORDER BY y ASC ) ORDER BY y,cnt DESC LIMIT 1 BY y

In that query, we build a derived table (to resolve the subquery) and I expect it will limit the parallelism even further. And it does; with one CPU, the query takes 183.063 seconds to execute. With 56 CPUs, it takes 28.572 seconds. So the speedup is only 6.4 times.

The chart is:

Conclusion

ClickHouse can capably utilize multiple CPU cores available on the server, and query execution is not limited by a single CPU (like in MySQL). The degree of parallelism is defined by the complexity of the query, and in the best case scenario, we see linear scalability with the number of CPU cores. For the scaling on multiple servers, you can see my previous post.

However, if query execution is serial, it limits the speedup (as described in Amdahl's Law).

One example is a 1.5 billion record Apache log, and we can see that ClickHouse can execute complex analytical queries within tens of seconds.

Database ClickHouse Processing MASSIVE (software)

Published at DZone with permission of Vadim Tkachenko, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Migrating From ClickHouse to Apache Doris: What Happened?
  • Conversational Applications With Large Language Models Understanding the Sequence of User Inputs, Prompts, and Responses
  • Resilient Kafka Consumers With Reactor Kafka
  • Rails Asynchronous Processing

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: