DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Building a 24-Core Docker Swarm Cluster on Banana Pi Zero
  • MariaDB Vector Edition: Designed for AI
  • Knowledge Graphs and Analytics Without Graph Databases for Gen-AI
  • Mixing SQL and NoSQL With MariaDB and MongoDB

Trending

  • Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights
  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • A Complete Guide to Modern AI Developer Tools
  1. DZone
  2. Data Engineering
  3. Databases
  4. Fast Analytics With MariaDB ColumnStore

Fast Analytics With MariaDB ColumnStore

Start your analytics journey with the open-source MariaDB relational database and the ColumnStore storage engine on Docker.

By 
Alejandro Duarte user avatar
Alejandro Duarte
DZone Core CORE ·
Jan. 19, 24 · Analysis
Likes (2)
Comment
Save
Tweet
Share
67.1K Views

Join the DZone community and get the full member experience.

Join For Free

Slow query times in large datasets are a common headache in database management. MariaDB ColumnStore offers a neat way out of this. It's a columnar storage engine that significantly speeds up data analytics. Typically, you can improve query performance in relational databases by adding appropriate indexes. However, maintaining indexes is hard, especially with ad-hoc queries where you don't know where indexes are going to be needed. ColumnStore eases this pain. It's as if you had an index on each column but without the hassle of creating and updating them. The price to pay? Well, inserts are not as fast as with InnoDB, so this is not the best option for operational/transactional databases but rather for analytical ones. Bulk inserts are very fast though.

There's plenty of online documentation about ColumnStore, so I won't go through all the details on how it works or how to deploy it on production. Instead, in this article, I'll show you how to try MariaDB ColumnStore on your computer using Docker.

Pre-Requisites

You'll need:

  • The mariadb command line tool
  • Docker

Setting up MariaDB ColumnStore

  1. Run a container with MariaDB + ColumnStore:
Shell
 
docker run -d -p 3307:3306 -e PM1=mcs1 --hostname=mcs1 --name mcs1 mariadb/columnstore


This command runs a new Docker container using the official ColumnStore image, with several specified options:

  • docker run: Starts a new Docker container.
  • -d: Runs the container in detached mode (in the background).
  • -p 3307:3306: Maps port 3307 on the host (your computer) to port 3306 inside the container. This makes the database accessible on the port 3307 on the host machine.
  • -e PM1=mcs1: The PM1 environment variable PM1 specifies the primary database node (mcs1).
  • --hostname=mcs1: Sets the hostname of the container to mcs1.
  • --name mcs1: Names the container mcs1.
  • mariadb/columnstore: Specifies the Docker image to use, in this case, an image for MariaDB with the ColumnStore storage engine.
  1. Provision ColumnStore:
Shell
 
docker exec -it mcs1 provision mcs1


The command docker exec is used to interact with a running Docker container. This is what each option does:

  • docker exec: Executes a command in a running container.
  • -it: This option ensures the command is run in interactive mode with a terminal.
  • mcs1 (first occurrence): This is the name of the Docker container in which the command is to be executed.
  • provision mcs1 This is the specific command being executed inside the container. provision is a script included in the Docker image that initializes and configures the MariaDB ColumnStore environment within the container? The argument mcs1 is passed to the provision command to specify the host for the MariaDB server within the Docker container.
  1. Connect to the MariaDB server using the default credentials defined in the MariaDB ColumnStore Docker image:
Shell
 
mariadb -h 127.0.0.1 -P 3307 -u admin -p'C0lumnStore!'


  1. Check that ColumnStore is available as a storage engine by running the following SQL sentence:
Shell
 
SHOW ENGINES;

Setting up a Demo Database

  1. Create the operations database and its InnoDB tables:
SQL
 
CREATE DATABASE operations;

CREATE TABLE operations.doctors(
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL CHECK(TRIM(name) != '')
) ENGINE=InnoDB;

CREATE TABLE operations.appointments(
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL CHECK(TRIM(name) != ''),
    phone_number VARCHAR(15) NOT NULL CHECK(phone_number RLIKE '[0-9]+'),
    email VARCHAR(254) NOT NULL CHECK(TRIM(email) != ''),
    time DATETIME NOT NULL,
    reason ENUM('Consultation', 'Follow-up', 'Preventive', 'Chronic') NOT NULL,
    status ENUM ('Scheduled', 'Canceled', 'Completed', 'No Show'),
    doctor_id BIGINT UNSIGNED NOT NULL,
    CONSTRAINT fk_appointments_doctors FOREIGN KEY (doctor_id) REFERENCES doctors(id)
) ENGINE=InnoDB;


  1. Create the analytics database and its ColumnStore table:
Shell
 
CREATE DATABASE analytics;

CREATE TABLE analytics.appointments(
    id BIGINT UNSIGNED NOT NULL,
    name VARCHAR(200) NOT NULL,
    phone_number VARCHAR(15) NOT NULL,
    email VARCHAR(254) NOT NULL,
    time DATETIME NOT NULL,
    reason VARCHAR(15) NOT NULL,
    status VARCHAR(10) NOT NULL,
    doctor_id BIGINT UNSIGNED NOT NULL
) ENGINE=ColumnStore;


You can use the same database (or schema, they are synonyms in MariaDB) for both the InnoDB and ColumnStore tables if you prefer. Use a different name for the ColumnStore table if you opt for this alternative.

Inserting Demo Data

  1. Insert a few doctors:
SQL
 
INSERT INTO operations.doctors(name)
VALUES ("Maria"), ("John"), ("Jane");


  1. Create a new file with the name test_data_insert.py with the following content:
SQL
 
import random
import os
import subprocess
from datetime import datetime, timedelta

# Function to generate a random date within a given range
def random_date(start, end):
    return start + timedelta(days=random.randint(0, int((end - start).days)))

# Function to execute a given SQL command using MariaDB
def execute_sql(sql):
    # Write the SQL command to a temporary file
    with open("temp.sql", "w") as file:
        file.write(sql)
    # Execute the SQL command using the MariaDB client
    subprocess.run(["mariadb", "-h", "127.0.0.1", "-P", "3307", "-u", "admin", "-pC0lumnStore!", "-e", "source temp.sql"])
    # Remove the temporary file
    os.remove("temp.sql")

print("Generating and inserting data...")

# Total number of rows to be inserted
total_rows = 4000000
# Number of rows to insert in each batch
batch_size = 10000

# Possible values for the 'reason' column and their associated weights for random selection
reasons = ["Consultation", "Follow-up", "Preventive", "Chronic"]
reason_weights = [0.5, 0.15, 0.25, 0.1]

# Possible values for the 'status' column and their associated weights for random selection
statuses = ["Scheduled", "Canceled", "Completed", "No Show"]
status_weights = [0.1, 0.15, 0.7, 0.05]

# Possible values for the 'doctor_id' column and their associated weights for random selection
doctors = [1, 2, 3]
doctors_weights = [0.4, 0.35, 0.25]

# List of patient names
names = [f"Patient_{i}" for i in range(total_rows)]

# Insert data in batches
for batch_start in range(0, total_rows, batch_size):
    batch_values = []

    # Generate data for each row in the batch
    for i in range(batch_start, min(batch_start + batch_size, total_rows)):
        name = names[i]
        phone_number = f"{random.randint(100, 999)}-{random.randint(100, 999)}-{random.randint(1000, 9999)}"
        email = f"patient_{i}@example.com"
        time = random_date(datetime(2023, 1, 1), datetime(2024, 1, 1)).strftime("%Y-%m-%d %H:%M:%S")
        reason = random.choices(reasons, reason_weights)[0]
        status = random.choices(statuses, status_weights)[0]
        doctor_id = random.choices(doctors, doctors_weights)[0]

        # Append the generated row to the batch
        batch_values.append(f"('{name}', '{phone_number}', '{email}', '{time}', '{reason}', '{status}', {doctor_id})")

    # SQL command to insert the batch of data into the 'appointments' table
    sql = "USE operations;\nINSERT INTO appointments (name, phone_number, email, time, reason, status, doctor_id) VALUES " + ", ".join(batch_values) + ";"
    # Execute the SQL command
    execute_sql(sql)
    # Print progress
    print(f"Inserted up to row {min(batch_start + batch_size, total_rows)}")

print("Data insertion complete.")


  1. Insert 4 million appointments by running the Python script:  python3 test_data_insert.py
  2. Populate the ColumnStore table by connecting to the database and running:
SQL
 
INSERT INTO analytics.appointments (
    id,
    name,
    phone_number,
    email,
    time,
    reason,
    status,
    doctor_id
)
SELECT
    appointments.id,
    appointments.name,
    appointments.phone_number,
    appointments.email,
    appointments.time,
    appointments.reason,
    appointments.status,
    appointments.doctor_id
FROM operations.appointments;


Run Cross-Engine SQL Queries

MariaDB ColumnStore is designed to run in a cluster of multiple servers. It is there where you see massive performance gains in analytical queries. However, we can also see this in action with the single-node setup of this article.

  1. Run the following query and pay attention to the time it needs to complete (make sure it queries the operations database):
SQL
 
SELECT doctors.name, status, COUNT(*) AS count
FROM operations.appointments -- use the InnoDB table
JOIN doctors ON doctor_id = doctors.id
WHERE status IN (
    'Scheduled',
    'Canceled',
    'Completed',
    'No Show'
)
GROUP BY doctors.name, status
ORDER BY doctors.name, status;


On my machine, it took around 3 seconds.

  1. Now modify the query to use the ColumnStore table instead (in the analytics database):
SQL
 
SELECT doctors.name, status, COUNT(*) AS count
FROM analytics.appointments -- use the ColumnStore table
JOIN doctors ON doctor_id = doctors.id
WHERE status IN (
    'Scheduled',
    'Canceled',
    'Completed',
    'No Show'
)
GROUP BY doctors.name, status
ORDER BY doctors.name, status;


It takes less than a second. Of course, you can speed up the first query by adding an index in this simplistic example, but imagine the situation in which you have hundreds of tables—it will become harder and harder to manage indexes. ColumnStore removes this complexity.

Analytics InnoDB MariaDB Relational database Docker (software)

Published at DZone with permission of Alejandro Duarte. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Building a 24-Core Docker Swarm Cluster on Banana Pi Zero
  • MariaDB Vector Edition: Designed for AI
  • Knowledge Graphs and Analytics Without Graph Databases for Gen-AI
  • Mixing SQL and NoSQL With MariaDB and MongoDB

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!