DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Reshaping the Data Engineer’s Experience With Declarative Engineering
  • Reading Table Metadata With Flight SQL
  • An In-Depth Look at Oracle MySQL HeatWave
  • SeaweedFS vs. JuiceFS Design and Features

Trending

  • API Standards Are Data Standards
  • Testing Java Applications With WireMock and Spring Boot
  • Debug Like a Pro in 2025: 10 New Eclipse Java Debugger Features to Enhance Your Productivity (With Spring Boot Examples)
  • Turn SQL into Conversation: Natural Language Database Queries With MCP
  1. DZone
  2. Data Engineering
  3. Databases
  4. Taming Billions of Rows: How Metadata and SQL Can Replace Your ETL Pipeline

Taming Billions of Rows: How Metadata and SQL Can Replace Your ETL Pipeline

Managing time-series data is challenging. This article presents a metadata-driven aggregation approach that cuts storage by 10x and speeds up queries without ETL.

By 
Rishu Chadha user avatar
Rishu Chadha
·
Jun. 10, 25 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
1.4K Views

Join the DZone community and get the full member experience.

Join For Free

Many enterprises that collect large volumes of time-series data from storage, virtualization, and cloud environments often run into a known problem: retaining long-term insights (data) without overwhelming storage and compute. To solve this problem, time-series analytics platforms need to handle billions of records efficiently while still delivering actionable insights.

The solution we will discuss here is to build a dynamic data aggregation engine directly into the platform. This article looks at a vendor-agnostic approach for aggregating, transforming, and purging time-series data effectively. The goal is to make it easier to manage growth without sacrificing data quality or performance and reducing storage needed.

Why We Needed Data Aggregation

This platform pulls in performance metrics from a variety of sources: storage arrays, fabrics, virtualization layers, and cloud endpoints. Each of these generates data at its own pace, some send metrics every 30 seconds, others every minute. When you multiply that by thousands of devices or virtual resources, the numbers add up fast easily reaching billions of rows.

Retention policies help a bit by letting us delete old data after a certain point. But for tasks like trend analysis, capacity planning, and compliance reporting, we often need to hold on to historical data much longer. Deleting that old data isn’t always straightforward, either. In relational databases, running delete queries at scale can be a heavy lift on system resources. On top of that, storing high-resolution data for months or years eats up a lot of storage and can drag down query performance.

To tackle this, we built a data aggregation layer. By rolling up older data into larger time buckets, we can keep the important trends and insights without overwhelming storage or making queries painfully slow.

What is Data Aggregation?

Our data aggregation logic is focused on taking high-frequency performance data, such as those 30-second metrics, and transforming them into more manageable summaries: 5-minute, 15-minute, 30-minute averages, and so on. This approach enables us to reduce the total data volume while still retaining the important trends and patterns. There is no need to retain every granular detail indefinitely, and administrators can also control how far the aggregation process goes, deciding on the right level for their needs.

Take 30-second metrics for 10,000 volumes result in about 28 million rows daily and around 10 billion rows annually, highlighting the necessity of aggregation. Storing all that raw data indefinitely is not practical—and it is expensive. Aggregation can reduce that volume by a factor of 10 or more, depending on the level of detail required.

Different metrics can utilize different functions depending on the analysis being performed: AVG, MAX, SUM, or STDDEV.

For instance, IOPS might require MAX to capture spikes, while throughput might rely on SUM or AVG to display overall trends. Our aggregation framework is metadata-driven and fully integrated into the database, allowing you to adjust aggregation rules as needed—without having to modify application code. This ensures consistency and ease of management across the platform.

A New Approach: Metadata-Driven Aggregation Framework

At enterprise scale, time-series data can easily overwhelm traditional pipelines, making it tough to handle performance, retention, and queries effectively. To tackle this challenge, we built a metadata-driven aggregation framework that runs entirely within the database itself. This way, we avoid separate ETL processes or complicated scripts. Everything stays self-contained and easier to manage.

Here’s how this innovative framework works:

1. Multi-Level Aggregation Framework

We implemented a multi-tiered time bucket structure:

  • Level 0: Raw data (1 second to 5 minutes)
  • Levels 1–4: Aggregated levels (5, 15, 30, and 60 minutes)

Each level aggregates the previous one, reducing granularity but extending retention. This design lets us preserve valuable insights for longer periods while controlling storage growth. For example, critical storage subsystems might retain more granular metrics than lower-priority systems—giving teams the flexibility to balance cost and visibility.

2. Metadata-Driven Orchestration

Instead of hardcoding aggregation logic, we used metadata tables to define:

  • Source tables and columns to aggregate (every table will have a column called AGGEGATION_INTERVAL_LEVEL
  • User-defined aggregation functions like AVG, MAX, SUM—all easily updatable without code deployments
  • Retention periods and granularities for each level

This metadata-driven approach means DevOps teams can adjust retention and aggregation strategies on the fly, aligning with rapidly changing business requirements. For instance, adding a new performance metric or adjusting retention windows is as simple as updating a table.

3. Retention-Aware Lifecycle

Unlike traditional ETL pipelines that just delete expired records, our system:

  • Aggregates before purging
  • Treats retention as a trigger for transformation
  • Supports a 5-tier lifecycle per subsystem

This ensures that even as old data is purged, key insights remain available—helping teams maintain visibility into long-term trends without overwhelming storage resources.

4. Partition-Based Optimization

Tables are interval-partitioned by timestamp and sub-partitioned by aggregation level, enabling:

  • Partition pruning during queries and inserts 
  • Lightning-fast DROP PARTITION operations
  • Granular level-based aggregation control

This architecture scales seamlessly to billions of rows and simplifies maintenance tasks like backups and archiving.

5. Embedded in Nightly Purge

Our aggregation engine is integrated with the system’s existing nightly purge schedule:

  • Aggregation levels are processed top-down (e.g., Level 0 → Level 1 → Level 2…)
  • Each level triggers INSERT-based rollups
  • Partitions are dropped only after successful aggregation

No external ETL tools or schedulers are required, ensuring consistency and simplifying operations.

6. Dynamic SQL Generation

Aggregation SQL is generated dynamically using metadata plus partition filters. For example:

INSERT INTO ' || tableName || ' (' || nonAggregateColumnNames || columnNames || 'AGGREGATION_INTERVAL_LEVEL) 
SELECT * 
FROM (
    SELECT ' || nonAggregateColumnNames || aggregateColumnNames || nextAggregationLevel || ' 
    FROM ' || tableName || ' 
    WHERE AGGREGATION_INTERVAL_LEVEL = ' || aggregationLevel || CHR(10) ||
    ' PARTITION(' || partitionName || ')  
    GROUP BY ' || nonAggregateColumnNames || CHR(10) || 
    ' FLOOR((log_Date - TRUNC(ADD_MONTHS(SYSDATE, -12))) * 24 * (60/' || aggregationLevelMins || '))'
);

This logic adapts per subsystem and per partition.

Sample query: 

INSERT INTO aggregated_metrics_log (
    STORAGE_SYSTEM_ID,
    HOST_NAME,
    LOGICAL_UNIT_ID,
    LUN_NUMBER,
    PORT_NAME,
    QUEUE_LEN,
    VOLUME_NAME,
    WWN_NAME,
    io_read_per_sec_avg,
    io_read_per_sec_cur,
    io_read_per_sec_max,
    io_size_read_kb_avg,
    io_size_read_kb_cur,
    io_size_write_kb_avg,
    io_size_write_kb_cur,
    io_write_per_sec_avg,
    io_write_per_sec_cur,
    io_write_per_sec_max,
    kbytes_read_per_sec_avg,
    kbytes_read_per_sec_cur,
    kbytes_read_per_sec_max,
    kbytes_write_per_sec_avg,
    kbytes_write_per_sec_cur,
    kbytes_write_per_sec_max,
    log_date,
    service_time_read_ms_avg,
    service_time_read_ms_cur,
    service_time_write_ms_avg,
    service_time_write_ms_cur,
    AGGREGATION_INTERVAL_LEVEL
)
SELECT *
FROM (
    SELECT
        STORAGE_SYSTEM_ID,
        HOST_NAME,
        LOGICAL_UNIT_ID,
        LUN_NUMBER,
        PORT_NAME,
        QUEUE_LEN,
        VOLUME_NAME,
        WWN_NAME,
        MAX(io_read_per_sec_avg),
        MAX(io_read_per_sec_cur),
        MAX(io_read_per_sec_max),
        MAX(io_size_read_kb_avg),
        MAX(io_size_read_kb_cur),
        MAX(io_size_write_kb_avg),
        MAX(io_size_write_kb_cur),
        MAX(io_write_per_sec_avg),
        MAX(io_write_per_sec_cur),
        MAX(io_write_per_sec_max),
        MAX(kbytes_read_per_sec_avg),
        MAX(kbytes_read_per_sec_cur),
        MAX(kbytes_read_per_sec_max),
        MAX(kbytes_write_per_sec_avg),
        MAX(kbytes_write_per_sec_cur),
        MAX(kbytes_write_per_sec_max),
        MAX(log_date),
        MAX(service_time_read_ms_avg),
        MAX(service_time_read_ms_cur),
        MAX(service_time_write_ms_avg),
        MAX(service_time_write_ms_cur),
        2 AS AGGREGATION_INTERVAL_LEVEL
    FROM aggregated_metrics_log
    SUBPARTITION(SYS_SUBP187153)
    WHERE AGGREGATION_INTERVAL_LEVEL = 1
    GROUP BY
        STORAGE_SYSTEM_ID,
        HOST_NAME,
        LOGICAL_UNIT_ID,
        LUN_NUMBER,
        PORT_NAME,
        QUEUE_LEN,
        VOLUME_NAME,
        WWN_NAME,
        FLOOR((log_date - TRUNC(ADD_MONTHS(SYSDATE, -12))) * 24 * (60 / 15))
);

Enabling Aggregation: Pre-Requisites

  • Partitioning: Tables must be migrated to time series based partitioning if not already partitioned accordingly.
  • Metadata Setup: A script defines columns, functions, and levels.

Once these are in place, aggregation is triggered via the nightly purge process.

Aggregation Functions (Example)

Table Column Description DYNAMIC Function
performance_log cum_interval_sec Time interval, in seconds AVG

kbytes_read Data read AVG

kbytes_written Data written AVG

log_date Timestamp of log entry MAX

read_hit_io Cache read hit rate AVG

read_io Read IOPS AVG

total_io Total IOPS MAX

write_io Write IOPS AVG
... ... ... ...


Aggregation Levels and Retention

Level Granularity Retention Example
0 Raw data (30s, 2min...) 45 days
1 5 mins 90 days
2 15 mins 180 days
3 30 mins 360 days
4 60 mins Based on table


Each level aggregates the one before it. For example, 10 × 30s entries become one 5-min summary at Level 1.

Embedded in the Purge Lifecycle

Aggregation is deeply integrated into the platform’s purge engine:

  • Runs nightly before partition drop of the previous level
  • Preserves long-term trends via rollups
  • Requires no external ETL tools
  • Its own audit process

Example: Performance Aggregation

  • Source Table: performance_log
  • Collection Frequency: 30s
  • Daily Volume: 28M rows
  • Annual Volume: 10B rows
  • Post-Aggregation: 10x reduction

Competitive Comparison

Tool Metadata-Driven Multi-Level Retention-Aware Partition Optimized
This Framework ✅ ✅ ✅ ✅
TimescaleDB ❌ ❌ ⚠️ Partial ✅
InfluxDB ❌ ❌ ✅ ❌
Timestream ❌ ❌ ⚠️ Limited ❌
Splunk ❌ ❌ ⚠️ Partial ❌
Oracle Native ⚠️ Manual ❌ ❌ ✅

Benefits

  • Cross-RDBMS Friendly: This design pattern isn’t tied to any one database—it can be extended to Oracle, PostgreSQL, SQL Server, and others that support partitioning and basic aggregation functions.

    Example: A PostgreSQL table partitioned by timestamp can apply the same metadata-driven rollups using PL/pgSQL and dynamic SQL, with pruning handled by native partition-aware query planning.

  • Smaller Footprint: Reduces raw data up to 10x

  • Faster Queries: Pre-aggregated values improve reporting speed

  • Simpler Management: Less data to handle, more structure

What's Next

Future plans may include:

  • AI based analytics on removing anomalies before aggregation
  • End-to-end automation for setup

Conclusion

Traditional RDBMS setups struggle to keep up with time-series data at enterprise scale. This in-engine, metadata-driven aggregation framework transforms retention from a blunt purge into a smart transformation pipeline entirely within the database.

Start small: model your retention lifecycle, define aggregation logic via metadata, and embed it into your system’s purge workflow. At scale, it’s not just efficient—it’s essential.

Database Extract, transform, load Metadata

Opinions expressed by DZone contributors are their own.

Related

  • Reshaping the Data Engineer’s Experience With Declarative Engineering
  • Reading Table Metadata With Flight SQL
  • An In-Depth Look at Oracle MySQL HeatWave
  • SeaweedFS vs. JuiceFS Design and Features

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: