Taming Billions of Rows: How Metadata and SQL Can Replace Your ETL Pipeline

Managing time-series data is challenging. This article presents a metadata-driven aggregation approach that cuts storage by 10x and speeds up queries without ETL.

Rishu Chadha

Jun. 10, 25 · Tutorial

Likes (6)

Comment

Save

2.1K Views

Many enterprises that collect large volumes of time-series data from storage, virtualization, and cloud environments often run into a known problem: retaining long-term insights (data) without overwhelming storage and compute. To solve this problem, time-series analytics platforms need to handle billions of records efficiently while still delivering actionable insights.

The solution we will discuss here is to build a dynamic data aggregation engine directly into the platform. This article looks at a vendor-agnostic approach for aggregating, transforming, and purging time-series data effectively. The goal is to make it easier to manage growth without sacrificing data quality or performance and reducing storage needed.

Why We Needed Data Aggregation

This platform pulls in performance metrics from a variety of sources: storage arrays, fabrics, virtualization layers, and cloud endpoints. Each of these generates data at its own pace, some send metrics every 30 seconds, others every minute. When you multiply that by thousands of devices or virtual resources, the numbers add up fast easily reaching billions of rows.

Retention policies help a bit by letting us delete old data after a certain point. But for tasks like trend analysis, capacity planning, and compliance reporting, we often need to hold on to historical data much longer. Deleting that old data isn’t always straightforward, either. In relational databases, running delete queries at scale can be a heavy lift on system resources. On top of that, storing high-resolution data for months or years eats up a lot of storage and can drag down query performance.

To tackle this, we built a data aggregation layer. By rolling up older data into larger time buckets, we can keep the important trends and insights without overwhelming storage or making queries painfully slow.

What is Data Aggregation?

Our data aggregation logic is focused on taking high-frequency performance data, such as those 30-second metrics, and transforming them into more manageable summaries: 5-minute, 15-minute, 30-minute averages, and so on. This approach enables us to reduce the total data volume while still retaining the important trends and patterns. There is no need to retain every granular detail indefinitely, and administrators can also control how far the aggregation process goes, deciding on the right level for their needs.

Take 30-second metrics for 10,000 volumes result in about 28 million rows daily and around 10 billion rows annually, highlighting the necessity of aggregation. Storing all that raw data indefinitely is not practical—and it is expensive. Aggregation can reduce that volume by a factor of 10 or more, depending on the level of detail required.

Different metrics can utilize different functions depending on the analysis being performed: AVG, MAX, SUM, or STDDEV.

For instance, IOPS might require MAX to capture spikes, while throughput might rely on SUM or AVG to display overall trends. Our aggregation framework is metadata-driven and fully integrated into the database, allowing you to adjust aggregation rules as needed—without having to modify application code. This ensures consistency and ease of management across the platform.

A New Approach: Metadata-Driven Aggregation Framework

At enterprise scale, time-series data can easily overwhelm traditional pipelines, making it tough to handle performance, retention, and queries effectively. To tackle this challenge, we built a metadata-driven aggregation framework that runs entirely within the database itself. This way, we avoid separate ETL processes or complicated scripts. Everything stays self-contained and easier to manage.

Here’s how this innovative framework works:

1. Multi-Level Aggregation Framework

We implemented a multi-tiered time bucket structure:

Level 0: Raw data (1 second to 5 minutes)
Levels 1–4: Aggregated levels (5, 15, 30, and 60 minutes)

Each level aggregates the previous one, reducing granularity but extending retention. This design lets us preserve valuable insights for longer periods while controlling storage growth. For example, critical storage subsystems might retain more granular metrics than lower-priority systems—giving teams the flexibility to balance cost and visibility.

2. Metadata-Driven Orchestration

Instead of hardcoding aggregation logic, we used metadata tables to define:

Source tables and columns to aggregate (every table will have a column called AGGEGATION_INTERVAL_LEVEL
User-defined aggregation functions like AVG, MAX, SUM—all easily updatable without code deployments
Retention periods and granularities for each level

This metadata-driven approach means DevOps teams can adjust retention and aggregation strategies on the fly, aligning with rapidly changing business requirements. For instance, adding a new performance metric or adjusting retention windows is as simple as updating a table.

3. Retention-Aware Lifecycle

Unlike traditional ETL pipelines that just delete expired records, our system:

Aggregates before purging
Treats retention as a trigger for transformation
Supports a 5-tier lifecycle per subsystem

This ensures that even as old data is purged, key insights remain available—helping teams maintain visibility into long-term trends without overwhelming storage resources.

4. Partition-Based Optimization

Tables are interval-partitioned by timestamp and sub-partitioned by aggregation level, enabling:

Partition pruning during queries and inserts
Lightning-fast DROP PARTITION operations
Granular level-based aggregation control

This architecture scales seamlessly to billions of rows and simplifies maintenance tasks like backups and archiving.

5. Embedded in Nightly Purge

Our aggregation engine is integrated with the system’s existing nightly purge schedule:

Aggregation levels are processed top-down (e.g., Level 0 → Level 1 → Level 2…)
Each level triggers INSERT-based rollups
Partitions are dropped only after successful aggregation

No external ETL tools or schedulers are required, ensuring consistency and simplifying operations.

6. Dynamic SQL Generation

Aggregation SQL is generated dynamically using metadata plus partition filters. For example:

INSERT INTO ' || tableName || ' (' || nonAggregateColumnNames || columnNames || 'AGGREGATION_INTERVAL_LEVEL) 
SELECT * 
FROM (
    SELECT ' || nonAggregateColumnNames || aggregateColumnNames || nextAggregationLevel || ' 
    FROM ' || tableName || ' 
    WHERE AGGREGATION_INTERVAL_LEVEL = ' || aggregationLevel || CHR(10) ||
    ' PARTITION(' || partitionName || ')  
    GROUP BY ' || nonAggregateColumnNames || CHR(10) || 
    ' FLOOR((log_Date - TRUNC(ADD_MONTHS(SYSDATE, -12))) * 24 * (60/' || aggregationLevelMins || '))'
);

This logic adapts per subsystem and per partition.

Sample query:

INSERT INTO aggregated_metrics_log (
    STORAGE_SYSTEM_ID,
    HOST_NAME,
    LOGICAL_UNIT_ID,
    LUN_NUMBER,
    PORT_NAME,
    QUEUE_LEN,
    VOLUME_NAME,
    WWN_NAME,
    io_read_per_sec_avg,
    io_read_per_sec_cur,
    io_read_per_sec_max,
    io_size_read_kb_avg,
    io_size_read_kb_cur,
    io_size_write_kb_avg,
    io_size_write_kb_cur,
    io_write_per_sec_avg,
    io_write_per_sec_cur,
    io_write_per_sec_max,
    kbytes_read_per_sec_avg,
    kbytes_read_per_sec_cur,
    kbytes_read_per_sec_max,
    kbytes_write_per_sec_avg,
    kbytes_write_per_sec_cur,
    kbytes_write_per_sec_max,
    log_date,
    service_time_read_ms_avg,
    service_time_read_ms_cur,
    service_time_write_ms_avg,
    service_time_write_ms_cur,
    AGGREGATION_INTERVAL_LEVEL
)
SELECT *
FROM (
    SELECT
        STORAGE_SYSTEM_ID,
        HOST_NAME,
        LOGICAL_UNIT_ID,
        LUN_NUMBER,
        PORT_NAME,
        QUEUE_LEN,
        VOLUME_NAME,
        WWN_NAME,
        MAX(io_read_per_sec_avg),
        MAX(io_read_per_sec_cur),
        MAX(io_read_per_sec_max),
        MAX(io_size_read_kb_avg),
        MAX(io_size_read_kb_cur),
        MAX(io_size_write_kb_avg),
        MAX(io_size_write_kb_cur),
        MAX(io_write_per_sec_avg),
        MAX(io_write_per_sec_cur),
        MAX(io_write_per_sec_max),
        MAX(kbytes_read_per_sec_avg),
        MAX(kbytes_read_per_sec_cur),
        MAX(kbytes_read_per_sec_max),
        MAX(kbytes_write_per_sec_avg),
        MAX(kbytes_write_per_sec_cur),
        MAX(kbytes_write_per_sec_max),
        MAX(log_date),
        MAX(service_time_read_ms_avg),
        MAX(service_time_read_ms_cur),
        MAX(service_time_write_ms_avg),
        MAX(service_time_write_ms_cur),
        2 AS AGGREGATION_INTERVAL_LEVEL
    FROM aggregated_metrics_log
    SUBPARTITION(SYS_SUBP187153)
    WHERE AGGREGATION_INTERVAL_LEVEL = 1
    GROUP BY
        STORAGE_SYSTEM_ID,
        HOST_NAME,
        LOGICAL_UNIT_ID,
        LUN_NUMBER,
        PORT_NAME,
        QUEUE_LEN,
        VOLUME_NAME,
        WWN_NAME,
        FLOOR((log_date - TRUNC(ADD_MONTHS(SYSDATE, -12))) * 24 * (60 / 15))
);

Enabling Aggregation: Pre-Requisites

Partitioning: Tables must be migrated to time series based partitioning if not already partitioned accordingly.
Metadata Setup: A script defines columns, functions, and levels.

Once these are in place, aggregation is triggered via the nightly purge process.

Aggregation Functions (Example)

Table	Column	Description	DYNAMIC Function
`performance_log`	cum_interval_sec	Time interval, in seconds	AVG
	kbytes_read	Data read	AVG
	kbytes_written	Data written	AVG
	log_date	Timestamp of log entry	MAX
	read_hit_io	Cache read hit rate	AVG
	read_io	Read IOPS	AVG
	total_io	Total IOPS	MAX
	write_io	Write IOPS	AVG
...	...	...	...

Aggregation Levels and Retention

Level	Granularity	Retention Example
0	Raw data (30s, 2min...)	45 days
1	5 mins	90 days
2	15 mins	180 days
3	30 mins	360 days
4	60 mins	Based on table

Each level aggregates the one before it. For example, 10 × 30s entries become one 5-min summary at Level 1.

Embedded in the Purge Lifecycle

Aggregation is deeply integrated into the platform’s purge engine:

Runs nightly before partition drop of the previous level
Preserves long-term trends via rollups
Requires no external ETL tools
Its own audit process

Example: Performance Aggregation

Source Table: performance_log
Collection Frequency: 30s
Daily Volume: 28M rows
Annual Volume: 10B rows
Post-Aggregation: 10x reduction

Competitive Comparison

Tool	Metadata-Driven	Multi-Level	Retention-Aware	Partition Optimized
This Framework	✅	✅	✅	✅
TimescaleDB	❌	❌	⚠️ Partial	✅
InfluxDB	❌	❌	✅	❌
Timestream	❌	❌	⚠️ Limited	❌
Splunk	❌	❌	⚠️ Partial	❌
Oracle Native	⚠️ Manual	❌	❌	✅

Benefits

Cross-RDBMS Friendly: This design pattern isn’t tied to any one database—it can be extended to Oracle, PostgreSQL, SQL Server, and others that support partitioning and basic aggregation functions.
Example: A PostgreSQL table partitioned by timestamp can apply the same metadata-driven rollups using PL/pgSQL and dynamic SQL, with pruning handled by native partition-aware query planning.
Smaller Footprint: Reduces raw data up to 10x
Faster Queries: Pre-aggregated values improve reporting speed
Simpler Management: Less data to handle, more structure

What's Next

Future plans may include:

AI based analytics on removing anomalies before aggregation
End-to-end automation for setup

Conclusion

Traditional RDBMS setups struggle to keep up with time-series data at enterprise scale. This in-engine, metadata-driven aggregation framework transforms retention from a blunt purge into a smart transformation pipeline entirely within the database.

Start small: model your retention lifecycle, define aggregation logic via metadata, and embed it into your system’s purge workflow. At scale, it’s not just efficient—it’s essential.

Database Extract, transform, load Metadata

Opinions expressed by DZone contributors are their own.

Related

Trending