Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research

Transformed 5-hour data loads into 1-2 minutes using Oracle's APPEND+PARALLEL+NOLOGGING, enabling researchers to go from 1-2 experiments/day to 2-3/hour.

Sanjay Mishra

Mar. 30, 26 · Analysis

Likes (0)

Comment

Save

2.0K Views

The Problem

Traditional INSERT for benchmark data loading:

Takes 5+ hours for 4M rows
Sequential execution
Normal logging and buffering
94% of experiment time wasted on data reload

The Solution: Three Techniques Combined

1. APPEND HINT

Tells Oracle: Skip normal buffering, write directly to disk.

Impact: ~10-20x speedup

2. PARALLEL EXECUTION

Tells Oracle: Use all CPU cores instead of sequential.

Impact: ~5-10x speedup

3. NOLOGGING MODE

Tells Oracle: No need to log test data changes.

Impact: ~3-5x speedup

Multiplication effect: 10-20x × 5-10x × 3-5x = 150-300x

Step 1: Create TPC-H Tables

    SQL
   
 

   CREATE TABLE LINEITEM (
    L_ORDERKEY NUMBER,
    L_PARTKEY NUMBER,
    L_SUPPKEY NUMBER,
    L_LINENUMBER NUMBER,
    L_QUANTITY NUMBER,
    L_EXTENDEDPRICE NUMBER,
    L_DISCOUNT NUMBER,
    L_TAX NUMBER,
    L_RETURNFLAG VARCHAR2(1),
    L_LINESTATUS VARCHAR2(1),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR2(25),
    L_SHIPMODE VARCHAR2(10),
    L_COMMENT VARCHAR2(44)
);

CREATE TABLE ORDERS (
    O_ORDERKEY NUMBER,
    O_CUSTKEY NUMBER,
    O_ORDERSTATUS VARCHAR2(1),
    O_TOTALPRICE NUMBER,
    O_ORDERDATE DATE,
    O_ORDERPRIORITY VARCHAR2(15),
    O_CLERK VARCHAR2(15),
    O_SHIPPRIORITY NUMBER,
    O_COMMENT VARCHAR2(79)
);
  

Step 2: Generate and Load TPC-H Data (Python)

    Python
   
 

   from oracle_26ai_setup_load.oracle_connector import get_connection

conn = get_connection()
c = conn.cursor()

# Populate ORDERS (700K rows)
print("Populating ORDERS (700K)...")
for i in range(1, 700001):
    cust = ((i - 1) % 150000) + 1
    c.execute(f"""
        INSERT INTO ORDERS 
        VALUES ({i}, {cust}, 'O', {i*100}, SYSDATE - {i % 365}, 
                'PRIO{i % 5}', 'CLERK{i % 100}', {i % 2}, 'comment')
    """)
    if i % 500 == 0: conn.commit()
conn.commit()

# Populate LINEITEM (4M rows)
print("Populating LINEITEM (4M)...")
li = 1
for o in range(1, 700001):
    lines = (o % 7) + 1
    for l in range(1, lines + 1):
        part = ((li - 1) % 20000) + 1
        supp = ((li - 1) % 10000) + 1
        c.execute(f"""
            INSERT INTO LINEITEM 
            VALUES ({o}, {part}, {supp}, {l}, {(li % 50) + 1}, 
                    {(li*10) % 100000}, {(li % 10) * 0.1}, 
                    {(li % 10) * 0.05}, 'R', 'F', 
                    SYSDATE - {li % 365}, SYSDATE - {li % 300}, 
                    SYSDATE - {li % 200}, 'INST{li % 5}', 
                    'MODE{li % 5}', 'comment')
        """)
        li += 1
    if o % 50 == 0: conn.commit()
conn.commit()

# Verify
c.execute("SELECT COUNT(*) FROM LINEITEM")
print(f"LINEITEM rows: {c.fetchone()[0]:,}")  # Should show ~4,000,000
  

The Optimization: Before vs. After

Before (traditional):

    SQL
   
 

   -- Standard insert into 
-- a backup/test table
INSERT INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 5 hours 
to copy 4M rows
  

After (optimized):

    SQL
   
 

   -- Setup (one-time)
ALTER TABLE LINEITEM_TEST 
NOLOGGING;

-- Fast parallel copy
INSERT /*+ APPEND PARALLEL(8) */ 
INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 1-2 minutes 
to copy 4M rows
  

150-300x speedup achieved.

Timing Comparison (Per Experiment Cycle)

Phase	Traditional	Optimized
Run optimization code	10 min	10 min
Reset data (reload 4M)	5 hours	1-2 min
Run queries	10 min	10 min
TOTAL PER CYCLE	~5.3 hours	~21-22 minutes

Speedup: 15x faster per experiment.

What's the Bottleneck?

Here are some traditional INSERT bottlenecks:

Cache management: Normal insert uses buffering (slow) → APPEND removes buffering
Sequential processing: One-at-a-time rows → PARALLEL spreads across 8 cores
Redo logging: Every change gets logged → NOLOGGING skips logging for ephemeral test data

Key insight: None of these bottlenecks matters for research data (you reload it anyway). Remove all three = multiplicative speedup.

Infrastructure bottlenecks determine what research is feasible. Remove them, and rigorous research becomes practical.

Why the new setup works:

Technique	Problem It Solves	Impact
APPEND	Cache inefficiency during bulk writes	~10-20x speedup
PARALLEL	Sequential processing bottleneck	~5-10x speedup
NOLOGGING	Logging overhead on bulk operations	~3-5x speedup
All three combined	All bottlenecks removed	~150-300x speedup

Before and After: What Changes for Researchers

Old Paradigm (Data Loading Constrained)	New Paradigm (Fast Loading Enabled)
1-2 experiments per working day	2-3 experiments per working hour
100-experiment study = 50-100 hours of compute	100-experiment study = 1-2 hours of compute time
7-14 days of continuous compute time	Same working day
Limits research scope to what's feasible in weeks	Enables comprehensive research that was previously impractical

Translation: The difference between "I can run a few experiments" and "I can run rigorous, statistical validation studies."

Real-World Use Cases

ML/AI Database Optimization

Using reinforcement learning to optimize database configuration? Fast loading means I can run comprehensive parameter tuning instead of guessing.

Learned Index Structures

Testing neural network-based indexes? I need to run dozens of variants. Fast loading makes this feasible.

Automated Query Planning

Developing a learned query optimizer? I need hundreds of experiments. Fast loading is essential.

Benchmarking Database Improvements

Validating a new indexing strategy across multiple workloads? Fast loading lets me test rigorously instead of settling for "spot checks."

Why This Matters More For Your Research

Infrastructure optimization deserves recognition as a legitimate research contribution.

This isn't a micro-optimization. This isn't shaving seconds off a 30-second process. This is removing a constraint that determines what research is feasible.

When your bottleneck is data loading, you optimize around it:

You run fewer experiments
You compromise on statistical rigor
You avoid comprehensive validation

Remove the bottleneck, and suddenly rigorous research becomes practical.

So, what's next?

Try it on your own data. Test with your actual TPC-H setup.
Measure your improvement. Document your before/after times.
Scale your experiments. Run the comprehensive studies that were previously impractical.
Share results. If you're publishing research, this infrastructure improvement is worth documenting.

For the Skeptics: Why This Works

"Why haven't I seen this before?"

These techniques exist in Oracle's documentation but are scattered across different guides. The key insight is understanding that they compose multiplicatively—none creates a bottleneck for the others when all three are applied together.

"Is this risky?"

In production? Yes, you'd want logging and transaction safety. In research? No — your test data is ephemeral. You'll reload it anyway.

"What about other databases?"

PostgreSQL: COPY command is similarly fast; parallel loading is possible.
MySQL: Similar techniques available (but vary by storage engine).
SQL Server: Bulk insert with similar tricks.
Snowflake/BigQuery: Already optimized for bulk loads.

Conclusion

Data loading shouldn't be your research bottleneck. It doesn't have to be.

Using 25-year-old database capabilities in the right combination, I can transform experimental workflows from "heavily constrained" to "limited only by ambition."

For researchers doing ML/AI work on database optimization, this infrastructure fix is the difference between surface-level experiments and rigorous validation studies.

Try it. Measure it. Share it.

Resources

Code and reproduction materials: https://github.com/sanmish4ds/oracle-index-advisor
TPC-H Benchmark Guide
Oracle Database Performance Tuning Guide

Location intelligence Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending