DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
  • A System Cannot Protect What It Does Not Understand
  • Using LLMs to Automate Data Cleaning and Transformation Pipelines

Trending

  • The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
  • Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
  • A System Cannot Protect What It Does Not Understand
  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
  1. DZone
  2. Data Engineering
  3. Data
  4. Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research

Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research

Transformed 5-hour data loads into 1-2 minutes using Oracle's APPEND+PARALLEL+NOLOGGING, enabling researchers to go from 1-2 experiments/day to 2-3/hour.

By 
Sanjay Mishra user avatar
Sanjay Mishra
·
Mar. 30, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.7K Views

Join the DZone community and get the full member experience.

Join For Free

The Problem

Traditional INSERT for benchmark data loading:

  • Takes 5+ hours for 4M rows
  • Sequential execution
  • Normal logging and buffering
  • 94% of experiment time wasted on data reload

The Solution: Three Techniques Combined

1. APPEND HINT

Tells Oracle: Skip normal buffering, write directly to disk.

Impact: ~10-20x speedup

2. PARALLEL EXECUTION

Tells Oracle: Use all CPU cores instead of sequential.

Impact: ~5-10x speedup

3. NOLOGGING MODE

Tells Oracle: No need to log test data changes.

Impact: ~3-5x speedup

Multiplication effect: 10-20x × 5-10x × 3-5x = 150-300x

Step 1: Create TPC-H Tables

SQL
 
CREATE TABLE LINEITEM (
    L_ORDERKEY NUMBER,
    L_PARTKEY NUMBER,
    L_SUPPKEY NUMBER,
    L_LINENUMBER NUMBER,
    L_QUANTITY NUMBER,
    L_EXTENDEDPRICE NUMBER,
    L_DISCOUNT NUMBER,
    L_TAX NUMBER,
    L_RETURNFLAG VARCHAR2(1),
    L_LINESTATUS VARCHAR2(1),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR2(25),
    L_SHIPMODE VARCHAR2(10),
    L_COMMENT VARCHAR2(44)
);

CREATE TABLE ORDERS (
    O_ORDERKEY NUMBER,
    O_CUSTKEY NUMBER,
    O_ORDERSTATUS VARCHAR2(1),
    O_TOTALPRICE NUMBER,
    O_ORDERDATE DATE,
    O_ORDERPRIORITY VARCHAR2(15),
    O_CLERK VARCHAR2(15),
    O_SHIPPRIORITY NUMBER,
    O_COMMENT VARCHAR2(79)
);


Step 2: Generate and Load TPC-H Data (Python)

Python
 
from oracle_26ai_setup_load.oracle_connector import get_connection

conn = get_connection()
c = conn.cursor()

# Populate ORDERS (700K rows)
print("Populating ORDERS (700K)...")
for i in range(1, 700001):
    cust = ((i - 1) % 150000) + 1
    c.execute(f"""
        INSERT INTO ORDERS 
        VALUES ({i}, {cust}, 'O', {i*100}, SYSDATE - {i % 365}, 
                'PRIO{i % 5}', 'CLERK{i % 100}', {i % 2}, 'comment')
    """)
    if i % 500 == 0: conn.commit()
conn.commit()

# Populate LINEITEM (4M rows)
print("Populating LINEITEM (4M)...")
li = 1
for o in range(1, 700001):
    lines = (o % 7) + 1
    for l in range(1, lines + 1):
        part = ((li - 1) % 20000) + 1
        supp = ((li - 1) % 10000) + 1
        c.execute(f"""
            INSERT INTO LINEITEM 
            VALUES ({o}, {part}, {supp}, {l}, {(li % 50) + 1}, 
                    {(li*10) % 100000}, {(li % 10) * 0.1}, 
                    {(li % 10) * 0.05}, 'R', 'F', 
                    SYSDATE - {li % 365}, SYSDATE - {li % 300}, 
                    SYSDATE - {li % 200}, 'INST{li % 5}', 
                    'MODE{li % 5}', 'comment')
        """)
        li += 1
    if o % 50 == 0: conn.commit()
conn.commit()

# Verify
c.execute("SELECT COUNT(*) FROM LINEITEM")
print(f"LINEITEM rows: {c.fetchone()[0]:,}")  # Should show ~4,000,000


The Optimization: Before vs. After

Before (traditional):

SQL
 
-- Standard insert into 
-- a backup/test table
INSERT INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 5 hours 
to copy 4M rows


After (optimized):

SQL
 
-- Setup (one-time)
ALTER TABLE LINEITEM_TEST 
NOLOGGING;

-- Fast parallel copy
INSERT /*+ APPEND PARALLEL(8) */ 
INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 1-2 minutes 
to copy 4M rows


150-300x speedup achieved.

Timing Comparison (Per Experiment Cycle)

Phase Traditional Optimized
Run optimization code 10 min 10 min
Reset data (reload 4M) 5 hours 1-2 min
Run queries 10 min 10 min
TOTAL PER CYCLE ~5.3 hours ~21-22 minutes


Speedup: 15x faster per experiment.

What's the Bottleneck?

Here are some traditional INSERT bottlenecks:

  1. Cache management: Normal insert uses buffering (slow) → APPEND removes buffering
  2. Sequential processing: One-at-a-time rows → PARALLEL spreads across 8 cores
  3. Redo logging: Every change gets logged → NOLOGGING skips logging for ephemeral test data

Key insight: None of these bottlenecks matters for research data (you reload it anyway). Remove all three = multiplicative speedup.

Infrastructure bottlenecks determine what research is feasible. Remove them, and rigorous research becomes practical.

Why the new setup works:

Technique Problem It Solves Impact
APPEND Cache inefficiency during bulk writes ~10-20x speedup
PARALLEL Sequential processing bottleneck ~5-10x speedup
NOLOGGING Logging overhead on bulk operations ~3-5x speedup
All three combined All bottlenecks removed ~150-300x speedup


Before and After: What Changes for Researchers

Old Paradigm (Data Loading Constrained) New Paradigm (Fast Loading Enabled)
1-2 experiments per working day 2-3 experiments per working hour
100-experiment study = 50-100 hours of compute 100-experiment study = 1-2 hours of compute time
7-14 days of continuous compute time Same working day
Limits research scope to what's feasible in weeks Enables comprehensive research that was previously impractical


Translation: The difference between "I can run a few experiments" and "I can run rigorous, statistical validation studies."

Real-World Use Cases

ML/AI Database Optimization

Using reinforcement learning to optimize database configuration? Fast loading means I can run comprehensive parameter tuning instead of guessing.

Learned Index Structures

Testing neural network-based indexes? I need to run dozens of variants. Fast loading makes this feasible.

Automated Query Planning

Developing a learned query optimizer? I need hundreds of experiments. Fast loading is essential.

Benchmarking Database Improvements

Validating a new indexing strategy across multiple workloads? Fast loading lets me test rigorously instead of settling for "spot checks."

Why This Matters More For Your Research

Infrastructure optimization deserves recognition as a legitimate research contribution.

This isn't a micro-optimization. This isn't shaving seconds off a 30-second process. This is removing a constraint that determines what research is feasible.

When your bottleneck is data loading, you optimize around it:

  • You run fewer experiments
  • You compromise on statistical rigor
  • You avoid comprehensive validation

Remove the bottleneck, and suddenly rigorous research becomes practical.

So, what's next?

  1. Try it on your own data. Test with your actual TPC-H setup.
  2. Measure your improvement. Document your before/after times.
  3. Scale your experiments. Run the comprehensive studies that were previously impractical.
  4. Share results. If you're publishing research, this infrastructure improvement is worth documenting.

For the Skeptics: Why This Works

"Why haven't I seen this before?"

These techniques exist in Oracle's documentation but are scattered across different guides. The key insight is understanding that they compose multiplicatively—none creates a bottleneck for the others when all three are applied together.

"Is this risky?"

In production? Yes, you'd want logging and transaction safety. In research? No — your test data is ephemeral. You'll reload it anyway.

"What about other databases?"

  • PostgreSQL: COPY command is similarly fast; parallel loading is possible.
  • MySQL: Similar techniques available (but vary by storage engine).
  • SQL Server: Bulk insert with similar tricks.
  • Snowflake/BigQuery: Already optimized for bulk loads.

Conclusion

Data loading shouldn't be your research bottleneck. It doesn't have to be.

Using 25-year-old database capabilities in the right combination, I can transform experimental workflows from "heavily constrained" to "limited only by ambition."

For researchers doing ML/AI work on database optimization, this infrastructure fix is the difference between surface-level experiments and rigorous validation studies.

Try it. Measure it. Share it.

Resources

  • Code and reproduction materials: https://github.com/sanmish4ds/oracle-index-advisor
  • TPC-H Benchmark Guide
  • Oracle Database Performance Tuning Guide
Location intelligence Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
  • A System Cannot Protect What It Does Not Understand
  • Using LLMs to Automate Data Cleaning and Transformation Pipelines

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook