Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research
Transformed 5-hour data loads into 1-2 minutes using Oracle's APPEND+PARALLEL+NOLOGGING, enabling researchers to go from 1-2 experiments/day to 2-3/hour.
Join the DZone community and get the full member experience.
Join For FreeThe Problem
Traditional INSERT for benchmark data loading:
- Takes 5+ hours for 4M rows
- Sequential execution
- Normal logging and buffering
- 94% of experiment time wasted on data reload
The Solution: Three Techniques Combined
1. APPEND HINT
Tells Oracle: Skip normal buffering, write directly to disk.
Impact: ~10-20x speedup
2. PARALLEL EXECUTION
Tells Oracle: Use all CPU cores instead of sequential.
Impact: ~5-10x speedup
3. NOLOGGING MODE
Tells Oracle: No need to log test data changes.
Impact: ~3-5x speedup
Multiplication effect: 10-20x × 5-10x × 3-5x = 150-300x
Step 1: Create TPC-H Tables
CREATE TABLE LINEITEM (
L_ORDERKEY NUMBER,
L_PARTKEY NUMBER,
L_SUPPKEY NUMBER,
L_LINENUMBER NUMBER,
L_QUANTITY NUMBER,
L_EXTENDEDPRICE NUMBER,
L_DISCOUNT NUMBER,
L_TAX NUMBER,
L_RETURNFLAG VARCHAR2(1),
L_LINESTATUS VARCHAR2(1),
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT VARCHAR2(25),
L_SHIPMODE VARCHAR2(10),
L_COMMENT VARCHAR2(44)
);
CREATE TABLE ORDERS (
O_ORDERKEY NUMBER,
O_CUSTKEY NUMBER,
O_ORDERSTATUS VARCHAR2(1),
O_TOTALPRICE NUMBER,
O_ORDERDATE DATE,
O_ORDERPRIORITY VARCHAR2(15),
O_CLERK VARCHAR2(15),
O_SHIPPRIORITY NUMBER,
O_COMMENT VARCHAR2(79)
);
Step 2: Generate and Load TPC-H Data (Python)
from oracle_26ai_setup_load.oracle_connector import get_connection
conn = get_connection()
c = conn.cursor()
# Populate ORDERS (700K rows)
print("Populating ORDERS (700K)...")
for i in range(1, 700001):
cust = ((i - 1) % 150000) + 1
c.execute(f"""
INSERT INTO ORDERS
VALUES ({i}, {cust}, 'O', {i*100}, SYSDATE - {i % 365},
'PRIO{i % 5}', 'CLERK{i % 100}', {i % 2}, 'comment')
""")
if i % 500 == 0: conn.commit()
conn.commit()
# Populate LINEITEM (4M rows)
print("Populating LINEITEM (4M)...")
li = 1
for o in range(1, 700001):
lines = (o % 7) + 1
for l in range(1, lines + 1):
part = ((li - 1) % 20000) + 1
supp = ((li - 1) % 10000) + 1
c.execute(f"""
INSERT INTO LINEITEM
VALUES ({o}, {part}, {supp}, {l}, {(li % 50) + 1},
{(li*10) % 100000}, {(li % 10) * 0.1},
{(li % 10) * 0.05}, 'R', 'F',
SYSDATE - {li % 365}, SYSDATE - {li % 300},
SYSDATE - {li % 200}, 'INST{li % 5}',
'MODE{li % 5}', 'comment')
""")
li += 1
if o % 50 == 0: conn.commit()
conn.commit()
# Verify
c.execute("SELECT COUNT(*) FROM LINEITEM")
print(f"LINEITEM rows: {c.fetchone()[0]:,}") # Should show ~4,000,000
The Optimization: Before vs. After
Before (traditional):
-- Standard insert into
-- a backup/test table
INSERT INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;
Result: 5 hours
to copy 4M rows
After (optimized):
-- Setup (one-time)
ALTER TABLE LINEITEM_TEST
NOLOGGING;
-- Fast parallel copy
INSERT /*+ APPEND PARALLEL(8) */
INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;
Result: 1-2 minutes
to copy 4M rows
150-300x speedup achieved.
Timing Comparison (Per Experiment Cycle)
| Phase | Traditional | Optimized |
|---|---|---|
| Run optimization code | 10 min | 10 min |
| Reset data (reload 4M) | 5 hours | 1-2 min |
| Run queries | 10 min | 10 min |
| TOTAL PER CYCLE | ~5.3 hours | ~21-22 minutes |
Speedup: 15x faster per experiment.
What's the Bottleneck?
Here are some traditional INSERT bottlenecks:
- Cache management: Normal insert uses buffering (slow) → APPEND removes buffering
- Sequential processing: One-at-a-time rows → PARALLEL spreads across 8 cores
- Redo logging: Every change gets logged → NOLOGGING skips logging for ephemeral test data
Key insight: None of these bottlenecks matters for research data (you reload it anyway). Remove all three = multiplicative speedup.
Infrastructure bottlenecks determine what research is feasible. Remove them, and rigorous research becomes practical.
Why the new setup works:
| Technique | Problem It Solves | Impact |
|---|---|---|
| APPEND | Cache inefficiency during bulk writes | ~10-20x speedup |
| PARALLEL | Sequential processing bottleneck | ~5-10x speedup |
| NOLOGGING | Logging overhead on bulk operations | ~3-5x speedup |
| All three combined | All bottlenecks removed | ~150-300x speedup |
Before and After: What Changes for Researchers
| Old Paradigm (Data Loading Constrained) | New Paradigm (Fast Loading Enabled) |
|---|---|
| 1-2 experiments per working day | 2-3 experiments per working hour |
| 100-experiment study = 50-100 hours of compute | 100-experiment study = 1-2 hours of compute time |
| 7-14 days of continuous compute time | Same working day |
| Limits research scope to what's feasible in weeks | Enables comprehensive research that was previously impractical |
Translation: The difference between "I can run a few experiments" and "I can run rigorous, statistical validation studies."
Real-World Use Cases
ML/AI Database Optimization
Using reinforcement learning to optimize database configuration? Fast loading means I can run comprehensive parameter tuning instead of guessing.
Learned Index Structures
Testing neural network-based indexes? I need to run dozens of variants. Fast loading makes this feasible.
Automated Query Planning
Developing a learned query optimizer? I need hundreds of experiments. Fast loading is essential.
Benchmarking Database Improvements
Validating a new indexing strategy across multiple workloads? Fast loading lets me test rigorously instead of settling for "spot checks."
Why This Matters More For Your Research
Infrastructure optimization deserves recognition as a legitimate research contribution.
This isn't a micro-optimization. This isn't shaving seconds off a 30-second process. This is removing a constraint that determines what research is feasible.
When your bottleneck is data loading, you optimize around it:
- You run fewer experiments
- You compromise on statistical rigor
- You avoid comprehensive validation
Remove the bottleneck, and suddenly rigorous research becomes practical.
So, what's next?
- Try it on your own data. Test with your actual TPC-H setup.
- Measure your improvement. Document your before/after times.
- Scale your experiments. Run the comprehensive studies that were previously impractical.
- Share results. If you're publishing research, this infrastructure improvement is worth documenting.
For the Skeptics: Why This Works
"Why haven't I seen this before?"
These techniques exist in Oracle's documentation but are scattered across different guides. The key insight is understanding that they compose multiplicatively—none creates a bottleneck for the others when all three are applied together.
"Is this risky?"
In production? Yes, you'd want logging and transaction safety. In research? No — your test data is ephemeral. You'll reload it anyway.
"What about other databases?"
- PostgreSQL: COPY command is similarly fast; parallel loading is possible.
- MySQL: Similar techniques available (but vary by storage engine).
- SQL Server: Bulk insert with similar tricks.
- Snowflake/BigQuery: Already optimized for bulk loads.
Conclusion
Data loading shouldn't be your research bottleneck. It doesn't have to be.
Using 25-year-old database capabilities in the right combination, I can transform experimental workflows from "heavily constrained" to "limited only by ambition."
For researchers doing ML/AI work on database optimization, this infrastructure fix is the difference between surface-level experiments and rigorous validation studies.
Try it. Measure it. Share it.
Resources
- Code and reproduction materials: https://github.com/sanmish4ds/oracle-index-advisor
- TPC-H Benchmark Guide
- Oracle Database Performance Tuning Guide
Opinions expressed by DZone contributors are their own.
Comments