DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Your RAG Pipeline Will Fail Without an MCP Server
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)
  • AI RAG Architectures: Comprehensive Definitions and Real-World Examples
  • Building an Internal Document Search Tool with Retrieval-Augmented Generation (RAG)

Trending

  • 5 AI Security Incidents That Broke Things in Production (and What They Have in Common)
  • Jakarta EE 12: Entering the Data Age of Enterprise Java
  • RAG Is Not Enough: Advanced Retrieval Architectures Using Vertex AI Search on GCP
  • Architecting Zero-Trust AI Agents: How to Handle Data Safely
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Tutorial: RAG at Scale With Vector Databases vs Lakehouse Architectures

Tutorial: RAG at Scale With Vector Databases vs Lakehouse Architectures

Learn how to scale RAG pipelines by storing embeddings in vector databases vs. lakehouses, with hands-on examples and key trade-offs.

By 
Vivek Venkatesan user avatar
Vivek Venkatesan
·
Sep. 09, 25 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
3.2K Views

Join the DZone community and get the full member experience.

Join For Free

Retrieval-augmented generation (RAG) is quickly becoming the standard enterprise pattern for deploying large language models (LLMs). Instead of relying solely on pretraining, RAG enriches prompts with fresh, domain-specific information. The result? More accurate answers, fewer hallucinations, and outputs that enterprises can trust.

But building RAG at enterprise scale is tricky. You’re not embedding a few PDFs anymore—you’re embedding billions of rows from databases, log streams, or knowledge repositories. That leads to a critical architectural question:

Where should your embeddings live?

Two main approaches dominate today’s landscape:

  1. Vector Databases (e.g., Pinecone, Weaviate, Milvus, Vespa) – purpose-built systems optimized for similarity search.
  2. Lakehouse + Vector Extensions (e.g., Snowflake Cortex, Databricks Vector Search, BigQuery, OpenSearch) – extending existing enterprise data platforms to handle embeddings.

This tutorial provides a hands-on walkthrough of both approaches with code, followed by trade-offs, hybrid strategies, and practical advice for engineers building RAG pipelines.

Step 1: Generating Embeddings

First, we need embeddings. Let’s assume we’re embedding text with OpenAI’s text-embedding-ada-002 model (though Hugging Face or Cohere work similarly).

Python
 
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY")

response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="RAG pipelines scale with the right retrieval layer"
)
vector = response.data[0].embedding
print(len(vector))  # usually 1536 dimensions


Now we’ve got a 1536-dimensional vector representing our sentence. The next question is: where do we put it?

Step 2: Storing in a Vector Database

Vector databases are purpose-built for this exact use case. Here’s an example with Pinecone:

Python
 
import pinecone

# initialize connection
pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
index = pinecone.Index("rag-demo")

# upsert vector
index.upsert([
    ("doc1", vector, {"source": "example"})
])

# similarity search
query = index.query(vector, top_k=3, include_metadata=True)
print(query)


Why This Works Well

  • Speed: Millisecond-level queries on billions of embeddings.
  • Flexibility: Built-in support for hybrid queries (keyword + vector).
  • Maturity: Libraries like Milvus and Vespa are tuned for ANN (Approximate Nearest Neighbor) search.

Where It Falls Short

  • New silo: Your vectors live outside your warehouse/lake.
  • Governance gaps: Fewer tools for lineage, access control, compliance.
  • Extra cost: Another system to scale and manage.

Step 3: Storing in a Lakehouse

Now let’s try Databricks with Delta Lake + Vector Search.

SQL
 
-- create a table for embeddings
CREATE TABLE rag_embeddings (
    id STRING,
    vector ARRAY<FLOAT>,
    source STRING
);

-- insert one embedding
INSERT INTO rag_embeddings VALUES (
    'doc1',
    [0.12, 0.55, -0.23, ...], -- your embedding array
    'example'
);

-- query with cosine similarity
SELECT id, source,
       vector_cosine_similarity(vector, ARRAY[0.12, 0.55, -0.23, ...]) AS score
FROM rag_embeddings
ORDER BY score DESC
LIMIT 3;


Why This Works Well

  • Unified storage: No duplication—embeddings live alongside structured and unstructured data.
  • Governance: Strong IAM, lineage, catalogs, and compliance baked in.
  • Easy joins: Combine similarity scores with transactions, profiles, or other attributes.

Where It Falls Short

  • Performance: Vector queries are slower than in purpose-built DBs.
  • Immaturity: Many lakehouse vector features are new and evolving.
  • Tuning pain: Engineers must balance cost vs. speed at scale.

Step 4: Comparison: Vector Database vs. Lakehouse

Here's how the two approaches stack up in practice:

Vector Database(e.g., Pinecone, Weaviate):

  • Query speed: Milliseconds, optimized for Approximate Nearest Neighbor (ANN) search.
  • Scale: Built to handle billions of embeddings.
  • Governance: Limited access control and compliance features compared to enterprise data platforms.
  • Data duplication: High, since embeddings live outside your core warehouse or lake.
  • Ecosystem: AI-first, feature-rich for similarity search but younger in enterprise maturity.

Lakehouse + Vector Extensions (e.g., Snowflake, Databricks):

  • Query speed: Slower—seconds rather than milliseconds, though improving with new features.
  • Scale: Growing, but not as fast as purpose-built vector databases.
  • Governance: Strong IAM, lineage, catalogs, and compliance controls.
  • Data duplication: Low, embeddings and structured/unstructured data live together.
  • Ecosystem: Enterprise-first, backed by stable vendors, well-suited for regulated industries.

Step 5: Hybrid Architectures

Most enterprises don’t choose one or the other—they combine both.

  • Low-latency layer: Vector DB (Pinecone/Weaviate) for semantic search in apps.
  • Governance layer: Lakehouse (Snowflake/Databricks) for storage, lineage, compliance, and integration with BI/ETL.
  • Pipelines: Sync embeddings between the two via batch or streaming jobs.

Think of it as speed in the vector DB, trust in the lakehouse.

Practical Use Case

Imagine a bank building a compliance chatbot:

  • Retrieval for analyst Q&A: Vector DB (fast, interactive).
  • Storage for audit/compliance: Lakehouse (secure, lineage, reproducible queries).
  • Sync every 24h: Batch jobs copy embeddings into both systems.

This hybrid gives both performance and trust, which is exactly what regulated industries need.

Step 6: Common Pitfalls (and How to Avoid Them)

  1. Bad chunking → Retrieval suffers more from chunk mistakes than from the storage choice. Keep chunks coherent; include titles/headers in each chunk.
  2. No metadata → You lose hybrid filtering (by access level, date, BU). Add metadata now; it’s hard to backfill.
  3. No lineage → Record embedding model/version, source doc hash, and timestamps. You’ll need it for audits and rebuilds.
  4. Ignoring security → Enforce row/column-level access and tenancy at the retrieval layer (filters) and at storage (RBAC).
  5. Cost surprises → Vector DB: storage + query costs; Lakehouse: compute + storage. Benchmark early with realistic loads.
  6. Assuming one size fits all → It rarely does. Prototype both paths against your latency SLOs and governance requirements.

Conclusion

RAG is about data engineering decisions at scale. The retrieval layer you choose defines whether your system is fast, trustworthy, and cost-effective.

  • Vector Databases: Best for low-latency, high-scale similarity search.
  • Lakehouses: Best for governance, lineage, and enterprise integration.
  • Hybrid: Often the pragmatic path forward.

As Lakehouses mature their vector capabilities, convergence is likely. Until then, choose based on your priorities:

  • Optimize for speed and experimentation → Vector DB.
  • Optimize for trust and compliance → Lakehouse.

The key takeaway: RAG at scale is less about the model and more about the data pipeline beneath it.

Data structure Database RAG large language model

Opinions expressed by DZone contributors are their own.

Related

  • Why Your RAG Pipeline Will Fail Without an MCP Server
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)
  • AI RAG Architectures: Comprehensive Definitions and Real-World Examples
  • Building an Internal Document Search Tool with Retrieval-Augmented Generation (RAG)

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook