Stop Running Two Data Systems for One Agent Query
Most RAG pipelines coordinate a vector database and a structured lakehouse that don't share a transaction model. Here's how to fix that with a unified approach.
Join the DZone community and get the full member experience.
Join For FreeIf you've deployed a retrieval-augmented generation (RAG) pipeline over enterprise data, you've probably ended up in the same place: Pinecone or Weaviate for embeddings, Delta Lake or Iceberg for structured data, and some custom middleware stitching them together that nobody fully owns.
This split made sense historically. Vector databases existed before lakehouse formats efficiently supported high-dimensional arrays. Standing up a Pinecone index was faster than waiting for your data platform to evolve. But what started as an expedient has become a structural problem for any team running agentic AI at scale.
The Coordination Tax
The moment an agent needs to combine semantic retrieval with structured filtering — find support tickets semantically similar to this new issue, but only from enterprise customers in EMEA, with a CSAT score below 3, created in the last 90 days — you're coordinating across two systems that don't share a transaction model.
In the two-system architecture, that's at minimum three operations: a vector search, a structured query, and a join that runs outside the query optimizer. No cost-based planning, no predicate pushdown across the boundary, no consistency guarantee between what the vector index returns and what's currently in the structured store.
For a human performing occasional ad hoc searches, the overhead is tolerable. For an agent issuing 20 hybrid queries in a reasoning chain, each waiting on that coordination overhead before proceeding, it compounds fast.
Embeddings Are Just Columns
The fix is conceptually simple: embeddings are vectors, vectors are fixed-dimension float arrays, and Parquet already handles array columns efficiently. A table that stores structured fields can also store an embedding column alongside them. The embedding lives next to the structured data, in the same file, under the same transaction model, updated in the same write operation.
Here's what the table schema looks like in practice:
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
# Schema: structured fields + embedding column side by side
schema = pa.schema([
pa.field("ticket_id", pa.string()),
pa.field("customer_tier", pa.string()),
pa.field("region", pa.string()),
pa.field("csat_score", pa.float32()),
pa.field("created_at", pa.timestamp("us")),
pa.field("content", pa.string()),
pa.field("embedding", pa.list_(pa.float32(), 1536)), # fixed-dimension
])
Writing a record means embedding generation happens at write time, in the same transaction as the structured fields:
from openai import OpenAI
client = OpenAI()
def embed_and_write(tickets: list[dict], writer: pq.ParquetWriter):
embeddings = client.embeddings.create(
input=[t["content"] for t in tickets],
model="text-embedding-3-small"
).data
batch = pa.table({
"ticket_id": [t["ticket_id"] for t in tickets],
"customer_tier": [t["customer_tier"] for t in tickets],
"region": [t["region"] for t in tickets],
"csat_score": [t["csat_score"] for t in tickets],
"created_at": [t["created_at"] for t in tickets],
"content": [t["content"] for t in tickets],
"embedding": [e.embedding for e in embeddings],
}, schema=schema)
writer.write_table(batch)
Querying It
DuckDB supports array_cosine_similarity natively and can query Parquet files directly. A hybrid query — semantic retrieval with structured predicates — is one SQL statement:
import duckdb
import numpy as np
conn = duckdb.connect()
query_embedding = client.embeddings.create(
input=["payment gateway timeout on checkout"],
model="text-embedding-3-small"
).data[0].embedding
results = conn.execute("""
SELECT
ticket_id,
region,
csat_score,
content,
array_cosine_similarity(
embedding::FLOAT[1536],
$1::FLOAT[1536]
) AS similarity_score
FROM read_parquet('support_tickets/*.parquet')
WHERE
customer_tier = 'enterprise'
AND region = 'EMEA'
AND csat_score < 3.0
AND created_at > NOW() - INTERVAL 90 DAYS
ORDER BY similarity_score DESC
LIMIT 10
""", [query_embedding]).fetchdf()
The structured predicates run as pushdown filters before the similarity computation. No cross-system join. No middleware. The optimizer sees both predicates.
What You Get
Consistency by construction. When you write a record, the embedding is updated in the same transaction as the structured fields. There's no freshness lag between what the vector index returns and what the structured store contains. The record that appears in your similarity results is the record that exists in your lakehouse — not a version from the last index rebuild.
One access control surface. A record's visibility is governed by the lakehouse's access control system. In the two-system architecture, you maintain policies separately in the vector database and the structured store. They drift. A record deleted from the structured store for privacy reasons can persist in the vector index. In the unified model, one policy, one audit log.
No middleware to maintain. The code that stitches two systems together is always the least-maintained piece of the architecture. Removing it removes a failure mode.
The Performance Caveat
Full-table ANN search over Parquet is expensive without indexing. Two practical mitigations:
Pre-clustering: Segment the embedding space into k clusters with k-means at write time, store the cluster assignment as a regular column, and partition the table on it. At query time, a nearest-neighbor lookup against cluster centroids narrows the scan before touching the full embedding column. You trade a small amount of recall for significant scan reduction.
DuckDB + VSS extension: DuckDB's vector similarity search extension (currently experimental) adds HNSW index support directly over array columns. For production use cases, pair this with Delta Lake or Iceberg for transactional writes and DuckDB for query.
Practical Starting Point
If you're running a two-system architecture today, you don't need to migrate everything at once. Start with one table where the hybrid query latency is most painful. Add an embedding column. Write a DuckDB query that replaces the vector-search + structured-join combination. Measure the latency difference and the freshness improvement.
The hard part was always the middleware. Once the embedding lives in the table, it's just a column.
This article is part of a series on designing data infrastructure for agentic AI workloads. The author is an engineering manager focused on large-scale data infrastructure and AI systems.
Opinions expressed by DZone contributors are their own.
Comments