DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Beyond Keywords: Modernizing Enterprise Search with Vector Databases
  • From Keywords to Meaning: A Hands-On Tutorial With Sentence-Transformers for Semantic Search
  • How To Build an AI-Powered Search Bar With Vector Embeddings and OpenAI

Trending

  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
  • Stop Running Two Data Systems for One Agent Query
  • Run Gemma 4 on Your Laptop: A Hands-On Guide to Google's Latest Open Multimodal LLM
  1. DZone
  2. Data Engineering
  3. Data
  4. Implementing Vector Search in Databricks

Implementing Vector Search in Databricks

Vector search eases the natural language queries by numerical representations that can correlate meanings beyond literal words.

By 
Junaith Haja user avatar
Junaith Haja
·
Sep. 25, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.1K Views

Join the DZone community and get the full member experience.

Join For Free

Search has always been at the heart of analytics. Whether you’re tracking down the right transaction, filtering a customer record, or pulling a specific review, the default approach has traditionally been keyword search.

Keyword search is simple and effective when you know exactly what you’re looking for, but it quickly falls apart when the language is messy, ambiguous, or when meaning matters more than exact words. That’s where vector search changes the game. Instead of matching literal keywords, vector search relies on embeddings — high-dimensional numeric representations of text, images, or other unstructured content — that capture semantic meaning. 

Two pieces of text that mean the same thing will have embeddings that are close to each other in vector space, even if they use completely different words.

For example:

  • A customer review that says “the delivery took forever” and another that says “shipping was slow” are semantically similar. A keyword-based search may miss that connection, but a vector search will surface both.

Until recently, implementing vector search meant stitching together multiple components: running embedding models in Python, storing vectors in an external vector database, syncing metadata, and writing custom APIs for retrieval. This was complex, brittle, and almost always lived outside the governed data warehouse, making governance, lineage, and compliance much harder.

Databricks brings native vector search into the Lakehouse. With a Python notebook and SQL commands, you can generate embeddings, build vector indexes, and perform semantic queries. 

In this article, we’ll walk through the step-by-step implementation of vector search using the Bakehouse dataset. See how to embed and run customer reviews by building a vector search endpoint, building a search index, and running semantic queries. By the end, you’ll understand not only how to use vector search in Databricks.

Context

Bakehouse is a sample dataset provided in Databricks for a fictional bakery. It receives a rapidly growing volume of free‑form customer reviews across many franchises and dates. Because feedback is written in varied phrasing (synonyms, slang, typos, multi‑lingual hints), analysts relying on simple keyword filters struggle to consistently find semantically similar complaints (e.g., “refund not received,” “money didn’t come back,” “return pending”). As a result, root‑cause patterns like late deliveries, freshness issues, or refund friction are detected late, triage is manual and time‑consuming, and franchise‑level accountability is uneven.

We need a governed, low‑latency semantic search capability over the reviews so stakeholders can ask questions in natural language and slice results by franchise and time — without building custom NLP pipelines per use case.

Scope and Constraints

  • Source of truth is bakehouse.reviews.customer_feedback (columns: new_id, review, franchiseID, review_date, plus an existing embedding).
  • Must operate within Unity Catalog governance; results should be queryable from Databricks SQL.
  • Latency targets appropriate for interactive analysis (top‑10 results in sub‑second to low‑second range depending on data scale).

Architecture Flow (High-Level)

Ingestion → Delta table (UC, CDF ON) →

 Vector Search Endpoint → Delta Sync Index (embeddings computed from review) → SQL vector_search() consumers

Core Components

  • Delta table bakehouse.reviews.customer_feedback with new_id (PK), review (STRING), franchiseID, review_date (DATE). 
  • Vector Search endpoint: managed compute that hosts one or more indexes. Choose Standard (low‑latency) or Storage‑optimized (billion‑scale, lower $/vector, higher latency).
  • Delta Sync Index (managed embeddings): computes embeddings from review using a hosted model (e.g., databricks-bge-large-en) and keeps in sync with table changes via CDF.
  • Query: Databricks SQL vector_search(index => ..., query_text => ..., num_results => N).

Implementation

The goal is to enable product and ops teams to type natural‑language queries like “refund not received” and instantly find semantically similar reviews, while keeping results governed, low‑latency, and always fresh.

Step 1: Enable Change Data Feed

Required for STANDARD Vector Search endpoints so the index can sync incrementally.

SQL
 
ALTER TABLE bakehouse.reviews.customer_feedback

SET TBLPROPERTIES (delta.enableChangeDataFeed = true);


Step 2: Create/Ensure Endpoint and Delta Sync Index (Managed Embeddings)

Create a notebook and run the code below:

Python
 
# If needed once per cluster:

%pip install -q databricks-vectorsearch

dbutils.library.restartPython()


from databricks.vector_search.client import VectorSearchClient


CATALOG = "bakehouse"

SCHEMA  = "reviews"

TABLE   = "customer_feedback"

SRC_FQN = f"{CATALOG}.{SCHEMA}.{TABLE}"


ENDPOINT_NAME = "bakehouse-vs-endpoint"

# Use a new index name so you can keep the old one around if you want

INDEX_FQN     = f"{CATALOG}.{SCHEMA}.customer_feedback_idx_v2"


PRIMARY_KEY   = "new_id"

TEXT_COL      = "review"


# Databricks-hosted embedding model endpoint (pay-per-token):

# See "Use foundation models" docs for supported endpoint names (e.g., databricks-bge-large-en / databricks-gte-large-en)

EMBEDDING_MODEL_ENDPOINT = "databricks-bge-large-en"


# 1) Prereq: Enable Change Data Feed on the source Delta table (required for STANDARD endpoints)

spark.sql(f"""

ALTER TABLE {SRC_FQN}

SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

""")


# 2) Create/ensure a Vector Search endpoint

vsc = VectorSearchClient()

try:

   vsc.get_endpoint(ENDPOINT_NAME)

   print(f"Endpoint exists: {ENDPOINT_NAME}")

except Exception:

   print(f"Creating endpoint: {ENDPOINT_NAME}")

   vsc.create_endpoint(name=ENDPOINT_NAME, endpoint_type="STANDARD")

   vsc.create_endpoint_and_wait(name=ENDPOINT_NAME)

   print("Endpoint ONLINE.")


# (Optional) delete an old index with the same name

# try: vsc.delete_index(endpoint_name=ENDPOINT_NAME, index_name=INDEX_FQN); print("Deleted existing index"); except: pass


# 3) Create a Delta Sync index that COMPUTES embeddings from text

# Key: use embedding_source_column + embedding_model_endpoint_name

print(f"Creating index with computed embeddings: {INDEX_FQN}")

idx = vsc.create_delta_sync_index_and_wait(

   endpoint_name=ENDPOINT_NAME,

   index_name=INDEX_FQN,

   primary_key=PRIMARY_KEY,

   source_table_name=SRC_FQN,

   pipeline_type="TRIGGERED",                 # or "TRIGGERED" if you prefer manual syncs

   embedding_source_column=TEXT_COL,           # text column to embed

   embedding_model_endpoint_name=EMBEDDING_MODEL_ENDPOINT,

   # Optional: write back computed vectors+payload to a UC table for inspection

   # sync_computed_embeddings=True

)
print("Index ONLINE:", INDEX_FQN)

Create/Ensure Endpoint and Delta Sync Index

The code takes about 3–5 minutes to run, and it:

  • Ensures a vector search endpoint exists
  • Creates a Delta Sync Index that computes embeddings from reviews using a Databricks embedding model
  • Uses TRIGGERED sync (manual) so you explicitly control refreshes

Step 3: Ensure Embedding and Vector Search Indexes Are Created

If you review the dataset bakehouse.reviews.customer_feedback, you will be able to see the newly created embedding column and index.

SQL
 
select * from bakehouse.reviews.customer_feedback limit 100;

Ensure Embedding and Vector Search Indexes Are Created


Step 4: Create Semantic Queries

1. Running a semantic query to check for refund issues:

SQL
 
SELECT new_id, review, franchiseID, review_date, score

FROM vector_search(

index => 'bakehouse.reviews.customer_feedback_idx_v2',

query_text => 'refund not received',

num_results => 10

)

ORDER BY score DESC;


2. Semantic query to check for delivery delays:

SQL
 
 SELECT new_id, review, franchiseID, review_date, search_score

 FROM vector_search(

   index        => 'bakehouse.reviews.customer_feedback_idx_v2',

   query_text   => 'delivery delays and late arrivals',

   num_results  => 50

 )

Semantic query to check for delivery delays

Through this implementation, we have turned the business teams into asking questions in natural language without the need for an ML pipeline.

Business can self-serve scenarios below:

  • Refunds: Regional manager types “refund not received” and filters by franchiseID → sees spike patterns by date without waiting for a new ML labeler.
  • Freshness: QA lead searches “stale bread, not fresh” → clusters issues to suppliers; opens targeted corrective actions.
  • Delivery: Ops analyst queries “arrived late / delivery delay” → validates routing changes before committing budget.

We implemented semantic search on Bakehouse customer reviews by pairing a Unity Catalog Delta table with a Databricks Vector Search endpoint and a Delta Sync Index that computes embeddings directly from the review column (CDF on). This lets analysts query in plain English and instantly surface meaning-matched reviews across franchises and dates. The managed index removes the need to build and maintain one-off ML classifiers per use case, keeps results fresh via continuous/triggered sync, and preserves governance under UC.

To evolve this into RAG, use the same index as the retriever (top-K reviews by vector_search) and pass those snippets to an LLM served on Databricks with instructions to answer only from the retrieved text and cite new_id (e.g., [12345]). This adds concise, grounded summaries and recommendations on top of the search layer without changing your data or governance model.

Data structure Semantic search DELTA (taxonomy)

Opinions expressed by DZone contributors are their own.

Related

  • Essential Techniques for Production Vector Search Systems Part 1 - Hybrid Search
  • Beyond Keywords: Modernizing Enterprise Search with Vector Databases
  • From Keywords to Meaning: A Hands-On Tutorial With Sentence-Transformers for Semantic Search
  • How To Build an AI-Powered Search Bar With Vector Embeddings and OpenAI

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook