DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Hybrid Search: A New Frontier in Enterprise Search
  • Modify JSON Data in Postgres and Hibernate 6
  • PostgresML: Extension That Turns PostgreSQL Into a Platform for AI Apps
  • Make @Observable Wrapper for Better State Control in Swift

Trending

  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • Testing SingleStore's MCP Server
  • The Cypress Edge: Next-Level Testing Strategies for React Developers
  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  1. DZone
  2. Data Engineering
  3. Databases
  4. Hybrid Search Using Postgres DB

Hybrid Search Using Postgres DB

This article explains how to implement hybrid search (lexical and semantic) in a single PostgresDB using full-text-search and pgvector.

By 
Suraj Dharmapuram user avatar
Suraj Dharmapuram
·
Jan. 22, 25 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
4.3K Views

Join the DZone community and get the full member experience.

Join For Free

With a traditional lexical-based (or keyword-based) search, we will find documents that contain the exact word we searched for. Keyword search excels at precision but struggles with alternate phrasing or natural language. 

Semantic search addresses these limitations by capturing the intent behind documents and user queries. This is typically done by leveraging vector embeddings to map documents and queries into a high dimensional space and computing vector similarity to retrieve relevant results. 

For several systems, a single search method may fall short, resulting in incomplete information being shown to users. Combining the strengths of both search methods described above would allow us to deliver an effective search experience.

Keyword-based search is supported well by systems like Elasticsearch and Apache Solr. Semantic search typically requires storage with a vector database and there exists a wide range of solutions. This post explains how we can support hybrid search involving both lexical and semantic search using a single and familiar storage system in Postgres.

Let’s suppose we have the following table used by an application that allows users to search for products via keyword or natural language:

SQL
 
CREATE TABLE products (

id bigserial PRIMARY KEY,

description VARCHAR(255),

embedding vector(384)

);


The description column contains a text/natural language description of the product. Postgres provides a default index on this column for full-text search, but we can also create a custom index to accelerate full-text search, which acts like an index for information retrieval. 

The embedding column stores vector (float) representations of product descriptions, capturing semantic meaning rather than words. The pgvector extension in Postgres brings with it the vector data type and vector similarity metrics — L2, cosine, and dot product distances. There are several ways of generating embeddings, for e.g., using word-level embeddings such as Word2Vec, Sentence/Document embeddings such as SBERT, or embeddings from transformer-based models such as the BERT model. 

For demonstration, we will insert the following data into the database:

SQL
 
INSERT INTO products (description) VALUES

('Organic Cotton Baby Onesie - Newborn Size, Blue'), 

('Soft Crib Sheet for Newborn, Hypoallergenic'), 

('Baby Monitor with Night Vision and Two-Way Audio'),

('Diaper Bag Backpack with Changing Pad - Unisex Design'),

('Stroller for Infants and Toddlers, Lightweight'),

('Car Seat for Newborn, Rear-Facing, Extra Safe'),

('Baby Food Maker, Steamer and Blender Combo'),

('Toddler Sippy Cup, Spill-Proof, BPA-Free'),

('Educational Toys for 6-Month-Old Baby, Colorful Blocks'),

('Baby Clothes Set - 3 Pack, Cotton, 0-3 Months'),

('High Chair for Baby, Adjustable Height, Easy to Clean'),

('Baby Carrier Wrap, Ergonomic Design for Newborns'),

('Nursing Pillow for Breastfeeding, Machine Washable Cover'),

('Baby Bath Tub, Non-Slip, for Newborn and Infant'), 

('Baby Skincare Products - Lotion, Shampoo, Wash - Organic');


For embeddings, I used a SentenceTransformer model (aka SBERT) to generate embeddings and then stored them in the database. The following Python code demonstrates this:

SQL
 
descriptions = [product[1] for product in products]
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(descriptions)

# Update the database with embeddings
for i, product in enumerate(products):
product_id = product[0]
embedding = embeddings[i] # Convert to Python list

# Construct the vector string representation
embedding_str = str(embedding.tolist())
cur.execute("UPDATE products SET embedding = %s WHERE id = %s", (embedding_str, product_id))

# Commit changes and close connection
conn.commit()


Full-Text Search 

Postgres provides extensive out-of-the-box support for keyword search. We can write a query like the following for keyword-based retrieval:

Let’s say we want to search for sleep accessories for a baby. We might search with the following query:

SQL
 
SELECT id, description
FROM products
WHERE description @@ to_tsquery('english', 'crib | baby | bed');


This returns the following product back:

SQL
 
"Soft Crib Sheet for Newborn, Hypoallergenic"


Note: ts_query searches for lexemes/normalized keywords, so replacing newborn with newborns or babies also returns the same result

The above is, of course, a simple example, and Postgres’s full-text search functionality allows us several customizations, e.g., skip certain words, process synonyms, use sophisticated parsing, etc., by overriding the default text search config. 

Although these queries will work without an index, most applications find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index. The following code demonstrates how we can create a GIN index (Generalized Inverted Index) on the description column and use it for efficient search.

SQL
 
--Create a tsvector column (you can add this to your existing table) 

ALTER TABLE products ADD COLUMN description_tsv tsvector;

 --Update the tsvector column with indexed data from the description column 

UPDATE products SET description_tsv = to_tsvector('english', description); 

-- Create a GIN index on the tsvector column 

CREATE INDEX idx_products_description_tsv ON products USING gin(description_tsv);


Semantic Search Example

Let’s now try to execute a semantic search request for our query intent — "baby sleeping accessories."  To do this, we compute the embedding (as above) and pick the most similar products by vector distance (in this case, cosine distance). The following code demonstrates this:

Python
 
# The query string
query_string = 'baby sleeping accessories'

# Generate embedding for the query string
query_embedding = model.encode(query_string).tolist()

# Construct the SQL query using the cosine similarity operator (<->)
# Assuming you have an index that supports cosine similarity (e.g., ivfflat with vector_cosine_ops)
sql_query = """
SELECT id, description, (embedding <-> %s::vector) as similarity
FROM products
ORDER BY similarity
LIMIT 5;
"""

# Execute the query
cur.execute(sql_query, (query_embedding,))

# Fetch and print the results
results = cur.fetchall()
for result in results:
product_id, description, similarity = result
print(f"ID: {product_id}, Description: {description}, Similarity: {similarity}")

cur.close()
conn.close()


This gives us the following results:

Plain Text
 
ID: 12, Description: Baby Carrier Wrap, Ergonomic Design for Newborns, Similarity: 0.9956936200879117
ID: 2, Description: Soft Crib Sheet for Newborn, Hypoallergenic, Similarity: 1.0233573590998544
ID: 5, Description: Stroller for Infants and Toddlers, Lightweight, Similarity: 1.078171715208051
ID: 6, Description: Car Seat for Newborn, Rear-Facing, Extra Safe, Similarity: 1.08259154868697
ID: 3, Description: Baby Monitor with Night Vision and Two-Way Audio, Similarity: 1.0902734271784085


Along with each result, we also get back its similarity (lower is better for cosine similarity). As we can see we get a richer set of results with embedding search which nicely augment the keyword based search. 

By default pgvector performs exact nearest neighbor search which guarantees perfect recall. However, this approach is quite expensive as the size of the dataset increases. We can add an index that trades off recall for speed. One example is the IVFFlat (Inverted File with Flat Compression) index in Postgres, which works by dividing the vector space into clusters using k-means clustering. During a search, it identifies the clusters closest to the query vector and performs a linear scan within those selected clusters, calculating the exact distances between the query vector and the vectors in those clusters. The following code defines how such an index can be created:

SQL
 
CREATE INDEX ON products USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); 

lists indicates the number of clusters to create.
vector_cosine_ops indicates the distance metric we are using (cosine, inner product, or Euclidean/L2)


Fusion of Results 

The two methods described above excel in different scenarios and complement each other. Combining the results of both methods would result in robust search results. Reciprocal Rank Fusion is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results. The core of RRF is captured in its formula:

Mathematica
 
RRF(d) = (r  R) 1 / k + r(d)) 

Where:
- d is a document
- R is the set of rankers (retrievers)
- k is a constant (typically 60)
- r(d) is the rank of document d in ranker r


In our example, we’d do the following:

  1. Calculate the rank of each product in each result set by taking the inverse of its rank after adding a constant. This constant prevents top-ranked products from dominating the final score and allows lower-ranked products to contribute meaningfully.
  2. Sum rank reciprocals from all result sets to get the final RRF score of a product.

For keyword search, Postgres provides a ranking function ts_rank (and some variants) which can be used as the rank of a product inside the result set. For semantic search, we can use the embedding distance to calculate the rank of a product in the result set. It can be implemented in SQL using CTEs for each search method and combining them at the end.

Further, we could also use an ML model to rerank the results after combining. Due to its high computational cost, ML model-based reranking is applied after the initial retrieval, which reduces the result set to a small set of promising candidates.

Conclusion 

With the components described above, we built an intelligent search pipeline that integrates: 

  • Full-text search for precise keyword matching
  • Vector search for semantic matching
  • Result fusion for combining results and reranking using ML

We accomplished this by using a single database system, where all the data is stored. By avoiding integration with separate search engines or databases, we eliminated the need for having multiple tech stacks and reduced system complexity.

Data structure Semantic search PostgreSQL Data Types

Opinions expressed by DZone contributors are their own.

Related

  • Hybrid Search: A New Frontier in Enterprise Search
  • Modify JSON Data in Postgres and Hibernate 6
  • PostgresML: Extension That Turns PostgreSQL Into a Platform for AI Apps
  • Make @Observable Wrapper for Better State Control in Swift

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!