Hybrid Search Using Postgres DB

This article explains how to implement hybrid search (lexical and semantic) in a single PostgresDB using full-text-search and pgvector.

Suraj Dharmapuram

Jan. 22, 25 · Tutorial

Likes (6)

Comment

Save

3.6K Views

With a traditional lexical-based (or keyword-based) search, we will find documents that contain the exact word we searched for. Keyword search excels at precision but struggles with alternate phrasing or natural language.

Semantic search addresses these limitations by capturing the intent behind documents and user queries. This is typically done by leveraging vector embeddings to map documents and queries into a high dimensional space and computing vector similarity to retrieve relevant results.

For several systems, a single search method may fall short, resulting in incomplete information being shown to users. Combining the strengths of both search methods described above would allow us to deliver an effective search experience.

Keyword-based search is supported well by systems like Elasticsearch and Apache Solr. Semantic search typically requires storage with a vector database and there exists a wide range of solutions. This post explains how we can support hybrid search involving both lexical and semantic search using a single and familiar storage system in Postgres.

Let’s suppose we have the following table used by an application that allows users to search for products via keyword or natural language:

    SQL
   
   CREATE TABLE products (

id bigserial PRIMARY KEY,

description VARCHAR(255),

embedding vector(384)

);

The description column contains a text/natural language description of the product. Postgres provides a default index on this column for full-text search, but we can also create a custom index to accelerate full-text search, which acts like an index for information retrieval.

The embedding column stores vector (float) representations of product descriptions, capturing semantic meaning rather than words. The pgvector extension in Postgres brings with it the vector data type and vector similarity metrics — L2, cosine, and dot product distances. There are several ways of generating embeddings, for e.g., using word-level embeddings such as Word2Vec, Sentence/Document embeddings such as SBERT, or embeddings from transformer-based models such as the BERT model.

For demonstration, we will insert the following data into the database:

    SQL
   
   INSERT INTO products (description) VALUES

('Organic Cotton Baby Onesie - Newborn Size, Blue'), 

('Soft Crib Sheet for Newborn, Hypoallergenic'), 

('Baby Monitor with Night Vision and Two-Way Audio'),

('Diaper Bag Backpack with Changing Pad - Unisex Design'),

('Stroller for Infants and Toddlers, Lightweight'),

('Car Seat for Newborn, Rear-Facing, Extra Safe'),

('Baby Food Maker, Steamer and Blender Combo'),

('Toddler Sippy Cup, Spill-Proof, BPA-Free'),

('Educational Toys for 6-Month-Old Baby, Colorful Blocks'),

('Baby Clothes Set - 3 Pack, Cotton, 0-3 Months'),

('High Chair for Baby, Adjustable Height, Easy to Clean'),

('Baby Carrier Wrap, Ergonomic Design for Newborns'),

('Nursing Pillow for Breastfeeding, Machine Washable Cover'),

('Baby Bath Tub, Non-Slip, for Newborn and Infant'), 

('Baby Skincare Products - Lotion, Shampoo, Wash - Organic');

For embeddings, I used a SentenceTransformer model (aka SBERT) to generate embeddings and then stored them in the database. The following Python code demonstrates this:

    SQL
   
 

   descriptions = [product[1] for product in products]
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(descriptions)

# Update the database with embeddings
for i, product in enumerate(products):
product_id = product[0]
embedding = embeddings[i] # Convert to Python list

# Construct the vector string representation
embedding_str = str(embedding.tolist())
cur.execute("UPDATE products SET embedding = %s WHERE id = %s", (embedding_str, product_id))

# Commit changes and close connection
conn.commit()
  

Full-Text Search

Postgres provides extensive out-of-the-box support for keyword search. We can write a query like the following for keyword-based retrieval:

Let’s say we want to search for sleep accessories for a baby. We might search with the following query:

    SQL
   
   SELECT id, description
FROM products
WHERE description @@ to_tsquery('english', 'crib | baby | bed');

This returns the following product back:

    SQL
   
   "Soft Crib Sheet for Newborn, Hypoallergenic"

Note: ts_query searches for lexemes/normalized keywords, so replacing newborn with newborns or babies also returns the same result

The above is, of course, a simple example, and Postgres’s full-text search functionality allows us several customizations, e.g., skip certain words, process synonyms, use sophisticated parsing, etc., by overriding the default text search config.

Although these queries will work without an index, most applications find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index. The following code demonstrates how we can create a GIN index (Generalized Inverted Index) on the description column and use it for efficient search.

    SQL
   
   --Create a tsvector column (you can add this to your existing table) 

ALTER TABLE products ADD COLUMN description_tsv tsvector;

 --Update the tsvector column with indexed data from the description column 

UPDATE products SET description_tsv = to_tsvector('english', description); 

-- Create a GIN index on the tsvector column 

CREATE INDEX idx_products_description_tsv ON products USING gin(description_tsv);

Semantic Search Example

Let’s now try to execute a semantic search request for our query intent — "baby sleeping accessories." To do this, we compute the embedding (as above) and pick the most similar products by vector distance (in this case, cosine distance). The following code demonstrates this:

    Python
   
 

   # The query string
query_string = 'baby sleeping accessories'

# Generate embedding for the query string
query_embedding = model.encode(query_string).tolist()

# Construct the SQL query using the cosine similarity operator (<->)
# Assuming you have an index that supports cosine similarity (e.g., ivfflat with vector_cosine_ops)
sql_query = """
SELECT id, description, (embedding <-> %s::vector) as similarity
FROM products
ORDER BY similarity
LIMIT 5;
"""

# Execute the query
cur.execute(sql_query, (query_embedding,))

# Fetch and print the results
results = cur.fetchall()
for result in results:
product_id, description, similarity = result
print(f"ID: {product_id}, Description: {description}, Similarity: {similarity}")

cur.close()
conn.close()
  

This gives us the following results:

    Plain Text
   
 

   ID: 12, Description: Baby Carrier Wrap, Ergonomic Design for Newborns, Similarity: 0.9956936200879117
ID: 2, Description: Soft Crib Sheet for Newborn, Hypoallergenic, Similarity: 1.0233573590998544
ID: 5, Description: Stroller for Infants and Toddlers, Lightweight, Similarity: 1.078171715208051
ID: 6, Description: Car Seat for Newborn, Rear-Facing, Extra Safe, Similarity: 1.08259154868697
ID: 3, Description: Baby Monitor with Night Vision and Two-Way Audio, Similarity: 1.0902734271784085
  

Along with each result, we also get back its similarity (lower is better for cosine similarity). As we can see we get a richer set of results with embedding search which nicely augment the keyword based search.

By default pgvector performs exact nearest neighbor search which guarantees perfect recall. However, this approach is quite expensive as the size of the dataset increases. We can add an index that trades off recall for speed. One example is the IVFFlat (Inverted File with Flat Compression) index in Postgres, which works by dividing the vector space into clusters using k-means clustering. During a search, it identifies the clusters closest to the query vector and performs a linear scan within those selected clusters, calculating the exact distances between the query vector and the vectors in those clusters. The following code defines how such an index can be created:

    SQL
   
   CREATE INDEX ON products USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); 

lists indicates the number of clusters to create.
vector_cosine_ops indicates the distance metric we are using (cosine, inner product, or Euclidean/L2)

Fusion of Results

The two methods described above excel in different scenarios and complement each other. Combining the results of both methods would result in robust search results. Reciprocal Rank Fusion is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results. The core of RRF is captured in its formula:

    Mathematica
   
   RRF(d) = (r  R) 1 / k + r(d)) 

Where:
- d is a document
- R is the set of rankers (retrievers)
- k is a constant (typically 60)
- r(d) is the rank of document d in ranker r

In our example, we’d do the following:

Calculate the rank of each product in each result set by taking the inverse of its rank after adding a constant. This constant prevents top-ranked products from dominating the final score and allows lower-ranked products to contribute meaningfully.
Sum rank reciprocals from all result sets to get the final RRF score of a product.

For keyword search, Postgres provides a ranking function ts_rank (and some variants) which can be used as the rank of a product inside the result set. For semantic search, we can use the embedding distance to calculate the rank of a product in the result set. It can be implemented in SQL using CTEs for each search method and combining them at the end.

Further, we could also use an ML model to rerank the results after combining. Due to its high computational cost, ML model-based reranking is applied after the initial retrieval, which reduces the result set to a small set of promising candidates.

Conclusion

With the components described above, we built an intelligent search pipeline that integrates:

Full-text search for precise keyword matching
Vector search for semantic matching
Result fusion for combining results and reranking using ML

We accomplished this by using a single database system, where all the data is stored. By avoiding integration with separate search engines or databases, we eliminated the need for having multiple tech stacks and reduced system complexity.

Data structure Semantic search PostgreSQL Data Types

Opinions expressed by DZone contributors are their own.

Related

Trending