A Practical Guide to Semantic Caching With Redis LangCache

Learn how to use Redis LangCache to semantically cache LLM prompts and responses, reducing inference costs and improving performance.

Subhashini Raman

Josephine Eskaline Joyce

CORE ·

Jan. 06, 26 · Tutorial

Likes (2)

Comment

Save

3.0K Views

Semantic cache is an advanced caching mechanism that differs from traditional caching, which relies on exact keyword matching; it stores and retrieves data based on semantic similarity. Redis LangCache is a fully hosted semantic caching service that helps cache LLM prompts and responses semantically, thereby reducing LLM usage costs.

In this tutorial, let's learn how to quickly create a simple application and use LangCache for caching LLM queries. Also, see if we can combine fuzzy logic to improve the responses.

Step 1: Set Up Redis LangCache

If you do not have an account yet, create an account on https://cloud.redis.io/. Once you have logged in,

Navigate to databases and create a new database (I used the free plan here).
Click on LangCache in the left menu and create an instance of the LangCache service. I used the "Quick service creation" option to create the LangCache service.
Copy the API key and keep it safe.
Click on the LangCache service you just created. When you click the "Connect" button, a quick connect guide will appear on the right with examples of how to connect to your LangCache instance.

Step 2: Create a Simple Python Script

Let's create a simple Python script that will first check the cache for a prompt. If a match is found, it returns the cached response. If not, it sends the prompt to the LLM, caches the response, and returns it.

Connect to LangCache:

     Python
    
 

    lang_cache =  LangCache( 
    server_url="<LangCache Service URL>",
    cache_id="<Get the cache id from LangCache Service created",
    api_key=<LangCache service api_key>,
)
   

Search the cache before sending it to LLM:

     Python
    
    result = lang_cache.search(
        prompt=query,
        similarity_threshold=0.90,
    )

This similarity_threshold determines how closely the prompt must match. A higher value means stricter matching.

Handle a cache hit:

     Python
    
 

    if result :
        for entry in result.data:
            print("Cache Hit!")
            print("Cache Response:::")
            print(f"Prompt: {entry.prompt}")
            print(f"Response: {entry.response}")
            print(f"Score: {entry.similarity}")
            return
   

Handle cache miss and store response:

     Python
    
 

    # Calling LLM here
response = requests.post(url, json=payload, headers=headers)
response_json = response.json()
response_text = response_json["choices"][0]["message"]["content"]

# --- Storing the reponse from LLM in LangCache ---
save_response = lang_cache.set(
     prompt=query,
     response=response_text,
  )
   

Benefits of Semantic Caching

For similar queries, responses are fetched from the cache, avoiding expensive LLM calls.
Faster response times.

Things to Watch Out For

Similarity threshold: Set it thoughtfully. Too high and you'll miss useful matches. Too low and you'll receive irrelevant results.
Accuracy: Even with an optimal threshold, results may not always be perfect.
Data privacy: In multi-tenant architectures, ensure proper data partitioning so users don’t see each other’s information.
Cache eviction: Know when and how to evict cache entries.

Let’s Run the Code

First Query: (Cache Is Empty)

     Python
    
    Query : Brief history on Capital of France
Response: 
      Cache Miss!
    Redirecting to LLM

The query and the response are now stored in the cache.

Modified Query

     Python
    
 

    Query:  ----Brief history on Paris---
Response: 
    Cache Hit!
    Cache Response:::
    Prompt: Brief history on Capital of France
    Score: 0.92440444
   

Even though the query changed, the cache returned a semantically similar result with a similarity score of 0.92

Another Variation

     Python
    
 

    Query:  ----Brief history on France---
Response: 
    Cache Hit!
    Cache Response:::
    Prompt: Brief history on Capital of France
    Score: 0.9121176
   

Oops! We asked for the history of France, but we got the history of the capital of France. Though Paris plays a major role in France's history, the context is different. One is a city, and the other is a country!

Tuning the Threshold

Let’s increase the threshold to 0.92 and clear the cache.

     Python
    
 

    Query:  ----Brief history on Capital of France---
Response:
    Cache Miss!
    LLM Response: <llm response>
#####################################

Query:  ----Brief history on Paris---
Response:
    Cache Hit!
    Cache Response:::
        Prompt: Brief history on Capital of France
        Score: 0.92440444
#####################################

Query:  ----Brief history on France---
Response:
    Cache Miss!
    LLM Response: <llm response>
   

It seems to be working better!

Performance Comparison

Let's compare the time it takes to query from the cache vs. querying from the LLM.

Semantic Cache Results — Sorted by Time

Query	Result	Matched Prompt	Similarity Score	Response Source	Time (seconds)
Brief history on the Capital of France	Cache Miss			LLM	0.8499
Brief history on Paris	Cache Hit	Brief history on the Capital of France	0.9244	Cache	0.2705
Brief history on France	Cache Miss			LLM	1.2543
Brief history on the Capital of France	Cache Hit	Brief history on the Capital of France	1.0	Cache	1.1139
Brief history on Paris	Cache Hit	Brief history on France	0.9386	Cache	0.2761
Brief history on France	Cache Hit	Brief history on France	1.0	Cache	0.2798
Brief history on the Capital of France	Cache Hit	Brief history on the Capital of France	1.0	Cache	1.0178
Brief history on Paris	Cache Hit	Brief history on France	0.9386	Cache	0.2806
Brief history on France	Cache Hit	Brief history on France	1.0	Cache	0.2778
How does Langcache work? explain…	Cache Hit	How does Langcache work? explain…	1.0	Cache	0.2930

Observations:

Though there are some anomalies, the response from the cache is much faster, which is obvious.
A high similarity threshold is held to reuse the cached response.
For distinct answers, a higher similarity threshold is recommended.
Based on the query and business requirements, always tune and experiment, as results vary with embedding models, similarity thresholds, etc.

One More Example

     Python
    
 

    Query:  ----How does Langcache work---
Response:
    Cache Miss!
    LLM Response: < Gives a 20 lines response>
    Time to get response from LLM API: 0.8144 seconds
----#####################################---
Query:  ----How does Langcache work---
Response:  
    Cache Hit!
    Time for cache lookup: 0.3162 seconds
    Cache Response:::
    Prompt: How does Langcache work
    Response: < Gives the same 20 lines response>
    Score: 1.0
   

All is well till now. Let's modify the query a bit!

     Python
    
 

    ----#####################################---
Query:  ----How does Langcache work, explain in 5 lines---
Response:
    Cache Hit!
    Time for cache lookup: 0.2719 seconds
    Cache Response:::
    Prompt: How does Langcache work
    Response: < Gives the same 20 lines response>
    Score: 0.9714471

----#####################################---
   

We asked for a 5-line response, but got the cached 20-line one. This highlights the importance of tuning and using attributes to scope responses.

Semantic Cache vs. Fuzzy Match

Fuzzy matching works based on approximate string matching. It works best for handling typos, spelling variants, and near-duplicate strings, whereas semantic match operates at the meaning and context levels.

Let's see the difference between them in action. Let's compare the semantic score (LangCache) with the fuzzy score (Ratcliff–Obershelp algorithm) when matching two strings.

Querying for the first time:

    Plain Text
   
 

   Query: Does Semantic cache work?
Response:
    Cache Miss!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
  

Querying the same again — notice that the semantic score and fuzzy score are pretty close.

    Plain Text
   
 

   Query: How does Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.95
    Fuzzy match score : 0.93
  

Let's try with a few more variations:

    Plain Text
   
 

   Query: Will Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.94
    Fuzzy match score : 0.83
  

    Plain Text
   
 

   Query: What does Semantic cache mean?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.94
    Fuzzy match score : 0.79
  

As you can see, fuzzy match focuses on "looks like," whereas semantic matching focuses on "means like."

Complete implementation of the above is available here.

Fuzzy match can be combined with the semantic cache in a few ways:

Store the last 'n' prompts and do a fuzzy match on those. Use semantic caching only if there is no match found.
When a high similarity threshold/score is used for the semantic cache (e.g., > 0.95), we end up caching prompts for every cache miss. This procedure will result in a lot of near-duplicates. We can use fuzzy match to identify these near duplicates and store only the prompts that are different
If the caching layer contains a lot of near duplicates, we can use fuzzy match for compaction.

Final Thoughts

When building applications with semantic caching, effective results depend on continuous testing and context-aware tuning. Similarity thresholds, prompt patterns, and cache scope should be adjusted based on workload behavior and accuracy requirements. Redis LangCache enables fine-grained control through attributes that partition and scope cached responses. Semantic caching becomes even more efficient when fuzzy matching logic is added, striking a balance between accuracy and increased cache hit rates. When combined, these methods can improve latency, lower LLM costs, and provide consistent results while maintaining accuracy and relevance.

Happy coding!

Cache (computing) Redis (company) large language model

Opinions expressed by DZone contributors are their own.

Related

Trending