DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • KV Cache Implementation Inside vLLM
  • Caching Issues With the Spring Expression Language
  • The Hidden Cost of AI Agents: A Caching Solution
  • From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM

Trending

  • Why Stable RAG Answers Can Still Hide Unstable Evidence
  • Alternative Structured Concurrency
  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
  • Liquid Glass, Material 3, and a Lot of Plumbing
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. A Practical Guide to Semantic Caching With Redis LangCache

A Practical Guide to Semantic Caching With Redis LangCache

Learn how to use Redis LangCache to semantically cache LLM prompts and responses, reducing inference costs and improving performance.

By 
Subhashini Raman user avatar
Subhashini Raman
·
Josephine Eskaline Joyce user avatar
Josephine Eskaline Joyce
DZone Core CORE ·
Jan. 06, 26 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

Semantic cache is an advanced caching mechanism that differs from traditional caching, which relies on exact keyword matching; it stores and retrieves data based on semantic similarity. Redis LangCache is a fully hosted semantic caching service that helps cache LLM prompts and responses semantically, thereby reducing LLM usage costs.

In this tutorial, let's learn how to quickly create a simple application and use LangCache for caching LLM queries. Also, see if we can combine fuzzy logic to improve the responses.  

Step 1: Set Up Redis LangCache

If you do not have an account yet, create an account on https://cloud.redis.io/. Once you have logged in, 

  1. Navigate to databases and create a new database (I used the free plan here).
  2. Click on LangCache in the left menu and create an instance of the LangCache service. I used the "Quick service creation" option to create the LangCache service. 
  3. Copy the API key and keep it safe.
  4. Click on the LangCache service you just created. When you click the "Connect" button, a quick connect guide will appear on the right with examples of how to connect to your LangCache instance.

Step 2: Create a Simple Python Script

Let's create a simple Python script that will first check the cache for a prompt. If a match is found, it returns the cached response. If not, it sends the prompt to the LLM, caches the response, and returns it.

Connect to LangCache:

Python
 
lang_cache =  LangCache( 
    server_url="<LangCache Service URL>",
    cache_id="<Get the cache id from LangCache Service created",
    api_key=<LangCache service api_key>,
)


Search the cache before sending it to LLM:

Python
 
result = lang_cache.search(
        prompt=query,
        similarity_threshold=0.90,
    )


This similarity_threshold determines how closely the prompt must match. A higher value means stricter matching.

Handle a cache hit:

Python
 
if result :
        for entry in result.data:
            print("Cache Hit!")
            print("Cache Response:::")
            print(f"Prompt: {entry.prompt}")
            print(f"Response: {entry.response}")
            print(f"Score: {entry.similarity}")
            return


Handle cache miss and store response:

Python
 
# Calling LLM here
response = requests.post(url, json=payload, headers=headers)
response_json = response.json()
response_text = response_json["choices"][0]["message"]["content"]

# --- Storing the reponse from LLM in LangCache ---
save_response = lang_cache.set(
     prompt=query,
     response=response_text,
  )


Benefits of Semantic Caching

  • For similar queries, responses are fetched from the cache, avoiding expensive LLM calls.
  • Faster response times.

Things to Watch Out For

  • Similarity threshold: Set it thoughtfully. Too high and you'll miss useful matches. Too low and you'll receive irrelevant results.
  • Accuracy: Even with an optimal threshold, results may not always be perfect.
  • Data privacy: In multi-tenant architectures, ensure proper data partitioning so users don’t see each other’s information.
  • Cache eviction: Know when and how to evict cache entries.

Let’s Run the Code

First Query: (Cache Is Empty)

Python
 
Query : Brief history on Capital of France
Response: 
      Cache Miss!
    Redirecting to LLM


The query and the response are now stored in the cache.

Modified Query

Python
 
Query:  ----Brief history on Paris---
Response: 
    Cache Hit!
    Cache Response:::
    Prompt: Brief history on Capital of France
    Score: 0.92440444


Even though the query changed, the cache returned a semantically similar result with a similarity score of 0.92

Another Variation

Python
 
Query:  ----Brief history on France---
Response: 
    Cache Hit!
    Cache Response:::
    Prompt: Brief history on Capital of France
    Score: 0.9121176


Oops! We asked for the history of France, but we got the history of the capital of France. Though Paris plays a major role in France's history, the context is different. One is a city, and the other is a country!

Tuning the Threshold

Let’s increase the threshold to 0.92 and clear the cache.

Python
 
Query:  ----Brief history on Capital of France---
Response:
    Cache Miss!
    LLM Response: <llm response>
#####################################

Query:  ----Brief history on Paris---
Response:
    Cache Hit!
    Cache Response:::
        Prompt: Brief history on Capital of France
        Score: 0.92440444
#####################################

Query:  ----Brief history on France---
Response:
    Cache Miss!
    LLM Response: <llm response>


It seems to be working better! 

Performance Comparison

Let's compare the time it takes to query from the cache vs. querying from the LLM.

Semantic Cache Results — Sorted by Time

Query
Result
Matched Prompt
Similarity Score
Response Source
Time (seconds)
Brief history on the Capital of France
Cache Miss


LLM
0.8499
Brief history on Paris
Cache Hit
Brief history on the Capital of France
0.9244
Cache
0.2705
Brief history on France
Cache Miss


LLM
1.2543
Brief history on the Capital of France
Cache Hit
Brief history on the Capital of France
1.0
Cache
1.1139
Brief history on Paris
Cache Hit
Brief history on France
0.9386
Cache
0.2761
Brief history on France
Cache Hit
Brief history on France
1.0
Cache
0.2798
Brief history on the  Capital of France
Cache Hit
Brief history on the Capital of France
1.0
Cache
1.0178
Brief history on Paris
Cache Hit
Brief history on France
0.9386
Cache
0.2806
Brief history on France
Cache Hit
Brief history on France
1.0
Cache
0.2778
How does Langcache work? explain…
Cache Hit
How does Langcache work? explain…
1.0
Cache
0.2930

Observations:

  1. Though there are some anomalies, the response from the cache is much faster, which is obvious.
  2. A high similarity threshold is held to reuse the cached response.
  3. For distinct answers, a higher similarity threshold is recommended.
  4. Based on the query and business requirements, always tune and experiment, as results vary with embedding models, similarity thresholds, etc.

One More Example

Python
 
Query:  ----How does Langcache work---
Response:
    Cache Miss!
    LLM Response: < Gives a 20 lines response>
    Time to get response from LLM API: 0.8144 seconds
----#####################################---
Query:  ----How does Langcache work---
Response:  
    Cache Hit!
    Time for cache lookup: 0.3162 seconds
    Cache Response:::
    Prompt: How does Langcache work
    Response: < Gives the same 20 lines response>
    Score: 1.0


All is well till now. Let's modify the query a bit!

Python
 
----#####################################---
Query:  ----How does Langcache work, explain in 5 lines---
Response:
    Cache Hit!
    Time for cache lookup: 0.2719 seconds
    Cache Response:::
    Prompt: How does Langcache work
    Response: < Gives the same 20 lines response>
    Score: 0.9714471

----#####################################---


We asked for a 5-line response, but got the cached 20-line one. This highlights the importance of tuning and using attributes to scope responses.

Semantic Cache vs. Fuzzy Match 

Fuzzy matching works based on approximate string matching. It works best for handling typos, spelling variants, and near-duplicate strings, whereas semantic match operates at the meaning and context levels.

Let's see the difference between them in action. Let's compare the semantic score (LangCache) with the fuzzy score (Ratcliff–Obershelp algorithm) when matching two strings.

Querying for the first time:

Plain Text
 
Query: Does Semantic cache work?
Response:
    Cache Miss!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.


Querying the same again — notice that the semantic score and fuzzy score are pretty close.

Plain Text
 
Query: How does Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.95
    Fuzzy match score : 0.93


Let's try with a few more variations:

Plain Text
 
Query: Will Semantic cache work?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.94
    Fuzzy match score : 0.83


Plain Text
 
Query: What does Semantic cache mean?
Response:
    Cache Hit!
    Cache Response:::
    Prompt: How does Semantic cache work?
    Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
    Semantic score  (Langacache) : 0.94
    Fuzzy match score : 0.79


As you can see, fuzzy match focuses on "looks like," whereas semantic matching focuses on "means like."

Complete implementation of the above is available here.

Fuzzy match can be combined with the semantic cache in a few ways:

  1. Store the last 'n' prompts and do a fuzzy match on those. Use semantic caching only if there is no match found. 
  2. When a high similarity threshold/score is used for the semantic cache (e.g., > 0.95), we end up caching prompts for every cache miss. This procedure will result in a lot of near-duplicates. We can use fuzzy match to identify these near duplicates and store only the prompts that are different
  3. If the caching layer contains a lot of near duplicates, we can use fuzzy match for compaction.

Final Thoughts

When building applications with semantic caching, effective results depend on continuous testing and context-aware tuning. Similarity thresholds, prompt patterns, and cache scope should be adjusted based on workload behavior and accuracy requirements. Redis LangCache enables fine-grained control through attributes that partition and scope cached responses. Semantic caching becomes even more efficient when fuzzy matching logic is added, striking a balance between accuracy and increased cache hit rates. When combined, these methods can improve latency, lower LLM costs, and provide consistent results while maintaining accuracy and relevance.

Happy coding!

Cache (computing) Redis (company) large language model

Opinions expressed by DZone contributors are their own.

Related

  • KV Cache Implementation Inside vLLM
  • Caching Issues With the Spring Expression Language
  • The Hidden Cost of AI Agents: A Caching Solution
  • From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook