A Practical Guide to Semantic Caching With Redis LangCache
Learn how to use Redis LangCache to semantically cache LLM prompts and responses, reducing inference costs and improving performance.
Join the DZone community and get the full member experience.
Join For FreeSemantic cache is an advanced caching mechanism that differs from traditional caching, which relies on exact keyword matching; it stores and retrieves data based on semantic similarity. Redis LangCache is a fully hosted semantic caching service that helps cache LLM prompts and responses semantically, thereby reducing LLM usage costs.
In this tutorial, let's learn how to quickly create a simple application and use LangCache for caching LLM queries. Also, see if we can combine fuzzy logic to improve the responses.
Step 1: Set Up Redis LangCache
If you do not have an account yet, create an account on https://cloud.redis.io/. Once you have logged in,
- Navigate to databases and create a new database (I used the free plan here).
- Click on LangCache in the left menu and create an instance of the LangCache service. I used the "Quick service creation" option to create the LangCache service.
- Copy the API key and keep it safe.
- Click on the LangCache service you just created. When you click the "Connect" button, a quick connect guide will appear on the right with examples of how to connect to your LangCache instance.
Step 2: Create a Simple Python Script
Let's create a simple Python script that will first check the cache for a prompt. If a match is found, it returns the cached response. If not, it sends the prompt to the LLM, caches the response, and returns it.
Connect to LangCache:
lang_cache = LangCache(
server_url="<LangCache Service URL>",
cache_id="<Get the cache id from LangCache Service created",
api_key=<LangCache service api_key>,
)
Search the cache before sending it to LLM:
result = lang_cache.search(
prompt=query,
similarity_threshold=0.90,
)
This similarity_threshold determines how closely the prompt must match. A higher value means stricter matching.
Handle a cache hit:
if result :
for entry in result.data:
print("Cache Hit!")
print("Cache Response:::")
print(f"Prompt: {entry.prompt}")
print(f"Response: {entry.response}")
print(f"Score: {entry.similarity}")
return
Handle cache miss and store response:
# Calling LLM here
response = requests.post(url, json=payload, headers=headers)
response_json = response.json()
response_text = response_json["choices"][0]["message"]["content"]
# --- Storing the reponse from LLM in LangCache ---
save_response = lang_cache.set(
prompt=query,
response=response_text,
)
Benefits of Semantic Caching
- For similar queries, responses are fetched from the cache, avoiding expensive LLM calls.
- Faster response times.
Things to Watch Out For
- Similarity threshold: Set it thoughtfully. Too high and you'll miss useful matches. Too low and you'll receive irrelevant results.
- Accuracy: Even with an optimal threshold, results may not always be perfect.
- Data privacy: In multi-tenant architectures, ensure proper data partitioning so users don’t see each other’s information.
- Cache eviction: Know when and how to evict cache entries.
Let’s Run the Code
First Query: (Cache Is Empty)
Query : Brief history on Capital of France
Response:
Cache Miss!
Redirecting to LLM
The query and the response are now stored in the cache.
Modified Query
Query: ----Brief history on Paris---
Response:
Cache Hit!
Cache Response:::
Prompt: Brief history on Capital of France
Score: 0.92440444
Even though the query changed, the cache returned a semantically similar result with a similarity score of 0.92
Another Variation
Query: ----Brief history on France---
Response:
Cache Hit!
Cache Response:::
Prompt: Brief history on Capital of France
Score: 0.9121176
Oops! We asked for the history of France, but we got the history of the capital of France. Though Paris plays a major role in France's history, the context is different. One is a city, and the other is a country!
Tuning the Threshold
Let’s increase the threshold to 0.92 and clear the cache.
Query: ----Brief history on Capital of France---
Response:
Cache Miss!
LLM Response: <llm response>
#####################################
Query: ----Brief history on Paris---
Response:
Cache Hit!
Cache Response:::
Prompt: Brief history on Capital of France
Score: 0.92440444
#####################################
Query: ----Brief history on France---
Response:
Cache Miss!
LLM Response: <llm response>
It seems to be working better!
Performance Comparison
Let's compare the time it takes to query from the cache vs. querying from the LLM.
Semantic Cache Results — Sorted by Time
|
Query
|
Result
|
Matched Prompt
|
Similarity Score
|
Response Source
|
Time (seconds)
|
|---|---|---|---|---|---|
|
Brief history on the Capital of France
|
Cache Miss
|
|
|
LLM
|
0.8499
|
|
Brief history on Paris
|
Cache Hit
|
Brief history on the Capital of France
|
0.9244
|
Cache
|
0.2705
|
|
Brief history on France
|
Cache Miss
|
|
|
LLM
|
1.2543
|
|
Brief history on the Capital of France
|
Cache Hit
|
Brief history on the Capital of France
|
1.0
|
Cache
|
1.1139
|
|
Brief history on Paris
|
Cache Hit
|
Brief history on France
|
0.9386
|
Cache
|
0.2761
|
|
Brief history on France
|
Cache Hit
|
Brief history on France
|
1.0
|
Cache
|
0.2798
|
|
Brief history on the Capital of France
|
Cache Hit
|
Brief history on the Capital of France
|
1.0
|
Cache
|
1.0178
|
|
Brief history on Paris
|
Cache Hit
|
Brief history on France
|
0.9386
|
Cache
|
0.2806
|
|
Brief history on France
|
Cache Hit
|
Brief history on France
|
1.0
|
Cache
|
0.2778
|
|
How does Langcache work? explain…
|
Cache Hit
|
How does Langcache work? explain…
|
1.0
|
Cache
|
0.2930
|
Observations:
- Though there are some anomalies, the response from the cache is much faster, which is obvious.
- A high similarity threshold is held to reuse the cached response.
- For distinct answers, a higher similarity threshold is recommended.
- Based on the query and business requirements, always tune and experiment, as results vary with embedding models, similarity thresholds, etc.
One More Example
Query: ----How does Langcache work---
Response:
Cache Miss!
LLM Response: < Gives a 20 lines response>
Time to get response from LLM API: 0.8144 seconds
----#####################################---
Query: ----How does Langcache work---
Response:
Cache Hit!
Time for cache lookup: 0.3162 seconds
Cache Response:::
Prompt: How does Langcache work
Response: < Gives the same 20 lines response>
Score: 1.0
All is well till now. Let's modify the query a bit!
----#####################################---
Query: ----How does Langcache work, explain in 5 lines---
Response:
Cache Hit!
Time for cache lookup: 0.2719 seconds
Cache Response:::
Prompt: How does Langcache work
Response: < Gives the same 20 lines response>
Score: 0.9714471
----#####################################---
We asked for a 5-line response, but got the cached 20-line one. This highlights the importance of tuning and using attributes to scope responses.
Semantic Cache vs. Fuzzy Match
Fuzzy matching works based on approximate string matching. It works best for handling typos, spelling variants, and near-duplicate strings, whereas semantic match operates at the meaning and context levels.
Let's see the difference between them in action. Let's compare the semantic score (LangCache) with the fuzzy score (Ratcliff–Obershelp algorithm) when matching two strings.
Querying for the first time:
Query: Does Semantic cache work?
Response:
Cache Miss!
Cache Response:::
Prompt: How does Semantic cache work?
Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
Querying the same again — notice that the semantic score and fuzzy score are pretty close.
Query: How does Semantic cache work?
Response:
Cache Hit!
Cache Response:::
Prompt: How does Semantic cache work?
Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
Semantic score (Langacache) : 0.95
Fuzzy match score : 0.93
Let's try with a few more variations:
Query: Will Semantic cache work?
Response:
Cache Hit!
Cache Response:::
Prompt: How does Semantic cache work?
Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
Semantic score (Langacache) : 0.94
Fuzzy match score : 0.83
Query: What does Semantic cache mean?
Response:
Cache Hit!
Cache Response:::
Prompt: How does Semantic cache work?
Response: Semantic caching stores query results along with their semantic descriptions, enabling new queries to be answered partially or fully by reusing cached data.
Semantic score (Langacache) : 0.94
Fuzzy match score : 0.79
As you can see, fuzzy match focuses on "looks like," whereas semantic matching focuses on "means like."
Complete implementation of the above is available here.
Fuzzy match can be combined with the semantic cache in a few ways:
- Store the last 'n' prompts and do a fuzzy match on those. Use semantic caching only if there is no match found.
- When a high similarity threshold/score is used for the semantic cache (e.g., > 0.95), we end up caching prompts for every cache miss. This procedure will result in a lot of near-duplicates. We can use fuzzy match to identify these near duplicates and store only the prompts that are different
- If the caching layer contains a lot of near duplicates, we can use fuzzy match for compaction.
Final Thoughts
When building applications with semantic caching, effective results depend on continuous testing and context-aware tuning. Similarity thresholds, prompt patterns, and cache scope should be adjusted based on workload behavior and accuracy requirements. Redis LangCache enables fine-grained control through attributes that partition and scope cached responses. Semantic caching becomes even more efficient when fuzzy matching logic is added, striking a balance between accuracy and increased cache hit rates. When combined, these methods can improve latency, lower LLM costs, and provide consistent results while maintaining accuracy and relevance.
Happy coding!
Opinions expressed by DZone contributors are their own.
Comments