Build Multimodal RAG Apps With Amazon Bedrock and OpenSearch
Deploying a scalable Multimodal RAG application using Amazon Bedrock for embeddings and language models, and Amazon OpenSearch as a vector store.
Join the DZone community and get the full member experience.
Join For FreeScenario
Customer support tickets with screenshots, technical documentation with diagrams, and a mountain of legacy PDFs — all containing valuable information, but impossible to query efficiently.
"There has to be a better way," I thought. That's when I dove headfirst into the world of multimodal retrieval-augmented generation (RAG).
The Multimodal Revelation
Like many developers, I had experimented with basic RAG systems that worked well enough for text. But our real-world data isn't just text. When a customer sends a screenshot of an error along with a description, or when our medical clients need to cross-reference radiology images with patient notes, text-only RAG falls short.
The revelation came when I realized we could generate embeddings for different types of data — text, images, and potentially audio — and use them together. Let me share what I learned by building a multimodal RAG system with AWS.
The AWS Services That Made It Possible
After several false starts with custom solutions, I landed on a combination that worked surprisingly well:
- Amazon Bedrock for both embeddings and LLMs
- Amazon OpenSearch as our vector database
- AWS Lambda and API Gateway for deployment
Why this stack? Honestly, I'd tried maintaining my own embedding models and it was a nightmare of GPU provisioning and version conflicts. Bedrock's fully managed approach meant I could focus on application logic instead of infrastructure.
How I Built It: The Architecture
Here's the approach that ultimately worked for us:
1. Data Ingestion Pipeline
The first challenge was converting our diverse data into embeddings. For text, the solution was straightforward:
import boto3
import json
bedrock = boto3.client(service_name='bedrock-runtime')
def get_text_embedding(text):
response = bedrock.invoke_model(
body=json.dumps({"inputText": text}),
modelId="amazon.titan-embed-text-v1",
accept="application/json",
contentType="application/json"
)
return json.loads(response.get('body').read())['embedding']
For images, though, I hit a roadblock. After experimenting with several approaches, I deployed a CLIP model on SageMaker that converted images into the same embedding space. This was tricky to get right — the dimensions had to be carefully managed.
2. Setting Up OpenSearch for Vector Search
The trickiest part was configuring OpenSearch correctly. After several failed attempts with incorrect dimensions, this mapping finally worked:
PUT /rag-index
{
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 1536 # Titan Embeddings dimension
},
"text": {"type": "text"},
"image_embedding": {
"type": "knn_vector",
"dimension": 512 # CLIP embedding dimension
}
}
}
}
A word of caution: make sure your dimension values match your embedding models exactly, or you'll spend hours debugging cryptic errors as I did!
3. The Retrieval Logic That Saved Our Project
The most elegant part of the solution was the retrieval function. Initially, I tried complicated hybrid approaches, but ended up with something simpler that worked better:
def retrieve_documents(query_text, query_image=None):
query_embedding = get_text_embedding(query_text)
knn_query = {
"size": 5,
"query": {
"knn": {
"embedding": {
"vector": query_embedding,
"k": 5
}
}
}
}
results = opensearch.search(body=knn_query, index="rag-index")
return results['hits']['hits']
The real magic happened when combining this with Claude on Bedrock:
def generate_answer(query, context):
prompt = f"""Human: Answer this query: {query} using the context:
{context}
A:"""
response = bedrock.invoke_model(
body=json.dumps({"prompt": prompt}),
modelId="anthropic.claude-v2"
)
return json.loads(response.get('body').read())['completion']
Real-World Impact and Lessons Learned
When we deployed this to production, the results were immediate. Our customer support team, which previously struggled with "I can't describe this error, here's a screenshot" tickets, could now instantly retrieve similar past issues.
Some hard-earned lessons:
- Cost management is crucial. Embedding generation costs can add up quickly. We implemented a caching layer that reduced our API calls by 70%.
- Start with text, then add images. Don't try to solve the multimodal problem all at once. Get the text working perfectly first.
- Latency matters. We initially put everything in Lambda, but for large embedding operations, we moved to dedicated EC2 instances with results cached in ElastiCache.
The Future of Our Multimodal RAG System
I'm most excited about extending this approach to video content. We're experimenting with extracting key frames and generating embeddings that can help retrieve relevant video segments.
We're also looking at adding speech transcription to bring audio into our multimodal mix — imagine being able to search through customer calls alongside documentation and screenshots.
Try It Yourself
If you're facing similar challenges with disconnected data sources, I encourage you to experiment with multimodal RAG. Start small, perhaps with just text and a few test images. The AWS stack makes iteration relatively painless.
Further Reading
Opinions expressed by DZone contributors are their own.
Comments