DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Enterprise RAG in Amazon Bedrock: Introduction to KnowledgeBases
  • S3 Vectors: How to Build a RAG Without a Vector Database
  • Building a Video Evidence Layer: Moment Indexing With Timecoded Retrieval
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)

Trending

  • Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture
  • What Is Plagiarism? How to Avoid It and Cite Sources
  • Exactly-Once Processing: Myth vs Reality
  • How to Format Articles for DZone
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Build Multimodal RAG Apps With Amazon Bedrock and OpenSearch

Build Multimodal RAG Apps With Amazon Bedrock and OpenSearch

Deploying a scalable Multimodal RAG application using Amazon Bedrock for embeddings and language models, and Amazon OpenSearch as a vector store.

By 
Mahesh Vaijainthymala Krishnamoorthy user avatar
Mahesh Vaijainthymala Krishnamoorthy
·
Mar. 20, 25 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
4.3K Views

Join the DZone community and get the full member experience.

Join For Free

Scenario

Customer support tickets with screenshots, technical documentation with diagrams, and a mountain of legacy PDFs — all containing valuable information, but impossible to query efficiently.

"There has to be a better way," I thought. That's when I dove headfirst into the world of multimodal retrieval-augmented generation (RAG).

The Multimodal Revelation

Like many developers, I had experimented with basic RAG systems that worked well enough for text. But our real-world data isn't just text. When a customer sends a screenshot of an error along with a description, or when our medical clients need to cross-reference radiology images with patient notes, text-only RAG falls short.

The revelation came when I realized we could generate embeddings for different types of data — text, images, and potentially audio — and use them together. Let me share what I learned by building a multimodal RAG system with AWS.

The AWS Services That Made It Possible

After several false starts with custom solutions, I landed on a combination that worked surprisingly well:

  1. Amazon Bedrock for both embeddings and LLMs
  2. Amazon OpenSearch as our vector database
  3. AWS Lambda and API Gateway for deployment

Why this stack? Honestly, I'd tried maintaining my own embedding models and it was a nightmare of GPU provisioning and version conflicts. Bedrock's fully managed approach meant I could focus on application logic instead of infrastructure.

How I Built It: The Architecture

Here's the approach that ultimately worked for us:

1. Data Ingestion Pipeline

The first challenge was converting our diverse data into embeddings. For text, the solution was straightforward:

Python
 
import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime')

def get_text_embedding(text):
    response = bedrock.invoke_model(
        body=json.dumps({"inputText": text}),
        modelId="amazon.titan-embed-text-v1",
        accept="application/json",
        contentType="application/json"
    )
    return json.loads(response.get('body').read())['embedding']


For images, though, I hit a roadblock. After experimenting with several approaches, I deployed a CLIP model on SageMaker that converted images into the same embedding space. This was tricky to get right — the dimensions had to be carefully managed.

2. Setting Up OpenSearch for Vector Search

The trickiest part was configuring OpenSearch correctly. After several failed attempts with incorrect dimensions, this mapping finally worked:

JSON
 
PUT /rag-index
{
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 1536 # Titan Embeddings dimension
      },
      "text": {"type": "text"},
      "image_embedding": {
        "type": "knn_vector",
        "dimension": 512 # CLIP embedding dimension
      }
    }
  }
}


A word of caution: make sure your dimension values match your embedding models exactly, or you'll spend hours debugging cryptic errors as I did!

3. The Retrieval Logic That Saved Our Project

The most elegant part of the solution was the retrieval function. Initially, I tried complicated hybrid approaches, but ended up with something simpler that worked better:

Python
 
def retrieve_documents(query_text, query_image=None):
    query_embedding = get_text_embedding(query_text)
    
    knn_query = {
        "size": 5,
        "query": {
            "knn": {
                "embedding": {
                    "vector": query_embedding,
                    "k": 5
                }
            }
        }
    }
    
    results = opensearch.search(body=knn_query, index="rag-index")
    return results['hits']['hits']


The real magic happened when combining this with Claude on Bedrock:

Python
 
def generate_answer(query, context):
    prompt = f"""Human: Answer this query: {query} using the context:
    {context}
    
    A:"""
    
    response = bedrock.invoke_model(
        body=json.dumps({"prompt": prompt}),
        modelId="anthropic.claude-v2"
    )
    
    return json.loads(response.get('body').read())['completion']


Real-World Impact and Lessons Learned

When we deployed this to production, the results were immediate. Our customer support team, which previously struggled with "I can't describe this error, here's a screenshot" tickets, could now instantly retrieve similar past issues.

Some hard-earned lessons:

  1. Cost management is crucial. Embedding generation costs can add up quickly. We implemented a caching layer that reduced our API calls by 70%.
  2. Start with text, then add images. Don't try to solve the multimodal problem all at once. Get the text working perfectly first.
  3. Latency matters. We initially put everything in Lambda, but for large embedding operations, we moved to dedicated EC2 instances with results cached in ElastiCache.

The Future of Our Multimodal RAG System

I'm most excited about extending this approach to video content. We're experimenting with extracting key frames and generating embeddings that can help retrieve relevant video segments.

We're also looking at adding speech transcription to bring audio into our multimodal mix — imagine being able to search through customer calls alongside documentation and screenshots.

Try It Yourself

If you're facing similar challenges with disconnected data sources, I encourage you to experiment with multimodal RAG. Start small, perhaps with just text and a few test images. The AWS stack makes iteration relatively painless.

Further Reading

  • Amazon Bedrock Documentation
  • Hybrid Search in OpenSearch
  • Multimodal AI with CLIP
AWS vector database RAG

Opinions expressed by DZone contributors are their own.

Related

  • Enterprise RAG in Amazon Bedrock: Introduction to KnowledgeBases
  • S3 Vectors: How to Build a RAG Without a Vector Database
  • Building a Video Evidence Layer: Moment Indexing With Timecoded Retrieval
  • An AI-Driven Architecture for Autonomous Network Operations (NetOps)

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook