Building a Production-Ready Conversational AI Agent With Cloudflare Workers and AI Gateway
Build a serverless chatbot with Cloudflare Workers for backend logic, AI Gateway for model routing, and KV for conversation history.
Join the DZone community and get the full member experience.
Join For FreeConversational AI is fundamentally transforming customer support, delivering instant, context-aware responses at a massive scale. With the global conversational AI market projected to reach $32.6 billion by 2030, growing at a CAGR of 23.6%, developers need efficient ways to deploy these powerful tools. Edge computing platforms like Cloudflare Workers and AI Gateway provide the ideal solution, enabling the deployment of low-latency, serverless AI agents without the complexity of managing infrastructure.
This tutorial provides a comprehensive guide to building a production-ready chatbot. We will use Cloudflare Workers for the serverless backend, AI Gateway to manage and route model inference, and Tailwind CSS for a clean, responsive frontend. The result is a context-aware customer support agent powered by a high-performance large language model like Llama 3.1.
Why Build Conversational AI on Cloudflare?
Deploying AI applications on Cloudflare's edge network offers several distinct advantages:
- Ultra-low latency: Cloudflare’s global network is designed to serve requests from a location physically close to the user, ensuring AI inference latency is consistently under 50 milliseconds for a real-time, natural-feeling chat experience.
- Unified management with AI Gateway: AI Gateway acts as a central control plane for your AI applications. It unifies routing to various models on Workers AI (e.g.,
@cf/meta/llama-3.1-8b-instruct), providing essential features like caching, rate limiting, and detailed analytics out-of-the-box. - State-of-the-art models: Recent benchmarks show Llama 3.1 variants excelling in multi-turn coherence, making them superior for conversational tasks. Workers AI provides easy, pay-as-you-go access to these models.
- Effortless scalability: The serverless architecture of Cloudflare Workers scales automatically to handle millions of requests with zero cold starts, ensuring your application is always available and performant, regardless of traffic spikes.
Prerequisites
To successfully complete this tutorial, you will need the following:
- Cloudflare account: A free Cloudflare account is sufficient to start. Sign up at cloudflare.com and enable Workers AI.
- Wrangler CLI: The command-line interface for managing Cloudflare Workers. Install it via npm:
npm install -g wrangler, and then authenticate with your account by runningwrangler login. - Wrangler CLI: The command-line interface for managing Cloudflare Workers. Install it via npm:
npm install -g wrangler, and then authenticate with your account by runningwrangler login. - Knowledge: A basic understanding of JavaScript, REST APIs, and utility-first CSS (like Tailwind CSS).
- Tools: A code editor like VS Code and a modern web browser.
Understanding the Architecture
Our chatbot's architecture is simple, scalable, and powerful:
- Frontend client: A static HTML page styled with Tailwind CSS captures user input and communicates with our backend Worker via fetch requests.
- Cloudflare worker: This serverless function is the core of our application. It receives requests from the client, manages the conversation's state, and securely calls the AI model.
- Cloudflare KV: A global, low-latency key-value data store used for session management. We will store each user's conversation history here to provide context for subsequent messages.
- AI Gateway: This service acts as a proxy between our Worker and the underlying AI model. It provides a stable endpoint and adds crucial observability, caching, and control features.
- Workers AI: The underlying inference platform that runs the Llama 3.1 model.

The client sends messages to the Cloudflare Worker. The Worker retrieves conversation history from KV, routes the request through the AI Gateway for Llama 3.1 inference, and stores the updated history back in KV.
Implementation
Step 1: Set Up Your AI Gateway
First, we'll create an AI Gateway to manage access to our model.
- Log in to your Cloudflare Dashboard.
- Navigate to AI > AI Gateway.
- Click Create Gateway.
- Give it a memorable name (e.g.,
chat-gateway) and select Workers AI as the API Provider. Click Create. - On the next screen, you will see your gateway's API endpoint. Note this down; it will look like this:
https://gateway.ai.cloudflare.com/v1/<YOUR_ACCOUNT_ID>/<YOUR_GATEWAY_ID>/workers-ai. - Next, create an API token to authenticate requests to the gateway. Go to My Profile > API Tokens and click Create Token.
- Use the Create Custom Token template. Give the token a name (e.g.,
AI Gateway Read Token). - Grant it the following permission: Account > AI > Read. This is the only permission needed.
- Click Continue to summary, then Create Token. Copy the generated token and save it securely.
Step 2: Set Up the Worker Project
Now, let's create the project for our backend logic.
-
Open your terminal and create a new Worker project using the Wrangler CLI:
Shellnpx wrangler init conversational-ai-worker --type javascript cd conversational-ai-worker npm install -
Create a KV namespace to store chat sessions. In your terminal, run:
Shellnpx wrangler kv:namespace create CHAT_SESSIONSThis command will output the
idfor your new namespace. Copy it. -
Open the
wrangler.tomlfile and configure it to connect to your AI Gateway and KV namespace. Replace the placeholder values with your actual credentials.TOMLname = "conversational-ai-worker" main = "src/index.js" compatibility_date = "2025-10-01" # Binding for KV session storage [[kv_namespaces]] binding = "CHAT_SESSIONS" id = "<YOUR_KV_NAMESPACE_ID>" # Paste the ID from the previous step here # Environment variables for configuration [vars] GATEWAY_URL = "https://gateway.ai.cloudflare.com/v1/<YOUR_ACCOUNT_ID>/<YOUR_GATEWAY_ID>/workers-ai" API_TOKEN = "<YOUR_API_TOKEN>" # Paste the token you created in Step 1 MODEL = "@cf/meta/llama-3.1-8b-instruct"
Step 3: Build the Worker Logic
The Worker will handle chat requests, manage conversation history in KV, and call the AI model through the AI Gateway. Replace the contents of src/index.js with the following code:
export default {
async fetch(request, env) {
// We only accept POST requests for this endpoint
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
// Parse the incoming JSON body
const { sessionId, message } = await request.json();
// Validate the required fields
if (!sessionId || !message) {
return new Response(JSON.stringify({ error: 'The fields "sessionId" and "message" are required.' }), {
status: 400,
headers: { 'Content-Type': 'application/json' },
});
}
// Define a unique key for storing this session's history in KV
const historyKey = `chat:${sessionId}`;
let history = [];
// Attempt to retrieve existing conversation history from KV
try {
const historyData = await env.CHAT_SESSIONS.get(historyKey);
if (historyData) {
history = JSON.parse(historyData);
}
} catch (error) {
console.error('KV read error:', error);
// Continue with empty history if KV read fails
}
// Construct the payload for the AI model
const messages = [
{ role: 'system', content: 'You are a friendly and helpful customer support agent.' },
...history,
{ role: 'user', content: message }
];
// Call the AI model via our AI Gateway
try {
const aiResponse = await fetch(`${env.GATEWAY_URL}/v1/chat/completions`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: env.MODEL,
messages,
max_tokens: 250, // Limit the response length
temperature: 0.7, // Adjust for creativity vs. predictability
}),
});
if (!aiResponse.ok) {
throw new Error(`API error: ${aiResponse.status} ${aiResponse.statusText}`);
}
const data = await aiResponse.json();
const reply = data.choices[0].message.content.trim();
// Add the user's message and the AI's reply to the history
history.push({ role: 'user', content: message });
history.push({ role: 'assistant', content: reply });
// Keep the history to a reasonable size (last 5 turns)
if (history.length > 10) {
history = history.slice(-10);
}
// Store the updated history back in KV, expiring after 1 hour (3600 seconds)
await env.CHAT_SESSIONS.put(historyKey, JSON.stringify(history), { expirationTtl: 3600 });
// Send the AI's reply back to the client
return Response.json({ reply });
} catch (error) {
console.error('Inference error:', error);
return new Response(JSON.stringify({ error: 'Failed to process your message. Please try again.' }), {
status: 500,
headers: { 'Content-Type': 'application/json' },
});
}
}
};
Step 4: Deploy the Worker
With the logic in place, deploy your Worker to the Cloudflare network.
npx wrangler deploy
After a successful deployment, Wrangler will output your Worker's URL, which will look like https://conversational-ai-worker.<YOUR_SUBDOMAIN>.workers.dev. Copy this URL for the next step.
Step 5: Create the Frontend Client
Now, create a simple HTML file for the user interface. Create a file named index.html in your project's root directory. This code uses Tailwind CSS via a CDN for rapid styling.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Customer Support Chatbot</title>
<script src="https://cdn.tailwindcss.com"></script>
</head>
<body class="font-sans max-w-2xl mx-auto p-4 sm:p-6 bg-gray-50 min-h-screen flex flex-col">
<header class="mb-4">
<h1 class="text-3xl font-bold text-gray-800">AI Customer Support</h1>
<p class="text-gray-600">Powered by Cloudflare Workers & Llama 3.1</p>
</header>
<main id="chatWindow" class="flex-1 border border-gray-300 h-96 overflow-y-auto p-4 mb-4 bg-white rounded-lg shadow-inner"></main>
<footer class="flex gap-2">
<input
id="userInput"
type="text"
placeholder="Type your message here..."
class="flex-1 p-3 border border-gray-300 rounded-md focus:outline-none focus:ring-2 focus:ring-blue-500 transition"
onkeypress="if(event.key === 'Enter') sendMessage()"
/>
<button
id="sendButton"
onclick="sendMessage()"
class="px-5 py-3 bg-blue-600 text-white font-semibold rounded-md hover:bg-blue-700 active:bg-blue-800 transition-colors disabled:bg-gray-400"
>
Send
</button>
</footer>
<script>
const WORKER_URL = 'https://conversational-ai-worker.<YOUR_SUBDOMAIN>.workers.dev'; // <-- REPLACE WITH YOUR WORKER URL
const sessionId = 'user_' + Math.random().toString(36).substr(2, 9);
const chatWindow = document.getElementById('chatWindow');
const userInput = document.getElementById('userInput');
const sendButton = document.getElementById('sendButton');
async function sendMessage() {
const message = userInput.value.trim();
if (!message) return;
addMessage(message, 'user');
userInput.value = '';
userInput.disabled = true;
sendButton.disabled = true;
try {
const response = await fetch(WORKER_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ sessionId, message })
});
const data = await response.json();
if (response.ok) {
addMessage(data.reply, 'bot');
} else {
addMessage(`Error: ${data.error || 'Unknown error'}`, 'bot');
}
} catch (error) {
addMessage('Network error. Could not connect to the AI agent.', 'bot');
} finally {
userInput.disabled = false;
sendButton.disabled = false;
userInput.focus();
}
}
function addMessage(text, sender) {
const messageDiv = document.createElement('div');
messageDiv.className = `p-3 my-2 rounded-lg max-w-xs md:max-w-md ${
sender === 'user'
? 'bg-blue-100 text-gray-800 self-end ml-auto'
: 'bg-gray-200 text-gray-800 self-start mr-auto'
}`;
messageDiv.textContent = text;
chatWindow.appendChild(messageDiv);
chatWindow.scrollTop = chatWindow.scrollHeight;
}
// Welcome message
addMessage("Hello! How can I assist you today?", "bot");
</script>
</body>
</html>
Important: Remember to replace the WORKER_URL placeholder in the script with the actual URL of your deployed Worker.
Step 6: Run and Test Your Chatbot
You can test the application by simply opening the index.html file in your web browser.
Start a conversation and test its context-awareness:
- User: "How do I reset my password?"
- User: "What is the status of my order #XYZ-98765?"
- User: (Follow-up) "Can you check its estimated delivery date?"
The chatbot should remember the order number from the previous turn and provide a relevant answer, thanks to the session history stored in Cloudflare KV.
Production Considerations
- Rate limiting and caching: In the AI Gateway dashboard, you can easily configure rate limits (e.g., 60 requests per minute) to prevent abuse and set caching rules to serve identical requests faster and at a lower cost.
- Scaling session storage: Cloudflare KV is highly scalable, but for applications with extremely high write volumes or that require transactional guarantees, consider using Cloudflare Durable Objects for session management.
- Performance and monitoring: Use the AI Gateway dashboard to monitor requests, errors, costs, and token usage. For more complex queries, you can easily switch to a more powerful model like
@cf/meta/llama-3.1-70b-instructby simply updating theMODELvariable inwrangler.toml. - Frontend build process: For a production site, it's best to set up a proper build step for Tailwind CSS to purge unused styles and minimize the final CSS file, rather than using the CDN.
Advanced Features: RAG and Sentiment Analysis
You can easily extend this foundation with more advanced capabilities.
Contextual Knowledge With Retrieval-Augmented Generation (RAG)
To make your bot answer questions based on your own documentation, you can integrate RAG using Cloudflare Vectorize.
-
Add Vectorize and Workers AI bindings to
wrangler.toml:TOML[[vectorize]] binding = "VECTOR_INDEX" index_name = "your-product-docs" [ai] binding = "AI" -
In your Worker, generate embeddings for the user's message, query your vector database for relevant context, and inject it into the prompt.
JavaScript// In src/index.js, before calling the AI gateway const { results } = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [message] }); const embedding = results[0]; const matches = await env.VECTOR_INDEX.query(embedding, { topK: 3 }); const context = matches.map(m => m.metadata.text).join('\n---\n'); const systemPrompt = `You are a helpful assistant. Answer the user's question based on the following context:\n${context}`; messages[0].content = systemPrompt; // Overwrite or prepend to system message
Conclusion
You have successfully built a scalable, low-latency, and context-aware conversational AI agent using Cloudflare Workers, AI Gateway, and KV. This architecture leverages the power of the edge to deliver a superior user experience while simplifying development and operations.
Key takeaways from this tutorial include:
- Edge inference: Workers AI brings model inference close to the user, drastically reducing latenc
- Simplified operations: AI Gateway centralizes routing, caching, rate limiting, and monitoring.
- Stateless with context: Cloudflare KV provides an easy and effective way to manage conversation history.
- Modern frontend: Tailwind CSS enables the rapid development of a clean, responsive user interface.
With the continuous evolution of Cloudflare's serverless platform, the possibilities for building intelligent, globally-distributed applications are more accessible than ever.
Additional Resources
Opinions expressed by DZone contributors are their own.
Comments