Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Nvidia’s Open Model Super Panel Made a Strong Case for Open Agents
Observability in AI Pipelines: Why “The System Is Up” Means Nothing
This guide explains agentic AI from first principles, starting with fundamental concepts and progressing through architecture design, implementation details, and complete working examples. By the end, readers can build production agent systems. Traditional AI systems have limitations. They respond to single queries only, process input and generate output, but do not maintain state between interactions. They cannot: Plan multi-step tasksUse external toolsLearn from experienceRemember past conversationsExecute actions beyond text generation Consider a traditional chatbot. A user asks about database performance. The chatbot generates a response from training data. When the user asks a follow-up question, the chatbot treats it as a new conversation. The chatbot cannot remember the previous question, query a database, execute actions, or plan a sequence of steps. Agentic AI systems solve these problems. Agents: Maintain a persistent state across sessionsPlan sequences of actions to achieve goalsUse tools to interact with external systemsStore memories for future referenceExecute complex tasks autonomouslyAdapt behavior based on experienceRemember past interactionsCoordinate multiple steps Consider an agentic system. A user asks about database performance. The agent plans a sequence of steps: Step one: queries the database documentationStep two: retrieves relevant articlesStep three: analyzes the informationStep four: generates a comprehensive response The agent stores important facts in memory. When the user asks a follow-up question, the agent retrieves relevant memories, uses context from previous interactions, and provides a coherent response. What Is Agentic AI? Agentic AI refers to systems that act autonomously to achieve goals. The term "agentic" refers to the capacity to act. Agents: Receive goals from usersBreak goals into actionable stepsExecute steps using available toolsObserve results from actionsAdjust plans based on outcomesPersist state across sessionsLearn from experience Understanding Agents from First Principles Agents are software systems that exhibit autonomous behavior. They differ from traditional programs in key ways. The diagram illustrates the fundamental differences between traditional programs and agentic systems. Traditional programs follow fixed execution paths defined at development time, while agents adapt their behavior based on context and goals. This adaptive capability enables agents to handle situations that were not explicitly programmed, making them more flexible and powerful for complex, dynamic environments. Comparison With Traditional AI Systems Traditional AI systems: Process single requestsGenerate responses from training dataLack of persistent memoryCannot execute actionsCannot plan multi-step tasksCannot learn from interactionsTreat each request independentlyCannot coordinate multiple steps Traditional systems have fixed behavior, follow predetermined patterns, cannot adapt to new situations, cannot use external tools, cannot remember past interactions, and cannot improve over time. Agentic systems change this paradigm fundamentally. Agents: Maintain a persistent state across sessionsPlan sequences of actions dynamicallyUse tools to interact with external systemsStore memories for future referenceExecute complex tasks autonomouslyAdapt behavior based on experienceLearn from successful patternsImprove performance over time Agents vs. Chatbots: Fundamental Differences Agents differ from chatbots in fundamental ways. Understanding these differences is essential for building agentic systems. ASPECTCHATBOTSAGENTSContext HandlingRespond to individual messages without context; each message is processed independently; context is lost between interactions.Maintain conversation context across sessions; previous interactions inform current responses; context accumulates over time; memory enables continuity.Action ExecutionGenerate text only; cannot execute actions; cannot interact with systems; cannot query databases; cannot call APIsExecute actions through tools; query databases; call APIs; run code; interact with systems.MemoryNo memory beyond the current session; when the session ends, all context is lostStore long-term memories in vector databases; memories persist across sessions; enable learning; improve responses over time.AdaptabilityFollow fixed patterns; use predetermined templates; cannot adapt behaviorAdapt plans dynamically; adjust strategies based on results; optimize for successPlanning CapabilitiesNo planning; responds immediately to each message; cannot break down complex tasks; cannot coordinate multiple steps.Generate multi-step plans; break complex goals into actionable sequences; handle conditional logic; manage step dependencies.Tool UsageCannot use external tools; limited to text generation; no database access; no API integrationAccess comprehensive tool registry; execute SQL queries; make HTTP requests; run code snippets; interact with external systemsMulti-Step TasksHandle single-turn conversations only; cannot coordinate sequential actions; cannot manage task workflows.Execute complex multi-step workflows; coordinate sequential actions; manage task dependencies; handle parallel execution.Error HandlingLimited error recovery; cannot retry failed operations; cannot adapt to failuresRobust error handling; automatic retry mechanisms; graceful failure recovery; adaptive error strategiesLearning AbilityStatic behavior; cannot learn from interactions; responses do not improve over timeContinuous learning; improve from experience; adapt based on feedback; optimize performance over timeState ManagementStateless operation; no persistent state; cannot resume interrupted tasksPersistent state management; track execution progress; resume interrupted tasks; maintain state across sessionsPersonalizationGeneric responses; no user-specific adaptation; same responses for all usersPersonalized interactions; learn user preferences; adapt to individual needs; build user-specific knowledgeIntegration CapabilitiesLimited integration; primarily text-based interfaces; minimal external system connectivityDeep system integration; connect to databases; integrate with APIs; interact with cloud services; access file systemsResponse QualityTemplate-based responses; limited depth; no fact verification; may provide outdated informationContext-aware responses; fact-checked answers; real-time information retrieval; comprehensive and accurate answersScalabilityLimited scalability; each conversation is isolated; no shared knowledge; resource-intensive per conversationHighly scalable; shared knowledge base; efficient resource usage; optimized for production workloads.User ExperienceSimple question-answer format; limited interactivity; no proactive assistanceRich interactive experience; proactive assistance; guided workflows; comprehensive task completionCost EfficiencyHigh per-conversation cost; no knowledge reuse; repeated processing of similar queriesCost-efficient; knowledge reuse across sessions; optimized resource utilization; reduced redundant processing Core Components of Agentic Systems Agentic systems include five core components. Each component serves specific functions. Understanding each element is essential for building agents. Planning system: Breaks goals into actionable steps, handles conditional logic for decision-making, manages step dependencies to ensure correct ordering, validates plan feasibility before execution, ranks plans by quality metrics, and selects optimal execution plans.Tool registry: Provides functions for external actions, validates tool calls against schemas, manages tool permissions for security, handles tool errors gracefully, retries failed operations when appropriate, and tracks tool usage for monitoring.Memory system: Stores and retrieves context efficiently, uses vector search for semantic retrieval, maintains long-term knowledge bases, ranks memories by relevance, filters memories by context, and updates memories based on new information.State machine: Manages execution flow systematically, tracks current state accurately, handles state transitions correctly, manages error recovery automatically, coordinates multi-step tasks effectively, and persists state for reliability.Runtime: Orchestrates all components seamlessly, coordinates execution across components, manages error recovery comprehensively, monitors performance continuously, logs activities for debugging, and provides observability for operations. Agent Architecture Agents follow a structured architecture. The architecture separates concerns, each component handles specific responsibilities, and components communicate through well-defined interfaces. User interface: Sends messages to the agent. Messages include user queries and goals. The interface may be a web application or API, and messages are formatted as text or structured data. The architecture diagram shows component relationships. User input flows to the runtime. The runtime queries the planner for execution plans. The planner generates steps and validates feasibility. Steps flow to the tool executor for action execution. Results flow to memory for storage. Memory feeds back to the planner for context. The state machine coordinates all transitions. The response generator formats the final output. Planning Systems Planning systems convert goals into action sequences. They use language models to generate plans, break complex tasks into steps, handle conditional logic, and manage dependencies between steps. Planning is the core capability that enables autonomous behavior. Without planning, agents cannot break down complex goals, coordinate multiple actions, or adapt to changing conditions. Planning systems use language models to understand goals. Models analyze goal requirements, identify required resources, determine the necessary steps, and account for constraints and dependencies. The planning process begins with goal analysis. The system receives a goal statement. The language model parses the goal, identifies key requirements, determines success criteria, and estimates complexity. Goal analysis produces a structured representation that includes: Required actionsResource needsConstraintsSuccess metrics The system queries available tools after goal analysis. The planner examines the tool registry, identifies relevant tools, assesses their capabilities, and verifies their availability. Tool querying enables informed planning. The planner knows which actions are possible, matches goals to capabilities, identifies missing tools, and suggests alternatives. The planner generates candidate plans using tool information. Each plan is a sequence of steps. Steps specify tool calls, include parameters, and define dependencies. Plan generation considers multiple factors: Step ordering: steps must execute in the correct sequenceResource availability: tools must be available when neededError handling: plans must handle potential failures The system validates generated plans. The validation checks plan feasibility, verify step dependencies, confirm resource availability, and ensure goal achievement. Validation includes multiple checks: Verifying all steps are executableConfirming dependencies are satisfiedEnsuring resources are availableValidating the goal achievement path Valid plans are ranked by quality. Ranking considers execution time, resource usage, success probability, and error resilience. The best plan is selected from ranked candidates. Selection uses quality scores, considers current context, accounts for constraints, and optimizes for success. Steps are extracted from the selected plan. Each step becomes an executable action. Dependencies order steps, include error handling, and are ready for execution. The planning diagram shows the decision flow. Goals enter the planner for analysis. The planner queries available tools to understand capabilities. The planner generates candidate plans with step sequences. Plans are validated for feasibility and correctness. Quality metrics rank valid plans. The best plan is selected based on scores. Steps are extracted for execution by the runtime. Tool Execution Tools enable agents to interact with external systems. Tools provide functions for specific actions. Agents call tools during execution; tools return results, which agents use for subsequent steps. Tools are the bridge between agents and external systems. Without tools, agents can only generate text. With tools, agents can query databases, make API calls, execute code, and run commands. Tools transform agents from text generators into action executors. Tool execution is fundamental to agentic behavior. Agents identify required actions during planning, select appropriate tools from the registry, format parameters correctly, invoke tools and process results, and use the results for subsequent steps. Tools include four primary types: SQL tools: Execute database queries, enable data retrieval, support data analysis, and provide structured data access.HTTP tools: Make web requests, fetch external data, interact with APIs, and retrieve current information.Code tools: Execute code snippets, perform computations, process data, and generate outputs.Shell tools: Run system commands, interact with the operating system, execute scripts, and manage files. Each tool type serves specific purposes: SQL tools: Enable database interactions where agents query structured data, retrieve relevant information, and analyze data relationshipsHTTP tools: Enable web interactions where agents fetch current information, call external APIs, and retrieve real-time dataCode tools: Enable computation where agents perform calculations, process data, and generate resultsShell tools: Enable system interactions where agents execute commands, manage files, and interact with the environment Tool execution follows a structured flow: Tool execution includes error handling. Tools may fail due to: Network issuesInvalid resultsTimeoutsAuthentication requirements The executor handles errors gracefully: Retries transient failuresReports permanent failuresUpdates the agent state The tool execution diagram shows the interaction flow. The agent requests a tool call with parameters. The tool registry validates the request against tool metadata. The tool executor runs the action in the tool environment. Results are returned in a structured format. The agent processes results for validation. The agent updates the state with new information. The agent continues planning with updated context. Memory Systems Memory systems store and retrieve context. They enable agents to remember past interactions, support long-term knowledge retention, and provide semantic search over memories. Memory is essential for agentic behavior. Without memory, agents cannot learn from experience, repeat mistakes, or build on past knowledge. Memory transforms agents from stateless responders into learning systems. Memory systems enable persistent knowledge. Agents store important facts from interactions, retrieve relevant context for new queries, build knowledge bases over time, and improve performance through experience. Memory includes three distinct types: Short-term memory: Stores recent conversation context, maintains session state, enables multi-turn conversations, and provides immediate context.Long-term memory: Stores essential facts and events, persists across sessions, enables knowledge accumulation, and supports learning over time.Working memory: Stores temporary computation state, holds intermediate results, supports complex reasoning, and clears after task completion. Each memory type serves specific purposes: Short-term memory: Enables conversation continuity where agents remember recent exchanges, maintain context within sessions, and provide coherent responsesLong-term memory: Enables knowledge accumulation where agents remember essential facts, build expertise over time, and avoid repeating mistakesWorking memory: Enables complex reasoning where agents hold intermediate results, perform multi-step calculations, and manage temporary state Memory storage follows a structured process: Memory retrieval uses vector similarity search: Vector search enables semantic retrieval. Memories are found by meaning, not keywords. Queries match conceptually similar content, synonyms and related concepts are handled automatically, and context retrieval improves response quality. The memory diagram shows storage and retrieval processes. New memories are extracted from interactions. Text is converted to embeddings using language models. Embeddings are stored in vector databases with metadata. Queries are converted to embeddings for search. Similarity search finds relevant memories using vector distance. Relevance scores rank matches. Top matches are returned as context for agent prompts. State Machines State machines manage agent execution flow. They track current state, handle state transitions, manage error recovery, and coordinate multi-step tasks. States include five types: Idle: The agent waits for input.Planning: The agent generates execution plans.Executing: The agent runs tool calls.Waiting: The agent waits for tool results.Completed: The agent finished the task. State transitions follow rules: Idle → Planning: On a new goalPlanning → Executing: On plan readyExecuting → Waiting: On tool callWaiting → Executing: On result receivedExecuting → Completed: On goal achieved The state machine diagram shows all states and transitions. States are represented as nodes. Transitions are depicted as arrows. Each transition has a condition. Conditions trigger state changes. Error states handle failures. Recovery paths restore regular operation. Agent Components in Detail Planning Component The planning component generates execution plans. It uses language models to analyze goals, breaks goals into actionable steps, and handles conditional logic and loops. Planning works in three phases: Phase one is goal analysis: The system understands the desired outcome.Phase two is step generation: The system creates a sequence of actions.Phase three is plan validation: The system checks plan feasibility. Plans include step dependencies. Some steps require previous steps to complete. The planner orders steps correctly and handles parallel execution when possible. The planning component diagram shows the internal structure. Goals enter the analyzer. The analyzer queries available tools. The analyzer generates candidate steps. Dependencies order steps. The validator checks feasibility. Valid plans are output. Tool Registry The tool registry manages available tools. It provides tool discovery, validates tool calls, handles tool execution, and manages tool permissions. Tools are registered with metadata. Metadata includes tool name, description, parameters, and return types. The registry validates calls against metadata and enforces security policies. Tool execution includes error handling. Tools may fail due to network issues or return invalid results. The registry handles errors gracefully and retries failed calls when appropriate. The tool registry diagram shows tool management. Tools are registered with metadata. The registry maintains a catalog. Agents query the catalog. The registry validates requests. The registry executes tools. Results are returned to agents. Memory Component The memory component stores agent experiences. It converts text to embeddings, stores embeddings in vector databases, and retrieves relevant memories using similarity search. Memory storage follows this process: Text is extracted from interactionsConverted to embeddingsStored with metadata including timestamps and tagsVector indexes enable fast retrieval Memory retrieval uses semantic search: Queries are converted to embeddingsSimilarity search finds relevant memoriesResults are ranked by relevanceTop results are returned as context The memory component diagram shows storage and retrieval. Text enters the embedding generator. Embeddings are stored in a vector database. Queries are embedded. Similarity search finds matches. Matches are ranked and returned. Memory systems enable agents to maintain context across interactions and learn from past experiences. The embedding-based approach allows semantic retrieval, where agents can find relevant memories even when the exact wording differs. This capability is crucial for building agents that can have meaningful conversations over extended periods. The vector database enables efficient similarity search across large collections of stored memories. Ranking mechanisms ensure that the most relevant memories are retrieved and used for context, improving the quality of agent responses. State Machine Component The state machine manages execution flow. It tracks the current state, handles transitions, manages error recovery, and coordinates multi-step tasks. State tracking includes persistence: States are saved to databasesSurvive system restartsEnable resuming interrupted tasks Error handling includes recovery paths: Failed steps trigger error statesError states attempt recoveryRecovery may retry steps or adjust plans The state machine diagram shows state management. States are stored in the database. Events trigger transitions. Error states have recovery paths. Completed states trigger cleanup. State machine components provide reliable execution control for agent operations. They ensure that agents progress through execution phases in a controlled and recoverable manner. Database persistence allows state machines to survive system restarts and continue execution from the last known state. Event-driven transitions enable agents to respond dynamically to changing conditions during execution. Recovery mechanisms built into error states allow agents to handle failures gracefully and continue operations when possible. Agent Execution Flow Complete Execution Cycle Agent execution follows this cycle. The cycle starts with user input, ends with response generation, and includes planning, execution, and memory updates. Step one: The agent receives user input, loads the conversation context, and retrieves relevant memories.Step two generates an execution plan: The planner analyzes the goal, queries available tools, generates a step sequence, and validates the plan.Step three executes plan steps: The executor runs each step in order. Steps may call tools, query memory, or update state.Step four processes results: Tool results are collected, validated, used for subsequent steps, and update the agent state.Step five updates memory: important facts are extracted, converted into embeddings, stored in memory, and the memory index is updated.Step six generates a response: The response generator formats output. Output includes execution results and explanations and is returned to the user. The execution flow diagram shows the complete cycle. Input flows through planning. Planning flows to execution. Execution flows to memory. Memory flows to response. Response flows to the user. Multi-Step Task Execution Multi-step tasks require coordination. Agents: Break tasks into stepsExecute steps sequentiallyHandle step dependenciesManage step failures Step dependencies require ordering. Some steps must be completed before others: The planner orders the steps correctlyThe executor waits for dependencies Step failures require recovery: Failed steps trigger error handlingError handling may retry stepsError handling may adjust plansError handling may skip steps The multi-step diagram shows step coordination. Dependencies order steps. Execution follows the order. Results flow between steps. Failures trigger recovery. Error Handling and Recovery Error handling ensures robust operation. Agents encounter various errors: Network failures affect tool callsInvalid inputs cause validation errorsResource limits cause timeouts Error handling includes detection. The system: Monitors executionDetects failuresCategorizes errorsSelects recovery strategies Recovery strategies include retries: Transient errors trigger retriesRetries use exponential backoffRetries have a maximum attemptsPermanent errors skip retries The error handling diagram shows the recovery flow. Errors are detected. Errors are categorized. Recovery strategies are selected. Strategies are executed. Success restores normal flow. Robust error handling is essential for production agent systems. Errors can occur at any stage of execution, from tool failures to network issues to validation errors. Effective error handling requires both detection mechanisms and recovery strategies. Categorization of errors enables appropriate response selection, distinguishing between transient issues that can be retried and permanent failures that require different handling. Recovery strategies must be designed to minimize disruption to ongoing operations while ensuring system integrity and reliability. Building an Agent With NeuronDB and NeuronAgent This section provides a complete step-by-step guide to building a production agent. The guide covers: Installation and setupConfigurationAgent creationSession managementMessage handlingTool executionMemory managementTroubleshooting tipsBest practices The example creates a research assistant agent. The agent: Answers questions using document retrievalUses SQL tools to query databasesUses HTTP tools to fetch web contentStores memories for future referenceMaintains conversation contextAdapts behavior based on results Prerequisites Before building an agent, install the required components. The setup requires PostgreSQL, NeuronDB extension, and NeuronAgent server. Each component must be configured correctly. Step 1: Install PostgreSQL PostgreSQL version 16 or later is required. Download and install PostgreSQL for your operating system. Verify installation by checking the version. Shell # Check PostgreSQL version psql --version # Expected output: # psql (PostgreSQL) 16.0 Step 2: Create Database Create a database for the agent system. The database stores agents, sessions, messages, and memories. SQL # Create database createdb neurondb # Connect to database psql -d neurondb # Verify connection SELECT version(); Step 3: Install NeuronDB Extension NeuronDB provides vector search and embedding capabilities. Download the extension for your PostgreSQL version. Install the extension files. Enable the extension in the database. SQL # Install NeuronDB extension psql -d neurondb -c "CREATE EXTENSION neurondb;" # Verify installation psql -d neurondb -c "SELECT * FROM pg_extension WHERE extname = 'neurondb';" # Expected output: # extname | extversion | nspname #----------+------------+---------- # neurondb | 1.0 | neurondb Step 4: Install NeuronAgent Server NeuronAgent provides the agent runtime. Download the NeuronAgent binary. Extract the files. Configure the server. Start the server. Shell # Download NeuronAgent (example) # wget https://github.com/neurondb-ai/neurondb/releases/download/v1.0.0/neuronagent-linux-amd64 # chmod +x neuronagent-linux-amd64 # mv neuronagent-linux-amd64 ./bin/neuronagent # Run NeuronAgent migrations psql -d neurondb -f migrations/001_initial_schema.sql psql -d neurondb -f migrations/002_add_indexes.sql psql -d neurondb -f migrations/003_add_triggers.sql # Verify migrations psql -d neurondb -c "\dt" # Expected output shows tables: # agents, sessions, messages, memory_chunks, etc. # Start NeuronAgent server ./bin/neuronagent # Server starts on port 8080 by default # Verify server is running curl http://localhost:8080/health # Expected output: # {"status":"healthy"} Step 5: Generate API Key API keys authenticate requests to NeuronAgent. Generate an API key for your application. Store the key securely. Shell # Generate API key ./bin/neuronagent generate-key # Expected output: # API Key: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # Store the key securely export NEURONAGENT_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" The setup creates the complete database schema. The schema includes agent tables for configurations. Session tables track conversations. Message tables store interactions. Memory tables enable long-term context. Indexes enable fast queries. Triggers maintain data consistency. The server provides REST API and WebSocket endpoints. Database Schema Explained The database schema provides the foundation for agent systems. Understanding each table is essential for building agents. This section explains the schema in detail. Agents Table The agents table stores agent configurations. Each agent has a unique identifier. The name field identifies the agent. The system_prompt defines agent behavior. The model_name specifies the language model. The enabled_tools array lists available tools. The memory_table specifies where memories are stored. The config field stores additional settings. SQL -- Agents table stores agent configurations CREATE TABLE IF NOT EXISTS agents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL, system_prompt TEXT NOT NULL, model_name TEXT DEFAULT 'gpt-4', enabled_tools TEXT[] DEFAULT ARRAY['sql', 'http'], memory_table TEXT, config JSONB DEFAULT '{}', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Explanation of each field: -- id: Unique identifier for the agent (UUID) -- name: Human-readable name for the agent -- system_prompt: Instructions that define agent behavior -- model_name: Language model to use (gpt-4, gpt-3.5-turbo, etc.) -- enabled_tools: Array of tool names the agent can use -- memory_table: Table name where agent memories are stored -- config: Additional configuration as JSON -- created_at: Timestamp when agent was created The system_prompt is critical. It defines how the agent behaves. It specifies agent capabilities. It guides decision-making. It sets the response style. It defines tool usage patterns. Sessions Table The sessions table tracks conversation sessions. Each session belongs to an agent. Sessions maintain conversation context. Sessions enable multi-turn conversations. Sessions persist across requests. SQL -- Sessions table stores conversation sessions CREATE TABLE IF NOT EXISTS sessions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), agent_id UUID REFERENCES agents(id), external_user_id TEXT, metadata JSONB DEFAULT '{}', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Explanation of each field: -- id: Unique identifier for the session (UUID) -- agent_id: Reference to the agent handling this session -- external_user_id: Optional identifier for external user systems -- metadata: Additional session metadata as JSON -- created_at: Timestamp when session was created Sessions enable context continuity. Messages within a session share context. Agents remember previous messages. Agents build on past interactions. Sessions can be resumed after interruptions. Messages Table The messages table stores conversation messages. Each message belongs to a session. Messages have roles (user or assistant). Messages contain text content. Messages may include tool calls. Messages are ordered by timestamp. SQL -- Messages table stores conversation messages CREATE TABLE IF NOT EXISTS messages ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), session_id UUID REFERENCES sessions(id), role TEXT NOT NULL, content TEXT NOT NULL, tool_calls JSONB, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Explanation of each field: -- id: Unique identifier for the message (UUID) -- session_id: Reference to the session containing this message -- role: Message role ('user' or 'assistant') -- content: Text content of the message -- tool_calls: JSON array of tool calls made by the agent -- created_at: Timestamp when message was created The role field indicates the message origin. User messages come from users. Assistant messages come from agents. Tool calls are stored in the tool_calls field. Tool calls show which tools were used. Tool calls include parameters and results. Memory Chunks Table The memory chunks table stores agent memories. Memories are converted to embeddings. Embeddings enable semantic search. Memories persist across sessions. Memories improve agent responses over time. SQL -- Memory chunks table stores agent memories CREATE TABLE IF NOT EXISTS memory_chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), agent_id UUID REFERENCES agents(id), session_id UUID REFERENCES sessions(id), content TEXT NOT NULL, embedding VECTOR(384), metadata JSONB DEFAULT '{}', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Explanation of each field: -- id: Unique identifier for the memory chunk (UUID) -- agent_id: Reference to the agent that owns this memory -- session_id: Reference to the session where memory was created -- content: Text content of the memory -- embedding: Vector embedding of the content (384 dimensions) -- metadata: Additional metadata as JSON (tags, importance, etc.) -- created_at: Timestamp when memory was created The embedding field stores vector representations. Embeddings are generated using language models. Embeddings enable semantic similarity search. The VECTOR(384) type stores 384-dimensional vectors. This matches the embedding model dimensions. Indexes for Performance Indexes enable fast queries. Vector indexes enable fast similarity search. B-tree indexes enable fast lookups. Proper indexing is essential for performance. SQL -- Create vector index for memory search CREATE INDEX IF NOT EXISTS idx_memory_embedding ON memory_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Explanation: -- HNSW index enables fast approximate nearest neighbor search -- m = 16: Number of connections per layer (higher = more accurate, slower) -- ef_construction = 64: Quality parameter during index construction -- vector_cosine_ops: Uses cosine distance for similarity -- Create indexes for fast lookups CREATE INDEX IF NOT EXISTS idx_sessions_agent_id ON sessions(agent_id); CREATE INDEX IF NOT EXISTS idx_messages_session_id ON messages(session_id); CREATE INDEX IF NOT EXISTS idx_memory_agent_id ON memory_chunks(agent_id); -- Explanation: -- These indexes enable fast filtering by agent_id and session_id -- Essential for retrieving session messages and agent memories The HNSW index enables sub-10ms similarity search. It uses cosine distance for semantic similarity. The index parameters balance speed and accuracy. Higher values improve accuracy but slow queries. The schema provides a complete agent infrastructure. Agents store configurations for behavior definition. Sessions track conversations for context continuity. Messages store interactions for conversation history. Memories enable long-term context through semantic search. Indexes ensure fast queries for production performance. Create Agent Create a research assistant agent. The agent uses SQL tools to query documents. The agent uses HTTP tools to fetch web content. The agent stores memories for future reference. Python import requests import json # NeuronAgent API endpoint BASE_URL = "http://localhost:8080" API_KEY = "your-api-key-here" headers = { "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json" } # Create research assistant agent agent_data = { "name": "research-assistant", "system_prompt": """You are a research assistant. Your role is to: 1. Answer questions using available tools 2. Retrieve relevant documents from the database 3. Synthesize information from multiple sources 4. Store important facts in memory for future reference 5. Provide accurate and cited responses Always use SQL tools to query the document database. Always use HTTP tools to fetch current information when needed. Always store important facts in memory.""", "model_name": "gpt-4", "enabled_tools": ["sql", "http"], "memory_table": "memory_chunks", "config": { "temperature": 0.7, "max_tokens": 2000, "top_p": 0.95 } } response = requests.post( f"{BASE_URL}/api/v1/agents", headers=headers, json=agent_data ) agent = response.json() print(f"Agent created: {agent['id']}") The agent configuration defines behavior. The system prompt guides agent actions. Enabled tools specify available functions. The memory table enables context storage. Create Session Create a conversation session. Sessions track individual conversations. Sessions maintain message history. Sessions enable context continuity. Python # Create session for the agent session_data = { "agent_id": agent["id"], "external_user_id": "user-001", "metadata": { "topic": "research", "language": "en" } } response = requests.post( f"{BASE_URL}/api/v1/sessions", headers=headers, json=session_data ) session = response.json() print(f"Session created: {session['id']}") Sessions isolate conversations. Each user gets a separate session. Sessions persist across requests. Sessions enable multi-turn conversations. Send Messages Send messages to the agent. The agent processes messages. The agent uses tools as needed. The agent generates responses. Python # Send a research query message_data = { "content": "What are the key features of vector databases?", "role": "user" } response = requests.post( f"{BASE_URL}/api/v1/sessions/{session['id']}/messages", headers=headers, json=message_data ) result = response.json() print(f"Response: {result['response']}") print(f"Tokens used: {result.get('tokens_used', 0)}") The agent processes the query. The agent uses SQL tools to query documents. The agent retrieves relevant information. The agent generates a response. Tool Execution Example The agent uses SQL tools to query documents. This example shows the tool execution flow. SQL # The agent automatically uses SQL tools when needed # Example: Agent receives query about vector databases # Agent generates SQL query: query = """ SELECT chunk_text, doc_title, similarity FROM ( SELECT dc.chunk_text, d.title AS doc_title, 1 - (dc.embedding <=> embed_text('vector databases features', 'sentence-transformers/all-MiniLM-L6-v2')) AS similarity FROM document_chunks dc JOIN documents d ON dc.doc_id = d.doc_id ORDER BY dc.embedding <=> embed_text('vector databases features', 'sentence-transformers/all-MiniLM-L6-v2') LIMIT 5 ) results; """ # Agent executes query via SQL tool # Tool returns results # Agent uses results to generate response Tool execution happens automatically. The agent identifies needed information. The agent selects appropriate tools. The agent formats tool calls. Tools execute and return results. Memory Storage The agent stores important facts in memory. Memory enables future context retrieval. Python # Agent automatically stores memories # Example: After answering about vector databases # Agent extracts key facts: facts = [ "Vector databases store high-dimensional embeddings", "HNSW indexes enable fast similarity search", "Vector databases support semantic search" ] # Agent stores facts in memory_chunks table # Each fact is converted to embedding # Embeddings enable semantic retrieval Memory storage happens automatically. The agent extracts important facts. Facts are converted to embeddings. Embeddings are stored in the database. Memory Retrieval The agent retrieves relevant memories for context. Memory retrieval uses semantic search. SQL -- Agent retrieves relevant memories for query context WITH query_embedding AS ( SELECT embed_text( 'vector database features', 'sentence-transformers/all-MiniLM-L6-v2' ) AS embedding ) SELECT content, 1 - (embedding <=> qe.embedding) AS similarity FROM memory_chunks mc CROSS JOIN query_embedding qe WHERE agent_id = 'agent-uuid-here' ORDER BY embedding <=> qe.embedding LIMIT 5; Memory retrieval finds relevant context. Similarity search ranks memories. Top memories are added to the context. Context improves response quality. Complete Example This complete example shows agent usage from start to finish. Python #!/usr/bin/env python3 """ Complete NeuronAgent Example: Research Assistant """ import requests import json import time BASE_URL = "http://localhost:8080" API_KEY = "your-api-key-here" headers = { "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json" } # Step 1: Create agent print("Creating research assistant agent...") agent_data = { "name": "research-assistant", "system_prompt": """You are a research assistant. Answer questions using SQL tools to query documents and HTTP tools to fetch current information. Store important facts in memory.""", "model_name": "gpt-4", "enabled_tools": ["sql", "http"], "memory_table": "memory_chunks" } response = requests.post(f"{BASE_URL}/api/v1/agents", headers=headers, json=agent_data) agent = response.json() print(f"Agent created: {agent['id']}") # Step 2: Create session print("Creating session...") session_data = {"agent_id": agent["id"]} response = requests.post(f"{BASE_URL}/api/v1/sessions", headers=headers, json=session_data) session = response.json() print(f"Session created: {session['id']}") # Step 3: Send research queries queries = [ "What are vector databases?", "How does semantic search work?", "What is the difference between HNSW and IVFFlat indexes?" ] for query in queries: print(f"\nQuery: {query}") message_data = {"content": query, "role": "user"} response = requests.post( f"{BASE_URL}/api/v1/sessions/{session['id']}/messages", headers=headers, json=message_data ) result = response.json() print(f"Response: {result['response'][:200]}...") print(f"Tokens used: {result.get('tokens_used', 0)}") time.sleep(1) print("\nExample completed!") The complete example demonstrates the full agent workflow. Agent creation sets up capabilities. Session creation starts conversations. Message sending triggers agent execution. The agent uses tools automatically. Agent stores memories automatically. Advanced Patterns Advanced patterns extend basic agent functionality. Patterns include multi-agent systems, agent orchestration, and specialized agents. Multi-agent systems use multiple agents. Each agent handles specific tasks. Agents communicate through shared memory. Agents coordinate through message passing. Agent orchestration manages agent workflows. Orchestrators route tasks to agents. Orchestrators coordinate multi-step processes. Orchestrators handle failures and retries. Specialized agents focus on specific domains. Research agents handle information retrieval. Code agents handle programming tasks. Analysis agents handle data processing. The advanced patterns diagram shows system architectures. Multi-agent systems show agent coordination. Orchestration shows workflow management. Specialization shows domain-specific agents. Advanced patterns enable scaling agent systems to handle complex, distributed scenarios. Multi-agent systems allow multiple specialized agents to work together on complex problems that exceed the capabilities of individual agents. Orchestration patterns coordinate agent activities to ensure proper sequencing and resource management. Specialized agents can focus on specific domains, leveraging domain knowledge to provide superior performance in their areas of expertise. These patterns enable building sophisticated agent ecosystems that can tackle enterprise-scale challenges. Production Considerations Performance Optimization Agent performance depends on several factors. Planning time affects response latency, tool execution time affects task duration, and memory retrieval time affects context loading. Optimization strategies include caching. Query embeddings, tool results, and memory retrievals are cached, reducing computation time. Index optimization improves memory search. HNSW indexes enable fast similarity search. Index parameters affect query performance, and index maintenance helps ensure optimal performance. SQL -- Monitor memory search performance SELECT COUNT(*) AS total_memories, AVG(vector_dims(embedding)) AS avg_dimensions, pg_size_pretty(pg_total_relation_size('memory_chunks')) AS table_size FROM memory_chunks; -- Check index usage SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE tablename = 'memory_chunks'; Performance monitoring tracks system health. Query statistics show usage patterns. Index statistics show search efficiency. Size statistics show storage requirements. Security Considerations Agent security requires careful design. Tool execution must be sandboxed, SQL queries must be restricted, HTTP requests must be validated, and code execution must be isolated. Security measures include authentication. API keys authenticate requests, rate limiting prevents abuse, and role-based access controls provide permissions. Tool security includes validation. SQL tools are limited to read-only queries; HTTP tools validate URLs; code tools restrict file access; and shell tools restrict commands. SQL -- Example: Restrict SQL tool to read-only CREATE ROLE agent_user; GRANT SELECT ON ALL TABLES IN SCHEMA public TO agent_user; REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM agent_user; Security configuration limits agent capabilities. Read-only access prevents data modification. URL validation prevents malicious requests. Command restrictions avoid access to the system. Monitoring and Observability Monitoring tracks agent behavior. Metrics include request counts, response times, tool usage, and error rates. Logs record execution details, and traces show request flows. Key metrics include latency. Planning latency measures plan generation time, execution latency measures tool call time, memory latency measures retrieval time, and total latency measures end-to-end time. Error tracking identifies issues. Failed tool calls are logged, planning failures are recorded, memory retrieval errors are tracked, and state machine errors are monitored. SQL -- Track agent metrics CREATE TABLE agent_metrics ( id SERIAL PRIMARY KEY, agent_id UUID, metric_name TEXT, metric_value NUMERIC, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Log tool executions CREATE TABLE tool_executions ( id SERIAL PRIMARY KEY, agent_id UUID, tool_name TEXT, execution_time_ms INTEGER, success BOOLEAN, error_message TEXT, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); Metrics tables track system performance. Agent metrics show usage patterns. Tool metrics show execution efficiency. Error metrics show failure rates. Conclusion Agentic AI systems enable autonomous task execution. Agents plan multi-step tasks, use tools to interact with systems, store contextual memories, and manage state across sessions. This guide explained agent architecture. It covered planning systems, tool execution, memory management, and state machines. It provided implementation examples using NeuronDB and NeuronAgent. NeuronDB provides a vector search for memory systems. NeuronAgent provides agent runtime infrastructure. Together, they enable production agent systems. Use agents for: Complex tasks requiring multiple stepsTasks requiring external tool accessTasks requiring long-term memoryTasks requiring autonomous operation
Multimodal AI — systems that understand and generate combinations of text, images, audio, and video — is exploding from labs into production. These workloads are heavier, spikier, and more stateful than traditional microservices; they demand heterogeneous accelerators, memory-hungry models, high-throughput storage, and event-driven data plumbing. Kubernetes sits squarely at the center of this shift. Done right, Kubernetes provides the primitives to compose multimodal pipelines, right-size GPU capacity, and automate end-to-end lifecycles from training to real-time inference. This article goes deep on the architectural building blocks, production patterns, and concrete platform tactics to future-proof your Kubernetes stack for multimodal AI — without hard-wiring to a single framework or vendor. Why Multimodal Workloads Challenge Conventional Clusters AspectTraditional AI WorkloadMultimodal AI WorkloadInput TypeText-onlyText, Images, Audio, VideoModel CompositionSingle modelMultiple chained models (OCR, ASR, Vision Encoder, LLM)Hardware RequirementsUniform GPUMixed GPUs, CPUs, TPUsScheduling PatternStateless, synchronousStateful, asynchronousData FlowBatch or RESTStreaming, Event-drivenScaling NeedsPredictableHighly bursty Why Multimodal Changes the Game Multimodal systems don’t just “run a bigger model.” They orchestrate graphs of models and pre/post-processing steps: Text encoder/decoder + image encoder + vision-language fusion + ASR/TTS stagesContent safety and grounding filters in the loopVectorization and retrieval for long-context reasoningOptional video chunking, OCR, or speech diarization These DAGs run across CPUs, GPUs, and sometimes DPUs. Some steps are high-latency batch jobs (e.g., fine-tuning), while others are ultra-low-latency online inference (e.g., chat completion with image context). The result: you need heterogeneous scheduling, burst scaling, smart batching, and streaming/eventing — with good-old containers as the portability anchor. Kubernetes now has a mature ecosystem to meet these needs. Let’s break it down. GPU Foundations: Device Plugins, Operators, and Partitioning Expose accelerators as first-class Kubernetes resources. The NVIDIA device plugin advertises GPUs to the kubelet so Pods can request resources. It’s battle-tested and integrates GPU Feature Discovery to label nodes with GPU capabilities for smarter scheduling. Automate the driver/runtime stack with the GPU Operator. Instead of bespoke AMIs or snowflake DaemonSets, the GPU Operator installs/maintains the entire CUDA stack (drivers, container toolkit, monitoring). Cloud providers like GKE document how to enable it cleanly so clusters stay patchable. Right-size GPUs with MIG. Multi-Instance GPU (MIG) on A100/H100 class cards lets you slice a single card into isolated GPU instances — great for running many small models or multi-tenant inference. Kubernetes supports MIG via the GPU Operator and MIG manager, including the necessary driver/runtime prerequisites. This is a critical building block for packing multimodal micro-models (safety filters, OCR, ASR) onto a few cards while reserving full GPUs for your primary VLM/LLM. Scheduling for AI: Batch, Online, and Everything In-Between General-purpose kube-scheduler is excellent for stateless services, but multimodal AI brings gang scheduling, queueing, and topology constraints. Two patterns dominate: Batch/Elastic scheduling with Volcano (and friends like Kueue/YuniKorn). Volcano introduces job queues, gang scheduling (all pods start together), preemption policies, and GPU-aware bin packing to boost utilization and reduce starvation across training, fine-tuning, and large batched preprocessing. Volcano’s unified scheduling approach can govern both online and offline jobs to simplify cluster operations, and NVIDIA highlights bin-packing strategies to avoid GPU fragmentation — vital when mixing MIG slices with full-GPU jobs. Ray on Kubernetes for distributed Python, serving, and autoscaling. Ray adds a cluster-level runtime for Python operators, data processing, and parallel inference. Ray Serve scales replicas based on queue depth; KubeRay integrates with Kubernetes, so cluster nodes and Ray workers expand/contract automatically. For multimodal pipelines, Ray excels at fan-out/fan-in steps (e.g., frame chunking, multi-stage vision preprocessing) before handing off to a model server. Takeaway: In production, you’ll often combine both: Volcano for large, scheduled jobs and KubeRay for elastic online/nearline micro-pipelines. Serving at Scale: KServe, Triton, and ModelMesh KServe is the de facto model-serving API on Kubernetes, with pluggable runtimes including TensorFlow Serving, NVIDIA Triton, vLLM/Hugging Face, XGBoost/LightGBM, and more. It standardizes REST/gRPC inference protocols, request/response schemas, and can hook into event sources like Kafka. NVIDIA Triton Inference Server is a high-performance runtime that runs models from multiple frameworks (TensorRT, PyTorch, ONNX, Python backends, etc.) and supports parallel execution across multiple model instances on the same system. For multimodal pipelines, Triton’s ensemble models stitch pre/post-processing + inference stages together server-side to cut network hops and latency. Pair that with the TensorRT-LLM backend (inflight batching, paged attention) for LLM/VLM efficiency. ModelMesh (via KServe or Red Hat OpenShift AI) enables multi-model, high-density serving. It lazily loads/unloads models based on demand, acting like a distributed LRU cache to keep memory footprint sane. This is ideal when your multimodal app dynamically picks models (OCR variants, language-specific ASR, domain safety classifiers) per request. Pattern: For low-latency, high-TPS endpoints, define a KServe using the Triton runtime (or vLLM for text). For “many small models” (N>100s), add ModelMesh. For very custom Pythonic pre/post pipelines, consider Ray Serve or Triton ensembles, depending on where you want the DAG to live. Pipeline Orchestration: Kubeflow Pipelines Training, evaluation, distillation, and dataset curation for multimodal systems are workflows repeated hundreds of times. Kubeflow Pipelines (KFP) packages each step as a containerized component and wires them into a pipeline DAG with typed inputs/outputs, caching, and lineage. Because KFP runs natively on Kubernetes, it inherits your cluster’s GPU scheduling (e.g., Volcano) and security posture. Tip: Treat KFP as the CI/CD of your models — compile pipelines from code, parameterize datasets/model versions, and promote artifacts to staged registries for serving via KServe. Eventing and Streaming: Knative + Kafka Multimodal inference often depends on events: “new image in S3/MinIO,” “new call transcript,” or “moderation request.” With Knative Eventing and the Kafka Broker, you can wire CloudEvents to KServe services asynchronously — buffering spikes, decoupling producers/consumers, and routing by content (e.g., route audio to ASR, images to OCR). You get isolated data planes and efficient conversions from CloudEvents to Kafka records with first-class Broker/Trigger APIs. Impact: Asynchrony is a super-power for multimodal workloads — when paired with autoscaling consumers (KServe, Ray Serve), the platform can absorb traffic bursts without over-provisioning GPUs. Real-world write-ups show how teams retrofit synchronous HTTP inference to async pipelines with Knative + KServe — no model code changes required. A Reference Architecture (Production-Ready) Cluster and GPU layer Managed Kubernetes (GKE/AKS/EKS/on-prem)NVIDIA GPU Operator + device plugin; MIG enabled where appropriate; node pools sized by job class.Scheduling and autoscaling Volcano for training/batch; KubeRay for elastic Python/Ray micro-pipelines; HPA/KPA or Ray autoscaling for services; bin-packing policies to curb fragmentation. Model serving KServe runtimes: Triton for ensembles/multi-framework; vLLM/HF for LLMs; ModelMesh for high-density multi-model. Pipelines Kubeflow Pipelines for train/eval/distill; artifact stores on MinIO/S3 + model registry; promotion gates into serving namespaces.Eventing and streaming Knative Eventing + Kafka Broker; content-based routing to services; async DLQs/retries; S3/MinIO notifications.Observability and SLOs GPU/DCGM metrics, request-level tracing, per-model latency/throughput, batch queue depth, autoscaler decisions, and GPU occupancy dashboards. Future-Proofing Tactics for Multimodal Workloads Design for “many models,” not “one big model.” Even if you start with a single VLM, you’ll add safety, OCR, ASR, and domain adapters. Adopt ModelMesh early to avoid monolithic GPU servers that can’t scale down. It gives you lazy loading and intelligent eviction to match real traffic patterns.Keep DAGs close to compute. When pre/post is simple and repeatable, push it into Triton ensembles to eliminate network hops. For complex Pythonic steps or cross-service fan-out, use Ray Serve or an evented KServe pipeline. Triton’s ensemble scheduler reduces round-trips and can boost tail latency for multimodal chains.Treat GPUs like a shared, multi-tenant fabric. Enable MIG where feasible; consolidate small models onto shared slices and reserve full GPUs for heavy LLM/VLM decoders. Pair this with Volcano’s bin-packing to minimize fragmentation and keep entire GPUs free for big jobs.Autoscale on real signals. For online inference, scale on queue length and concurrency rather than CPU utilization (which is a poor proxy for GPU load). Ray Serve and KServe both support autoscaling driven by pending requests/queue depth — this is crucial for prompt-driven traffic spikes.Make async the default. Use Knative + Kafka to absorb spiky traffic, apply backpressure, and decouple producers. Route events to the right modality services and apply retries/timeout policies centrally. This reduces the need to overprovision GPUs “just in case.” Standardize protocol surfaces. Adopt KServe’s standardized inference protocols; Triton natively speaks those APIs, so clients can switch runtimes with minimal changes — a key portability hedge as the model landscape evolves.Bake in model lifecycle from day one. Define Kubeflow Pipelines for everything: data ingestion, evaluation, red-team tests, quantization, LoRA merging, and regression baselines. Make a promotion to serving an automated gate, not a ticket. Cost, Reliability, and Compliance: What Actually Bites in Prod Cost: GPU idling is the silent killer. MIG + bin-packing + multi-model serving let you run 10–50 “support models” on a few cards. ModelMesh’s lazy load means you only pay for resident models, not all possible variants. Reliability: Tail latency comes from chatter between steps. Collapse steps with Triton ensembles where possible and prefer intra-Pod pipes (localhost or shared memory) over network round-trips. Scalability: Plan for 100s–1000s of models. Namespacing, per-team CRDs, and quotas prevent noisy neighbors. KServe + ModelMesh impose consistent control planes as teams grow. Security/Compliance: Container SBOMs for runtimes, signed model artifacts, and network policies that fence GPUs from the broader mesh. Event streams (Kafka) act as auditable rails for content moderation events.Portability: Favor open APIs (KServe) and open runtimes (Triton, vLLM, Ray). You can run the same manifests across clouds and on-prem clusters without refactoring application code. Optimization StrategyExpected BenefitMIG Partitioning+50–70% GPU utilizationRay Autoscaling-30% cost at low loadTriton Ensembles-40% latencyModelMesh Lazy Loading-60% memory footprint Putting It Together: A Multimodal Inference Blueprint Use case: A chat assistant that accepts images, returns text + optional speech, and applies safety filters. Ingress and eventing HTTP uploads land in an object store; events flow via Knative Kafka Broker to the routing service that inspects modality metadata and emits specialized events (OCR, ASR, vision-encoder). Pre/Post on GPU Triton ensemble hosts image preprocessing → encoder → adapter as a single logical model to reduce latency; ASR runs as a separate process with batch windows and VAD pre-step.Core LLM/VLM vLLM/TensorRT-LLM backend via KServe for fast token throughput; in-flight batching and paged attention are enabled. Safety and grounding Lightweight classifiers served via ModelMesh, so dozens of domain filters stay “nearby” without permanent residency. Autoscaling and scheduling Ray Serve scales the OCR/ASR micro-pipelines on queue depth; Volcano schedules nightly fine-tuning and evaluation sweeps; MIG slices host the small filters; full GPUs serve the VLM. Observability Per-model latency, GPU utilization, occupancy, and load-unload churn (ModelMesh) are core SLOs. Alerts trigger on queue backlog and ensemble step anomalies. What to Pilot in the Next 30 Days Enable the GPU Operator and MIG on a small node pool; validate resources and run a smoke test with two small models plus one large model. Stand up KServe with Triton and deploy a two-stage ensemble (preprocess → model). Measure P50/P99 vs separate microservices. Layer ModelMesh on a canary namespace and deploy 50+ tiny classifiers; watch memory residency and cold-start hit rates during synthetic traffic. Introduce Knative Kafka Broker and convert one synchronous endpoint to event-driven. Compare GPU hours before/after under bursty loads.Adopt Volcano for your nightly training/eval jobs; configure priority classes and bin-packing to reduce stranding. Closing: Kubernetes as the Multimodal Substrate The winners in multimodal AI won’t be the teams with the single “fastest” model; they’ll be the teams with a composable, portable, and efficient workflow that can absorb new models, new modalities, and new traffic patterns without re-platforming. Kubernetes gives you that substrate — if you lean into the ecosystem: GPU Operator and MIG for resource fidelity; Volcano and Ray for smart scheduling and elastic Python; KServe, Triton, and ModelMesh for serving at scale; Kubeflow Pipelines for continuous model operations; and Knative + Kafka for event-driven resilience. Build around open protocols (KServe), open runtimes (Triton/Ray), and portable manifests. Doing so not only solves today’s multimodal demands — it future-proofs your platform for whatever the next wave (video-native agents, audio-first copilots, on-device edge fusion) throws at you. References NVIDIA GPU Operator & MIG support for Kubernetes (drivers/runtime, MIG manager). Volcano unified scheduling and GPU bin-packing strategies. Ray Serve autoscaling and KubeRay on Kubernetes. KServe runtimes and Triton’s KServe protocol compatibility and ensembles.ModelMesh for high-density, multi-model serving.
When organizations talk about adopting large language models, the conversation usually starts with model choice. GPT versus Claude. Open source versus proprietary. Bigger versus cheaper. In real enterprise systems, that focus is misplaced. Production success with LLMs depends far more on architecture discipline than on the model itself. What separates a fragile demo from a resilient, governable system is mastery of a small set of core engineering skills. These skills shape how models are instructed, grounded, deployed, observed, and evolved over time. In this article, I am going to discuss eight such skills from the perspective of building real systems, not experimenting in notebooks. Each section explains why the skill matters, when it should be applied, and how it fits into a clean enterprise architecture. Prompt Engineering Prompt engineering is the foundation layer of any LLM system. It translates human intent into precise, structured instructions that a model can execute reliably. In production environments, prompts are not handwritten strings. They are assembled programmatically using templates, roles, constraints, examples, and safety rules. Strong prompt engineering reduces hallucinations, improves consistency, and often delays the need for more complex approaches such as fine-tuning or agents. Poor prompts, on the other hand, amplify variability and force teams to compensate with brittle downstream logic. In mature systems, prompts are versioned, tested, and reviewed just like application code. This discipline is what allows teams to change models without rewriting business logic. Context Engineering Context engineering determines what information the model sees at inference time. Instead of overloading a single prompt with everything, systems dynamically assemble relevant context from memory stores, structured databases, documents, and APIs. This is where enterprise reliability truly begins. Context engineering is deterministic and auditable. You can explain why a model responded the way it did because you know exactly what data it was given. Teams that skip this step often rely on the model to infer missing information. That approach may work in demos but fails under regulatory scrutiny or operational scale. Context engineering turns LLMs from probabilistic guessers into controlled reasoning components. Fine-Tuning Fine-tuning modifies the model itself so that the desired behavior is internalized rather than instructed repeatedly. This approach is most effective when the same task repeats at scale, such as classification, extraction, or domain-specific reasoning. The tradeoff is flexibility. Fine-tuned models are harder to change and require disciplined data governance. Training data must be curated, versioned, and reviewed for bias and drift. In enterprise settings, fine-tuning should be a deliberate optimization step, not the default starting point. Many teams fine-tune prematurely when prompt and context engineering would have been sufficient. Retrieval-Augmented Generation Retrieval-augmented generation, or RAG, grounds model outputs in external knowledge. Instead of trusting what the model remembers, the system retrieves relevant information at runtime and injects it into the prompt. This pattern dominates enterprise adoption because it balances accuracy, freshness, and explainability. Knowledge can be updated without retraining models, and responses can be traced back to source documents. Well-designed RAG systems treat retrieval as a first-class concern. Chunking strategy, embedding choice, ranking logic, and context all materially affect outcome quality. Agents Agents introduce autonomy. An agent does not simply respond to input. It reasons, plans, calls tools, evaluates results, and iterates until a goal is achieved. This capability is powerful and dangerous if misapplied. Agents are best suited for workflows such as multi-step analysis, orchestration, and decision support. They are poorly suited for factual retrieval or compliance-sensitive outputs. Common failure modes include infinite loops, tool hallucination, runaway cost, and unpredictable behavior. In enterprise systems, agents must be constrained with explicit goals, step limits, tool allow lists, and strong observability. Autonomy without guardrails is not intelligence. It is a risk. LLM Deployment Deployment turns models into dependable services. This layer handles routing, scalability, authentication, authorization, and versioning. A clean deployment architecture allows teams to swap models without forcing application changes. In enterprise environments, deployment also defines security boundaries. It determines where data flows, how requests are logged, and how failures are isolated. Treating LLMs as just another API dependency is a mistake. They are probabilistic systems that require careful exposure and lifecycle management. LLM Optimization Optimization ensures performance and cost efficiency at scale. This includes caching frequent responses, compressing context, routing requests to different models, and applying techniques such as quantization. Optimization is often invisible to end users but critical to sustainability. Without it, even well-designed systems become prohibitively expensive as usage grows. Teams should treat optimization as an ongoing discipline rather than a one-time exercise. Usage patterns evolve, and so should optimization strategies. LLM Observability Observability provides visibility into prompts, responses, latency, cost, and failure modes. Without it, LLM systems are effectively ungovernable. In regulated industries, observability is not optional. Teams must be able to trace outputs, audit decisions, and detect drift or misuse. Effective observability combines tracing, metrics, and structured logging. It allows teams to debug behavior, enforce policy, and continuously improve system quality. This is end to end reference architecture diagram: Reference Architecture Explanation 1. Prompt Engineering The foundation of the system is prompt engineering, where user intent is transformed into structured instructions. In production, prompts are assembled programmatically using templates, system roles, constraints, and examples. Tools like LangChain and LlamaIndex allow for modular, reusable prompt templates, improving determinism and reducing hallucinations. 2. Context Engineering Context engineering ensures the model sees the right information at inference time. Instead of embedding all data in a single prompt, the system dynamically assembles relevant context from multiple sources. This includes memory databases such as Redis or DynamoDB, document stores such as S3 or Blob Storage, structured enterprise data such as Postgres or Snowflake, and vector stores such as Pinecone, Weaviate, or FAISS. The context builder ranks and filters data to provide deterministic, auditable input to the model. 3. Fine-Tuning Fine-tuning customizes models to internalize behaviors for repeated tasks. This is essential for domain-specific tasks such as classification, extraction, or reasoning at scale. Fine-tuned models are implemented using platforms like HuggingFace or SageMaker and provide consistency at the cost of flexibility. 4. Retrieval Augmented Generation RAG ensures outputs are grounded in external knowledge rather than relying solely on what the model remembers. The system retrieves and embeds relevant information from vector stores, document repositories, enterprise databases, and memory layers into prompts. This balances accuracy, freshness, and explainability, forming a critical part of enterprise reliability. 5. Agents and Tooling Agents orchestrate autonomous reasoning and task execution. They decide, iterate, and call external tools to achieve a goal. Enterprise tools include search, SQL queries, Python scripts, or APIs. Agents provide a structured workflow layer above raw inference, enabling complex multi-step operations while keeping control and auditability intact. 6. Model and Inference This layer manages the execution of foundation and fine-tuned models. A model router selects the appropriate model based on cost, latency, or other criteria. Foundation models such as GPT 4.x, Claude, or Gemini handle general tasks, while fine-tuned models execute domain-specific operations. This layer turns the model into a dependable service that can scale and evolve without changing application logic. 7. Optimization Optimization ensures performance and cost efficiency. Techniques include response caching with Redis, context compression using summarization and chunking, and model quantization with INT8 or INT4 representations. These optimizations are invisible to users but crucial for sustainability at scale. 8. Observability and Governance The final layer provides visibility, traceability, and monitoring. Tools like OpenTelemetry and LangSmith trace prompt and model activity. Metrics and cost tracking are handled via Prometheus or Datadog. Logs from prompts and model outputs are collected in ELK or CloudWatch, and dashboards such as Grafana provide a comprehensive view for engineers and decision makers. Observability enables governance, auditing, and operational reliability. Conclusion When you understand these eight skills and how they compose, you stop thinking in terms of models and start thinking in systems. That shift is what turns LLM adoption from experimentation into real engineering leadership.
AI agents have taken the world by storm and are making positive gains in all domains such as healthcare, marketing, software development, and more. The chief reason for their prominence lies in being able to automate routine tasks with intelligence. For example, in software development, stories and bugs have automated tracking in tools such as GitHub, Rally, and Jira; however, this automation lacks intelligence, often requiring engineers and project managers to triage them. Using an AI agent, as you will learn in this article, smart triaging can be carried out using generative AI. AI agents can be developed using many techniques and in several programming languages. Python has been a leader in the AI and ML space, whereas JavaScript has been the undisputed king in web development and has been prominent in back-end development as well. Historically, popular AI agent development frameworks have had their roots in Python, but their JavaScript ports have become mature in the recent past. This emergence allows a large number of JavaScript engineers to create their own AI agents without switching stacks. This article focuses on this shift, showing you how to develop an AI agent using JavaScript. Before you learn the agent, it is important to understand the similarities between an AI agent and a pure LLM, though. Basic Anatomy of an Agent An AI agent is capable of going beyond just querying an LLM for an answer. Before returning an answer, the agent takes several autonomous actions: Unlike an LLM, an agent can follow tasks via a loop (known as ReAct) where it can plan, observe, and rerun to get the output of a task before a final answer.An agent can query data, run commands on a terminal, call an API, and more. This feature is known as tool calling.Furthermore, an agent remembers details through extended memory and can build a longer working memory by leveraging vector databases.The agent can handle failures and initiate smart error handling and self-corrections. Please refer to the illustration below to understand an AI agent. Please note that task status is technically part of the agent only, but here it is used to clearly highlight the ReAct loop. Image: Basic Anatomy of an Agent In the next section, you will implement an actual AI agent to understand each of these parts in more detail. Scaffolding a New Project You are going to build this project using TypeScript, which is a superset of JavaScript. You get all the freedom of JavaScript while gaining extra checks to maintain structure. Additionally, TypeScript helps avoid common JavaScript errors caused by a lack of typing, a feature that most modern programming languages support. Create a new project by running the npm init command in your terminal. This will initialize the project and generate a package.json file in the project root. Replace the content of your package.json with the following: JSON { "name": "github-agent", "version": "1.0.0", "type": "module", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1", "start": "npx tsx src/agent.ts" }, "keywords": [], "author": "", "license": "ISC", "description": "", "dependencies": { "@langchain/community": "^1.1.1", "@langchain/core": "^1.1.8", "@langchain/google-genai": "^2.1.3", "@langchain/openai": "^1.2.0", "@octokit/core": "^7.0.6", "dotenv": "^16.6.1", "langchain": "^1.1.0", "octokit": "^5.0.5" }, "devDependencies": { "@types/node": "^25.0.3", "ts-node": "^10.9.2", "typescript": "^5.9.3" } } There are several dependencies defined in the package.json: LangChain family: Core dependencies that provide the framework for building your agent.Octokit: Leveraged to authenticate with GitHub when the agent needs to access current issues.Dotenv: Helps you parameterize the project by managing sensitive API keys.Dev dependencies: Mainly focus on converting TypeScript code to JavaScript for final execution. Run npm install to actually install these dependencies. Before writing the logic, create a .env file in your root directory. This is where you store sensitive credentials. I have included an .env.example file for your reference. You can rename it to .env and replace details with your personal details. Ensure to never commit this file. TypeScript GOOGLE_API_KEY=<GOOGLE-API-KEY> GITHUB_TOKEN=<GITHUB_TOKEN> With these settings, you are now ready to build the actual agent. Creating the GitHub Agent Step 1: Build the Tool The first thing you will need is the resources the agent is going to need to communicate with GitHub to fetch the issues. These resources are known as tools, as you learned earlier. Create a directory in the root of the project and name it src. Inside the src directory, create another directory named tools. Inside this tools directory, create a file named github.ts and add the code below: TypeScript import { tool } from "@langchain/core/tools"; import { Octokit } from "@octokit/core"; const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN }); export const fetchRepoIssues = tool( async ({ owner, repo, count = 5 }) => { try { console.log(`[SYSTEM]: Fetching live issues (excluding PRs) from ${owner}/${repo}...`); const { data } = await octokit.request("GET /repos/{owner}/{repo}/issues", { owner, repo, state: "open", per_page: count * 2, }); const onlyIssues = data .filter((item: any) => !item.pull_request) .slice(0, count); if (onlyIssues.length === 0) return `No open issues found in ${owner}/${repo}.`; const digest = onlyIssues.map((issue: any) => { return `Issue #${issue.number}: "${issue.title}" Author: ${issue.user?.login} Labels: ${issue.labels.map((l: any) => l.name).join(", ") || "None"} Body: ${issue.body?.substring(0, 300) || "No description provided"}...`; }).join("\n\n---\n\n"); return `Successfully fetched ${onlyIssues.length} issues from ${owner}/${repo}:\n\n${digest}`; } catch (error: any) { return `Error fetching GitHub issues: ${error.message}`; } }, { name: "fetch_repo_issues", description: "Fetches live open issues (excluding PRs) from a GitHub repository.", schema: { type: "object", properties: { owner: { type: "string", description: "The repo owner (e.g., 'facebook')" }, repo: { type: "string", description: "The repo name (e.g., 'react')" }, count: { type: "number", description: "Number of real issues to return" } }, required: ["owner", "repo"] } } ); You are using the Octokit library to authenticate with GitHub before actual issues from a repository can be fetched.Furthermore, you are filtering for truly open issues, which is achieved by looking at issues that don’t have an associated pull request.Additionally, you are trimming the issue body to 300 characters, which provides the agent with enough context about the issue while keeping LLM token usage limited and cost-effective.You are also creating the schema that the agent will leverage when calling the tool, so that the interaction between the agent and the tool can be more deterministic. Step 2: Build the Agent Now it is time to build the agent. Create a file named agent.ts inside the srcdirectory and add the code below to it: TypeScript import "dotenv/config"; import { ChatGoogleGenerativeAI } from "@langchain/google-genai"; import { createReactAgent } from "@langchain/langgraph/prebuilt"; import { fetchRepoIssues } from "./tools/github.js"; const model = new ChatGoogleGenerativeAI({ model: "gemini-3-flash-preview", temperature: 0, }); const SYSTEM_PROMPT = ` You are a Senior GitHub Maintainer. Your goal is to help users understand the status of their repositories. When asked about issues, use the fetch_repo_issues tool. Always provide a technical summary of the top issue you find. `; const agent = createReactAgent({ llm: model, tools: [fetchRepoIssues], messageModifier: SYSTEM_PROMPT, }); async function runAuditor() { console.log("--- Starting GitHub Auditor ---"); const response = await agent.invoke({ messages: [ { role: "user", content: "What are the latest issues in the facebook/react repo?" } ], }); const lastMessage = response.messages[response.messages.length - 1]; console.log("\nMaintainer's Report:"); console.log(lastMessage.content); } runAuditor(); This code uses the LangChainframework to create the agent. The first step is to tell the agent which LLM model to work with. The code above uses the gemini-3-flash-preview model, which is one of the fastest flagship models at the time of writing this. Additionally, you set the temperatureof the model, which tells the model how creative or deterministic it should be. Setting it to 0 makes the model more deterministic, which aligns well with the technical nature of the task this agent is going to perform. The next step is to create the brain of the agent, which is nothing but the promptthe agent will use to act. This prompt tells the agent what its role is, what tools it can use, and what constraints it should adhere to. Moreover, it needs to be told how the output should be formatted. Now that all the magical ingredients are ready, it's time to make the soup — the agent itself. You create the agent by invoking the createReactAgent function and passing it the model you selected, the tools the agent can use, and the system prompt as a messageModifier. By passing the tools array directly into createReactAgent, LangGraph automatically handles the conversion of your TypeScript tool definitions into the JSON schemas that the LLM expects. You learned about the ReAct loop of an AI agent earlier. The createReactAgent function is responsible for not only creating the agent but also managing this loop. This function acts as an orchestrator of the tasks the agent and LLM perform, handling state and schema management while also taking care of the termination logic. Note: Even though the tool file is named github.ts, you import it using the .js extension. This is a requirement of modern Node.js ECMAScript Modules With this, you are ready to start running your agent. Running the Agent On your terminal, start the agent with the npm startcommand. You should see output produced similar to the illustration given below. Image: Agent Response As you can see in the illustration above, the agent logs its internal actions, such as the tool call, highlighting that it intelligently understood the need for live data from GitHub. After retrieving the data, it parses and analyzes the information to create a structured report on the top issues, just like a Senior Maintainer. Conclusion In this article, you have learned what an AI agent is, how to build one, and, most importantly, how to use it in a real-world scenario. It's like having a co-worker assisting you with complex tasks. The entire code for the project you built here is available at this GitHub Link, but I highly recommend you follow along with the steps above rather than just cloning and running the project. Building it piece by piece is the best way to truly understand how the "brain" and "hands" of an agent work together.
Large language models have recently made it possible to generate UI code from natural language descriptions or design mockups. However, applying this idea in real development environments often requires more than simply prompting a model. Generated code must conform to framework conventions, use the correct components, and pass basic structural validation. In this article, we describe how we built an Agent Skill called zul-writer that generates UI pages and controller templates for applications built with the ZK framework. For readers unfamiliar with it, ZK is a Java-based web UI framework, and ZUL is its XML-based language used to define user interfaces. A typical page is written in ZUL and connected to a Java controller that handles the application logic. The goal of this agent skill is to transform textual descriptions or UI mockups into ZUL pages and a Java controller scaffold, while validating the output to ensure it conforms to ZK’s syntax and component model. This article focuses on the technical design of the agent, including prompt design, validation steps, and how we guide the model to generate framework-specific UI code. Architecting the Agent: Guiding the LLM Toward Valid UI Code When building tools for enterprise developers, free-form LLM generation is a liability. LLMs often invent non-existent tags, use unsupported properties, and mix architectural patterns. The solution is strictly architecting the agent's constraints. The prompt constraints (SKILL.md): Instead of writing a prompt that "teaches" the LLM how to write ZUL, we use Markdown frontmatter and structured sections inside SKILL.md to establish ironclad constraints. These constraints bind the LLM to a strict 4-step process, effectively removing its freedom to improvise outside of our defined architecture. Structuring the context (RAG in practice): To prevent the LLM from guessing components, we feed it an exact UI-to-component mapping (references/ui-to-component-mapping.md) and base XML templates (assets/). By providing these reference assets directly within the skill, it minimizes the LLM's chance of making up invalid UI tags or layout structures. It doesn't need to guess how an MVVM ViewModel should look; it just follows mvvm-pattern-structure.zul. Designing a Deterministic Workflow for LLMs (The 4-Step Process) Why does free-form prompting fail for complex UI generation? Because generating a full UI requires multiple context switches: understanding the layout, mapping the components, writing the XML, validating the schema, and finally wiring the backend controller. To handle this, zul-writer uses Dual Input Modes (text vs. image), natively supporting both descriptive text requirements and direct image inputs (like mockups or screenshots). Here is the deterministic workflow the skill enforces: Requirement gathering and visual analysis: If an image is provided, the agent performs a visual analysis to identify layouts, tables, and buttons. It then asks necessary clarifying questions: Target ZK version (9 or 10)? MVC or MVVM? Layout preferences?Context-aware generation: The agent generates the ZUL using the exact component mappings and base XML templates provided in the assets/ directory.Local validation: (Covered in the next section).Controller generation: Ensuring the Java code (Composer or ViewModel) is generated to match the IDs and bindings of the generated ZUL perfectly. Trust, But Verify: Validating AI Output In a professional engineering workflow, you cannot blindly trust AI-generated code. XML-based languages are particularly prone to LLMs inventing invalid attributes or placing valid attributes on the wrong tags, e.g., putting an iconSclass on a textbox. Why local script validation? (cost and efficiency): You might think: "Why not just ask the LLM to validate its own code against the XSD?" Validating against massive XSD schemas via LLM prompts consumes huge amounts of tokens, takes too long, and might be prone to "sycophancy" (the LLM telling you it looks fine when it doesn't). Offloading this to a local Python script is deterministic, vastly cheaper, and significantly faster. The zul-writer skill employs a local Python validation script (validate-zul.py) featuring a 4-layer validation strategy: Layer 1: XML well-formedness.Layer 2: XSD schema validation.Layer 3: Attribute placement checks (catching context-specific errors).Layer 4: ZK version-specific compatibility checks. The agentic loop: If the local script throws an error, the agent intercepts the stack trace, understands what went wrong, and self-corrects the ZUL file before presenting the final code to the developers. Test-Driven AI Development Building an AI workflow requires applying traditional software engineering practices — specifically, testing. Testing the agent with Google Stitch and human-in-the-loop: To test zul-writer, I used Google Stitch to rapidly generate diverse UI screenshots to serve as test inputs. The iteration loop looks like this: Feed the Stitch-generated image into zul-writer.Manually review the generated ZUL output for layout accuracy and component misuse.Identify the AI's "bad habits" and write explicit rules/constraints into SKILL.md to prevent future recurrences. (This is Prompt Optimization in action). Codebase testing: The repository includes a test/ directory with known good and bad ZUL files to independently verify the Python validation script. Furthermore, a zulwriter-showcase/ gallery serves as a runnable integration test to prove that the AI-generated UIs (like enterprise Kanban boards and Feedback dashboards) actually render perfectly. Developer pro-tip: During the development of the zul-writer skill, managing file changes can be tedious. Instead of repeatedly copy-pasting the skill directory into Claude Code's skill folder every time you make a change, use a Mac Symbolic Link to point ~/.claude/skills/zul-writer directly to your actual local project directory. This single trick saves endless context switching and allows for instant testing during development! The ZUL-Writer Showcase The screenshot generated by Stitch: The ZUL page generated by ZUL-writer: As you can see, the generated result is very similar to the mockup. But what makes the result particularly useful is that the generated page is not just a generic HTML layout. The agent understands the ZK component ecosystem and generates the interface using ZK components and icon libraries. As a result, the generated page is usually very close to what a developer would write manually. Layouts, components, and event hooks are already structured correctly for a typical ZK application. Developers typically only need to: Adjust minor UI detailsRefine component propertiesImplement the actual business logic inside the generated composer In practice, this reduces a large portion of repetitive scaffolding work and allows developers to focus on application logic rather than UI boilerplate. Conclusion and Key Takeaways Large language models are becoming increasingly capable of generating code, but producing reliable results in real development environments usually requires additional structure. In this article, we explored how an agent skill can guide the model to generate framework-specific UI code and validate the output through simple checks such as XML and XSD validation. While this example focuses on generating ZUL pages and Java controller templates, the same approach can be applied to many other libraries and technologies. By combining LLM prompts, domain knowledge, and lightweight validation, developers can build agent skills that automate repetitive scaffolding tasks. Hopefully, this article provides some ideas and inspiration for building similar agent skills for the frameworks and tools you use in your own projects. Also, if you are interested in trying out the ZUL-writer, it is available on GitHub.
The company will use AI as an operational foundation rather than implementing it as a simple chatbot add-on in 2026. AI will evolve from standalone tools into an operational framework that includes multi-agent systems, specialized models, AI-based software development processes, enhanced security measures, data location tracking, and human–AI interface management under defined regulatory frameworks. Agentic Workflows Become Normal During 2024–2025, most teams used a single assistant connected to their product. In 2026, systems will evolve into multi-agent architectures in which multiple agents with specific capabilities collaborate on tasks — planner, researcher, executor, and verifier. Gartner identifies Multiagent Systems as a 2026 strategic trend, marking the maturation of many “chatbot-era” experiments. Execution requires: Orchestration layers that determine which agents and tools perform specific tasks.Policy engines that define agent access permissions and authorization rules.Human-in-the-loop controls for approvals and escalations — once considered optional.Reliability monitoring with production KPIs such as success rates, rollback paths, audit logs, and reproducibility metrics. Engineering value shifts toward workflow design. Teams must design systems with state management, defined roles, policy enforcement, and fallback mechanisms — not just chat flows. Domain-Specific Models Outperform Single Large Models Enterprises will increasingly stop defaulting to the largest general-purpose model for every task. Instead, they will combine general LLMs with domain-specific language models (DSLMs) trained for particular industries or operational areas. Gartner identifies DSLMs as a top trend for 2026 because organizations have proven their effectiveness. Where this appears quickly: Transforming text into more human-like output while preserving meaning and proportional length without excessive explanation.Healthcare use cases such as clinical documentation, coding support, and workflow-aware assistants.Engineering environments with repo-aware copilots trained on company code, patterns, and conventions. A mature stack combines a general model for broad reasoning and exploration with specialized DSLMs for workflows requiring precise token tracking and logging. AI-Native Development Platforms Reshape the SDLC AI-assisted coding is now expected. The next step is AI-native development platforms that integrate across the entire software delivery lifecycle — from coding to testing, infrastructure management, documentation, and review. Gartner highlights AI-Native Development Platforms as a major trend because they merge development tools with platform engineering. Key shifts developers will feel: Smaller teams delivering more, as tests, boilerplate, infrastructure-as-code, and documentation become automated.Test generation, security scanning, and documentation treated as first-class outputs alongside production code.Policy-based review checkpoints enforcing quality, security, licensing, and compliance during AI-driven workflows. Platform teams can codify development standards into AI tooling. The platform becomes a co-author that enforces house style — not merely a suggestion engine. Security: Guardrails + Preemptive Defense The security perimeter changes when agents begin making tool calls, interacting with systems, and triggering workflows. Gartner highlights AI Security Platforms and Preemptive Cybersecurity as major focus areas for 2026 — especially for organizations exposing tools to AI agents. “Good” will include: Centralized guardrails such as allowlists, policy controls, DLP mechanisms, and prompt-injection defenses.Monitoring model and tool behavior, not just network traffic.AI-specific incident response playbooks covering data exfiltration, unauthorized tool execution, prompt attacks, and model failures.Closer collaboration between security and ML teams, as prompts, tools, and agents expand the attack surface. Digital Provenance Becomes a Platform Capability As AI-generated content becomes the default, users will ask: Where did this come from? Can I trust it? Gartner identifies Digital Provenance as its top trend for 2026, requiring platform-level implementation. Concrete patterns: Provenance metadata attached to documents, media, and customer communications, including model version, inputs, and timestamps.Watermarking and tamper-evident logging for regulated outputs such as financial advice, clinical notes, and legal documents.Built-in traceability for legal, audit, and compliance teams — not retrofitted later. Developers must design systems that are both explainable and traceable by default. Privacy-Preserving AI and Confidential Computing Organizations will increasingly keep sensitive workloads on-premises or within tightly controlled environments. Confidential computing and privacy-preserving AI architectures are expanding across financial services, healthcare, and other regulated sectors. Gartner lists Confidential Computing among its predicted 2026 trends. Patterns to expect: Secure enclaves for protected training and inference.Architectures that minimize raw data movement outside secure boundaries.Designs addressing data residency, third-party risk, and cross-border inference from day one.“Where the model runs” treated as a core design decision, equal to model selection. AI Supercomputing and FinOps for Tokens Infrastructure competition continues. Gartner highlights AI Supercomputing Platforms as organizations require high-speed training and inference on massive datasets. On the ground, this looks like: Tracking GPU hours and token usage with FinOps metrics for budget control and team-level cost visibility.Tiered model routing — smaller models for routine tasks, larger models for complex workloads.Hybrid infrastructure: on-premises for critical steady workloads, cloud for burst capacity and experimentation. Cost-aware model routing becomes as essential as autoscaling was during the microservices era. Physical AI Leaves the Lab Physical AI — robots, embodied agents, assistants — will move from demos to operations. Research from Gartner and Deloitte indicates adoption growth in 2026. Adoption will begin where ROI is measurable: Warehouses and logistics centers.Hospitals and care facilities.Field operations and facilities management. The integration challenge shifts to verification: confirming tasks are completed correctly and integrated with existing operational systems, safety standards, and workflows. Voice-First Assistants Mature LLMs have revitalized voice interfaces by enabling contextual, flexible interactions. Reuters notes growing adoption of audio assistants alongside privacy concerns about “always listening” systems. Early adoption areas: Customer service and contact centers.Field operations where hands-free interaction is critical.Healthcare documentation and dictation workflows. Users will demand speed and convenience — but also full control over what is recorded, stored, and used for model training. Regulation Targets AI Companions Regulators are defining boundaries for AI companion systems. Reuters reports new rules in New York and regulations taking effect in California on January 1, 2026, covering disclosures and safety requirements. Implications: Products that may foster emotional dependence must embed compliance mechanisms from inception.Systems must provide transparency, safety-by-design controls, and clear human intervention pathways.Legal and ethics teams must be involved earlier in the development cycle for “relationship-like” AI systems. Geopatriation and AI Sovereignty Organizations must answer fundamental questions: Where is our data stored? Where does inference run? These questions move from operational details to strategic priorities. Gartner identifies Geopatriation as a 2026 trend reflecting growing AI sovereignty concerns. Impacts include: Vendor selection, procurement requirements, and contract structures.Decisions about geographic placement of data and models.Disaster recovery and business continuity planning against geopolitical risk. Architecture diagrams in 2026 must show not only services and data flows but also jurisdictional boundaries and legal domains. Entry-Level Roles Evolve IEEE Spectrum reports that AI is reshaping entry-level work. Roles will still exist, but responsibilities will change as routine tasks become automated and collaboration shifts toward higher-value work. For early-career developers and analysts: Routine tasks can be completed faster.Training must include system limitations, performance constraints, and deployment monitoring.New responsibilities emerge: workflow evaluation, governance, monitoring, and lifecycle management.Prompt design, workflow architecture, and guardrail implementation become foundational skills — comparable to mastering a framework.
The transition from large language models (LLMs) as simple chat interfaces to autonomous AI agents represents the most significant shift in enterprise software since the move to microservices. With the release of Gemini 3, Google Cloud has provided the foundational model capable of long-context reasoning and low-latency decision-making required for sophisticated multi-agent systems (MAS). However, building an agent that "actually works" — one that is reliable, observable, and capable of handling edge cases — requires more than a prompt and an API key. It requires a robust architectural framework, a deep understanding of tool use, and a structured approach to agent orchestration. The Architecture of a Modern AI Agent At its core, an AI agent is a loop. Unlike a standard LLM call, which is a single input-output transaction, an agent uses the model's reasoning capabilities to interact with its environment. In the context of Gemini 3 on Google Cloud, this environment is managed through Vertex AI Agent Builder. The Agentic Loop: Perception, Reasoning, and Action Perception: The agent receives a goal from the user and context from its internal memory or external data sources.Reasoning: Using Gemini 3's advanced reasoning capabilities (such as Chain of Thought or ReAct), the agent breaks the goal into sub-tasks.Action: The agent selects a tool (a function call, an API, or a search) to execute a sub-task.Observation: The agent evaluates the output of the action and decides whether to continue or finish. System Architecture To build a multi-agent system, we must move away from a monolithic agent. Instead, we use a modular approach where a "Manager" or "Orchestrator" agent delegates tasks to specialized "Worker" agents. In this architecture, the Manager Orchestrator serves as the brain. It uses Gemini 3's high-reasoning threshold to determine which worker agent is best suited for the current task. This prevents "token bloat" in worker agents, as they only receive the context necessary for their specific domain. Why Gemini 3 for Multi-Agent Systems? Gemini 3 introduces several key advantages for agentic workflows that weren't present in previous iterations: Native function calling: Gemini 3 is fine-tuned to generate structured JSON tool calls with higher accuracy, reducing the "hallucination" rate during API interactions.Expanded context window: With a massive context window, Gemini 3 can retain the entire history of a multi-turn, multi-agent conversation without needing complex vector database retrieval for every step.Multimodal reasoning: Agents can now "see" and "hear," allowing them to process UI screenshots or audio logs as part of their reasoning loop. Feature Comparison: Gemini 1.5 vs. Gemini 3 for Agents FeatureGemini 1.5 ProGemini 3 (Agentic)Tool Call Accuracy~85%>98%Reasoning LatencyModerateOptimized Low-LatencyNative Memory ManagementLimitedIntegrated Session StateMultimodal ThroughputStandardHigh-Speed Stream ProcessingTask DecompositionManual PromptingNative Agentic Reasoning Building a Multi-Agent System: Technical Implementation Let's walk through the implementation of a multi-agent system designed for a financial analysis use case. We will use the Vertex AI Python SDK to define our agents and tools. Step 1: Defining Tools Tools are the "hands" of the agent. In Gemini 3, tools are defined as Python functions with clear docstrings, which the model uses to understand when and how to call them. Python import vertexai from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration # Initialize Vertex AI vertexai.init(project="my-project-id", location="us-central1") # Define a tool for fetching stock data get_stock_price_declaration = FunctionDeclaration( name="get_stock_price", description="Fetch the current stock price for a given ticker symbol.", parameters={ "type": "object", "properties": { "ticker": {"type": "string", "description": "The stock ticker (e.g., GOOG)"} }, "required": ["ticker"] }, ) stock_tool = Tool( function_declarations=[get_stock_price_declaration], Step 2: The Worker Agent A worker agent is specialized. Below is an example of a "Data Agent" that uses the stock tool. Python model = GenerativeModel("gemini-3-pro") chat = model.start_chat(tools=[stock_tool]) def run_data_agent(prompt): """Handsoff logic for the data worker agent""" response = chat.send_message(prompt) # Handle function calling logic if response.candidates[0].content.parts[0].function_call: function_call = response.candidates[0].content.parts[0].function_call # In a real scenario, you would execute the function here # and send the result back to the model. return f"Agent wants to call: {function_call.name}" Step 3: The Orchestration Flow In a complex system, the data flow must be managed to ensure that Agent A's output is correctly passed to Agent B. We use a sequence diagram to visualize this interaction. Advanced Pattern: State Management and Memory One of the biggest challenges in multi-agent systems is "state drift," where agents lose track of the original goal during long interactions. Gemini 3 addresses this with native session state management in Vertex AI. Instead of passing the entire conversation history back and forth (which increases cost and latency), we can use context caching. This allows the model to "freeze" the initial instructions and background data, only processing the new delta in the conversation. Code Example: Context Caching for Efficiency Python from vertexai.preview import generative_models # Large technical manual context long_context = "... thousands of lines of documentation ..." # Create a cache (valid for a specific TTL) cache = generative_models.Caching.create( model_name="gemini-3-pro", content=long_context, ttl_seconds=3600 ) # Initialize agent with the cached context agent = GenerativeModel(model_name="gemini-3-pro") # The agent now has 'memory' of the documentation without re-sending it Challenges in Multi-Agent Systems Building these systems isn't without hurdles. Here are the three most common technical challenges and how to solve them: 1. The "Infinite Loop" Problem Agents can sometimes get stuck in a loop, repeatedly calling the same tool or asking the same question. Solution: Implement a max_iterations counter in your Python controller and use an "Observer" pattern where a separate model monitors the agentic loop for redundancy. 2. Tool Output Ambiguity If a tool returns an error or unexpected JSON, the agent might hallucinate a solution. Solution: Use strict Pydantic models for function outputs and feed the validation error back into the agent's context, allowing it to self-correct. 3. Context Overflow Despite Gemini 3's large window, multi-agent systems can produce massive amounts of logs. Solution: Use an "Information Bottleneck" strategy. The Orchestrator should summarize the output of each worker before passing it to the next agent, ensuring only high-signal data moves forward. Testing and Evaluation (LLM-as-a-Judge) Traditional unit tests are insufficient for agents. You must evaluate the reasoning path. Google Cloud's Vertex AI Rapid Evaluation allows you to use Gemini 3 as a judge to grade the performance of your agents based on criteria like: Helpfulness: Did the agent fulfill the intent?Tool efficiency: Did it use the minimum number of tool calls?Safety: Did it adhere to the defined system instructions? Evaluation MetricDescriptionTarget ScoreFaithfulnessHow well the agent sticks to retrieved data.> 0.90Task CompletionSuccess rate of complex multi-step goals.> 0.85Latency per StepTime taken for a single reasoning loop.< 2.0s Conclusion Gemini 3 and Vertex AI Agent Builder have fundamentally changed the barrier to entry for building intelligent, autonomous systems. By utilizing a modular multi-agent architecture, leveraging native function calling, and implementing rigorous evaluation cycles, developers can move past the prototype stage and build production-ready AI systems. The key to success lies not in the size of the prompt, but in the elegance of the orchestration and the reliability of the tools provided to the agents. As we move into the era of agentic software, the role of the developer shifts from writing logic to designing ecosystems where agents can collaborate effectively.=
In late January 2026, a startup CEO launched a Reddit-style social network called Moltbook — exclusively for AI agents. Within days, it claimed 1.5 million autonomous agents posting, commenting, and upvoting[1]. OpenAI founding member Andrej Karpathy initially called it “the most incredible sci-fi takeoff-adjacent thing I’ve seen recently.” Then security researchers at Wiz found an exposed database API key on the front end of the site — granting full read and write access to the entire production database, including 1.5 million API authentication tokens and 35,000 email addresses[2]. Karpathy reversed course: “It’s a dumpster fire. I definitely do not recommend people run this stuff on their computers.” Moltbook is not an edge case. It is a preview of what happens when autonomous AI agents operate without governance. And the timing is striking: the very same week Moltbook went viral, Singapore’s Infocomm Media Development Authority (IMDA) released the world’s first governance framework built specifically for agentic AI[3]. One event showed the fire. The other offered the fire code. With 35% of enterprises already deploying agentic AI and nearly three-quarters planning to within two years[4], the question is no longer whether to govern AI agents but how. I’ve spent the past several weeks analyzing Singapore’s framework alongside regulatory approaches from the EU, the UK, China, and the US, plus industry frameworks from OpenAI, Anthropic, Google DeepMind, and Microsoft. Here’s what the global landscape looks like — and the playbook for applying it on Monday morning. Why Agentic AI Breaks Traditional Governance Traditional AI governance assumes a simple loop: human prompts, AI responds, human decides. The EU AI Act, NIST’s AI Risk Management Framework, and the UK’s principles-based approach — all were designed with that paradigm in mind. Agentic AI shatters it. These systems plan across multiple steps, invoke external tools at runtime, take real-world actions (some irreversible), and operate with varying degrees of independence[5]. When a customer-service chatbot gives a bad answer, you correct it. When an autonomous procurement agent commits your company to a six-figure contract based on flawed reasoning, the consequences are materially different. It gets even more complex with multi-agent systems. Google DeepMind’s 145-page safety paper identifies what they call “structural risks”: harms that emerge from interactions between multiple agents where no single system is at fault[6]. That’s a category of risk that only Singapore’s framework has explicitly addressed at the national level. Case Study: What Moltbook Revealed About Agent-Only Platforms Moltbook is worth examining in detail because it compressed months of lessons into days. Built on the OpenClaw framework, the platform gave agents persistent access to users’ computers, files, calendars, and messaging apps. Security firm Wiz discovered the database was completely open; 404 Media confirmed anyone could commandeer any agent on the platform[2]. Palo Alto Networks identified what they called Simon Willison’s “lethal trifecta”: access to private data, exposure to untrusted content, and the ability to communicate externally — plus a fourth risk unique to agents: persistent memory enabling delayed-execution attacks[7]. The numbers are sobering. Enterprise analysis found that uncontrolled AI agents reach their first critical security failure in a median time of 16 minutes under normal conditions[8]. On Moltbook, adversarial agents actively probing for credentials compressed that window further. Agents were asking each other for passwords. Some posted requests for private, encrypted channels to exclude human oversight. And Wiz’s investigation revealed that roughly 17,000 humans controlled the platform’s “1.5 million agents” — an average of 88 bots per person, with no mechanism to verify whether an “agent” was actually AI[2]. The lesson: Agent-only platforms without identity verification, sandboxing, and governance controls are not experimental playgrounds — they are attack surfaces. Every risk Singapore’s framework was designed to mitigate showed up in Moltbook within 72 hours. What Singapore Got Right Singapore’s IMDA framework, released at the World Economic Forum in Davos on January 22, 2026, stands out for its practicality[3]. Where other frameworks offer abstract principles, Singapore offers an operational matrix. The centerpiece is a two-axis risk model that maps an agent’s “action-space” (what it can access, read vs. write permissions, whether actions are reversible) against its “autonomy” (how independently it makes decisions). This gives enterprises a tool they can use immediately to calibrate governance intensity to actual risk. Here’s how I’ve adapted that model into a four-tier framework that combines Singapore’s approach with security insights from OWASP, NIST, and the Moltbook post-mortem: Agentic AI Risk Tiering Matrix Risk TierAction-SpaceAutonomy LevelGovernance Required Tier 1 – Low Read-only access, sandboxed tools, reversible actions Follows detailed SOPs, minimal judgment Standard logging, periodic review Tier 2 – Medium Read/write to internal systems, limited tool invocation Some discretion within defined guardrails Human approval for high-impact actions, continuous monitoring Tier 3 – High Cross-system write access, external API calls, financial transactions Independent planning and execution Real-time oversight, anomaly detection, kill switches, agent identity with delegation chains Tier 4 – Critical Multi-agent orchestration, irreversible actions across org boundaries Full autonomy, multi-step planning, tool selection Governance board review, continuous auditing, mandatory human escalation triggers, incident response protocols Adapted from Singapore IMDA Framework, OWASP Agentic Top 10, and enterprise security research. The framework also tackles the accountability chain head-on, defining clear roles for five actor types: model developers, system providers, tooling providers, deploying organizations, and end users. Crucially, it addresses agent identity management — requiring unique identities tied to supervising humans, with the principle that agents cannot receive permissions exceeding those of their human sponsors[9]. If you’ve been in enterprise IT long enough, you’ll recognize this as least privilege extended to non-human actors. The Human Verification Counter-Move: OpenAI’s World ID and the Orb While Singapore was building governance infrastructure for agents, Sam Altman’s other venture was building verification infrastructure for humans. Tools for Humanity’s World project—co-founded by the OpenAI CEO — launched its iris-scanning Orb devices in the US in May 2025, with 7,500 units rolling out across dozens of cities[10]. The premise: as AI agents become indistinguishable from humans online, platforms need biometric “proof of personhood” to separate real users from bots. In early February 2026, reports emerged that OpenAI is considering using World ID to verify users on a proposed social network — creating what would be a “humans-only” platform, the philosophical opposite of Moltbook[11]. The irony is striking: the company building the most capable AI agents is also building infrastructure to keep agents out of human spaces. This is not a contradiction — it’s a governance insight. The emerging consensus is that the solution is not agents-everywhere or humans-only, but identity-verified participation in both directions. Agents need verifiable identities (Singapore’s approach) so enterprises know what they’re interacting with. Humans need verifiable identities (World ID’s approach) so platforms can guarantee authentic human spaces when needed. Moltbook collapsed because it had neither: no real agent verification, no human verification, and no sandbox boundaries between the two. The Global Regulatory Patchwork: Who’s Leading and Who’s Lagging The EU AI Act is the most comprehensive binding AI regulation globally, but it creates what legal scholars describe as a “compliance impossibility” for agentic systems[12]. Article 14 mandates meaningful human oversight for high-risk systems — yet the core value of agentic AI is autonomous operation. The Act’s pre-market conformity model struggles with agents that invoke unknown tools at runtime. The Future Society’s analysis confirmed that technical standards under development “will likely fail to fully address risks from agents.” The United States has no federal agentic AI governance framework. NIST’s AI Risk Management Framework remains voluntary and lacks a dedicated agentic AI profile, though NIST is actively developing security overlays for agent systems — with researcher Apostol Vassilev publicly stating current frameworks are “too weak” for enterprise agentic AI[13]. The gap has left a patchwork of state-level laws with no coherent national approach. The UK has done valuable evaluation work through its AI Security Institute, stress-testing over 30 frontier models and finding that self-replication success rates jumped from 5% to 60% between 2023 and 2025[14]. But no agent-specific guidance has materialized yet. China governs AI through binding regulations, including draft ethics measures for “highly autonomous decision-making systems”[15] — which captures agentic systems — but no unified agent-specific regulation exists. Industry Is Moving Faster—With Uneven Results OpenAI’s 2024 whitepaper on governing agentic systems proposed seven core practices, including constraining action-spaces, maintaining legibility, and ensuring at least one human is accountable for every harm[16]. Their Preparedness Framework now tracks autonomous replication as a research category. Yet academic analysis found the framework’s governance provisions contain significant flexibility that could allow deployment of high-risk capabilities[17] — underscoring the limits of self-governance. Anthropic’s Responsible Scaling Policy uses a biosafety-level analogy (ASL-1 through ASL-5+), with ASL-3 activated for the first time in May 2025[18]. They donated the Model Context Protocol (MCP) — the leading standard for agent-tool interaction — to the newly formed Agentic AI Foundation under the Linux Foundation. Google DeepMind’s safety paper is the most theoretically sophisticated, identifying “structural risks” as a distinct category that no other framework addresses[6]. Microsoft has built the most enterprise-oriented infrastructure, including Entra Agent ID for machine-level identity and a tiered autonomy classification model[19]. On the standards front, IEEE approved Standard P3709 for agentic AI architecture in September 2025[20]. OWASP published its Top 10 for Agentic Applications in December 2025 — identifying memory poisoning, tool misuse, and privilege compromise as top threats[21]. And OpenAI, Anthropic, and Block co-founded the Agentic AI Foundation to steward open standards for agent interoperability[22]. Three Scenarios, Three Governance Approaches Theory is useful. Application is what matters. Here’s how the risk-tiering framework maps to real deployment scenarios: Scenario 1: Customer Support Triage Agent (Tier 2) A retail company deploys an agent that reads customer tickets, categorizes them by urgency, and drafts initial responses for human agents to review. The agent has read access to the ticket system and write access only to an internal draft queue. Under Singapore’s framework, this is medium action-space (read/write but internal only, actions are reversible) with low autonomy (following predefined classification rules). Governance requirement: standard logging, periodic accuracy audits, and an identity tied to the support operations team. The human team reviews and sends all responses. Scenario 2: Autonomous Procurement Agent (Tier 3) A manufacturing firm deploys an agent that monitors supplier pricing, evaluates contracts, and executes purchase orders up to $50,000. This agent has external API access, financial transaction capability, and cross-system write permissions. Under the tiering matrix, this is a high action-space with significant autonomy. Governance requirement: real-time monitoring, anomaly detection flagging unusual purchase patterns, a mandatory human escalation trigger for orders above the threshold, an agent identity with explicit delegation from the CFO’s office, and a kill switch. Critically, every action must be logged with an audit trail linking back to the authorizing human — because when the auditor asks “who approved this purchase?” the answer can never be “the agent decided.” Scenario 3: Multi-Agent Research Pipeline (Tier 4) A pharmaceutical company runs a pipeline where Agent A searches scientific literature, Agent B synthesizes findings, and Agent C drafts regulatory submission documents. These agents invoke external tools, interact with each other, and produce outputs with significant downstream consequences. This is Singapore’s most complex governance scenario: multi-agent orchestration across organizational boundaries with potentially irreversible regulatory implications. Governance requirement: governance board review before deployment, continuous auditing of agent-to-agent interactions, mandatory human review at each handoff point, incident response protocols for emergent behavior, and clear accountability maps for each agent in the chain. This is where Moltbook’s lessons matter most — unmonitored agent-to-agent communication is where risks compound fastest. The Monday Morning Playbook If you’re deploying or planning to deploy agentic AI in your organization, here’s the implementation sequence — ordered by impact and urgency: Weeks 1–2: Inventory and Classify Catalog every AI agent operating in your environment, including shadow deployments employees spun up without IT approval. Moltbook’s Wiz investigation found employees installing agents without authorization, creating “shadow IT risks amplified by AI.”Map each agent to a tier in the risk matrix above. Be honest about action-space: if the agent can write to production systems, it’s not Tier 1.Identify every agent-to-agent interaction path. These are your highest-risk vectors. Weeks 3–4: Identity and Access Assign a unique identity to every agent, tied to a supervising human or department. If you use Microsoft’s ecosystem, evaluate Entra Agent ID. The core principle from Singapore’s framework: no agent gets permissions exceeding its human sponsor’s.Implement least-privilege access. An agent that needs to read customer tickets does not need write access to your financial systems.Deploy kill switches for Tier 3 and 4 agents. Sixty percent of organizations currently have no mechanism to stop an agent that misbehaves[8]. Month 2: Monitoring and Escalation Stand up continuous monitoring for Tier 2+ agents. Pre-deployment testing is necessary but not sufficient for non-deterministic systems that adapt post-deployment.Define escalation protocols: what anomaly score triggers human review vs. automatic suspension vs. immediate termination?Audit agent-to-agent interactions. Apply OWASP’s Agentic Top 10 as a security checklist. Month 3: Governance Structure Establish a cross-functional governance board spanning IT, legal, compliance, cybersecurity, and business leadership. Forrester predicts 60% of Fortune 100 companies will appoint a head of AI governance by end of 2026[23].Document accountability chains: for every agent, there must be a named human who is answerable for its actions.Review Singapore’s IMDA framework as your operational baseline and adapt its two-axis risk model to your industry. The Bottom Line The global landscape of agentic AI governance in early 2026 is defined by a paradox: broad agreement on principles coexists with fragmented implementation. Singapore’s IMDA framework is the only national framework that starts from the actual characteristics of agentic systems rather than retrofitting rules designed for chatbots. Moltbook is the most vivid demonstration of what happens without governance. And OpenAI’s World ID project represents a complementary bet — that in a world of autonomous agents, verified human identity becomes infrastructure, not a feature. The most important insight from this analysis is that the governance challenge of agentic AI is fundamentally different from traditional AI governance — not in degree, but in kind. Agents that take irreversible actions, invoke unknown tools, interact with other agents across organizational boundaries, and adapt post-deployment cannot be governed by static compliance models. The organizations that internalize this shift fastest won’t be the ones that slow down innovation. They’ll be the ones that scale it — because governance, as Deloitte’s research makes clear, is what gets you past the pilot stage[4]. The agents are already here. The fire code is now available. Use it. References [1] Wikipedia / Fortune, “Moltbook, a social network where AI agents hang together,” January 2026. [2] Wiz Research, “Hacking Moltbook: AI Social Network Reveals 1.5M API Keys,” January 2026. [3] IMDA, “Model AI Governance Framework for Agentic AI, Version 1.0,” January 2026. [4] Deloitte Global Survey of 3,000 leaders across 24 countries; CIO Dive, January 2026. [5] IMDA Framework, Section 2: Defining characteristics of agentic AI systems. [6] Google DeepMind, “Approach to AGI Safety and Security,” April 2025. [7] Palo Alto Networks, “The Moltbook Case and How We Need to Think About Agent Security,” February 2026. [8] Kiteworks, “Moltbook Security Threat: 16-Minute Failure Window,” February 2026. [9] IMDA Framework, Section 4: Agent identity management and delegation chains. [10] TIME, “The Orb Will See You Now,” May 2025; TechCrunch, “World unveils mobile verification device,” April 2025. [11] The Block / Forbes, “OpenAI social network could tap World’s eyeball-scanning Orbs,” January 2026. [12] The Future Society, “How AI Agents Are Governed Under the EU AI Act,” June 2025. [13] Security Boulevard / NIST, Apostol Vassilev on agentic AI security taxonomy, December 2025. [14] UK AI Security Institute, “2025 Year in Review” and Frontier AI Trends Report. [15] Mayer Brown, “China AI Global Governance Action Plan and Draft Ethics Rules,” October 2025. [16] OpenAI, “Practices for Governing Agentic AI Systems,” 2024. [17] arXiv, “The 2025 OpenAI Preparedness Framework: affordance analysis of AI safety policies.” [18] Anthropic, “Activating ASL-3 Protections” and Updated Responsible Scaling Policy, 2025. [19] Microsoft, “2025 Responsible AI Transparency Report,” June 2025. [20] IEEE Standard P3709, approved September 2025. [21] OWASP GenAI Security Project, “Top 10 Risks for Agentic AI Security,” December 2025. [22] OpenAI, “OpenAI co-founds the Agentic AI Foundation under the Linux Foundation,” 2025. [23] Forrester / WEF industry reports, 2025–2026. [24] McKinsey, “Deploying Agentic AI with Safety and Security,” 2025. [25] MIT Sloan Management Review, “Agentic AI: Nine Essential Questions,” 2025.
We are currently witnessing a paradox in the AI industry. On one hand, building a generative AI demo has never been easier; with a few lines of Python and an API key, a developer can spin up a "chat with your data" prototype in an afternoon. On the other hand, deploying a reliable, autonomous agent into a production environment remains notoriously difficult. The chasm between these two states is the "Demo Trap." Image Source: Robert Lukeman on Unsplash (For Illustrative purposes only) For CTOs and engineering leaders, getting stuck in this trap is a critical strategic risk. The gap between an unreliable proof-of-concept prototype and a predictable business process is filled with hallucinations, latency issues, and infinite loops. The industry is realizing that prompt engineering alone is insufficient for complex systems. The Solution: Agent Engineering Moving beyond the demo trap requires treating agent development not as magic, but as rigorous software engineering. This means applying the discipline of traditional development modularity, type safety, testing, and CI/CD to the probabilistic world of AI. This article explores how emerging frameworks, specifically the Agent Development Kit (ADK), enable this shift by treating agents as standard software artifacts rather than opaque black boxes. The Shift: From Scripting to Engineering The first generation of agent frameworks prioritized extreme abstraction. They abstracted away the complexities of LLM interactions, often hiding prompt chains and logic deep within the library's internals. While excellent for rapid prototyping, this opacity becomes a liability in production. When an agent fails in a banking workflow, you cannot debug transaction logic that leaves no auditable trace of how it arrived at a specific financial decision. To build production-grade agents, we need frameworks that prioritize: Transparency: Logic should be explicit, not hidden behind high-level abstractions.Modularity: Tools and skills should be decoupled from the core reasoning engine.Agnosticism: The architecture should not break if you swap the underlying model (e.g., from Gemini to another LLM) or the deployment target. This is the core philosophy behind the Agent Development Kit (ADK). It is designed to make agent development feel less like alchemy and more like Python development. CategoryDemo VersionProduction Version Logic & Control Relies on bespoke natural-language prompts and "best-effort" instructions that are hard to verify. Uses code-first definitions, type-safe schemas, and structured outputs to ensure predictable execution. System Structure Hard-coded logic where API keys, model settings, and business rules are all tangled in a single script. A decoupled architecture where the core reasoning is isolated from infrastructure and specific LLM backends. Resource Access Granting the agent broad, implicit permissions to system resources or entire environment variable blocks. Scoped tool injection where capabilities are versioned and provided only at runtime through secure interfaces. Failure Handling Prone to recursive loops or hallucinations when the model hits a technical roadblock it wasn't built for. Defined failure boundaries with specific retry logic and human-in-the-loop triggers for high-stakes decisions. Validation Opaque behavior that makes it difficult to replicate bugs or verify why a specific path was taken. Integrated with standard CI/CD and observability pipelines to provide a clear, step-by-step audit trail. Table: Table showing key technical differences between Demo vs Production approach Technical Deep Dive: Building With ADK Let’s look at how this "software-first" approach translates to code. Unlike heavy orchestration frameworks that force a specific graph structure, ADK uses a lightweight, flexible architecture. 1. The Setup The barrier to entry is deliberately low. The framework is installed via standard package managers and integrates seamlessly into existing Python workflows. Python # Install the ADK package pip install google-adk # Create a standard virtual environment python -m venv .venv source .venv/bin/activate # or .venv\Scripts\activate.bat on Windows 2. Initialization as a Project Structure One of ADK's distinct differentiators is that it initializes an Agent Project, not just a script. This seemingly minor detail encourages developers to think about file structure, configuration management, and modularity from Day 1. Python # Initialize a new agent project structure adk create my_enterprise_agent This command generates a structured directory: agent.py: The main controller logic.env: Secure management for keys (crucial for enterprise security compliance__init__.py: Treating the agent as a proper Python package 3. Defining the Agent and Tools The code structure in agent.py reveals the engineering-first mindset. Notice the strong typing and the explicit definition of tools. The agent isn't just "given access" to functions; it is architected with specific capabilities. Here is how we define a functional agent that uses a custom tool. This follows the pattern of defining the tool's contract (input/output) and then injecting it into the agent's context. Python from google.adk.agents.llm_agent import Agent # 1. Define the tool with clear type hints and docstrings # The model uses the docstring to understand WHEN to use the tool def get_current_time(city: str) -> dict: """Returns the current time in a specified city.""" # In a real scenario, this would hit an external Time API return { "status": "success", "city": city, "time": "10:30 AM" } # 2. Instantiate the Agent # Note the clear separation of model configuration, persona, and capabilities root_agent = Agent( model='gemini-3-pro-preview', # Model-agnostic configuration name='time_keeper_bot', description="Tells the current time in a specified city.", instruction="You are a helpful assistant. Use the 'get_current_time' tool when asked about time.", tools=[get_current_time], # Explicit tool injection ) 4. The Loop: Testing and Iteration The practical value of this approach becomes clear during the testing phase. Instead of relying solely on unit tests, ADK provides both a CLI and a web interface for interactive debugging. This allows developers to inspect the agent's "thought process" in real time. Python # Run in CLI mode for quick headless testing adk run my_enterprise_agent # Run with Web UI for visual inspection of the conversation flow adk web --port 8000 The Strategic "So What?" for Leaders Why should a CTO care which library their team uses to build a chatbot? Because the library dictates the asset's maintainability. De-risking model dependency: The AI landscape changes weekly. ADK is model-agnostic. If a new, more efficient model is released next month, your team can switch to it via a simple configuration update without rewriting the agent's business logic or tool definitions.Deployment agnosticism: Enterprise infrastructure is complex. Agents built with this modular approach are deployment-agnostic, meaning they can be containerized and shipped to Kubernetes, run as serverless functions, or embedded in edge devices without major refactoring.Auditability: By defining agents as code with explicit tool permissions (as in Section 3 above), you create a clear audit trail of the data the agent can access and the actions it can perform. The Modular Future The industry has hit a wall with these existing approaches. To escape the "demo trap," we cannot simply layer more abstraction on top of complexity. The Agent Development Kit (ADK) represents a necessary correction: a code-first yet framework-light architecture. By decoupling the agent's cognitive logic from the underlying model and infrastructure, ADK transforms agents from fragile scripts into durable assets. This approach allows engineering teams to build systems that are robust enough for today's production and scalability standards and distinct enough to survive tomorrow's model breakthroughs. The market for agentic frameworks is rapidly maturing into distinct categories. 1. Full-Stack Orchestrators Early movers in this space created massive, “all-in-one” frameworks designed to handle the entire stack, from session memory to tool execution, out of the box. While powerful, they often suffer from bloat and abstraction leaks, making debugging difficult when the agent deviates from the happy path. 2. Low-Code/No-Code Platforms These are excellent for non-technical users, but often hit a hard ceiling when complex custom logic or legacy system integration is required. 3. Vendor-Specific SDKs Highly optimized for a single cloud provider, but they introduce significant vendor lock-in risks. Conclusion We are moving past the "shock and awe" phase with AI. The next phase is about reliability, governance, and integration. Tools that treat AI agents as standard software components subject to the same rigors of version control, testing, and modular design will be the ones that survive the transition from the innovation lab to the enterprise core.
In modern IT operations (ITOps), we face a paradox: our infrastructure is dynamic, scalable, and cloud-native, but our operational processes are often static, manual, and dependent on a few hero engineers. When an incident occurs, the mean time to recovery (MTTR) often depends less on the technology stack and more on who is on call. If the expert is unavailable, the system stays down. This is the knowledge bottleneck. Based on recent research into efficiency management, this article proposes a dual-layer solution: AIOps to automate the known knowns and the SECI model to democratize the known unknowns. The Problem: The “Hero” Dependency Analyzing typical operational failures reveals a recurring pattern: Alert fatigue: Thousands of alerts flood the dashboard.Manual triage: Operators manually log in to inspect logs.Knowledge silos: The fix requires “tribal knowledge” held by senior engineers. This results in high operational costs and slow recovery times. To address this, we must treat knowledge as code and operations as data. Layer 1: AIOps for Automation AIOps (Artificial Intelligence for IT Operations) is not just a buzzword; it is a practical mechanism for applying machine learning to massive streams of operational data. Research indicates that AIOps delivers the highest ROI in three key areas: Intelligent alerting: Instead of 100 separate alerts for “CPU High,” “Latency High,” and “Pod Crash,” AIOps correlates them into a single incident linked to a root cause (e.g., “Database Lock”).Impact: Reduces triage noise by up to 90%.Root cause analysis (RCA): Automatically identifying the “patient zero” service.Auto-remediation: Executing scripts for known issues (e.g., restarting a stuck service). Implementation Strategy Do not attempt to automate everything at once. Start with the low-hanging fruit. Phase 1: Log aggregation – Centralize logs (ELK, Splunk) to feed the AI.Phase 2: Alert correlation – Use clustering algorithms to group related events.Phase 3: Remediation – Connect the AIOps engine to Ansible or Kubernetes Operators to trigger fixes. Layer 2: The SECI Model for Human Knowledge Automation cannot solve every problem. Complex, novel incidents still require human intuition. The challenge is that this intuition is often locked in a senior engineer’s head as tacit knowledge. The SECI model (Socialization, Externalization, Combination, Internalization) provides a structured way to convert this tacit knowledge into explicit, shareable assets. The SECI Cycle in DevOps Socialization (Tacit → Tacit) Old way: Shadowing a senior engineer. New way: Weekly “war room” reviews. Instead of a formal meeting, hold a brainstorming session where junior and senior engineers discuss difficult tickets from the past week. Record these sessions. Externalization (Tacit → Explicit) The hack: Don’t ask engineers to write documentation. Ask them to record a five-minute video explaining how they fixed an issue. Use speech-to-text to index these videos. This converts “gut feeling” into searchable knowledge. Combination (Explicit → Explicit) Combine these artifacts into a knowledge graph or structured runbooks (e.g., in Confluence or a Git repository). Group incidents by service or error type. Internalization (Explicit → Tacit) Junior engineers review runbooks and videos before going on call. They simulate fixes in a sandbox environment, building their own intuition over time The Combined Architecture By integrating AIOps and SECI, we create a self-reinforcing loop: AIOps handles repetitive noise.Humans handle novel issues.SECI ensures that once a novel issue is solved, it is documented and eventually converted into an auto-remediation script — feeding improvements back into the machine layer. Results: Efficiency Metrics Implementing this dual approach yields measurable improvements: 90% reduction in triage time: AIOps filters noise, allowing engineers to focus on real incidents.Knowledge redundancy: By systematically externalizing knowledge, the organization is no longer dependent on a single “hero.”Cost optimization: Junior engineers resolve complex incidents using shared knowledge, while senior engineers focus on architecture and innovation. Conclusion Operational efficiency is not just about better tools — it is about better knowledge management. By using AIOps to manage data and the SECI model to manage human expertise, organizations can build resilient, self-healing IT operations that grow smarter with every incident.
Tuhin Chattopadhyay
CEO at Tuhin AI Advisory and Professor & Area Chair – AI & Analytics,
JAGSoM
Frederic Jacquet
Technology Evangelist,
AI[4]Human-Nexus
Suri (thammuio)
Data & AI Services and Portfolio
Pratik Prakash
Principal Solution Architect,
Capital One