Building AI Agents Capable of Exploring Contextual Data for Taking Action
AI agents are evolving fast, blending reasoning, context, and web skills to automate complex tasks and redefine the future of intelligent automation.
Join the DZone community and get the full member experience.
Join For FreeArtificial intelligence is on a rapid evolutionary track, and the once awe-inspiring conversational capabilities of ChatGPT raise very few eyebrows these days. AI developers are shifting into a higher gear, and these days, the focus is all about agents. They’re building more advanced AI systems that transform large language models into thinkers, decision-makers, and action-takers, which can automate many kinds of work.
To create an AI agent, the developer must assign an LLM to a specific role, assign it a clear goal to accomplish, and provide access to the necessary resources for the agent to fulfill its mission. When AI agents are focused on a clearly defined objective and can utilise APIs, web browsers, search engines, and databases as humans do, they can autonomously determine how to perform the assigned task.
Agentic AI represents an entirely new paradigm for developers, enabling multiple agents to collaborate on complex, multistep tasks and redefine the nature of business automation.
How Did We Get Here?
One of the most important capabilities for AI agents is their ability to understand context. LLMs can be taught to remember what was said earlier during a conversation or in previous sessions, and they can take this into account when it comes to making decisions, without any changes made to their underlying code. This in-context learning is what enables LLMs to adapt and respond more effectively to complex queries.
AI agents are further enhanced by retrieval-augmented generation (RAG), which is a popular technique that enables LLMs to augment their knowledge with data from dynamic sources beyond their initial training sets. This is what makes it possible to customize an LLM’s responses for a given context, such as providing customer service support for a specific organization.
A more recent development is multimodal models, or MLLMs, which enable AI agents to explore and navigate through a graphical user interface. MLLMs combine the capabilities of LLMs, which perform well at tasks involving natural language processing but struggle when it comes to processing visual elements, and large vision models or LVMs, which excel at processing visuals but do not possess the advanced reasoning skills of traditional LLMs.
By blending an LVM’s visual processing with an LLM’s reasoning, MLLMs can analyze and understand both text and images.
Navigating the Web
A key skill for any AI agent is the ability to explore, understand, and take actions online, which means developers need to teach it how to surf the web using a browser.
Browser Use
One of the most popular tools for this is Browser Use, an open-source framework that helps to make the internet “readable” to AI agents. Browser Use enables agents to go beyond their visual recognition capabilities by breaking down each website into a structured text.
Once this is accomplished, AI agents can process what they’re seeing online in a more deterministic way, including dynamic, embedded web elements that computer vision-based agents might miss. This means it can understand all of the options available on a specific web page and identify what it needs to do.
Scraping Browser
AI agents also need a specialized browser that allows them to navigate the web at scale, avoiding the various pitfalls set up by web publishers to try and prevent automated bots from navigating through them and importing their data. With Bright Data’s Scraping Browser, AI agents gain access to a variety of tools that can help them to do this at an unprecedented scale.
With unlimited concurrent sessions, thousands of agents can explore the web continuously, thanks to API and script management integrations that provide granular control.
It also offers a range of mechanisms for getting around the bot-blocking tools implemented by sites such as Amazon and Facebook that aim to curtail autonomous traffic. These include browser fingerprinting, automated retries, advanced Captcha solvers, and a library of more than 150 million proxy IP addresses.
Sequential Task Execution
Now that our AI agents are set up to explore the web, the next step for developers is to teach them to execute tasks sequentially, in a logical order, so they can undertake complex work involving multiple steps. When AI agents are tasked with gathering context from multiple sources and reasoning across them, they often struggle.
Some examples of this might include adaptive surveys, which require an agent to perform sentiment analysis in real-time and then ask follow-up questions. Similarly, tasks such as supplier risk assessment, customer churn analysis, and forecasting bottlenecks in manufacturing operations involve pulling data from multiple domains.
Agentic Teams
To address this, developers must devise a method for unifying input data and integrating it so that their AI agents can gain a comprehensive understanding of the information they’re drawing on. The easiest way to do this is to employ teams of specialized AI agents that are each trained to understand or work on a specific domain or task.
By using Crew AI’s open-source agentic AI framework, developers can quickly spin up a team of AI agents that can collaborate to perform multi-step tasks. These agentic teams will split a task between them, with each one focusing on whatever aspect falls within its capabilities, leaving the other tasks to an agent that’s better suited for it.
Once their work is complete, they’ll combine the results.
Standardized Interactions
These AI agent teams may require access to a range of different software tools to complete their assigned tasks, which is where the Model Context Protocol comes into play. The open-source MCP is rapidly emerging as the de facto way for AI agents to interact with software, APIs, and services because of the way it standardizes context sharing and action execution, allowing those agents to operate in dynamic, multi-tool environments.
MCP provides AI agents with structured access to almost any API, data source, or tool, enabling natural and flexible workflows within applications while reducing the custom logic required for integration. Just as APIs transformed the way software communicates, MCP is set to become a universal language for agent-tool interaction, providing support for chaining tools across domains to enable more powerful compound actions.
Cross-Domain Context
We’ll also need a semantic layer to link the information found in structured datasets with the live, unstructured data that’s sourced from the internet. Wren AI offers a powerful semantic layer that helps developers to standardize cross-domain data, which is often stored in incompatible formats, so it can be amalgamated and interpreted consistently by AI agents.
Crucially, it provides the business context that agents need to work with structured enterprise data, so it can be tagged and aligned with web-based data to create comprehensive knowledge graphs. By mapping different cross-domain entities in this way using a knowledge graph, AI agents can more accurately identify context-based relationships between them.
Armed with this ability to execute sequential tasks, developers will be able to create AI agents that can generate more relevant cross-domain insights by contextualizing external, web-based data against internal metrics. For instance, an AI agent might be able to connect an external news story regarding supply shortages to update the risk score in an organization’s internal procurement system, taking into account the company’s existing stocks, the expected duration of the shortage, and the ability to source alternatives from different suppliers.
Automation at Unprecedented Scales
AI agents represent a dramatic evolution of LLMs, which have transformed from providing simple, grounded responses based on their pre-trained data into intelligent entities that can actively explore and interact with their environments and complete assigned tasks.
When developers combine data and web exploration with logical reasoning and decision-making, AI agents can perform more complicated, multi-step tasks with greater autonomy and accuracy. It will usher in a new era of more robust and flexible task automation by LLMs with almost human-level understanding and problem-solving skills.
AI agents are becoming much more “human” in terms of what they can do, and we’re only just beginning to realize the possibilities this will unlock for enterprise acceleration.
Opinions expressed by DZone contributors are their own.
Comments