AI/ML Resources

DZone's Featured AI/ML Resources

*You* Can Shape Trend Reports: Join DZone's GenAI Research + Enter the Prize Drawing!

By Lauren Forbes

Hey, DZone Community! We have an exciting year ahead of research for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you choose) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Comic by Daniel Stori Generative AI Research Generative AI is revolutionizing industries, and software development is no exception. At DZone, we're diving deep into how GenAI models, algorithms, and implementation strategies are reshaping the way we write code and build software. Take our short research survey ( ~10 minutes) to contribute to our latest findings. We're exploring key topics, including: Embracing generative AI (or not)Multimodal AIThe influence of LLMsIntelligent searchEmerging tech And don't forget to enter the raffle for a chance to win an e-gift card of your choice! Join the GenAI Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team More

Building a Full-Stack Resume Screening Application With AI

By Anujkumarsinh Donvir

CORE

The release of the DeepSeek open-source AI model has created a lot of excitement in the technology community. It allows developers to build applications entirely locally without needing to connect to online AI models such as Claude, ChatGPT, and more. Open-source models have opened doors to new opportunities when building enterprise applications that integrate with generative AI. In this article, you will learn how to run such a model locally on your personal machine and build a full-stack React and NodeJS-powered application that is not just another chatbot. You will be able to use this application to analyze resumes faster and make smarter hiring decisions. Before you build the application, it is important to understand the benefits of open-source LLMs. Benefits of Open-Source Large Language Models Open-source models provide several key benefits over using proprietary models : Cost-Effective and License-Free Open-source LLMs are cost-effective and don't need a special license. For example, as of the date of writing this, OpenAI’s o1 costs $60 per million output tokens, and open-source DeepSeek R1 costs $2.19. Customizable and Fine-Tunable Open-source models can be fine-tuned easily to meet unique business cases - allowing for more domain-specific use cases to be built. This leads to optimized performance in the enterprise applications. Enhanced Data Security and Privacy Open source makes applications more secure as precious personal data doesn't need to be uploaded to third-party servers and stays on a local machine or a company's network only stay in local machine or a companies network only. This provides a high level of data security. Furthermore, open-source models can be fine-tuned to remove any biases of data. Community-Driven and No Vendor Lock-In Open-source models enjoy large community support and benefit from the rapid pace of feature development. On the other hand, using property models makes the application vendor-locked and reliant on vendor companies to provide feature updates. With this information in hand, you are ready to build a real-world application using the DeepSeek R1 open-source model, Node.JS, and React. Project and Architecture Overview You will be building a resume analyzer application — which will help you learn the benefits and shortcomings of the uploaded resume. DeepSeek R1 LLM will analyze the uploaded resume and provide feedback. You can learn about the architecture of the application through the illustration below. Architecture Diagram The React-based user interface communicates with the NodeJS-based backend using REST APIs. NodeJS backend then sends the user request to DeepSeek R1 hosted using Ollama. This entire tech stack can be run on a single machine, as you will do throughout the article, or it can be hosted across multiple containers in more complex use cases. Prerequisites To run the project, you will need a machine with some compute power, preferably one that has an NVIDIA graphics card. The project has been developed and tested on NVIDIA 4090RTX based Windows machine and M2 MacBook Pro. You will need to have NodeJS installed on the machine. This project has been built on NodeJS version 22.3.0. You can verify NodeJS installation using the node -v command. You will also need an editor of your choice to work through the code. Visual Studio Code has been used while building the application and is generally recommended. Setting Up and Running DeepSeek Locally To run DeepSeek R1 locally, follow the steps below: 1. Install Ollama from its official website. 2. After installation is complete, you will be able to run models using the ollama run command from your machine's terminal. 3. Run the DeepSeek model of your choice. This tutorial was built using the DeepSeek R1 8-Billon parameter model. You can run it by using the command ollama run deepseek-r1:8b. 4. If you have a lower-specification machine than the one mentioned in the prerequisites section, the 7B and 1.5B parameter models will work as well, but the generated output quality may be lower. 5. It may take some time for models to run the first time as they will need to get downloaded. Once the model is running, you should be able to ask it a question right in the terminal and get an output. You can refer to the illustration below to view the DeepSeek R1 8B model in action. Ollama DeepSeek R1 6. DeepSeek R1 is a reasoning model, and therefore, it thinks before giving the first answer it can generate. As highlighted in the illustration above, it is thinking before giving the answer to our prompt. This thinking can be seen in tags <think> </think>. Cloning and Running NodeJS Backend The Ollama service can be accessed via an API as well. You are going to leverage this API and build a NodeJS-based backend layer. This layer will take the uploaded PDF from the user and extract text from it. After the text extraction, the backend will feed the text to the DeepSeek R1 model via the Ollama API and get a response back. This response will be sent to the client to display to the user. 1. Clone the backend project from GitHub using this URL. Ideally, you should fork the project and then clone your own local copy. 2. After cloning, to run the project, go to the project root directory using cd deepseek-ollama-backend. 3. Once inside the project root, install dependencies by giving npm install command. Once the installation completes, the project can be run using the npm start command. The core of the project is the app.js file. Examine its code, which is provided below. JavaScript const express = require('express'); const multer = require('multer'); const pdfParse = require('pdf-parse'); const axios = require('axios'); const fs = require('fs'); const cors = require('cors'); const app = express(); app.use(cors()); app.use(express.json()); const upload = multer({ dest: 'uploads/', fileFilter: (req, file, cb) => { file.mimetype === 'application/pdf' ? cb(null, true) : cb(new Error('Only PDF files are allowed!')); } }).single('pdfFile'); app.post('/analyze-pdf', (req, res) => { upload(req, res, async function(err) { if (err) { return res.status(400).json({ error: 'Upload error', details: err.message }); } try { if (!req.file) { return res.status(400).json({ error: 'No PDF file uploaded' }); } const dataBuffer = fs.readFileSync(req.file.path); const data = await pdfParse(dataBuffer); const pdfText = data.text; fs.unlinkSync(req.file.path); const response = await axios.post('http://127.0.0.1:11434/api/generate', { model: "deepseek-r1:8b", prompt: `Analyze this resume. Resume text is between two --- given ahead: ---${pdfText}---`, stream: false }); res.json({ success: true, message: 'Successfully connected to Ollama', ollamaResponse: response.data }); } catch (error) { if (req.file && fs.existsSync(req.file.path)) { fs.unlinkSync(req.file.path); } res.status(500).json({ error: 'Error processing PDF', details: error.message }); } }); }); if (!fs.existsSync('uploads')) { fs.mkdirSync('uploads'); } const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`Server is running on port ${PORT}`); }); 4. The client interacts with the backend by invoking /analyze-pdf API endpoint, which is of type POST. The client sends the user-uploaded pdf file as a payload to this API. 5. The server stores this file in uploads directory temporarily, and extracts the text in the file. 6. The server then prompts DeepSeek R1 using Ollama's localhost API endpoint. 7. DeepSeek R1 analyzes the content of the resume and provides its feedback. The server then responds to the client with this analysis using res.json(). Cloning and Running the React User Interface The user interface of the project will allow users to upload the resume, send this resume to the backend and then display the result of DeepSeek R1's analysis of the resume to the user. It will also display an internal chain of thoughts or thinking of DeepSeek R1 as well. 1. To get started, fork and then clone the project from this GitHub URL. You can simply clone it as well if you don't intend to do many customizations. 2. Once the project is cloned, go to the root project directory using the command cd deepseek-ollama-frontend. 3. Inside the project root, install the necessary dependencies using the npm install command. After the installation completes, start the project using the npm run dev command. 4. The main component of this React application is ResumeAnalyzer. Open it in your editor of choice and analyze it. JSX import './ResumeAnalyzer.css'; import React, { useState } from 'react'; import { Upload, Loader2 } from 'lucide-react'; import AnalysisSection from './AnalysisSection'; const ResumeAnalyzer = () => { const [file, setFile] = useState(null); const [loading, setLoading] = useState(false); const [feedback, setFeedback] = useState(null); const [error, setError] = useState(null); const handleFileChange = (event) => { const selectedFile = event.target.files?.[0]; if (selectedFile && selectedFile.type === 'application/pdf') { setFile(selectedFile); setError(null); } else { setError('Please upload a PDF file'); setFile(null); } }; const analyzePDF = async () => { if (!file) return; setLoading(true); setError(null); try { const formData = new FormData(); formData.append('pdfFile', file); const response = await fetch('http://localhost:3000/analyze-pdf', { method: 'POST', body: formData, }); if (!response.ok) { const errorData = await response.json(); throw new Error(errorData.details || 'Failed to analyze PDF'); } const data = await response.json(); setFeedback(data); } catch (err) { setError(err.message || 'An error occurred'); } finally { setLoading(false); } }; return ( <div className="max-w-4xl mx-auto p-4"> <div className="bg-gray-50 rounded-lg shadow-lg p-6"> <h1 className="text-3xl font-bold mb-6 text-gray-800">Resume Analyzer</h1> <div className="bg-white rounded-lg shadow-sm p-8"> <div className="border-2 border-dashed border-gray-300 rounded-lg p-8 text-center"> <Upload className="w-12 h-12 text-gray-400 mx-auto mb-4" /> <input type="file" accept=".pdf" onChange={handleFileChange} className="hidden" id="file-upload" /> <label htmlFor="file-upload" className="cursor-pointer text-blue-600 hover:text-blue-800 font-medium" > Upload Resume (PDF) </label> {file && ( <p className="mt-2 text-sm text-gray-600"> Selected: {file.name} </p> )} </div> </div> {error && ( <div className="mt-4 p-4 bg-red-50 text-red-700 rounded-lg border border-red-200"> {error} </div> )} <button onClick={analyzePDF} disabled={!file || loading} className="mt-6 w-full bg-blue-600 text-white py-3 px-4 rounded-lg hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed flex items-center justify-center font-medium transition-colors" > {loading ? ( <> <Loader2 className="mr-2 h-5 w-5 animate-spin" /> Analyzing Resume... </> ) : ( 'Analyze Resume' )} </button> {feedback && !loading && ( <div className="mt-8"> <h2 className="text-2xl font-bold mb-6 text-gray-800">Analysis Results</h2> {feedback.ollamaResponse && <AnalysisSection ollamaResponse={feedback.ollamaResponse} /> } </div> )} </div> </div> ); }; export default ResumeAnalyzer; 5. This component provides an input field for the user to upload the file. 6. The uploaded file is sent to the server using the API endpoint. 7. The response of the server is divided into two parts- Internal thinking of the model and actual response of the model. 8. The AnalysisSection component displays the response of the model as well as houses the ExpandableSection component, which is used for displaying DeepSeek R1's internal thinking. 9. Navigate to the URL in your browser to load the application. Upload your resume (or any sample resume) and observe the analysis received by DeepSeek R1. Resume Analyzer Conclusion DeepSeek R1 provides a unique opportunity to build GenAI-powered applications completely in-house, and customize them as per your needs. In this article, you have learned about the benefits of using open-source GenAI models. Furthermore, you have set up a real application using DeepSeek R1, Node.js, and React. This setup allows you to perform resume analysis using AI completely offline. You can use this tool to hire smart at your organization, and I advise you to continue building on the knowledge gained from this article and explore more use cases and applications. More

Chat Completion Models vs OpenAI Assistants API

By Mohammed Talib

Explore Open WebUI: Your Offline AI Interface

By Gunter Rotsaert

CORE

SQL as the Backbone of Big Data and AI Powerhouses

By Medha Gupta

An In-Depth Guide to Threads in OpenAI Assistants API

In this blog, we will explore what chat completion models can and cannot do and then see how Assistants API addresses those limitations. We will also focus on threads and messages — how to create them, list them, retrieve them, modify them, and delete them. Additionally, we will add some Python code snippets and show possible outputs based on the script language. Limitations of Chat Completion Models No Memory Chat completion models do not have a memory concept. For example, if you ask: “What’s the capital of Japan?” The model might say: “The capital of Japan is Tokyo.” But when you ask again: “Tell me something about that city.” It often responds with: “I’m sorry but you didn’t specify which city you are referring to.” It does not understand what was discussed previously. That’s the main issue: there is no memory concept in chat completions. Poor at Computational Tasks Chat completion models are really bad at direct computational tasks. For instance, if you want to reverse the string “openaichatgpt,” it may generate the wrong output, like inserting extra characters or missing some letters. No Direct File Handling In chat completions, there is no way to process text files or Word documents directly. You have to convert those files to text, do chunking (divide documents into smaller chunks), create embeddings, and do vector searches yourself. Only then do you pass some relevant text chunks to the model as context. Synchronous Only Chat completion models are not asynchronous. You must ask a question and wait for it to finish. You cannot do something else while it’s processing without extra workarounds. Capabilities of the Assistants API Context Support With Threads In Assistants API, you can create a thread for each user. A thread is like a chat container where you can add many messages. It persists the conversation, so when the user logs in again, you can pass the same thread ID to retrieve what was discussed previously. This is very helpful. Code Interpreter There is also a code interpreter. Whenever you ask for some computational task, it runs Python code. It then uses that answer to expand or explain. This makes it very helpful for reversing strings, finding dates, or any Python-based operations. Retrieval With Files The Assistants API has retrieval support, letting you upload files and ask questions based on those files. The system handles the vector search process and then uses relevant chunks as context. You can upload up to 20 files in Assistants as context. This is very helpful for referencing company documents, reports, or data sets. Function Calling Function calling allows the model to tell you what function to call and what arguments to pass, so that you can get external data (like weather or sales from your own database). It does not call the function automatically; it indicates which function to call and with what parameters, and then you handle that externally. Asynchronous Workflows The Assistants API is asynchronous. You can run a request, and you don’t have to wait for it immediately. You can keep checking if it’s done after a few seconds. This is very helpful if you have multiple tasks or want to do other things in parallel. Focusing on Threads and Messages A thread is essentially a container that holds all messages in a conversation. OpenAI recommends creating one thread per user as soon as they start using your product. This thread can store any number of messages, so you do not have to manually manage the context window. Unlimited messages. You can add as many user queries and assistant responses as you want.Automatic context handling. The system uses truncation if the conversation grows beyond token limits.Metadata storage. You can store additional data in the thread’s metadata (for example, user feedback or premium status). Below are code snippets to demonstrate how to create, retrieve, modify, and delete threads. 1. Creating an Assistant First, you can create an assistant with instructions and tools. For example: Python from openai import OpenAI client = OpenAI() file_input = client.files.create(file=open("Location/to/the/path", "rb"), purpose = "assistants") file_input.model_dump() Python assistant = client.beta.assistants.create( name="data_science_tutor", instructions="This assistant is a data science tutor.", tools=[{"type":"code_interpreter", {"type":"retrieval"}], model="gpt-4-1106-preview", file_ids=[file_input.id] ) print(assistant.model_dump()) 2. Creating Threads A thread is like a container that holds the conversation. We can create one thread per user. Python thread = client.beta.threads.create() print(thread.model_dump()) id – a unique identifier that starts with thr-object – always "thread"metadata – an empty dictionary by default Why Create Separate Threads? OpenAI recommends creating one thread per user as soon as they start using your product. This structure ensures that the conversation context remains isolated for each user. 3. Retrieving a Thread Python retrieved_thread = client.beta.threads.retrieve(thread_id=thread.id) print(retrieved_thread.model_dump()) This returns a JSON object similar to what you get when you create a thread, including the id, object, and metadata fields. 4. Modifying a Thread You can update the thread’s metadata to store important flags or notes relevant to your application. For instance, you might track if the user is premium or if the conversation has been reviewed by a manager. Python updated_thread = client.beta.threads.update( thread_id=thread.id, metadata={"modified_today": True, "user_is_premium": True} ) print(updated_thread.model_dump()) modified_today – a custom Boolean to note whether you changed the thread todayuser_is_premium – a Boolean flag for user account tierconversation_topic – a string that labels this thread’s main subject Further Metadata Examples {"language_preference": "English"} – if the user prefers answers in English or another language{"escalated": true} – if the thread needs special attention from a support team{"feedback_rating": 4.5} – if you collect a rating for the conversation 5. Deleting a Thread When you no longer need a thread, or if a user deletes their account, you can remove the entire conversation container: Python delete_response = client.beta.threads.delete(thread_id=thread.id) print(delete_response.model_dump()) Once deleted, you can no longer retrieve this thread or any messages it contained. Working With Messages Previously, we focused on threads — the containers that hold conversations in the Assistants API. Now, let’s explore messages, which are the individual pieces of content (questions, responses, or system notes) you add to a thread. We’ll walk through creating messages, attaching files, listing and retrieving messages, and updating message metadata. We’ll also show Python code snippets illustrating these steps. Messages and Their Role in Threads What Are Messages? Messages are mostly text (like user queries or assistant answers), but they can also include file references. Each thread can have many messages, and every message is stored with an ID, a role (for example, "user" or "assistant"), optional file attachments, and other metadata. Opposite Index Order Unlike chat completions, where the first message in the list is the earliest, here, the first message you see in the array is actually the most recent. So, index 0 corresponds to the newest message in the thread. Annotations and File Attachments Messages can include annotations, for instance, if a retrieval step references certain files. When using a code interpreter, any new files generated may also appear as part of the message annotations. Create a Message in a Thread Messages are added to a thread. Each message can be a user message or an assistant message. Messages can also contain file references. Before adding messages, we need a thread. If you do not already have one: Python # Create a new thread new_thread = client.beta.threads.create() print(thread.model_dump()) # Shows the thread's detailspython Python # Create a new message in the thread message = client.beta.threads.messages.create( thread_id=thread.id, role="user", content="ELI5: What is a neural network?", file_ids=[file_input.id] # Passing one or more file IDs ) print(message.model_dump()) Here, you can see: Message ID – unique identifier starting with msgRole – user, indicating this is a user inputFile attachments – the file_ids list includes any referenced filesAnnotations – empty at creation, but can include details like file citations if retrieval is involvedMetadata – a placeholder for storing additional key-value pairs List Messages in a Thread To list messages in a thread, use the list method. The limit parameter determines how many recent messages to retrieve. Now, let’s try to list all the messages: You will see only the most recent messages. For instance, if we’ve added just one message, the output will look like: Python messages_list = client.beta.threads.messages.list( thread_id=thread.id, limit=5 ) for msg in messages_list.data: print(msg.id, msg.content) If there are multiple messages, the system works like a linked list: The first ID points to the newest message.The last ID points to the earliest message. Retrieve a Specific Message Python retrieved_msg = client.beta.threads.messages.retrieve( thread_id=new_thread.id, message_id=message.id ) print(retrieved_msg.model_dump()) Retrieve a Message File Now, let’s retrieve a message file: This provides the file’s metadata, including its creation timestamp. Python files_in_msg = client.beta.threads.messages.files.list( thread_id=new_thread.id, message_id=message.id ) print(files_in_msg.model_dump()) Modify a Message Python updated_msg = client.beta.threads.messages.update( thread_id=new_thread.id, message_id=message.id, metadata={"added_note": "Revised content"} ) print(updated_msg.model_dump()) Delete a Message Python deleted_msg = client.beta.threads.messages.delete( thread_id=new_thread.id, message_id=message.id ) print(deleted_msg.model_dump()) We have seen that chat completion models have no memory concept, are bad at computational tasks, cannot process files directly, and are not asynchronous. Meanwhile, Assistants API has context support with threads, code interpreter for computational tasks, retrieval for files, function calling for external data, and it also supports asynchronous usage. In this blog, we focused on how to create, list, retrieve, modify, and delete threads and messages. We also saw how to handle file references within messages. In the next session, we will learn more about runs, which connect threads and assistants to get actual outputs from the model. I hope this is helpful. Thank you for reading! Let’s connect on LinkedIn! Further Reading Where did multi-agent systems come from?Summarising Large Documents with GPT-4oHow does LlamaIndex compare to LangChain in terms of ease of use for beginnersPre-training vs. Fine-tuning [With code implementation]Costs of Hosting Open Source LLMs vs Closed Sourced (OpenAI)Embeddings: The Back Bone of LLMsHow to Use a Fine-Tuned Language Model for Summarization

By Mohammed Talib

Build an AI Browser Agent With LLMs, Playwright, Browser Use

Browser Use is a tool or platform designed to enable AI agents (such as OpenAI’s GPT models or other large language models) to interact with and control web browsers in an intelligent and automated way. It essentially bridges the gap between AI capabilities and real-world browser interactions, making it possible for AI systems to perform tasks like navigating websites, extracting data, filling out forms, clicking buttons, and more — just as a human user would. The primary goal of Browser Use is to make websites accessible and actionable for AI agents by abstracting away the complexities of browser automation. Instead of requiring developers to write intricate scripts to locate and interact with webpage elements, Browser Use simplifies this process by extracting all interactive elements (like buttons, input fields, links, etc.) and providing a structured interface for AI agents to interact with. Key Characteristics of Browser Use AI-Driven Automation Browser Use leverages AI to understand and interact with web pages. For example, it can analyze the content of a webpage, identify relevant actions (like clicking a button or filling out a form), and execute those actions autonomously. Vision + HTML Extraction It combines visual understanding (recognizing elements on the screen) with HTML structure extraction (parsing the underlying code of a webpage). This dual approach ensures that AI agents can interact with both static and dynamic web elements, even if they don’t have clear identifiers like IDs or classes. Multi-Tab Management Browser Use can handle multiple browser tabs simultaneously, allowing AI agents to perform complex workflows that involve interacting with several web pages at once. The tool tracks the exact actions performed by the AI agent (e.g., clicking a button or filling out a form) and can replicate those actions consistently, even if the website layout changes slightly. This is particularly useful for creating self-healing tests in QA automation. Custom Actions Users can extend Browser Use by adding custom actions, such as saving data to files, performing database operations, sending notifications, or handling human input during specific steps in the automation process. Self-Correcting Browser Use includes intelligent error handling and automatic recovery mechanisms. If something goes wrong during automation (e.g., a missing element or a network timeout), the tool can detect the issue and attempt to recover automatically, ensuring that workflows continue without interruption. Compatibility With Multiple LLMs Browser Use supports various large language models (LLMs), including OpenAI’s GPT-4, Anthropic’s Claude, and Meta’s Llama 2. This flexibility allows users to choose the best AI model for their specific needs. How Browser Use Works Browser Use scans a webpage and extracts all interactive elements (buttons, input fields, links, forms, etc.). It then provides a structured representation of these elements that AI agents can understand and interact with. AI Interaction Once the interactive elements are identified, AI agents can perform actions like clicking buttons, filling out forms, navigating between pages, or extracting data. The AI agent can also analyze the content of the webpage and make decisions based on the information it finds. Automation Workflows Browser Use allows users to create complex automation workflows. For example, an AI agent could navigate through an e-commerce site, add items to a shopping cart, and complete a purchase — all without human intervention. Error Handling and Recovery If something goes wrong during the automation process (e.g., a missing element or a slow-loading page), Browser Use can detect the issue and attempt to recover automatically. This ensures that workflows continue smoothly, even in unpredictable environments. Installation Guide Getting started with Browser Use is straightforward, but it requires some initial setup to ensure everything runs smoothly. Below is a detailed installation guide based on the prerequisites and steps you’ve provided. This guide will walk you through setting up Browser Use locally on your machine. Prerequisites Before you begin, ensure that your system meets the following requirements: Python 3.11 or higher. You can check your Python version by running the command: Python python --version Git. Git is required to clone the repository Local Installation Step 1: Clone the Repository Shell git clone https://github.com/browser-use/web-ui.git cd web-ui Step 2: Set Up Python Environment We recommend using uv for managing the Python environment (recommended for Mac): Shell curl -LsSf https://astral.sh/uv/install.sh | sh 1. Create a virtual environment. Run the following command to create a virtual environment with Python 3.11: Shell uv venv -- python 3.11 2. Activate the virtual environment. Windows (command prompt): Shell .venv\Scripts\activate macOS/Linux: Shell source .venv/bin/activate Once activated, you should see .venv in your terminal prompt, indicating that the virtual environment is active. Step 3: Install Dependencies Now that your environment is set up, it’s time to install the necessary dependencies. Install Python packages. Use the following command to install the required Python packages listed in requirements.txt: Shell uv pip install -r requirements.txt Step 4: Install Playwright Playwright is a browser automation library used by Browser Use. To install it, run the command: Shell playwright install Local Setup Guide for Browser Use WebUI Once you’ve completed the installation steps for Browser Use, you can start running the WebUI locally. This guide will walk you through launching the application, customizing its settings, and configuring it to use your own browser if needed. Running the WebUI After completing the installation steps, you can start the Browser Use WebUI by running the following command: Shell python webui.py --ip 127.0.0.1 --port 7788 The WebUI provides several options to customize its behavior. Here’s a breakdown of the available flags: --ip– the IP address to bind the WebUI to Default – 127.0.0.1 (localhost)--port– the port to bind the WebUI to Default – 7788--theme – the theme for the user interface Accessing the WebUI Once the WebUI is running, open your web browser and navigate to: Plain Text http://127.0.0.1:7788 Once the above command is executed, you should see the Browser Use interface, where you can interact with the tool and configure AI-driven browser automation tasks. LLM Configuration In LLM configuration, select a language model, e.g., gemini. Gemini provides the free API key. Generate the API keys from the link attached below. In the screenshot below, you can see we have added the API keys generated with the above link. Run Agent In Run agent, let's give the prompt "go to amazon.in and type 'Playwright' click search and give me the first URL." In the screenshot below, you can see that when we run the prompt, it will open the Chromium browser and interact with the whole DOM of the page. Finally, it will enter the value Playwright in the search box, and you can see the below screenshot. In the below screenshot, you can see it gives us the first URL. In the backend, you can see all the logs are executed; whatever agent is performing its log, all logs are in the backend. Result In the result tab, you can see the final result, model action, model thoughts, trace file, and agent history. Video You can download the video by clicking on the link provided. You can also see the attached video under the Recordings tab. When you run the video, you will see all the steps the agent has performed. Below are some screenshots of the video. Conclusion The integration of LLMs, Playwright, and Browser Use represents a new leap in browser automation and AI-driven workflows. Combining these tools will allow you to create intelligent browser agents capable of performing complex tasks with minimal human intervention. From automating repetitive processes to enabling dynamic QA testing and real-time decision-making, the possibilities are endless. Reference Browser Use

By Kailash Pathak

CORE

Designing AI Multi-Agent Systems in Java

The year 2025 is the year of AI agents. For the purposes of this article, an AI agent is a system that can leverage AI to achieve a goal by following a series of steps, possibly reasoning on its results and making corrections. In practice, the steps that an agent follows can constitute a graph. We will build a reactive agent (meaning that it reacts to a stimulus, in our case, the input from a user) to help people find their perfect vacation. Our agent will find the best city in the specified country, considering the food, sea, and activity specified by the user. The agent will look like this: In the first phase, it will collect information in parallel, ranking the cities by a single characteristic. The last step will use this information to choose the best city. You could use a search engine to collect information, but we will use ChatGPT for all the steps, though we will use different models. You could write all the code by hand or use some library to help you simplify the code a bit. Today, we will use a new feature that I added to Fibry, my Actor System, to implement the graph and control the parallelism in great detail. Fibry is a simple and small Actor System that provides an easy way to leverage actors to simplify multi-threading code and does not have any dependency. Fibry also implements a Finite State Machine, so I decided to extend it to make it easier to write agents in Java. My inspiration has been LangGraph. As Fibry is about multi-threading, the new features allow plenty of flexibility in deciding the level of parallelism while keeping everything as simple as possible. You should use Fibry 3.0.2, for example: Plain Text compile group: 'eu.lucaventuri', name: 'fibry', version: '3.0.2' Defining the Prompts The first step is defining the prompts that we need for the LLM: Java public static class AiAgentVacations { private static final String promptFood = "You are a foodie from {country}. Please tell me the top 10 cities for food in {country}."; private static final String promptActivity = "You are from {country}, and know it inside out. Please tell me the top 10 cities in {country} where I can {goal}"; private static final String promptSea = "You are an expert traveler, and you {country} inside out. Please tell me the top 10 cities for sea vacations in {country}."; private static final String promptChoice = """ You enjoy traveling, eating good food and staying at the sea, but you also want to {activity}. Please analyze the following suggestions from your friends for a vacation in {country} and choose the best city to visit, offering the best mix of food and sea and where you can {activity}. Food suggestions: {food}. Activity suggestions: {activity}. Sea suggestions: {sea}. """; } Defining the States Normally, you would define four states, one for each step. However, since branching out and back is quite common, I added a feature to handle this with only a single state. As a result, we need only two states: CITIES, where we collect information, and CHOICE, where we choose the city. Plain Text enum VacationStates { CITIES, CHOICE } Defining the Context The different steps of the agent will collect information that needs to be stored somewhere; let’s call it context. Ideally, you would want every step to be independent and know as little as possible of the other, but achieving this in a simple way, with a low amount of code while keeping as much type safety as possible and maintaining thread safety, is not exactly straightforward. As a result, I choose to force the context to be a record, providing some functionality to update the values of the record (using reflection underneath) while we wait for JEP 468 (Derived Record Creation) to be implemented. Java public record VacationContext(String country, String goal, String food, String activity, String sea, String proposal) { public static VacationContext from(String country, String goal) { return new VacationContext(country, goal, null, null, null, null); } } Defining the Nodes Now, we can define the logic of the agent. We will allow the user to use two different LLM models, for example, a “normal” LLM for the search and a “reasoning” one for the choice step. This is where things become a bit trickier, as it is quite dense: Java AgentNode<VacationStates, VacationContext> nodeFood = state -> state.setAttribute("food", modelSearch.call("user", replaceField(promptFood, state.data(), "country"))); AgentNode<VacationStates, VacationContext> nodeActivity = state -> state.setAttribute("activity", modelSearch.call("user", replaceField(promptActivity, state.data(), "country"))); AgentNode<VacationStates, VacationContext> nodeSea = state -> state.setAttribute("sea", modelSearch.call("user", replaceField(promptSea, state.data(), "country"))); AgentNode<VacationStates, VacationContext> nodeChoice = state -> { var prompt = replaceAllFields(promptChoice, state.data()); System.out.println("***** CHOICE PROMPT: " + prompt); return state.setAttribute("proposal", modelThink.call("user", prompt)); }; As you might have guessed, modelSearch is the model used for search (e.g., ChatGPT 4o), and modelThink could be a “reasoning model” (e.g., ChatGPT o1). Fibry provides a simple LLM interface and a simple implementation for ChatGPT, exposed by the class ChatGpt. Please note that calling ChatPGT API requires an API key that you need to define using the “-DOPENAI_API_KEY=xxxx” JVM parameter. Different and more advanced use cases will require custom implementations or the usage of a library. There is also a small issue related to the philosophy of Fibry, as Fibry is meant not to have any dependencies, and this gets tricky with JSON. As a result, now Fibry can operate in two ways: If Jackson is detected, Fibry will use it with reflection to parse JSON.If Jackson is not detected, a very simple custom parser (that seems to work with ChatGPT output) is used. This is recommended only for quick tests, not for production.Alternatively, you can provide your own JSON processor implementation and call JsonUtils.setProcessor(), possibly checking JacksonProcessor for inspiration. The replaceField() and replaceAllFields() methods are defined by RecordUtils and are just convenience methods to replace text in the prompt, so that we can provide our data to the LLM.The setAttribute() function is used to set the value of an attribute in the state without you having to manually recreate the record or define a list of “withers” methods. There are other methods that you might use, like mergeAttribute(), addToList(), addToSet(), and addToMap(). Building the Agent Now that we have the logic, we need to describe the graph of dependencies between states and specify the parallelism we want to achieve. If you imagine a big multi-agent system in production, being able to express the parallelism required to maximize performance without exhausting resources, hitting rate limiters, or exceeding the parallelism allowed by external systems is a critical feature. This is where Fibry can help, making everything explicit but relatively easy to set up. Let’s start creating the agent builder: Plain Text var builder = AiAgent.<VacationStates, VacationContext>builder(true); The parameter autoGuards is used to put automatic guards on the states, which means that they are executed with an AND logic, and a state is executed only after all the incoming states have been processed. If the parameter is false, the state is called once for each incoming state. In the previous example, if the intention is to execute D once after A and once after C, then autoGuards should be false, while if you want it to be called only once after both have been executed, then autoGuards should be true. But let’s continue with the vacation agent. Plain Text builder.addState(VacationStates.CHOICE, null, 1, nodeChoice, null); Let’s start with the method addState(). It is used to specify that a certain state should be followed by another state and execute a certain logic. In addition, you can specify the parallelism (more on that soon) and the guards. In this case: The state is CHOICEThere is no default following state (e.g., this is a final state)The parallelism is 1There is no guard The next state is just a default because the node has the possibility to overwrite the next state, which means that the graph can dynamically change at runtime, and in particular, it can perform cycles, for example, if some steps need to be repeated to collect more or better information. This is an advanced use case. An unexpected concept might be the parallelism. This has no consequences in a single run of the agent, but it is meaningful in production at scale. In Fibry, every node is backed by an actor, which, from a practical point of view, is a thread with a list of messages to process. Every message is an execution step. So parallelism is the number of messages that can be executed at a single time. In practice: parallelism == 1 means there is only one thread managing the step, so only one execution at a time.parallelism > 1 means that there is a thread pool backing the actor, with the number of threads specified by the user. By default, it uses virtual threads.parallelism == 0 means that every message creates a new actor backed by a virtual thread, so the parallelism can be as high as necessary. Every step can be configured configured independently, which should allow you to configure performance and resource usage quite well. Please consider that if parallelism != 1, you might have multi-threading, as the thread confinement typically associated with actors is lost. This was a lot to digest. If it is clear, you can check state compression. State Compression As said earlier, it is quite common to have a few states that are related to each other, they need to be performed in parallel and join before moving to a common state. In this case, you do not need to define multiple states, but you can use only one: Plain Text builder.addStateParallel(VacationStates.CITIES, VacationStates.CHOICE, 1, List.of(nodeFood, nodeActivity, nodeSea), null); In this case, we see that the CITIES state is defined by three nodes, and addStateParallel() takes care of executing them in parallel and waits for the execution of all of them to be finished. In this case, the parallelism is applied to each node, so in this case, you will get three single-thread actors. Please note that if you do not use autoGuards, this basically allows you to mix OR and AND logic. In case you want to merge some nodes in the same state, but they need to be executed serially (e.g., because they need information generated by the previous node), the addStateSerial() method is also available. AIAgent creation is simple, but there are a few parameters to specify: The initial stateThe final state (which can be null)A flag to execute states in parallel when possible Plain Text var vacationAgent = builder.build(VacationStates.CITIES, null, true); Now we have an agent, and we can use it, calling process: Plain Text vacationsAgent.process(AiAgentVacations.VacationContext.from("Italy", "Dance Salsa and Bachata"), (state, info) -> System.out.println(state + ": " + info)); This version of process() takes two parameters: The initial state, which contains the information required by the agent to perform its actionsAn optional listener, for example, if you want to print the output of each step If you need to start the action and check its return value, later, you can use processAsync(). If you are interested in learning more about the parallelism options, I recommend you check the unit test TestAIAgent. It simulates an agent with nodes that sleep for a while and can help you see the impact of each choice: But I promised you a multi-agent, didn’t I? Extending to Multi-Agents The AIAgent that you just created is an actor, so it runs on its own thread (plus all the threads used by the nodes), and it also implements the Function interface, in case you need it. There is actually nothing special about a multi-agent; just one or more nodes of an agent ask another agent to perform an action. However, you can build a library of agents and combine them in the best way while simplifying the whole system. Let’s imagine that we want to leverage the output of our previous agent and use it to calculate how much that vacation would cost so the user can decide if it is affordable enough. Like a real Travel Agent! This is what we want to build: First, we need prompts to extract the destination and compute the cost. Java private static final String promptDestination = "Read the following text describing a destination for a vacation and extract the destination as a simple city and country, no preamble. Just the city and the country. {proposal}"; private static final String promptCost = "You are an expert travel agent. A customer asked you to estimate the cost of travelling from {startCity}, {startCountry} to {destination}, for {adults} adults and {kids} kids}"; We just need two states, one to research the cities, which is done by the previous agent, and one to calculate the cost. Plain Text enum TravelStates { SEARCH, CALCULATE } We also need a context, that should also hold the proposal from the previous agent. Plain Text public record TravelContext(String startCity, String startCountry, int adults, int kids, String destination, String cost, String proposal) { } Then we can define the agent logic, which requires as a parameter another agent. The first node calls the previous agent to get the proposal. Java var builder = AiAgent.<TravelStates, TravelContext>builder(false); AgentNode<TravelStates, TravelContext> nodeSearch = state -> { var vacationProposal = vacationsAgent.process(AiAgentVacations.VacationContext.from(country, goal), 1, TimeUnit.MINUTES, (st, info) -> System.out.print(debugSubAgentStates ? st + ": " + info : "")); return state.setAttribute("proposal", vacationProposal.proposal()) .setAttribute("destination", model.call(promptDestination.replaceAll("\\{proposal\\}", vacationProposal.proposal()))); }; The second node computes the cost: Plain Text AgentNode<TravelStates, TravelContext> nodeCalculateCost = state -> state.setAttribute("cost", model.call(replaceAllFields(promptCost, state.data()))); Then, we can define the graph and build the agent Java builder.addState(TravelStates.SEARCH, TravelStates.CALCULATE, 1, nodeSearch, null); builder.addState(TravelStates.CALCULATE, null, 1, nodeCalculateCost, null); var agent = builder.build(TravelStates.SEARCH, null, false); Now we can instantiate the two agents (I chose to use ChatGPT 4o and ChatGPT 01-mini) and use them: Java try (var vacationsAgent = AiAgentVacations.buildAgent(ChatGPT.GPT_MODEL_4O, ChatGPT.GPT_MODEL_O1_MINI)) { try (var travelAgent = AiAgentTravelAgency.buildAgent(ChatGPT.GPT_MODEL_4O, vacationsAgent, "Italy", "Dance Salsa and Bachata", true)) { var result = travelAgent.process(new AiAgentTravelAgency.TravelContext("Oslo", "Norway", 2, 2, null, null, null), (state, info) -> System.out.println(state + ": " + info)); System.out.println("*** Proposal: " + result.proposal()); System.out.println("\n\n\n*** Destination: " + result.destination()); System.out.println("\n\n\n*** Cost: " + result.cost()); } } Final Outputs If you wonder what the result is, here is the long output that you can get when stating that what you want to do is to dance Salsa and Bachata: Destination Plain Text Naples, Italy Proposal Plain Text Based on the comprehensive analysis of your friends' suggestions, **Naples** emerges as the ideal city for your vacation in Italy. Here's why Naples stands out as the best choice, offering an exceptional mix of excellent food, beautiful seaside experiences, and a vibrant salsa and bachata dance scene: ### **1. Vibrant Dance Scene** - **Dance Venues:** Naples boasts numerous venues and events dedicated to salsa and bachata, ensuring that you can immerse yourself in lively dance nights regularly. - **Passionate Culture:** The city's passionate and energetic atmosphere enhances the overall dance experience, making it a hotspot for Latin dance enthusiasts. ### **2. Culinary Excellence** - **Authentic Neapolitan Pizza:** As the birthplace of pizza, Naples offers some of the best and most authentic pizzerias in the world. - **Fresh Seafood:** Being a coastal city, Naples provides access to a wide variety of fresh seafood dishes, enhancing your culinary adventures. - **Delicious Pastries:** Don't miss out on local specialties like **sfogliatella**, a renowned Neapolitan pastry that is a must-try for any foodie. ### **3. Stunning Seaside Location** - **Bay of Naples:** Enjoy breathtaking views and activities along the Bay of Naples, including boat tours and picturesque sunsets. - **Proximity to Amalfi Coast:** Naples serves as a gateway to the famous Amalfi Coast, allowing you to explore stunning coastal towns like Amalfi, Positano, and Sorrento with ease. - **Beautiful Beaches:** Relax on the city's beautiful beaches or take short trips to nearby seaside destinations for a perfect blend of relaxation and exploration. ### **4. Cultural Richness** - **Historical Sites:** Explore Naples' rich history through its numerous museums, historic sites, and UNESCO World Heritage landmarks such as the Historic Centre of Naples. - **Vibrant Nightlife:** Beyond dancing, Naples offers a lively nightlife scene with a variety of bars, clubs, and entertainment options to suit all tastes. ### **5. Accessibility and Convenience** - **Transportation Hub:** Naples is well-connected by air, rail, and road, making it easy to travel to other parts of Italy and beyond. - **Accommodation Options:** From luxury hotels to charming boutique accommodations, Naples offers a wide range of lodging options to fit your preferences and budget. ### **Conclusion** Naples perfectly balances a thriving dance scene, exceptional culinary offerings, and beautiful seaside attractions. Its unique blend of culture, history, and vibrant nightlife makes it the best city in Italy to fulfill your desires for travel, good food, and lively dance experiences. Whether you're dancing the night away, savoring authentic pizza by the sea, or exploring nearby coastal gems, Naples promises an unforgettable vacation. ### **Additional Recommendations** - **Day Trips:** Consider visiting nearby attractions such as Pompeii, the Isle of Capri, and the stunning Amalfi Coast to enrich your travel experience. - **Local Experiences:** Engage with locals in dance classes or attend festivals to dive deeper into Naples' vibrant cultural scene. Enjoy your trip to Italy, and may Naples provide you with the perfect blend of everything you're looking for! Cost Plain Text To estimate the cost of traveling from Oslo, Norway, to Naples, Italy, for two adults and two kids, we need to consider several key components of the trip: flights, accommodations, local transportation, food, and activities. Here's a breakdown of potential costs: 1. **Flights**: - Round-trip flights from Oslo to Naples typically range from $100 to $300 per person, depending on the time of booking, the season, and the airline. Budget airlines might offer lower prices, while full-service carriers could be on the higher end. - For a family of four, the cost could range from $400 to $1,200. 2. **Accommodations**: - Hotels in Naples can vary significantly. Expect to pay approximately $70 to $150 per night for a mid-range hotel room that accommodates a family. Vacation rentals might offer more flexibility and potentially lower costs. - For a typical 5-night stay, this would range from $350 to $750. 3. **Local Transportation**: - Public transportation in Naples (buses, metro, trams) is affordable, and daily tickets cost around $4 per person. - Assume about $50 to $100 for the family's local transport for the entire trip, depending on usage. 4. **Food**: - Dining costs are highly variable. A budget for meals might be around $10-$20 per person per meal at casual restaurants, while dining at mid-range restaurants could cost $20-$40 per person. - A family of four could expect to spend around $50 to $100 per day, reaching a total of $250 to $500 for five days. 5. **Activities**: - Entry fees for attractions can vary. Some museums and archaeological sites charge around $10 to $20 per adult, with discounts for children. - Budget around $100 to $200 for family activities and entrance fees. 6. **Miscellaneous**: - Always allow a little extra for souvenirs, snacks, and unexpected expenses. A typical buffer might be $100 to $200. **Estimated Total Cost**: - **Low-end estimate**: $1,250 - **High-end estimate**: $2,950 These are general estimates and actual costs can vary based on when you travel, how far in advance you book, and your personal preferences for accommodation and activities. For the most accurate assessment, consider reaching out to airlines for current flight prices, hotels for room rates, and looking into specific attractions you wish to visit. That was a lot, and this is only the output of the two “reasoning” models! But the result is quite interesting. Naples is on my bucket list, and I am curious to see if the agent is correct! Let’s also check the intermediate results to see how it reached this conclusion, which seems reasonable to me. Intermediate Outputs If you are curious, there are intermediate results. Food Plain Text As a foodie exploring Italy, you're in for a treat, as the country boasts a rich culinary heritage with regional specialties. Here's a list of the top 10 cities in Italy renowned for their food: 1. **Bologna** - Often referred to as the gastronomic heart of Italy, Bologna is famous for its rich Bolognese sauce, tasty mortadella, and fresh tagliatelle. 2. **Naples** - The birthplace of pizza, Naples offers authentic Neapolitan pizza, as well as delicious seafood and pastries like sfogliatella. 3. **Florence** - Known for its Florentine steak, ribollita (a hearty bread and vegetable soup), and delicious wines from the surrounding Tuscany region. 4. **Rome** - Enjoy classic Roman dishes such as carbonara, cacio e pepe, and Roman-style artichokes in the bustling capital city. 5. **Milan** - A city that blends tradition and innovation, Milan offers risotto alla milanese, ossobuco, and an array of high-end dining experiences. 6. **Turin** - Known for its chocolate and coffee culture, as well as traditional dishes like bagna cauda and agnolotti. 7. **Palermo** - Sample the vibrant street food scene with arancini, panelle, and sfincione, as well as fresh local seafood in this Sicilian capital. 8. **Venice** - Famous for its seafood risotto, sarde in saor (sweet and sour sardines), and cicchetti (Venetian tapas) to enjoy with a glass of prosecco. 9. **Parma** - Home to the famous Parmigiano-Reggiano cheese and prosciutto di Parma, it’s a haven for lovers of cured meats and cheeses. 10. **Genoa** - Known for its pesto Genovese, focaccia, and variety of fresh seafood dishes, Genoa offers a unique taste of Ligurian cuisine. Each of these cities offers a distinct culinary experience influenced by local traditions and ingredients, making them must-visit destinations for any food enthusiast exploring Italy. Sea Plain Text Italy is renowned for its stunning coastline and beautiful seaside cities. Here are ten top cities and regions perfect for a sea vacation: 1. **Amalfi** - Nestled in the famous Amalfi Coast, this city is known for its dramatic cliffs, azure waters, and charming coastal villages. 2. **Positano** - Also on the Amalfi Coast, Positano is famous for its colorful buildings, steep streets, and picturesque pebble beachfronts. 3. **Sorrento** - Offering incredible views of the Bay of Naples, Sorrento serves as a gateway to the Amalfi Coast and provides a relaxing seaside atmosphere. 4. **Capri** - The island of Capri is known for its rugged landscape, upscale hotels, and the famous Blue Grotto, a spectacular sea cave. 5. **Portofino** - This quaint fishing village on the Italian Riviera is known for its picturesque harbor, pastel-colored houses, and luxurious coastal surroundings. 6. **Cinque Terre** - Comprising five stunning villages along the Ligurian coast, Cinque Terre is a UNESCO World Heritage site known for its dramatic seaside and hiking trails. 7. **Taormina** - Situated on a hill on the east coast of Sicily, Taormina offers sweeping views of the Ionian Sea and beautiful beaches like Isola Bella. 8. **Rimini** - Located on the Adriatic coast, Rimini is known for its long sandy beaches and vibrant nightlife, making it a favorite for beach-goers and party enthusiasts. 9. **Alghero** - A city on the northwest coast of Sardinia, Alghero is famous for its medieval architecture, stunning beaches, and Catalan culture. 10. **Lerici** - Near the Ligurian Sea, Lerici is part of the stunning Gulf of Poets and is known for its beautiful bay, historic castle, and crystal-clear waters. Each of these destinations offers a unique blend of beautiful beaches, cultural sites, and local cuisine, making Italy a fantastic choice for a sea vacation. Activity Plain Text Italy has a vibrant dance scene with many cities offering great opportunities to enjoy salsa and bachata. Here are ten cities where you can indulge in these lively dance styles: 1. **Rome** - The capital city has a bustling dance scene with numerous salsa clubs and events happening regularly. 2. **Milan** - Known for its nightlife, Milan offers various dance clubs and events catering to salsa and bachata enthusiasts. 3. **Florence** - A cultural hub, Florence has several dance studios and clubs where you can enjoy Latin dances. 4. **Naples** - Known for its passionate culture, Naples offers several venues and events for salsa and bachata lovers. 5. **Turin** - This northern city has a growing salsa community with events and social dances. 6. **Bologna** - Known for its lively student population, Bologna has a number of dance clubs and events for salsa and bachata. 7. **Venice** - While famous for its romantic canals, Venice also hosts various dance events throughout the year. 8. **Palermo** - In Sicily, Palermo has a vibrant Latin dance scene reflecting the island's festive culture. 9. **Verona** - Known for its romantic setting, Verona has several dance studios and clubs for salsa and bachata. 10. **Bari** - This coastal city in the south offers dance festivals and clubs perfect for salsa and bachata enthusiasts. These cities offer a mix of cultural experiences and lively dance floors, ensuring you can enjoy salsa and bachata across Italy. Interestingly enough, Naples does not top any of the lists, though the first four cities in the sea list are all close to Naples. Licensing Details Before closing the article, just two words on the license of Fibry. Fibry is no longer distributed as a pure MIT license. The main difference now is that if you want to build a system to generate code at scale for third parties (like a software engineer agent), you need a commercial license. Also, it is forbidden to include it in any datasets to train systems to generate code (e.g., ChatGPT should not be trained on the source code of Fibry). Anything else, you are good to go. I can provide commercial support and develop features on demand. Conclusion I hope you had fun and could get an idea of how to use Fibry to write AI agents. If you think that a multi-agent system needs to be distributed and run on multiple nodes, Fibry has got you covered! While we’ll save the details for another article, it’s worth noting that setting up Fibry actors in a distributed system is straightforward, and your agents are already actors: when you call process() or processAsync(), a message is sent to the underlying actor. In Fibry, sending and receiving messages over the network is abstracted away, so you don’t even need to modify your agent logic to enable distribution. This makes Fibry uniquely simple for scaling across nodes without rewriting core logic. Happy coding!

By Luca Venturi

Exploring Operator, OpenAI’s New AI Agent

Testing is a critical yet often time-consuming process. Ensuring that every feature, flow, and edge case works as intended can take up significant resources — both in terms of time and manpower. Manual testing, while thorough, is prone to human error and inefficiency, especially when dealing with repetitive tasks or complex workflows. OpenAI recently introduced an advanced AI agent that would enhance our approach to software testing. In this article, we’ll explore what Operator is, how it functions, and, most importantly, how it can drastically reduce manual testing time for developers and QA teams. We’ll also walk through some real-world examples to demonstrate its potential impact on testing various application flows and some potential limitations. What Is Operator? Operator is an AI-powered agent designed to interact with digital systems in a way that mimics human behavior. Unlike traditional automation tools that require explicit scripting and predefined rules, Operator leverages natural language processing (NLP) and machine learning to understand instructions and execute actions dynamically. It’s like having a virtual assistant that can navigate applications, perform tasks, and even troubleshoot issues — all without requiring extensive coding knowledge. The key features of Operator include: Natural language understanding. You can provide instructions in plain English, such as "Log into the app using test credentials" or "Verify if the payment gateway redirects correctly."Dynamic adaptability. Operator adapts to changes in UI elements, making it more resilient than static scripts.Task automation. From filling out forms to simulating multi-step user journeys, Operator handles repetitive tasks effortlessly.Error detection. The agent can identify anomalies during execution and flag them for review. These capabilities make Operator particularly well-suited for automating end-to-end testing scenarios, where flexibility and adaptability are crucial. Why Manual Testing Still Dominates and Its Challenges Despite advances in automated testing frameworks, many organizations still rely heavily on manual testing for several reasons: Complex workflows. Some applications have intricate user paths that are difficult to script.Frequent updates. Agile development cycles mean frequent updates, rendering pre-written scripts obsolete quickly.Edge cases. Identifying and testing rare but critical edge cases requires creativity and intuition, which scripted tests lack. However, manual testing comes with its own set of challenges: Time-consuming. Repetitive tasks eat up valuable hours that could be spent on innovation.Human error. Even experienced testers can miss subtle bugs due to fatigue or oversight.Scalability issues. As projects grow larger, scaling manual efforts becomes impractical. This is where Operator shines — it combines the precision of automation with the adaptability of human-like interaction, addressing these pain points effectively. Reducing Manual Testing Time With Operator Let’s dive into a practical example to illustrate how Operator can streamline testing processes and save time. Imagine you’re working on an e-commerce platform with the following core functionalities: User registration and loginProduct search and filteringAdding items to the cartCheckout process, including payment integration Each of these steps involves multiple sub-tasks, validations, and possible error conditions. Let’s see how Operator can help automate the testing of these flows. Scenario 1: Testing User Registration and Login Traditional Approach A manual tester would need to: Create new accounts repeatedly with different datasets (valid emails, invalid formats, duplicate entries)Test password strength requirementsAttempt logins with correct/incorrect credentialsCheck email verification links. This process could easily take 1–2 hours per round of testing, depending on the number of variations. With Operator: You simply instruct Operator in natural language: Prompt Create five new user accounts with valid details, one account with an invalid email format, and another with a weak password. Then, attempt to log in with each set of credentials and verify error messages. Operator will: Generate test data automaticallyExecute registration attempts across all specified scenariosLog in with each credential combinationValidate responses against expected outcomes What once took hours now takes mere minutes, freeing up your team to focus on higher-value activities. Scenario 2: Testing Product Search and Filtering Traditional Approach Testers manually search for products using various keywords, filters (price range, category), and sorting options. They must ensure results align with expectations and handle cases where no matches exist. With Operator Provide a simple command: Prompt Search for 'laptop' and apply filters: price between $100–$1000, brand='Apple', sort by relevance. Repeat with non-existent product names like 'unicorn laptop.' Operator will: Perform searches and apply filters systematicallyCompare actual results with expected outputsFlag discrepancies, such as incorrect filter applications or missing items Scenario 3: End-to-End Checkout Process Traditional Approach Manually adding items to the cart, entering shipping details, selecting payment methods, and verifying confirmation pages is tedious. Any change in the checkout flow necessitates retesting everything from scratch. With Operator Use a straightforward instruction: Prompt Add three random products to the cart, proceed to checkout, enter dummy shipping info, select PayPal as the payment method, and confirm the order. Operator will: Automate the entire checkout journeyHandle both successful and failure scenariosEnsure error messages appear appropriately and transactions reflect accurately Benefits Beyond Time Savings While reducing manual testing time is a significant advantage, Operator offers additional benefits that enhance the overall testing process: Improved accuracy. Operator eliminates human errors associated with repetitive tasks, leading to more reliable results.Enhanced collaboration. Since Operator uses natural language, non-technical stakeholders can easily participate in defining test scenarios.Cost efficiency. Automating routine tests reduces dependency on large QA teams, lowering operational costs.Focus on innovation. Freed from manual tasks, testers can dedicate more time to exploratory testing and creative problem-solving. Potential Limitations and Considerations While Operator holds immense promise, it’s essential to acknowledge certain limitations: Learning curve. Teams must learn to phrase test requirements effectively for the AI.Complex UI interactions. Highly dynamic interfaces (e.g., games, AR apps) may still require human intervention.Ethical oversight. Over-reliance on AI could lead to complacency. Human review remains essential for critical systems. That said, these challenges are outweighed by the long-term gains in efficiency and reliability. Conclusion As software complexity continues to rise, so does the demand for smarter, faster, and more adaptable testing solutions. Operator represents a paradigm shift in how we approach quality assurance, bridging the gap between human expertise and machine efficiency. With Operator, development teams can significantly cut down on manual testing time, achieve broader test coverage, and deliver high-quality products at a faster pace. In my next blog, I will provide a live example and explain it in greater detail.

By Kailash Pathak

CORE

A View on Understanding Non-Human Identities Governance

Can an identity exist without being referenced by another identity? How would we know? That might seem a bit philosophical for a security tech article, but it is an important point to keep in mind when tackling the subject of non-human identities. A better question around security would actually be, "Should an identity exist if it can not be interacted with?" We might not be able to reach the answer to that first question, as proving the nature of reality is a little out of scope for computer science. However, a lot of folks have been hard at work building the NHI Governance tools to determine if a machine identity exists, why it exists, and answer the question of whether it should exist. The future of eliminating secrets sprawl means getting a handle on the lifecycles and interdependencies of the non-human identities that rely on secrets. But why now? Let's step back and re-examine some of our assumptions about NHIs and their existence. What Are Non-Human Identities? Before we proceed, let's define NHI in the context of this conversation. In the simplest terms, a non-human identity, also commonly referred to as a machine identity or a workload identity, is any entity that is not human and can perform an action within your system, most commonly interacting exclusively with other non-humans. This could be a Kubernetes pod that needs to interact with a data source and send the processed data to a reporting system. This could be an Internet of Things (IoT) sensor feeding data to a central server. This could be a Slack-based chatbot. If no human input is directly needed after the initial creation for the entity to get work done, then we should consider that identity 'non-human.' The one thing all these examples have in common is that they interact with another system. If we want them to communicate with the entire world, that is easy, as we simply point to the other non-human identities and programmatically describe how they should interact. However, we most likely want these systems to communicate securely, only authorizing specific identities under specific circumstances. This has driven the evolution of secrets for access management, from simple username/password pairs to API keys to certificates. Admittedly, that is a broad definition of NHI. However, we can narrow down what we care about with machine identities by stepping back and considering how these entities relate to one another through the lens of their secrets, allowing access and communication. All NHIs Connect to Other Systems Can you build a stand-alone application that does not take in any input, produce any output, and has no addressable interface? Does such an application exist outside of a thought experiment? While fun to think about, the reality is that all NHIs we care about exist to communicate with other identities. NHIs inherently require connections to other systems and services to fulfill their purpose. This interconnectivity means every NHI becomes a node in a web of interdependencies. From an NHI governance perspective, this necessitates maintaining an accurate and dynamic inventory of these connections to manage the associated risks. For example, if a single NHI is compromised, what does it connect to, and what would an attacker be able to access to laterally move into? Proper NHI governance must include tools to map and monitor these relationships. While there are many ways to go about this manually, what we actually want is an automated way to tell what is connected to what, what is used for what, and by whom. When thinking in terms of securing our systems, we can leverage another important fact about all NHIs in a secured application to build that map, they all, necessarily, have secrets. All Secure NHIs Must Have a Secret In order to establish trusted communication between any two NHIs, a unique secret, such as an API key, token, or certificate, must exist for those entities to authenticate. We can use the secret to prove an NHI's identity and map it in the ecosystem. The question becomes, where do we look for these secrets? In the modern enterprise, especially larger ones, there are essentially only two places a secret can live. Your first option is the best practice and safest option: a secrets management system, such as CyberArk's Conjur, Vault by HashiCorp, or AWS Secrets Manager. The other option is much less secure but, unfortunately, all too common: outside of a vault, in code, or configuration in plaintext. Enterprise secrets management platforms, often referred to as vaults, are critical for storing and protecting secrets used by NHIs. Vaults can provide a single source of truth for all secrets, ensuring they are encrypted at rest, tightly access-controlled, and monitored for unauthorized access attempts. This assumes you have standardized on a single enterprise secret management platform. Most organizations actually have many vaults in use at the same time, making synchronization between all vaults an additional challenge. Teams can map all existing machine identities based on the existence of these secrets. For enterprises with multiple secret management solutions in place, you need to know which vaults do and do not contain a secret and to reduce the overhead of storing the same key redundantly across several vaults. All NHI Secrets Have an Origin Story Machines can't grant themselves permissions and access. Every machine identity was created by or represents a human identity. Governance of NHIs must include secret creation tracking to ensure every secret is traceable to its origin, securely distributed, and linked to a legitimate identity. While this aspect could be accounted for with the proper use of a secret management platform, our data keeps telling us that a certain percentage of secrets leak year after year because we are not consistently using these vault solutions. We know from years of experience helping teams remediate incidents that the creator of a secret will almost always be the person who first introduces the credential into an ecosystem. We can also tell from the code history or other system timestamp information when this was first seen, which is the most probable time for it to be created or at least come into existence in a meaningful way. This is a critical detail that might never have been properly logged or documented anywhere else. Once you understand who created a secret to be able to leverage an NHI, then you truly understand the beginning of our NHI lifecycle. All NHI Secrets Must Grant Some Set of Permissions When created, every NHI secret must be granted a certain set of permissions. The scope determines what actions an identity can perform and on which systems. This makes permission scoping and enforcement crucial components of governance. Essentially, two risks make understanding the scope of a secret critical for enterprise security. First is that misconfigured or over-privileged secrets can inadvertently grant access to sensitive data or critical systems, significantly increasing the attack surface. Imagine accidentally giving write privileges to a system that can access your customer's PII. That is a ticking clock waiting for a threat actor to find and exploit it. Also, just as troubling is that when a secret is leaked or compromised, a team can not replace it until they first understand how those permissions were configured. For example, suppose you know a mission-critical microservice's secret was accidentally pushed to a public GitHub repo. In that case, it is only a matter of time before it will be discovered and used by someone outside of your organization. In our recent Voice of the Practitioner report, IT decision-makers admitted it took, on average, 27 days to rotate these critical secrets. Teams should be able to act in seconds or minutes, not days. Tools that provide additional context about detected secrets, including their roles and permissions, are needed. Rapidly understanding what assets are exposed when a leak occurs and what potential damage can be inflicted by a threat actor goes a long way when responding to an incident. Knowing exactly how to replace it from a dashboard view or API call can mean the difference between a breach and a frustrated attacker finding the key they have is invalid. All NHI Secrets Need to be Rotated A machine identity can, and likely should, have many secrets in its lifetime. If credentials are left to live for months or years, or in the worst case, forever, NHI secrets exposure or compromise becomes increasingly likely. Manual rotation is error-prone and operationally taxing, particularly in environments with thousands of NHIs. Automating the secret rotation process is a cornerstone of NHI governance, ensuring that secrets are refreshed before they expire or are leaked. For any of the secrets in your vaults, rotation should be a simple matter of scripting. Most secret management platforms provide scripts or some other mechanism to handle the delicate dance of safely replacing and revoking the old secret. But what about the NHI secrets that live outside of these vaults, or perhaps the same secret that is spread across multiple vaults? A good secret scanning platform needs seamless integration with these vaults so that your team can more easily find and safely store these secrets in the secrets manager and prepare the way for automated rotation. GitGuardian's reference implementation with CyberArk's Conjur goes into more detail on how you can fully automate the entire storage and rotation process. By identifying all the NHIs and knowing when they were created, we can also predict when they need to be rotated. While every team will judge exactly how long each secret should live, any secrets that have never been rotated after creation are ripe to be replaced. Any secret older than a year, or for some mission-critical systems, a few days should also be prioritized for rotation asap. All NHIs Will Have an End-of-Life NHIs, like their human counterparts, have finite lifecycles. They may be decommissioned when a service is retired, replaced, or no longer needed. Without addressing the deactivation and cleanup of NHIs to prevent the persistence of unused secrets or stale connections, we are creating security blind spots. But how do we know when we are at the end of the road for an NHI, especially if its secret remains valid? One answer is that it should no longer exist when an NHI no longer connects to another active system. This ensures attackers cannot exploit defunct NHI secrets to gain a foothold in your environment. Remember that attackers do not care how a secret should be appropriately used; they only care about what they can do with it. By mapping all the relationships an NHI's secrets allow, you can identify when a system is no longer connected to any other identity. Once there are no more ways for an identity to communicate, then it and its secrets should no longer exist. It also means the secret no longer needs to be stored in your secrets managers, giving you one less thing to store and manage. Understanding the World Around Your NHIs is Critical to Security In 2022, CyberArk's research showed that for every human identity in an environment, at least 45 non-human identities need to be managed. That ratio today is likely closer to 1 to 100 and is ever-increasing. The best time to come to terms with your NHI governance and lifecycle management was years ago. The next best time is right now. It is time for a full-cycle approach to non-human identity security, mapping out not just where your NHI secrets are but, just as importantly, what other NHIs are connected. We are overdue, across all industries, to implement NHI governance at scale. Finding and properly storing your secrets is just the beginning of the story. We must better document and understand the scope of NHI secrets, their age, who implemented them, and other contextual information, such as when they should be rotated. Even though machine identities outnumber human beings, there is no reason to work alone to solve this problem; we are all in it together.

By Dwayne McDaniel

How to Use AI With WordPress

Artificial intelligence (AI) lets you manage WordPress in many ways, including generating AI content, creating images, improving SEO, and more. You can use AI to do the following: Generate WordPress AI contentGenerate WordPress art and imagesImprove WordPress SEODesign and build a WordPress siteEnhance user experience with AI-powered chatbotImprove WordPress web forms with AITranslate WordPress into a multilingual site And much more! In this article, we’ll learn how to use AI with WordPress using plugins. Let’s get started. How to Use AI With WordPress There are many ways you can use AI with WordPress. You can use WordPress AI plugins like ShortPixel to carry out specific tasks like creating/optimizing images. Alternatively, you can use Uncanny Automator to use AI in WordPress creatively. Uncanny Automator lets you integrate OpenAI such as Ada, Babbage, Davinci, etc., enabling you to create an AI-powered workflow for different tasks such as: Create AI blog post contentGenerate images using the DALL-E text promptAutomate replies to questions in forums or chats Furthermore, it works with the most popular WordPress plugins, including WPForms, MemberPress, AffiliateWP, etc. To install Uncanny Automator, follow this guide, and to connect to OpenAI, use this guide. Now, let’s discuss the different use cases and the AI tools/plugins you need to carry them out. Generate AI Content and Improve Existing Content Using generative AI, you can write or improve content. Plenty of tools let you create content, such as Jasper and ChatGPT. However, the challenge is to make them work directly from WordPress. To overcome this, I recommend using AI Engine, which lets you use OpenAI directly within WordPress. Once installed and connected (through the Open AI API key), you can create blog content, optimize SEO, and improve the content’s grammar without leaving WordPress. Alternatively, you can also opt for: Divi AI. It directly integrates with WordPress, offering powerful contextual text generation.Uncanny Automator. Set up custom steps to generate content directly in WordPress using forms. Generate and Optimize Images for Your WordPress Site You can use GenAI tools like DALL-E 2 to generate and optimize images, such as content generation. If you’re new to this, I suggest directly visiting the DALL-E 2 site and generating images there. However, if you want to integrate DALL-E 2 in WordPress, use Uncanny Automator, as it lets you create workflows directly in WordPress. Source: Automator Plugin With Divi AI, you can generate unique, stunning images directly from Divi Builder. It also offers AI Image refinement to existing images — all from text input, without the need to know coding or image editing skills.ShortPixel is an AI-powered plugin that optimizes images using advanced compression algorithms. The AI manages image resizing and scaling, ensuring it matches the user's device resolutions without affecting the site's functionality and load time (as it runs in the background). Improve WordPress SEO If you need help with WordPress SEO, you can try AIOSEO (All in One SEO) and Rank Math AI plugins. With AIOSEO, you can generate titles and meta descriptions. Rank Math, on the other hand, offers a complete package that you can use to generate SEO content, links, headings, and links. Rank Math's AI also lets you improve the content by suggesting additional keywords and providing dynamic optimization recommendations. Both these tools are free to install. However, if you want to use the AI features, you must buy credits (each action consumes one credit). Design and Build a WordPress Site You can also use AI to help you design and build a WordPress site. One such plugin is Elementor AI, an AI-powered tool for Elementor Builder that lets you do WordPress site design and building. With Elementor AI, you can generate container layouts using the desired design, all using prompts. And, if you are new to prompts, you can use its "Enhance Prompt" feature that offers suggestions to provide better input to get the desired output. Additionally, I like its leverage prompt history as it records your decision-making. Some other useful Elementor AI features include: Library-based AI container variationsAbility to generate wireframes and designGenerate editable images (change resolution, resize, replace image background, etc.) directly from the builderGenerate contextual content for your site, generation, headers, posts, and translationsGenerate custom code (HTML and CSS) to add animations and visual effects Alternatively, you can use: Divi AI lets you build AI-designed layouts, fine-tune the Divi module codebase, and define custom AI styles. Enhance User Experience With AI-Powered Chatbot AI chatbots are becoming more human-like, allowing you to use them to improve visitors’ real-time interactions. You can use WordPress AI plugins such as Chatbase to create a customizable chatbot. It can handle pre-trained data, file types, text, and content from specific URLs. It also learns over time by interacting with customers. However, it is an AI-focused tool, so you cannot add human interactivity. If you want a mix of AI and human support, ChatBot is what you need to use. I also suggest checking out AI ChatBot by QuantumCloud and Tidio, as they offer excellent AI-powered chatbots for your WordPress site. Improve Web Forms With AI Another way you can use AI to improve web forms on your WordPress site is by predicting user input, improving form interactions, and making them personalized based on individual preferences. To enhance web forms with AI, check out: Formidable Forms is an AI-powered form plugin that uses OpenAI. It generates AI responses to form field inputs. I prefer Formidable Forms because it is easy to set up. Gravity Forms also offer similar functionality, but you have to use OpenAI Addon by Gravity Wiz to make it work. Translate WordPress into Multilingual Sites If you want to translate WordPress into a multilingual site, you can use an AI-powered translation plugin such as Weglot, WPML, or Loco Translate. With Weglot, you can translate into other languages and reach a global audience. Weglot also ensures that Google indexes your newly generated translated post/page. Conclusion WordPress with AI opens up endless possibilities. Apart from the mentioned use cases, you can also use AI to help block WordPress comment spam, generate product descriptions/ad copy, or create event summaries. Just pick the right WordPress AI plugin, and you’re good to go. However, If you want more flexibility and control, you need Uncanny Automator. It is completely customizable, offering multiple combinations of steps enabling you to perform diverse tasks such as sentiment analysis, automating customer support, drafting emails, etc. So, which AI WordPress plugin/tool are you planning to use? Comment below and let us know.

By Nitish Singh

CORE

4 Reasons Agentic AI Is Reshaping Enterprise Search

Generative AI has been the cutting-edge technology that greatly reshaped the enterprise search landscape. But now, artificial intelligence (AI) development communities are delving into a new industry-leading innovation — Agentic AI. Agentic AI is a system that exhibits a high degree of autonomy. It designs workflow and uses available tools to take action independently on behalf of the users and solve complex problems that require multi-step solutions. It also interacts with external environments and goes beyond the data on which the system's machine learning models were trained. AI agents, powered by advanced machine learning techniques such as reinforcement learning, learn from user behavior and improve over time. These agents use multiple tools that enable them to work effectively in dynamic conditions. This blog explains the key problems that Agentic AI resolves in enterprise search. Critical Challenges in Enterprise Search That Agentic AI Addresses Ambiguity in User Queries Users usually search with certain keywords only, avoiding typing search queries. Due to the vague nature of the query, it becomes challenging for traditional AI models to comprehend the intent and deliver relevant results. However, AI agents take the decision to rephrase or augment the query. They have a query rephrase tool that autonomously refines or rephrases search terms when they are invalid by analyzing historical data and previous query contexts to refine the query. Consider a user who searches for "watches," but this query is ambiguous and incomplete and doesn't give the idea of what kind of watches the user is looking for, smart or regular. Now, suppose the user previously searched for "tracking burn calories." AI agents' query rephrase tool will rephrase the query based on the user's browsing history and previous query context and deliver search results for "Smartwatches." Inconsistent Sentiment Analysis Sentiments are a range of emotions that customers experience throughout their brand journey. Deciphering those sentiments is one crucial aspect of boosting customer satisfaction scores (CSAT). Traditional AI models fall short of understanding user query sentiments in many scenarios. Moreover, you have to leverage certain approaches that rely on pre-made dictionaries with words and their sentiment scores (positive, negative, or neutral) and redefine rules to determine the text sentiments. However, AI agents autonomously analyze the query sentiment and take action further based on that without human help. Its sentiment analyzer tool captures the overall sentiment of complex sentences, goes beyond just positive or negative sentiment, and distinguishes fine-grained sentiment expressions. Suppose a customer searched for "I tried everything but did not get my answers, feeling frustrated." An AI agent interprets the query sentiment, "the user is frustrated," and suggests something can aggravate their anger. So, it will either create a support ticket for the customer or directly connect with a live support agent to resolve their query. Identifying Key Entities in Data Earlier exact match and regex methods were used to find string values to tag the data. However, these methods miss the mark when it comes to contextual tagging and synonyms with the same lemma and stem. However, AI agents can perform Named Entity Recognition (NER) independently. The tool identifies and extracts key entities such as name, date, location, organization, or product from unstructured data without the need for manual tagging. This capability of agentic AI enhances the customer experience by making support service faster and more efficient. Imagine a customer raising a support ticket mentioning, "I haven't received my iPhone 16 pro, which I ordered on September 30." The AI agent tool autonomously performs NER and identifies key entities from the query, such as iPhone 16 pro (product) and (date) through NER. Then, it automatically cross-checks the information from the order database to find the reason for the delay. Based on this analysis, AI agents take further action to inform customers of the reason for the delay, initiate refunds directly, or escalate to live support agents directly. Therefore, agentic AI reduces resolution time and enhances customer satisfaction. Irrelevant Search Results Users, both customers and support agents, usually desire relevant, accurate, and contextual results for solving their queries. However, traditional models struggle when it comes to capturing evolving user query intent and proactive situation analysis in such a nuanced context. These limitations make traditional models lag behind in improving user satisfaction and efficiency. AI agents, on the contrary, rerank and refine search results. They automatically adapt to changing user inputs, analyze the past interaction of that user, decipher the evolving users' query intent, keep the previous context in their memory, and then refine and rerank the search result based on these analyses. Picture this: When a user searches for "best laptops for gaming," agentic AI goes in-depth for query intent interpretation and considers various factors such as gaming performance, affordability, and customer reviews. Then, results are reranked to bring the most relevant ones before others. This ability of agentic AI to autonomously fine-tune and prioritize relevant search results improves the user experience. How the Tools Integrate Seamlessly for Better Efficiency When a search query comes in, LLMs can determine whether it's related to a previous query or not. Based on this, it comprehends how to integrate previous conversations into this and rephrase a search query if it's incomplete or vague. Using NER, it automatically selects facets. Simultaneously, it analyzes user sentiment, whether they are happy, neutral, or frustrated, and if they need to escalate the ticket to the support agent. If you give autonomy to the agent, it will figure out to whom to assign the case. Conclusion To sum up, AI agents can enhance search accuracy, perform complex reasoning tasks, improve user experience, and complete tasks autonomously without human intervention.

By Taranjeet Singh

A Guide to Using Amazon Bedrock Prompts for LLM Integration

As generative AI revolutionizes various industries, developers increasingly seek efficient ways to integrate large language models (LLMs) into their applications. Amazon Bedrock is a powerful solution. It offers a fully managed service that provides access to a wide range of foundation models through a unified API. This guide will explore key benefits of Amazon Bedrock, how to integrate different LLM models into your projects, how to simplify the management of the various LLM prompts your application uses, and best practices to consider for production usage. Key Benefits of Amazon Bedrock Amazon Bedrock simplifies the initial integration of LLMs into any application by providing all the foundational capabilities needed to get started. Simplified Access to Leading Models Bedrock provides access to a diverse selection of high-performing foundation models from industry leaders such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon. This variety allows developers to choose the most suitable model for their use case and switch models as needed without managing multiple vendor relationships or APIs. Fully Managed and Serverless As a fully managed service, Bedrock eliminates the need for infrastructure management. This allows developers to focus on building applications rather than worrying about the underlying complexities of infrastructure setup, model deployment, and scaling. Enterprise-Grade Security and Privacy Bedrock offers built-in security features, ensuring that data never leaves your AWS environments and is encrypted in transit and at rest. It also supports compliance with various standards, including ISO, SOC, and HIPAA. Stay Up-to-Date With the Latest Infrastructure Improvements Bedrock regularly releases new features that push the boundaries of LLM applications and require little to no setup. For example, it recently released an optimized inference mode that improves LLM inference latency without compromising accuracy. Getting Started With Bedrock In this section, we’ll use the AWS SDK for Python to build a small application on your local machine, providing a hands-on guide to getting started with Amazon Bedrock. This will help you understand the practical aspects of using Bedrock and how to integrate it into your projects. Prerequisites You have an AWS account.You have Python installed. If not installed, get it by following this guide.You have the Python AWS SDK (Boto3) installed and configured correctly. It's recommended to create an AWS IAM user that Boto3 can use. Instructions are available in the Boto3 Quickstart guide.If using an IAM user, ensure you add the AmazonBedrockFullAccess policy to it. You can attach policies using the AWS console.Request access to 1 or more models on Bedrock by following this guide. 1. Creating the Bedrock Client Bedrock has multiple clients available within the AWS CDK. The Bedrock client lets you interact with the service to create and manage models, while the BedrockRuntime client enables you to invoke existing models. We will use one of the existing off-the-shelf foundation models for our tutorial, so we’ll just work with the BedrockRuntime client. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') 2. Invoking the Model In this example, I’ve used the Amazon Nova Micro model (with modelId amazon.nova-micro-v1:0), one of Bedrock's cheapest models. We’ll provide a simple prompt to ask the model to write us a poem and set parameters to control the length of the output and the level of creativity (called “temperature”) the model should provide. Feel free to play with different prompts and tune parameters to see how they impact the output. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select a model (Feel free to play around with different models) modelId = 'amazon.nova-micro-v1:0' # Configure the request with the prompt and inference parameters body = json.dumps({ "schemaVersion": "messages-v1", "messages": [{"role": "user", "content": [{"text": "Write a short poem about a software development hero."}]}], "inferenceConfig": { "max_new_tokens": 200, # Adjust for shorter or longer outputs. "temperature": 0.7 # Increase for more creativity, decrease for more predictability } }) # Make the request to Bedrock response = bedrock.invoke_model(body=body, modelId=modelId) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) We can also try this with another model like Anthropic’s Haiku, as shown below. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select a model (Feel free to play around with different models) modelId = 'anthropic.claude-3-haiku-20240307-v1:0' # Configure the request with the prompt and inference parameters body = json.dumps({ "anthropic_version": "bedrock-2023-05-31", "messages": [{"role": "user", "content": [{"type": "text", "text": "Write a short poem about a software development hero."}]}], "max_tokens": 200, # Adjust for shorter or longer outputs. "temperature": 0.7 # Increase for more creativity, decrease for more predictability }) # Make the request to Bedrock response = bedrock.invoke_model(body=body, modelId=modelId) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) Note that the request/response structures vary slightly between models. This is a drawback that we will address by using predefined prompt templates in the next section. To experiment with other models, you can look up the modelId and sample API requests for each model from the “Model Catalog” page in the Bedrock console and tune your code accordingly. Some models also have detailed guides written by AWS, which you can find here. 3. Using Prompt Management Bedrock provides a nifty tool to create and experiment with predefined prompt templates. Instead of defining prompts and specific parameters such as token lengths or temperature in your code every time you need them, you can create pre-defined templates in the Prompt Management console. You specify input variables that will be injected during runtime, set up all the required inference parameters, and publish a version of your prompt. Once done, your application code can invoke the desired version of your prompt template. Key advantages of using predefined prompts: It helps your application stay organized as it grows and uses different prompts, parameters, and models for various use cases.It helps with prompt reuse if the same prompt is used in multiple places.Abstracts away the details of LLM inference from our application code.Allows prompt engineers to work on prompt optimization in the console without touching your actual application code.It allows for easy experimentation, leveraging different versions of prompts. You can tweak the prompt input, parameters like temperature, or even the model itself. Let’s try this out now: Head to the Bedrock console and click “Prompt Management” on the left panel.Click on “Create Prompt” and give your new prompt a nameInput the text that we want to send to the LLM, along with a placeholder variable. I used Write a short poem about a {{topic}.In the Configuration section, specify which model you want to use and set the values of the same parameters we used earlier, such as “Temperature” and “Max Tokens.” If you prefer, you can leave the defaults as-is.It's time to test! At the bottom of the page, provide a value for your test variable. I used “Software Development Hero.” Then, click “Run” on the right to see if you’re happy with the output. For reference, here is my configuration and the results. We need to publish a new Prompt Version to use this Prompt in your application. To do so, click the “Create Version” button at the top. This creates a snapshot of your current configuration. If you want to play around with it, you can continue editing and creating more versions. Once published, we need to find the ARN (Amazon Resource Name) of the Prompt Version by navigating to the page for your Prompt and clicking on the newly created version. Copy the ARN of this specific prompt version to use in your code. Once we have the ARN, we can update our code to invoke this predefined prompt. We only need the prompt version's ARN and the values for any variables we inject into it. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select your prompt identifier and version promptArn = "<ARN from the specific prompt version>" # Define any required prompt variables body = json.dumps({ "promptVariables": { "topic":{"text":"software development hero"} } }) # Make the request to Bedrock response = bedrock.invoke_model(modelId=promptArn, body=body) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) As you can see, this simplifies our application code by abstracting away the details of LLM inference and promoting reusability. Feel free to play around with parameters within your prompt, create different versions, and use them in your application. You could extend this into a simple command line application that takes user input and writes a short poem on that topic. Next Steps and Best Practices Once you're comfortable with using Bedrock to integrate an LLM into your application, explore some practical considerations and best practices to get your application ready for production usage. Prompt Engineering The prompt you use to invoke the model can make or break your application. Prompt engineering is the process of creating and optimizing instructions to get the desired output from an LLM. With the pre-defined prompt templates explored above, skilled prompt engineers can get started with prompt engineering without interfering with the software development process of your application. You may need to tailor your prompt to be specific to the model you would like to use. Familiarize yourself with prompt techniques specific to each model provider. Bedrock provides some guidelines for commonly large models. Model Selection Making the right model choice is a balance between the needs of your application and the cost incurred. More capable models tend to be more expensive. Not all use cases require the most powerful model, while the cheapest models may not always provide the performance you need. Use the Model Evaluation feature to quickly evaluate and compare the outputs of different models to determine which one best meets your needs. Bedrock offers multiple options to upload test datasets and configure how model accuracy should be evaluated for individual use cases. Fine-Tune and Extend Your Model With RAG and Agents If an off-the-shelf model doesn't work well enough for you, Bedrock offers options to tune your model to your specific use case. Create your training data, upload it to S3, and use the Bedrock console to initiate a fine-tuning job. You can also extend your models using techniques such as retrieval-augmented generation (RAG) to improve performance for specific use cases. Connect existing data sources which Bedrock will make available to the model to enhance its knowledge. Bedrock also offers the ability to create agents to plan and execute complex multi-step tasks using your existing company systems and data sources. Security and Guardrails With Guardrails, you can ensure that your generative application gracefully avoids sensitive topics (e.g., racism, sexual content, and profanity) and that the generated content is grounded to prevent hallucinations. This feature is crucial for maintaining your applications' ethical and professional standards. Leverage Bedrock's built-in security features and integrate them with your existing AWS security controls. Cost Optimization Before widely releasing your application or feature, consider the cost that Bedrock inference and extensions such as RAG will incur. If you can predict your traffic patterns, consider using Provisioned Throughput for more efficient and cost-effective model inference.If your application consists of multiple features, you can use different models and prompts for every feature to optimize costs on an individual basis.Revisit your choice of model as well as the size of the prompt you provide for each inference. Bedrock generally prices on a "per-token" basis, so longer prompts and larger outputs will incur more costs. Conclusion Amazon Bedrock is a powerful and flexible platform for integrating LLMs into applications. It provides access to many models, simplifies development, and delivers robust customization and security features. Thus, developers can harness the power of generative AI while focusing on creating value for their users. This article shows how to get started with an essential Bedrock integration and keep our Prompts organized. As AI evloves, developers should stay updated with the latest features and best practices in Amazon Bedrock to build their AI applications.

By Adit Jamdar

Community Over Code Keynotes Stress Open Source's Vital Role

At the ASF's flagship Community Over Code North America conference in October 2024, keynote speakers underscored the vital role of open-source communities in driving innovation, enhancing security, and adapting to new challenges. By highlighting the Cybersecurity and Infrastructure Security Agency's (CISA) intensified focus on open source security, citing examples of open source-driven innovation, and reflecting on the ASF's 25-year journey, the keynotes showcased a thriving but rapidly changing ecosystem for open source. Opening Keynote: CISA's Vision for Open Source Security Aeva Black from CISA opened the conference with a talk about the government's growing engagement with open source security. Black, a long-time open source contributor who helps shape federal policy, emphasized how deeply embedded open source has become in critical infrastructure. To help illustrate open source's pervasiveness, Black noted that modern European cars have more than 100 computers, "most of them running open source, including open source orchestration systems to control all of it." CISA's open-source roadmap aims to "foster an open source ecosystem that is secure, sustainable and resilient, supported by a vibrant community." Black also highlighted several initiatives, including new frameworks for assessing supply chain risk, memory safety requirements, and increased funding for security tooling. Notably, in the annual Administration Cybersecurity Priorities Memo M-24-14, the White House has encouraged Federal agencies to include budget requests to establish Open Source Program Offices (OSPOs) to secure their open source usage and develop contribution policies. Innovation Showcase: The O.A.S.I.S Project Chris Kersey delivered a keynote demonstrating the O.A.S.I.S Project, an augmented-reality helmet system built entirely with open-source software. His presentation illustrated how open source enables individuals to create sophisticated systems by building upon community-maintained ecosystems. Kersey's helmet integrates computer vision, voice recognition, local AI processing, and sensor fusion - all powered by open source. "Open source is necessary to drive this level of innovation because none of us know all of this technology by ourselves, and by sharing what we know with each other, we can build amazing things," Kersey emphasized while announcing the open-sourcing of the O.A.S.I.S Project. State of the Foundation: Apache at 25 David Nalley, President of the Apache Software Foundation (ASF), closed the conference with the annual 'State of the Foundation' address, reflecting on the ASF's evolution over 25 years. He highlighted how the foundation has grown from primarily hosting the Apache web server to becoming a trusted home for hundreds of projects that "have literally changed the face of the (open source) ecosystem and set a standard that the rest of the industry is trying to copy." Nalley emphasized the ASF's critical role in building trust through governance: "When something carries the Apache brand, people know that means there's going to be governance by consensus, project management committees, and people who are acting in their capacity as an individual, not as a representative of some other organization." Looking ahead, Nalley acknowledged the need for the ASF to adapt to new regulatory requirements like Europe's Cyber Resiliency Act while maintaining its core values. He highlighted ongoing collaboration with other foundations like the Eclipse Foundation to set standards for open-source security compliance. "There is a lot of new work we need to do. We cannot continue to do the things that we have done for many years in the same way that we did them 25 years ago," Nalley noted while expressing confidence in the foundation's ability to evolve. Conclusion This year's Community Over Code keynotes highlighted a maturing open-source ecosystem tackling new challenges around security, regulation, and scalability while showing how community-driven innovation continues to push technical limits. Speakers stressed that the ASF's model of community-led development and strong governance is essential for fostering trust and driving innovation in today's complex technology landscape.

By Brian Proffitt

Limitations of LLM Reasoning

Large language models (LLMs) have disrupted AI with their ability to generate coherent text, translate languages, and even carry on conversations. However, despite their impressive capabilities, LLMs still face significant challenges when it comes to reasoning and understanding complex contexts. These models, while adept at identifying and replicating patterns in vast amounts of training text, often struggle with tasks that require true comprehension and logical reasoning. This can result in problems such as inconsistencies in long conversations, errors in connecting disparate pieces of information, and difficulties in maintaining context over extended narratives. Understanding these reasoning problems is crucial for improving the future development and application of LLMs. Key Reasoning Challenges Lack of True Understanding Language models operate by predicting the next keyword based on patterns they have learned from extensive data during training. However, they lack a deep, inherent understanding of the environment and the concepts they discuss. As a result, they may find complex reasoning tasks that demand true comprehension challenging. Contextual Limitations Although modern language models excel at grasping short contexts, they often struggle to maintain coherence and context over extended conversations or larger text segments. This can result in reasoning errors when the model must link information from various parts of the dialogue or text. In a lengthy discussion or intricate narrative, the model might forget or misinterpret earlier details, leading to contradictions or inaccurate conclusions later on. Inability to Perform Planning Many reasoning tasks involve multiple steps of logic or the ability to track numerous facts over time. Current language models often struggle with tasks that demand long-term coherence or multi-step logical deductions. They may have difficulty solving puzzles that require several logical operations. Answering Unsolvable Problems Answering unsolvable problems is a critical challenge for LLMs and highlights the limitations of their reasoning capabilities. When presented with an unsolvable problem, such as a paradox, a question with no clear answer, or a question that contradicts established facts, LLMs can struggle to provide meaningful or coherent responses. Instead of recognizing the inherent impossibility of the problem, the model may attempt to offer a solution based on patterns in the data it has been trained on, which can lead to misleading or incorrect answers. State Space Computation Complexity Some problems necessitate exploring all possible states from the initial state to the goal state. For instance, travel planning can involve numerous options, and with additional constraints like budget and mode of travel, the search state space can approach polynomial explosion. It would be impractical for a language model to compute and respond to all these possibilities. Instead, it would rely on the heuristics it has learned to provide a feasible solution that may not be correct. A Real-Life Example of Incorrect Reasoning Let’s take the question: Plain Text "A jug filled with 8 units of water, and two empty jugs of sizes 5 and 5. The solver must pour the water so that the first and second jugs both contain 4 units, and the third is empty. Each step pouring water from a source jug to a destination jug stops when either the source jug is empty or the destination jug is full, whichever happens first". We can see from the responses below that the LLMs that exist today give incorrect answers. This problem is actually not solvable, but all LLMs try to give an answer as if they found the solution. ChatGPT's Response Google's Response Bing Copilot's Response LLMs Reciting vs. Reasoning However, if you were to change the question to have “two empty jugs of sizes 5 and 4” instead of “two empty jugs of sizes 5 and 5”, then all LLMs would answer the memorized question correctly. What Are Researchers Proposing to Help With Reasoning? While some researchers focus on improving the dataset and using the chain of thoughts approach, others propose the use of external verifiers and solvers. Each of these techniques aims to bring improvements by addressing different dimensions of the problem. Improving the Dataset Some researchers propose improving the quality and diversity of the data used to train language models. By curating more comprehensive and varied datasets, the models can learn from a broader range of contexts and examples. This approach aims to increase the model's ability to handle diverse scenarios. Chain-of-Thought This technique involves training models to follow a structured reasoning process, similar to human thinking, step by step. By encouraging models to generate intermediate reasoning steps explicitly, researchers hope to improve the models' ability to tackle complex reasoning tasks and provide more accurate, logically consistent responses. Using External Verifiers To address the issue of models generating incorrect or misleading information, some researchers propose integrating external verification mechanisms. These verifiers can cross-check the model's output against trusted sources or use additional algorithms to validate the accuracy of the information before presenting it to the user. This helps ensure that the generated content is reliable and factually correct. Using Solvers Another approach involves incorporating specialized solvers that are designed to handle specific types of reasoning tasks. These solvers can be used to perform calculations, solve equations, or process logical statements, complementing the language model's capabilities. By delegating these tasks to solvers, the overall system can achieve more accurate and reliable results. Conclusion Despite impressive progress in areas like text generation and comprehension, current language models struggle with intricate, multi-layered reasoning tasks due to their inability to fully grasp the meaning, maintain consistent context, and rely solely on patterns extracted from large but potentially flawed training data. To address these limitations, future models will likely require more sophisticated architectures alongside ongoing research into common sense reasoning. References Water pouring puzzle Learning to reason with LLMs GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about ChangeLLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBenchLLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

By Bhala Ranganathan

AI/ML

DZone's Featured AI/ML Resources

Top AI/ML Experts

The Latest AI/ML Topics