DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

AI/ML

Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.

icon
Latest Premium Content
Trend Report
Generative AI
Generative AI
Refcard #401
Getting Started With Agentic AI
Getting Started With Agentic AI
Refcard #394
AI Automation Essentials
AI Automation Essentials

DZone's Featured AI/ML Resources

Azure SLM Showdown: Evaluating Phi-3, Llama 3, and Snowflake Arctic for Production

Azure SLM Showdown: Evaluating Phi-3, Llama 3, and Snowflake Arctic for Production

By Jubin Abhishek Soni
In the rapidly evolving landscape of Generative AI, the industry is witnessing a significant shift. While the “bigger is better” mantra once dominated, the tide is turning. As organizations move from experimental pilots to production-grade applications, the focus has shifted toward small language models (SLMs). These models offer lower latency, reduced compute costs, and the ability to run on edge devices, while maintaining performance that rivals massive models like GPT-4 for specific tasks. Microsoft Azure has positioned itself as a premier destination for these models, offering them through the Model-as-a-Service (MaaS) framework and the Azure AI Model Catalog. In this article, we provide a technical deep dive into three of the most prominent SLMs available on Azure: Microsoft’s Phi-3, Meta’s Llama 3 (8B), and Snowflake Arctic. We analyze their architectures, benchmark performance, deployment strategies, and cost efficiency to help you decide which model best fits your workload. 1. Microsoft Phi-3: The Master of Efficiency Microsoft’s Phi-3 family represents a breakthrough in how model quality is achieved. Rather than relying on sheer volumes of web-scraped data, Phi-3 was trained on Phi-3-specific data — a combination of highly filtered web data and synthetic data designed to resemble the clarity and educational value of textbooks. Architecture and Variations Phi-3 is available in several sizes, but Phi-3 Mini (3.8B parameters) is the most popular for SLM use cases. Despite its small size, it frequently outperforms models twice its size (such as Llama 2 7B or Mistral 7B) on reasoning and logic tasks. It uses a dense Transformer architecture and is optimized for ONNX Runtime, making it ideal for cross-platform deployment. Pros and Cons Pros Unmatched efficiency: Extremely low resource footprint; can run on basic CPU-only instances or mobile devices.Reasoning capability: Exceptionally strong at logical reasoning and mathematics relative to its size.Permissive licensing: MIT license allows broad commercial use. Cons Knowledge cutoff: Due to its focus on reasoning over factual memorization, it may struggle with niche factual queries without RAG (Retrieval-Augmented Generation).Context window limitations: While a 128k context version exists, the baseline 4k version is limited for long-document processing. 2. Meta Llama 3 (8B): The Generalist Powerhouse Llama 3 8B is the evolution of Meta’s highly successful open-weights lineage. Trained on a massive 15 trillion tokens, Llama 3 emphasizes versatility and conversational fluency. It is the “Swiss Army knife” of SLMs, designed to handle everything from creative writing to complex coding. Architecture and Improvements Llama 3 uses a standard decoder-only Transformer architecture but introduces a more efficient tokenizer with a 128k vocabulary, significantly improving token compression and inference speed. It also features Grouped Query Attention (GQA), which enhances performance during long-context inference. Pros and Cons Pros Generalization: Excellent at following complex instructions and maintaining a consistent persona.Ecosystem support: As an industry standard for open-weights models, it has best-in-class support for quantization and fine-tuning tools (UnsLoTH, vLLM, etc.).Fine-tuning potential: Highly responsive to supervised fine-tuning (SFT) and RLHF. Cons Compute requirements: Requires more VRAM than Phi-3 and typically needs an A10 or T4 GPU for comfortable inference.Licensing constraints: The Llama 3 Community License includes restrictions for very large-scale commercial deployments (over 700M monthly active users). 3. Snowflake Arctic: The Enterprise Specialist Snowflake Arctic is a unique entrant in the SLM space. While its total parameter count is large (480B), it uses a Mixture-of-Experts (MoE) architecture. In this setup, only a small subset of parameters (about 17B) is active during any single inference request. This makes it “small” in terms of compute cost per token, even though its memory footprint is larger. Architecture and Enterprise Focus Arctic was built specifically for enterprise tasks such as SQL generation, coding, and complex instruction following. It uses a dense-to-MoE hybrid design that prioritizes high-quality reasoning over broad creative knowledge. Pros and Cons Pros Data-to-SQL mastery: Outperforms nearly all peers for generating SQL and interacting with structured data.MoE efficiency: Delivers the reasoning depth of a massive model with the token-generation speed of a much smaller one.Apache 2.0 license: Fully open for commercial use without restrictive clauses. Cons Memory footprint: Because all 480B parameters must be loaded into memory (unless using quantized or offloaded variants), it requires significantly more GPU memory than Phi-3 or Llama 3 8B.Deployment complexity: Best suited for Azure’s serverless MaaS endpoints rather than small self-hosted VMs. Advanced Data Flow: RAG with SLMs Retrieval-Augmented Generation (RAG) is one of the most common production patterns. SLMs are particularly well suited for RAG because they can process retrieved context with much lower latency than GPT-4. However, smaller context windows — such as Arctic’s 4k or Llama 3’s 8k — require more sophisticated retrieval strategies compared to Phi-3’s 128k variant. Technical Comparison Tables To better understand how these models stack up, we have categorized their capabilities into three comparison tables focusing on technical specifications, benchmarks, and Azure-specific deployment factors. Table 1: Technical Specifications FeaturePhi-3 MiniLlama 3 8BSnowflake ArcticParameters3.8 Billion8 Billion480B (17B Active)ArchitectureDense TransformerDense TransformerMoE (Mixture of Experts)Context Window4k / 128k8k4kTokenizer32k Vocab128k Vocab32k VocabLicensingMITLlama 3 CommunityApache 2.0Primary StrengthReasoning & LogicGeneral PurposeSQL & Coding Table 2: Benchmark Performance (Reported Figures) BenchmarkPhi-3 MiniLlama 3 8BSnowflake ArcticMMLU (General)68.8%66.6%62.9%GSM8K (Math)82.5%79.6%66.1%HumanEval (Code)58.5%62.2%64.3%BigBench Hard69.7%61.1%51.5% Table 3: Azure Deployment and Cost (Estimated) FactorPhi-3 MiniLlama 3 8BSnowflake ArcticAzure MaaS AvailabilityYes (Serverless)Yes (Serverless)Yes (Serverless)Min. Recommended VMStandard_NC6s_v3Standard_NC24s_v3Standard_ND96asr_v4Cost per 1M Input~$0.10~$0.15~$0.24Cost per 1M Output~$0.10~$0.60~$0.24Fine-Tuning SupportAzure AI Studio LoRAAzure AI Studio LoRAAzure ML / Custom Note: Costs are based on average Azure Model-as-a-Service pricing and are subject to regional variation. Analysis: Which Model Should You Choose? Use Case 1: Low-Latency Edge Applications If you are building an application that needs to run on a local device or requires the absolute lowest latency for simple tasks (like text classification or basic summarization), Phi-3 Mini is the undisputed winner. Its small footprint allows it to be quantized to 4-bit and run on a standard laptop CPU while still providing coherent, logical responses. Use Case 2: Sophisticated Chatbots and Creative Tools For applications requiring “personality,” conversational nuance, and broad general knowledge, Llama 3 8B is superior. It has a much lower “hallucination" rate in casual conversation compared to Phi-3 and handles creative tasks (like drafting emails or marketing copy) with much better flow and vocabulary diversity. Use Case 3: Enterprise Data Bots and SQL Generation If your goal is to build a copilot for your data warehouse or an internal tool that generates SQL queries from natural language, Snowflake Arctic is designed for this specific purpose. Its training focus on “Enterprise Intelligence” makes it more reliable for code generation and technical instruction following than its dense SLM counterparts. Deployment Strategies on Azure Azure offers two primary ways to deploy these models, each with distinct advantages. 1. Model-as-a-Service (Serverless APIs) This is the recommended approach for most developers. You don't need to manage GPUs; instead, you call an API and pay per token. Best for: Burst workloads, rapid prototyping, and applications where managing infrastructure is a bottleneck.How-to: Navigate to Azure AI Studio, select the model from the catalog, and click “Deploy” -> “Serverless API.” 2. Managed Online Endpoints (Dedicated Infrastructure) This involves deploying the model onto a specific Azure VM instance (e.g., NCv3-series). Best for: High-volume, steady-state workloads where token-based pricing becomes more expensive than hourly VM costs, or when high customization of the inference server (like using vLLM) is required.How-to: Use the azure-ai-ml Python SDK to define an endpoint and deployment configuration. Fine-Tuning Example: Phi-3 on Azure AI Studio Fine-tuning is essential for making an SLM perform like a specialized expert. Here is a conceptual workflow for fine-tuning Phi-3 using Low-Rank Adaptation (LoRA) on Azure. Step 1: Data Preparation Format your data into a JSONL file. For Phi-3, the format should follow the ChatML structure: Plain Text {"messages": [{"role": "user", "content": "Explain quantum physics to a toddler."}, {"role": "assistant", "content": "Quantum physics is like having a toy that can be in two boxes at the same time..."}]} Step 2: Submission via Python SDK Using the Azure AI SDK, you can trigger a fine-tuning job on a GPU cluster: Plain Text from azure.ai.ml import MLClient from azure.ai.ml.entities import FineTuningJob # Initialize client ml_client = MLClient(credential, subscription_id, resource_group, workspace_name) # Define the job job = FineTuningJob( model="azureml://registries/azureml/models/Phi-3-mini-4k-instruct", task="chat_completion", training_data=Input(type="uri_file", path="path_to_your_data.jsonl"), hyperparameters={ "learning_rate": "0.0002", "batch_size": "4", "epochs": "3" } ) # Submit the job ml_client.jobs.create_or_update(job) This approach utilizes LoRA, which only updates a small fraction of the model's weights, significantly reducing the VRAM required for training and preventing “catastrophic forgetting.” Conclusion: The Right Tool for the Job Choosing between Phi-3, Llama 3, and Snowflake Arctic on Azure is not about which model is objectively “best,” but which best aligns with your operational constraints: Choose Phi-3 when compute efficiency and logical reasoning are paramount.Choose Llama 3 8B when you need a versatile, conversational generalist with a rich ecosystem.Choose Snowflake Arctic when your application centers on structured data, SQL, and enterprise-grade code generation. As Azure continues to expand its Model Catalog, standardized APIs make swapping models easier than ever, reducing the risk of model lock-in. Organizations should test prompts across all three to find the optimal balance of cost, performance, and capability for their specific workloads. Conclusion: The Right Tool for the Job Choosing between Phi-3, Llama 3, and Snowflake Arctic on Azure is not about which model is objectively “best,” but which best aligns with your operational constraints: Choose Phi-3 when compute efficiency and logical reasoning are paramount.Choose Llama 3 8B when you need a versatile, conversational generalist with a rich ecosystem.Choose Snowflake Arctic when your application centers on structured data, SQL, and enterprise-grade code generation. As Azure continues to expand its Model Catalog, standardized APIs make swapping models easier than ever, reducing the risk of model lock-in. Organizations should test prompts across all three to find the optimal balance of cost, performance, and capability for their specific workloads. More
The Quantum Computing Mirage: What Three Years of Broken Promises Have Taught Me

The Quantum Computing Mirage: What Three Years of Broken Promises Have Taught Me

By Igboanugo David Ugochukwu DZone Core CORE
I've lost count of how many quantum computing briefings I've sat through where executives project timelines on screens that quietly shift right every six months. The promises sound identical to what I heard in 2022, except the dates change. Quantum advantage was coming in 2024. Then 2025. Now it's 2026. Next year, I'll probably hear 2027. Google's announcement about their Willow processor in late 2024 followed a script I could recite from memory. A hundred-plus qubits. Performance beyond classical supercomputers. A calculation verified as correct. The press release carefully avoids mentioning that the calculation serves no purpose beyond demonstrating the machine works. It's a benchmark divorced from any application someone would pay to run. IBM's Nighthawk rollout last year hit similar notes. One hundred twenty qubits with better connectivity, allowing marginally more complex circuits. Expected availability late 2025, with scaling to thousands of qubits by 2028. I asked an IBM quantum researcher off the record what “thousands of qubits” actually enables. The answer, after some hedging: “Still not enough for most things people care about.” The Error Correction Problem That Keeps Getting Worse Here's what quantum computing vendors don't emphasize in announcements: physical qubits are catastrophically unreliable. Decoherence times — how long a qubit maintains its quantum state before environmental noise destroys it — are measured in microseconds to milliseconds, depending on implementation. That's not long. Classical bits maintain state indefinitely unless you actively flip them. Building reliable logical qubits from unreliable physical ones requires quantum error correction codes. Surface codes, the most promising approach, need roughly 1,000 physical qubits to produce one logical qubit at useful fidelity. Some estimates run higher — 1,500 to 2,000 physical qubits per logical qubit, depending on target error rates. Do the math. If you need 10,000 logical qubits for a useful computation — and many proposed applications require more — you’re looking at 10 million physical qubits. Current systems have a few hundred. IBM's roadmap targets thousands by 2028. Even if they hit that goal, they're three to four orders of magnitude short. IBM published results last year showing 10× faster decoding of error correction codes. They're optimizing the classical computation required to interpret syndrome measurements and apply corrections. That matters for scalability, but it doesn't change the fundamental overhead ratio. You still need roughly 1,000 physical qubits per logical qubit. Decoding them 10× faster just means you waste less classical computing time managing the error correction. The Qiskit framework improvements — 24 percent better accuracy on 100-qubit circuits through dynamic circuits and error mitigation — sound significant until you check baseline accuracy. Running a 100-qubit circuit on current hardware without extensive error mitigation gives you garbage results. Improving garbage by 24 percent still gives you mostly garbage. These aren't production systems. They're lab instruments requiring expert supervision. What Google's Willow Actually Demonstrated Google's Quantum Echoes algorithm on Willow performed a quantum simulation matching advanced nuclear magnetic resonance simulations of a 15-atom molecular system. The technical execution deserves respect. Simulating quantum systems using quantum hardware is conceptually clean — you’re modeling quantum behavior with a quantum device. Classical computers struggle with this because quantum state spaces grow exponentially with system size. But a 15-atom simulation doesn't get you to drug discovery. Pharmaceutically interesting molecules — proteins, enzyme binding sites, antibody structures — contain hundreds to thousands of atoms. The computational complexity doesn't scale linearly. Simulating a system twice as large requires far more than twice the quantum resources. The gap between demonstrating a 15-atom system and simulating molecules relevant to drug development is enormous. I spoke with a computational chemist at a major pharmaceutical company after Google's announcement. They use classical quantum chemistry software — Gaussian, ORCA, Q-Chem — to model molecular interactions daily. These tools run on HPC clusters and handle much larger systems than current quantum hardware can touch. Yes, they use approximations. Yes, certain calculations remain intractable. But classical methods keep improving, and quantum computers aren't close to being competitive for real pharmaceutical workflows. The classical computing community responds to each quantum achievement by optimizing their algorithms. After Google's 2019 quantum supremacy claim with Sycamore, researchers developed classical algorithms that narrowed the performance gap substantially. The same pattern plays out repeatedly. Quantum researchers announce a hard problem for classical computers. Classical algorithm researchers respond with better approaches. The quantum advantage evaporates or becomes marginal. Where the Qubit Scaling Curve Actually Goes Superconducting qubits — the technology IBM and Google primarily use — require dilution refrigerators that cool systems to millikelvin temperatures. Operating temperatures are around 15 millikelvin, colder than outer space. The refrigeration systems are expensive, power-hungry, and physically large. You can't just stack them arbitrarily to add more qubits. IBM's 300 mm wafer fabrication for quantum processors is genuine engineering progress. Producing superconducting qubit chips on semiconductor manufacturing infrastructure means you can potentially scale production using existing fab technology. The problem isn't making qubits. It's connecting them, controlling them, and reading them out without introducing noise. Each qubit needs control lines — microwave pulses to manipulate quantum state. It needs readout circuitry to measure state. In superconducting systems, this means physical wiring running from room temperature down to millikelvin temperatures. Current systems use coaxial cables, which is manageable for hundreds of qubits but doesn't scale to millions. Some researchers are developing multiplexing schemes to reduce wiring requirements, but this introduces new error sources. Connectivity matters critically. Most quantum algorithms assume all-to-all qubit connectivity — any qubit can interact directly with any other. Physical systems don't provide this. Superconducting qubit chips typically arrange qubits in 2D grids where each qubit couples to nearest neighbors. Running an algorithm that requires distant qubits to interact means executing SWAP operations to move quantum state around the chip. Each SWAP introduces errors and consumes gate-depth budget. IBM's Nighthawk improved connectivity with 218 couplers for 120 qubits, allowing slightly more flexible circuit topologies. This enables 30 percent more complex circuits, which sounds good until you realize the baseline is “extremely limited circuits.” Going from very constrained to somewhat less constrained doesn't fundamentally change what computations are feasible. The Application Gap Nobody Wants to Quantify Optimization problems appear frequently in quantum computing pitches. Portfolio optimization for finance. Logistics routing. Manufacturing scheduling. Supply chain management. Quantum annealing and variational quantum algorithms supposedly handle these better than classical approaches. Except classical optimization has also improved dramatically. Mixed-integer programming solvers are far more capable than a decade ago. Machine learning techniques like reinforcement learning tackle optimization problems quantum algorithms target. Companies deploy these classical methods in production today and get useful results. D-Wave has sold quantum annealers commercially since 2011. Their systems now have thousands of qubits. After over a decade, clear examples of quantum annealing outperforming classical optimization on real business problems remain scarce. Most published comparisons show quantum annealing competitive with or slightly better than some classical algorithms, but not decisively superior to state-of-the-art classical methods. The pattern suggests quantum computing might find niche applications where it's marginally better than classical approaches for specific problem instances. That's not worthless, but it's far from the revolutionary impact implied by vendor marketing. Marginal improvements don't justify the infrastructure cost and operational complexity of quantum systems. Cryptanalysis through Shor's algorithm represents the clearest threat quantum computing poses. It can factor large numbers exponentially faster than classical algorithms, breaking RSA and other public-key cryptosystems. This danger is mathematically proven, not speculative. A fault-tolerant quantum computer with a few thousand logical qubits running Shor's algorithm would compromise much of current encryption infrastructure. The timeline matters. Building fault-tolerant systems with thousands of logical qubits requires millions of physical qubits with error correction. Generous estimates put this 15 to 20 years out. Conservative estimates say 30+ years, or possibly never. NIST finalized post-quantum cryptographic standards in 2024. Migration to quantum-resistant algorithms is underway. We're hardening systems against a threat that won't materialize for decades, if ever. What the Talent Shortage Reveals Quantum computing requires expertise at the intersection of physics, mathematics, and computer science. You need to understand quantum mechanics deeply enough to reason about qubit behavior. You need mathematical sophistication to work with Hilbert spaces and unitary transformations. You need programming skills to implement algorithms. Universities produce maybe 500 to 1,000 graduates per year with serious quantum computing training globally — a generous estimate. Industry demand far exceeds supply. Companies hire PhD physicists and teach them software engineering, or hire software engineers and teach them quantum mechanics. Neither path is efficient. If quantum computing were about to become mainstream technology, we'd see massive university program expansion training quantum developers. We'd see bootcamps and online courses producing job-ready quantum programmers. We'd see salary surveys for quantum software engineers reflecting high demand. Some of this exists, but at modest scale. The education pipeline doesn't reflect an imminent technology revolution. Compare this to classical machine learning. When deep learning took off around 2012, universities couldn't expand programs fast enough. Online courses proliferated. Bootcamps emerged. Within five years, you could hire ML engineers without requiring PhD-level expertise. The ecosystem grew organically in response to genuine demand. Quantum computing shows no comparable acceleration. It's a specialized research field attracting bright people, but the broader engineering community isn't rushing to acquire quantum skills because there are no jobs requiring them. Quantum computing remains primarily academic research and corporate R&D. The transition to an engineering discipline keeps receding. Why IBM and Google Keep Building Anyway Both companies have invested hundreds of millions, probably billions, in quantum computing programs. IBM operates multiple quantum data centers. Google has a dedicated quantum AI lab. Neither can walk away without admitting those investments were premature. Cloud quantum computing as a service generates minimal revenue. Running experiments on IBM Quantum or Google's quantum processors is interesting for researchers but not a sustainable business model. These offerings exist primarily for ecosystem development — getting people familiar with quantum programming so there's demand when useful hardware eventually arrives. The corporate motivation isn't immediate ROI. It's positioning for potential future markets and hedging against competitors achieving breakthroughs. If quantum computing does eventually work, being years behind IBM or Google would be catastrophic for competitors. So Microsoft, Amazon, and others maintain quantum programs despite unclear commercial timelines. Research publications and patent portfolios from quantum programs justify continued investment to boards and shareholders. The programs employ world-class physicists doing legitimate research. Even if practical quantum computers remain decades away, the research produces scientific value and attracts talent. This creates sustained funding without requiring near-term applications. Classical tech companies are rich enough to run multi-decade research programs. Quantum computing can continue indefinitely at current burn rates without demonstrating utility, similar to corporate research labs of previous eras. What Changed Between 2019 and 2025 Google's 2019 quantum supremacy announcement generated enormous attention. Nature published the paper. Media coverage went worldwide. The narrative was: quantum computing has arrived; classical computers are obsolete for certain problems. Within months, IBM published analysis showing classical supercomputers could complete Google's benchmark calculation much faster than Google estimated. Other researchers developed improved classical algorithms. The quantum advantage shrank dramatically under scrutiny. Since then, quantum announcements have been more cautious. Terms like “quantum advantage” get defined more carefully. Claims get hedged more thoroughly. Vendors emphasize specific benchmarks rather than general computational superiority. The field learned that overpromising invites backlash. Qubit counts increased steadily. Error rates decreased incrementally. Coherence times improved gradually. None of these advances translated into new applications. The gap between better qubits and useful quantum computers didn't narrow meaningfully. Classical computing also improved. GPUs got faster. AI accelerators proliferated. Neuromorphic chips emerged. For many problems quantum computing supposedly targets, classical hardware advances provided better near-term solutions. Quantum advantage became a moving target as classical capabilities increased. The Post-Quantum Cryptography Migration Actually Matters While quantum computers struggle to do anything useful, cryptographic infrastructure is undergoing massive changes to defend against them. NIST selected lattice-based and hash-based algorithms resistant to both quantum and classical attacks. Government agencies mandate transition timelines. Financial institutions are planning migrations. This represents real cost and real risk. Changing cryptographic algorithms throughout complex systems takes years and breaks things. Compatibility issues emerge. Performance degrades because post-quantum algorithms are typically slower. Hardware acceleration for new algorithms doesn't exist yet on most platforms. The threat model is “harvest now, decrypt later” — adversaries capture encrypted data today, store it, and decrypt it once quantum computers arrive. For data with decades-long confidentiality requirements, this matters. You need to transition now even if quantum computers remain 20 years away. So organizations spend millions upgrading cryptography to defend against machines that don't exist and might not exist for decades. It's insurance — arguably sensible insurance — but the situation is bizarre. Defensive measures against quantum computers have more immediate impact than quantum computers themselves. Where Technical Depth Actually Lies Understanding quantum computing requires getting past surface-level explanations. Qubits aren't just “bits that can be 0 and 1 simultaneously.” That's a cartoon. Qubits are quantum two-level systems described by complex probability amplitudes in a Hilbert space. Quantum gates are unitary transformations operating on those state vectors. Measurement collapses superposition to definite states probabilistically according to the Born rule. Gate fidelity specifications matter. A two-qubit gate with 99.9 percent fidelity sounds good until you realize complex algorithms require thousands or millions of gates. Errors accumulate. At 99.9 percent fidelity, after 1,000 gates you've got roughly a 37 percent chance the computation is correct, assuming independent errors. Error correction overhead is why you need so many physical qubits. Coherence times set fundamental limits. T1 relaxation time measures how quickly excited states decay to the ground state. T2 dephasing time measures how quickly relative phases between superposition components degrade. Superconducting qubits typically achieve T1 around 100 microseconds and T2 around 50 microseconds. Every operation must complete before coherence is lost. This constrains circuit depth severely. Connectivity topology determines what algorithms you can run efficiently. Heavy-hex lattice connectivity, which IBM uses, provides each qubit with three connections arranged hexagonally. This is better than simple grid connectivity but still far from all-to-all. Mapping logical circuit topology to physical connectivity requires SWAP insertion, expanding circuit depth and introducing errors. These technical details separate informed commentary from promotional fluff. When a vendor announces a new processor, specifications that matter include qubit count, gate fidelities (single-qubit and two-qubit), coherence times, connectivity, readout fidelity, and crosstalk. Press releases often omit these or bury them in supplementary materials. The Next Five Years, Realistically IBM will probably hit their thousands-of-qubits target by 2028. Google will build larger processors. Coherence times will improve incrementally. Error rates will decrease gradually. None of this will enable practical applications that justify the investment. Research will continue producing interesting results. Quantum simulation experiments will handle slightly larger systems. Error correction demonstrations will show better performance. Benchmarks will be achieved proving quantum computers can do specific artificial tasks faster than classical systems. Companies maintaining quantum programs will emphasize long-term positioning and research value. Funding will continue because shutting programs down means admitting massive sunk costs. Cloud quantum services will persist as loss leaders supporting ecosystem development. Classical computing will advance simultaneously. AI chips will get faster. Novel architectures will emerge. For most problems, classical solutions will remain superior. The opportunity cost of quantum computing—what else could have been done with those resources — will grow harder to justify. Post-quantum cryptography migration will accelerate, driven by compliance requirements and institutional caution. This might end up being quantum computing's primary impact: forcing cryptographic upgrades through the threat of future capabilities rather than demonstrating actual capabilities. Universities will continue training quantum researchers. Talented physicists and mathematicians will work on legitimate problems. Scientific progress will occur — just not at the pace or with the impact industry roadmaps suggest. The honest assessment most researchers would give privately: we're making progress, the physics works, scaling remains extremely hard, practical applications are distant, predictions about timelines are guesses, and anyone promising a quantum revolution by 2026 is either ignorant or lying. That's not what you'll read in press releases. But that's the reality. More
Ralph Wiggum Ships Code While You Sleep. Agile Asks: Should It?
Ralph Wiggum Ships Code While You Sleep. Agile Asks: Should It?
By Stefan Wolpers DZone Core CORE
Automating TDD: Using AI to Generate Edge-Case Unit Tests
Automating TDD: Using AI to Generate Edge-Case Unit Tests
By Nikita Kothari
Agentic AI vs Copilots: The Architectural Shift from Assistance to Autonomy
Agentic AI vs Copilots: The Architectural Shift from Assistance to Autonomy
By Nikita Kothari
From Prompt Loops to Systems: Hosting AI Agents in Production
From Prompt Loops to Systems: Hosting AI Agents in Production

An agent can reason well and still fail badly. Most teams do not notice this during early experiments because nothing is under pressure yet. The model calls tools, answers questions, and produces outputs that look correct. From the outside, the system works. The problems surface later, once the agent is expected to run continuously instead of intermittently. Restarts become normal, context has to survive across runs, external services are often involved, and their actions are not always closely monitored. That is where the difference shows. At that point, outcomes depend far less on how the agent reasons and far more on how it is hosted, because hosting determines what happens when execution is interrupted, state disappears, or permissions suddenly block an action. This article walks through what breaks once agents leave controlled environments and why runtime control, memory persistence, tool mediation, and observability determine whether an agent behaves like a system or collapses into a script. Local Testing Works Because the Rules Are Simple Most agents begin life in forgiving conditions. A developer runs them locally or on a small cloud instance, often with a single user and no real concurrency. Frameworks such as LangChain or LangGraph handle the wiring: the model is connected to tools, state is passed through in-memory objects, and behavior is easy to observe while everything runs in a single process. In that environment, the system feels stable. State lives in memory for as long as the process stays alive. Tools are called directly, without mediation. Logs are easy to follow. When something goes wrong, restarting the process usually resets the world, and the problem disappears. Production does not work that way. Once the same agent runs across machines, handles concurrent requests, and restarts without warning, those assumptions fall apart. Memory vanishes unless it is explicitly persisted. Execution spreads across services instead of living in one place. Failures become intermittent and difficult to reproduce. If hosting does not account for this shift, the agent starts behaving unpredictably, even though individual model outputs may still look reasonable in isolation. A prompt can describe what an agent is supposed to do. It cannot enforce how that behavior unfolds over time. That enforcement has to come from hosting. Runtimes Turn Agents Into Services An agent implemented as a prompt loop has no real boundaries. It decides when to act, what to remember, and how to call tools. That is acceptable for experiments; it becomes dangerous once the agent touches real infrastructure. A runtime layer changes the operating model by separating intent from execution. Below is a simplified example of a runtime-controlled agent loop. The model proposes actions. The runtime decides what actually happens. Python def process_step(agent_id, proposed_action): state = state_store.load(agent_id) decision = policy_engine.evaluate( agent_id=agent_id, action=proposed_action, state=state ) if decision == "DENY": audit_log.record(agent_id, proposed_action, "DENIED") return state result = tool_gateway.execute( agent_id=agent_id, action=proposed_action ) updated_state = state_store.persist(agent_id, result) audit_log.record(agent_id, proposed_action, "EXECUTED") return updated_state This structure is what makes agent behavior predictable. The model suggests. The runtime enforces. When something fails, engineers inspect execution paths instead of guessing why the model said what it said. Managed runtimes such as Amazon Bedrock Agents follow the same pattern. Execution control, state management, and logging live outside the model. The separation matters more than the platform. Memory Has to Survive the Process Agents depend on context. During early development, that context often lives in prompt history or in-memory objects. This works until the first restart. In production, memory has to survive restarts and scaling events. It also has to be inspectable. Without that, agents forget earlier decisions, repeat work, or contradict themselves across runs. From the outside, it looks like poor reasoning. It is usually a missing state. A simple persistent state model already fixes much of this. Python class State: def __init__(self, context, history): self.context = context self.history = history self.updated_at = time.time() class StateStore: def load(self, agent_id): return database.fetch(agent_id) def persist(self, agent_id, result): state = self.load(agent_id) state.history.append(result) state.updated_at = time.time() database.save(agent_id, state) return state When state lives outside the prompt, engineers can see what the agent knew, what changed, and when. Without that visibility, behavior feels random even when the logic itself is not. Memory is not an optimization. It is part of the system’s contract. Tools Should Be Mediated, Not Exposed Most agents become useful only when they can act in the world. That usually means tools: APIs, databases, internal services, automation hooks. In prototypes, these tools are often called directly because it is fast. That shortcut does not survive scale. Direct tool access lets the model decide when side effects occur. Permissions sprawl. Credentials end up embedded where they should not be. Auditing becomes difficult because there is no single path that captures what was called and why. The model requests an action. The system decides whether the action is allowed, under what conditions, and with which permissions. Python def execute_tool(agent_id, tool_request): permissions = permission_service.get_permissions(agent_id) if not permissions.allows(tool_request.name): raise PermissionError("Action not permitted") credentials = credential_service.issue_scoped_credentials( agent_id=agent_id, tool=tool_request.name ) return tool_executor.run( tool_request=tool_request, credentials=credentials ) This moves access control out of prompts and into configuration. Credentials can be rotated. High-risk operations can be restricted. The agent still reasons about what it wants to do. The system controls what actually happens. Guardrails Must Live Outside the Model Many early designs rely on instructions in prompts to enforce safety rules. Do not delete data. Do not escalate privileges. Only read from this system. Those instructions are guidance, not enforcement. When guardrails exist only in text, compliance depends on how the model interprets them in a given moment. That is not reliable enough for systems that perform real actions. Guardrails belong in the control layer, where actions are validated before execution. Python def evaluate_policy(action, environment): if environment == "production" and action.type == "destructive": return "DENY" if action.required_scope not in action.granted_scopes: return "DENY" return "ALLOW" If an action is not allowed, the system says no. The explanation does not matter. One Agent Eventually Becomes a Bottleneck As agents take on more responsibility, a single reasoning loop becomes harder to control. Information gathering, evaluation, policy enforcement, and execution carry different risks and permission requirements. Treating them as one unit increases complexity and widens access scopes. A common production pattern is to separate these concerns. One component gathers information. Another evaluates conditions. A third applies organizational rules. A fourth executes approved actions. An orchestrator coordinates the flow. Python def orchestrate(task): data = data_agent.collect(task) assessment = evaluation_agent.analyze(data) decision = policy_agent.validate(assessment) if decision.approved: return execution_agent.execute(decision) return None This mirrors how distributed systems have been built for years. Boundaries reduce blast radius and make failures easier to reason about. Observability Is a Hosting Responsibility When agents operate continuously, visibility is no longer optional. Teams need to know what the agent saw, what it decided, which tools it called, and what changed as a result. Console output might work early on. It does not hold up in production. A hosting environment has to capture execution steps, tool usage, and state transitions in a structured way. Python def record_event(agent_id, phase, details): telemetry.write({ "agent_id": agent_id, "phase": phase, "details": details, "timestamp": time.time() }) With proper observability, agent behavior becomes something engineers can analyze instead of arguing about. Without it, every incident turns into guesswork. Frameworks Still Matter, But They Are Not Hosting Agent frameworks such as LangChain, LangGraph, LlamaIndex, and CrewAI still play an important role. They speed up development, reduce boilerplate, and make it easier to express reasoning flows, tool chains, and memory patterns. For early experimentation, they are often exactly what teams need. What they do not provide is a hosting environment. Frameworks do not solve identity, durable state, policy enforcement, execution control, or observability. They assume those concerns are handled elsewhere. As systems mature, this distinction becomes unavoidable. In production architectures, frameworks live inside a structured runtime. The framework defines what the agent is allowed to reason about. The platform decides what the agent is actually allowed to do. That separation is what makes complex agent systems operable. It preserves the flexibility of framework-driven development while preventing reasoning logic from becoming the enforcement mechanism. Conclusion AI agents earn trust through consistency, not clever output. An agent that runs for weeks without drifting, respects permissions without constant reminders, and leaves a clear trail of decisions becomes genuinely useful. An agent that relies on fragile prompts and a hidden, in-memory state does not, no matter how impressive it looks in a demo. Strong hosting turns AI from a text generator into a dependable system component. A capable model is impressive. A well-hosted agent is reliable.

By Amit Chaudhary
Azure AI Search at Scale: Building RAG Applications with Enhanced Vector Capacity
Azure AI Search at Scale: Building RAG Applications with Enhanced Vector Capacity

In the rapidly evolving landscape of Generative AI, the Retrieval-Augmented Generation (RAG) pattern has emerged as the gold standard for grounding Large Language Models (LLMs) in private, real-time data. However, as organizations move from proof of concept (PoC) to production, they encounter a significant hurdle: scaling. Scaling a vector store isn’t just about adding more storage; it’s about maintaining low latency, high recall, and cost efficiency while managing millions of high-dimensional embeddings. Azure AI Search (formerly Azure Cognitive Search) has recently undergone major infrastructure upgrades, specifically targeting enhanced vector capacity and performance. In this technical deep dive, we explore how to architect high-scale RAG applications using the latest capabilities of Azure AI Search. 1. The Architecture of Scalable RAG At its core, a RAG application consists of two distinct pipelines: the Ingestion Pipeline (data to index) and the Inference Pipeline (query to response). When scaling to millions of documents, the bottleneck usually shifts from the LLM to the retrieval engine. Azure AI Search addresses this by separating storage and compute through partitions and replicas, while offering specialized, hardware-accelerated vector indexing. System Architecture Overview The following diagram illustrates a production-grade RAG architecture. Note how the Search service acts as the orchestration layer between raw data and the generative model. 2. Understanding Enhanced Vector Capacity Azure AI Search has introduced new storage-optimized and compute-optimized tiers that significantly increase the number of vectors you can store per partition. The Vector Storage Math Vector storage consumption is determined by the dimensionality of your embeddings and the data type (for example, float32). A standard 1,536-dimensional embedding (common for OpenAI models) using float32 requires: Python 1536 dimensions * 4 bytes = 6,144 bytes per vector (plus metadata overhead) With the latest enhancements, certain tiers can now support tens of millions of vectors per index, using techniques such as Scalar Quantization to reduce memory footprint without significantly impacting retrieval accuracy. Comparing Retrieval Strategies To build at scale, you must choose the right search mode. Azure AI Search is unique in that it combines traditional full-text search with vector capabilities. FeatureVector SearchFull-Text SearchHybrid SearchSemantic RankerMechanismCosine Similarity/HNSWBM25 AlgorithmReciprocal Rank FusionTransformer-based L3StrengthsSemantic meaning, contextExact keywords, IDs, SKUBest of both worldsHighest relevanceScalingMemory intensiveCPU/IO intensiveBalancedExtra latency (ms)Use Case"Tell me about security""Error code 0x8004"General Enterprise SearchCritical RAG accuracy 3. Deep Dive: High-Performance Vector Indexing Azure AI Search uses the HNSW (Hierarchical Navigable Small World) algorithm for vector indexing. HNSW is a graph-based approach that enables approximate nearest neighbor (ANN) searches with sub-linear time complexity. Configuring the Index When defining your index, the vectorSearch configuration is critical. You must define the algorithmConfiguration to balance speed and accuracy. Python from azure.search.documents.indexes.models import ( SearchIndex, SearchField, SearchFieldDataType, VectorSearch, HnswAlgorithmConfiguration, VectorSearchProfile, SearchableField ) # Configure HNSW Parameters # m: number of bi-directional links created for each new element during construction # efConstruction: tradeoff between index construction time and search speed vector_search = VectorSearch( algorithms=[ HnswAlgorithmConfiguration( name="my-hnsw-config", parameters={ "m": 4, "efConstruction": 400, "metric": "cosine" } ) ], profiles=[ VectorSearchProfile( name="my-vector-profile", algorithm_configuration_name="my-hnsw-config" ) ] ) # Define the index schema index = SearchIndex( name="enterprise-rag-index", fields=[ SimpleField(name="id", type=SearchFieldDataType.String, key=True), SearchableField(name="content", type=SearchFieldDataType.String), SearchField( name="content_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, vector_search_dimensions=1536, vector_search_profile_name="my-vector-profile" ) ], vector_search=vector_search ) Why m and efConstruction Matter m: Higher values improve recall for high-dimensional data but increase the memory footprint of the index graph.efConstruction: Higher values produce a more accurate graph but increase indexing time. For enterprise datasets with over one million documents, values between 400 and 1000 are commonly used for initial index builds. 4. Integrated Vectorization and Data Flow A common challenge at scale is the orchestration tax — the overhead of managing separate embedding services and indexers. Azure AI Search now offers Integrated Vectorization. The Data Flow Mechanism By using integrated vectorization, the Search service handles chunking and embedding internally. When a document is added to a data source (such as Azure Blob Storage), the indexer automatically detects the change, chunks the content, invokes the embedding model, and updates the index. This significantly reduces custom pipeline complexity. 5. Implementing Hybrid Search with Semantic Ranking Pure vector search often struggles with domain-specific jargon or product identifiers (for example, Part-99-X). To build a robust RAG system, implement Hybrid Search with Semantic Ranking. Hybrid search combines the results from a vector query and a keyword query using Reciprocal Rank Fusion (RRF). The Semantic Ranker then takes the top 50 results and applies a secondary, more compute-intensive transformer model to re-order them based on actual meaning. Code Example: Performing a Hybrid Query Python from azure.search.documents import SearchClient from azure.search.documents.models import VectorQuery client = SearchClient(endpoint=AZURE_SEARCH_ENDPOINT, index_name="enterprise-rag-index", credential=credential) # User's natural language query query_text = "How do I reset the firewall configuration for the Pro series?" # This embedding should be generated via your choice of model (e.g., text-embedding-3-small) query_vector = get_embedding(query_text) ### results = client.search( | search_text=query_text, # Keyword search query | vector_queries=[VectorQuery(vector=query_vector, k_nearest_neighbors=50, fields="content_vector")], | select=["id", "content"], | query_type="semantic", | semantic_configuration_name="my-semantic-config", | | --- | --- | --- | --- | --- | for result in results: print(f"Score: {result['@search.score']} | Semantic Score: {result['@search.reranker_score']}") print(f"Content: {result['content'][:200]}...") The @search.reranker_score provides a more reliable relevance signal for LLM context selection than cosine similarity alone. 6. Scaling Strategies: Partitions and Replicas Azure AI Search scales in two dimensions: Partitions and Replicas. Partitions (Horizontal Scaling for Storage): Partitions provide more storage and faster indexing. If you are hitting the vector limit, you add partitions. Each partition effectively "slices" the index. For example, if one partition holds 1M vectors, two partitions hold 2M.Replicas (Horizontal Scaling for Query Volume): Replicas handle query throughput (Queries Per Second - QPS). If your RAG app has 1,000 concurrent users, you need multiple replicas to prevent request queuing. Estimating Capacity When designing your system, follow this rule of thumb: Low Latency Req: Maximize Replicas.Large Dataset: Maximize Partitions.High Availability: Minimum of 2 Replicas for read-only SLA, 3 for read-write SLA. 7. Performance Tuning and Best Practices Building at scale requires more than just infrastructure; it requires smart data engineering. Optimal Chunking Strategies The quality of your RAG system is directly proportional to the quality of your chunks. Fixed-size chunking: Fast but often breaks context.Overlapping chunks: Essential for ensuring context isn't lost at the boundaries. A common pattern is 512 tokens with a 10% overlap.Semantic chunking: Using an LLM or specialized model to find logical breakpoints (paragraphs, sections). This is more expensive but yields better retrieval results. Indexing Latency vs. Search Latency When you scale to millions of vectors, the HNSW graph construction can take time. To optimize: Batch your uploads: Don't upload documents one by one. Use the upload_documents batch API with 500-1000 documents per batch.Use the ParallelIndex approach: If your dataset is static and massive, consider using multiple indexers pointing to the same index to parallelize the embedding generation. Monitoring Relevance Scaling isn't just about size; it's about maintaining quality. Use Retrieval Metrics to evaluate your index performance: Recall@K: How often is the correct document in the top K results?Mean Reciprocal Rank (MRR): How high up in the list is the relevant document?Latency P95: What is the 95th percentile response time for a hybrid search? 8. Conclusion: The Future of Vector-Enabled Search Azure AI Search has evolved from a keyword index into a high-performance vector engine capable of powering large-scale RAG systems. With enhanced vector capacity, hybrid retrieval, and integrated vectorization, teams can focus on the generation layer rather than retrieval infrastructure. Future capabilities such as vector quantization and disk-backed HNSW will push scalability further, enabling billions of vectors at lower cost. For enterprise architects, the takeaway is clear: scaling RAG isn’t just about the LLM — it’s about building a resilient, high-capacity retrieval foundation. Technical Checklist for Production Deployment Choose the right tier: S1, S2, or the new L-series (Storage Optimized) based on vector counts.Configure HNSW: Tune m and efConstruction based on your recall requirements.Enable Semantic Ranker: Use it for the final re-ranking step to significantly improve LLM output.Implement Integrated Vectorization: Simplify your pipeline and reduce maintenance overhead.Monitor with Azure Monitor: Keep an eye on Vector Index Size and Search Latency as your dataset grows. For more technical guides on Azure, AI architecture and implementation, follow: Twitter/XLinkedInGitHub

By Jubin Abhishek Soni
From Command Lines to Intent Interfaces: Reframing Git Workflows Using Model Context Protocol
From Command Lines to Intent Interfaces: Reframing Git Workflows Using Model Context Protocol

My recent journey into agentic developer systems has been driven by a desire to understand how AI moves from passive assistance to active participation in software workflows. In an earlier article, AI Co-creation in Developer Debugging Workflows, I explored how developers and AI systems collaboratively reason about code. As I went deeper into this space, I came across the Model Context Protocol (MCP) and became keen to understand what this component is and why it is important. I noticed that MCP was frequently referenced in discussions about agentic systems, yet rarely explained in a concrete, developer-centric way. This article is a direct outcome of that learning process, using a practical Git workflow example to clarify the role and value of MCP in intent-driven developer tooling. What Is an MCP Server? At a conceptual level, an MCP server acts as a control plane between an AI assistant and external systems. Rather than allowing an LLM to issue arbitrary API calls, the MCP server implements the Model Context Protocol and exposes a constrained, well-defined set of capabilities that the model can invoke. As illustrated in the diagram, the AI assistant functions as an MCP client, issuing structured MCP requests that represent user intent. The MCP server receives these requests, validates them against exposed capabilities and permissions, and translates them into concrete API calls or queries against external systems such as databases, version control platforms, or document stores. The results are then returned to the model as structured context, enabling subsequent reasoning or follow-up actions. This intermediary role is critical. The MCP server is not merely a proxy; it enforces permission boundaries, operation granularity, and deterministic execution. By separating intent expression from execution logic, MCP reduces the risk of unsafe or unintended actions while enabling AI systems to operate on real developer tools in a controlled manner. In effect, the MCP server bridges conversational AI and operational systems, making intent-driven workflows both practical and governable. Case Study: Intent-Driven Git Workflows Using GitHub MCP in VS Code To ground the discussion, this section presents a concrete case study using the open-source github-mcp-server, integrated into Visual Studio Code via GitHub Copilot Chat. The goal of this case study is not to demonstrate feature completeness, but to illustrate how MCP enables intent-first interaction for common GitHub workflows. MCP Server Registration in VS Code MCP servers are configured at the workspace or user level using a dedicated configuration file. In this setup, the GitHub MCP server is registered by adding an MCP configuration file under the VS Code workspace: .vscode/mcp.json JSON { "servers": { "github": { "url": "https://api.githubcopilot.com/mcp/" } } } This configuration declares GitHub as an MCP server and points the IDE’s MCP client to a remote endpoint. Once registered, the IDE can discover the capabilities exposed by the GitHub MCP server and make them available to the chat interface as structured tools. Authentication via OAuth Approval When the MCP server is first invoked, VS Code initiates an OAuth flow with GitHub. In this case, authentication was completed by approving access through a browser-based login using GitHub credentials (username and password, followed by any configured multi-factor authentication). This OAuth-based flow has several important properties: Credentials are not stored directly in the MCP configuration.Permissions are scoped to the approved application.Token issuance and rotation are handled by the GitHub authorization system. Once authorization is complete, the MCP server can securely execute GitHub operations on behalf of the user, subject to the granted scopes (these are listed as tools when configuring the MCP server). Alternative Authentication: Personal Access Tokens In addition to browser-based OAuth authorization, the GitHub MCP server can also be configured using a GitHub Personal Access Token (PAT). This approach is useful when explicit credential control is required or when OAuth approval is not feasible in a given environment. In this setup, the MCP configuration declares an Authorization header and prompts the user to supply the token securely at runtime, rather than hardcoding it in the file. .vscode/mcp.json (PAT-based authentication) JSON { "servers": { "github": { "type": "http", "url": "https://api.githubcopilot.com/mcp/", "headers": { "Authorization": "Bearer ${input:github_mcp_pat}" } } }, "inputs": [ { "type": "promptString", "id": "github_mcp_pat", "description": "GitHub Personal Access Token", "password": true } ] } This configuration has two practical advantages. First, the token is not committed to source control because it is collected via an interactive prompt. Second, it makes the authentication mechanism explicit and portable across environments while keeping the MCP server endpoint unchanged. After the token is provided, the IDE can invoke GitHub MCP capabilities through the same intent-driven prompts used in the OAuth-based setup. Verifying MCP Server Initialization in VS Code After adding the MCP configuration, it is important to verify that the GitHub MCP server is correctly initialized and running. Visual Studio Code exposes MCP server lifecycle events directly in the Output panel, which serves both as a validation mechanism and a primary debugging surface. Once the .vscode/mcp.json file is detected, VS Code attempts to start the configured MCP server automatically. In the Output tab, selecting the “MCP: github” channel shows detailed startup logs, including server initialization, connection state, authentication discovery, and tool registration. The logs confirm several important stages: The GitHub MCP server transitions from Starting to RunningOAuth-protected resource metadata is discoveredThe GitHub authorization server endpoint is identifiedThe server responds successfully to the initialization handshakeA total of 40 tools are discovered and registered These log entries provide concrete evidence that the MCP server is active and that its capabilities are available to the IDE. They also offer visibility into the OAuth flow, making it clear when authentication is required and when it has been successfully completed. From a practical standpoint, the Output panel becomes essential when troubleshooting MCP integrations. Configuration errors, authentication failures, or capability discovery issues surface immediately in these logs, allowing developers to debug MCP setup issues without leaving the IDE or guessing at silent failures. Executing GitHub Operations Through Intent Once the GitHub MCP server is configured and running, GitHub operations become available inside the IDE as structured capabilities. Using Visual Studio Code with GitHub Copilot Chat, prompts expressed in natural language are translated into constrained GitHub operations via the github-mcp-server. Repository Discovery Prompt: “List all repos in my GitHub account.” The assistant invokes the repository-listing capability and returns the results directly in the IDE, validating authentication and MCP capability discovery. Pull Request Creation Prompt: “Create a PR.” Because the request is underspecified, the assistant asks for required parameters, including repository, change source, title, description, and base branch. After responding with: “react-storybook-starter, staged changes, PR title – Add a dummy commit, PR description none, merge to master” the assistant creates a branch, commits the staged changes, and opens a pull request. The PR is confirmed with its repository identifier. Repository Creation Prompt: “Create a new repo in mvmaishwarya. Repo name: problems-and-prep. Repo is public.” The MCP server executes the repository creation operation and returns confirmation that the public repository has been successfully provisioned. Observations from Intent-Driven Execution Across these examples, several consistent behaviors emerge. First, the assistant requests clarification only when required by the operation’s schema, avoiding unnecessary dialogue. Second, all actions are executed through explicitly exposed MCP capabilities rather than inferred or free-form API calls. Finally, the IDE remains the primary workspace, reducing context switching between terminals, browsers, and documentation. Together, these interactions demonstrate how MCP enables GitHub workflows to shift from command-driven procedures to intent-driven execution while maintaining safety, transparency, and developer control.

By Aishwarya Murali
Amazon Q Developer for AI Infrastructure: Architecting Automated ML Pipelines
Amazon Q Developer for AI Infrastructure: Architecting Automated ML Pipelines

The landscape of Machine Learning Operations (MLOps) is shifting from manual configuration to AI-driven orchestration. As organizations scale their AI initiatives, the bottleneck is rarely the model architecture itself, but rather the underlying infrastructure required to train, deploy, and monitor these models at scale. Amazon Q Developer, a generative AI–powered assistant, has emerged as a critical tool for architects and engineers looking to automate the lifecycle of AI infrastructure. Traditionally, setting up a robust ML pipeline involved complex Infrastructure as Code (IaC), intricate IAM permissioning, and manual tuning of compute resources like NVIDIA H100s or AWS Trainium. Amazon Q Developer streamlines this by translating high-level architectural requirements into production-ready scripts, optimizing resource allocation, and troubleshooting connectivity issues within the AWS ecosystem. This article explores the technical architecture of using Amazon Q for ML infrastructure and provides practical implementation strategies. 1. The Architectural Blueprint of Q-Assisted ML Pipelines To understand how Amazon Q Developer automates ML pipelines, we must examine its integration points within the AWS Well-Architected Framework. Amazon Q operates as a management layer that interfaces with the AWS Cloud Control API, SageMaker, and CloudFormation/CDK. In a typical automated ML architecture, Amazon Q acts as the “intelligence agent” that sits between the developer’s IDE and the target cloud environment. It doesn’t just suggest code snippets; it understands the context of ML workloads, such as data throughput requirements and memory-intensive training jobs. This architecture ensures that the infrastructure is not a static set of scripts, but an evolving entity that can be refactored by Amazon Q based on performance metrics received from CloudWatch. 2. Automating Infrastructure as Code (IaC) for GPU Clusters Provisioning high-performance compute clusters for deep learning is notoriously difficult. Misconfigurations in VPC subnets or security groups can lead to latency issues during distributed training (e.g., using Horovod or PyTorch Distributed Data Parallel). Amazon Q Developer excels at generating AWS CDK (Cloud Development Kit) code that follows best practices for networking and resource isolation. When prompted to “Create a SageMaker pipeline with VPC-only access and GPU acceleration,” Amazon Q generates the necessary constructs to ensure that training traffic stays within the AWS backbone, reducing data transfer costs and increasing security. Comparison: Manual vs. Q-Assisted Provisioning FeatureManual ImplementationQ-Assisted ImplementationResource SelectionManual benchmarking of P4/P5 instancesAI-driven recommendation based on workloadIAM Policy CreationTrial and error (Least Privilege)Automated generation of scoped IAM rolesNetworkingManual VPC/Subnet/NAT Gateway setupPattern-based VPC architecture generationScalingStatic Auto-scaling policiesDynamic scaling based on throughput projections 3. Streamlining the Data Engineering Layer ML pipelines are only as good as the data feeding them. Automating the ETL (Extract, Transform, Load) process is a primary use case for Amazon Q. It can generate AWS Glue jobs or Amazon EMR configurations that handle petabyte-scale data processing. For example, if you need to partition a massive dataset in S3 by date and feature set, Amazon Q can provide the PySpark code necessary to optimize the storage layout for Athena queries. This reduces the time data scientists spend on “data plumbing” and allows them to focus on feature engineering. Python import boto3 import sagemaker from sagemaker.workflow.pipeline import Pipeline # This script demonstrates a Q-assisted SageMaker Pipeline definition def create_ml_pipeline(role_arn, bucket_name): # Initialize SageMaker Session sagemaker_session = sagemaker.Session() # Amazon Q assisted in generating this processing step configuration # It ensures the use of the correct instance type for large-scale CSV processing from sagemaker.processing import ProcessingInput, ProcessingOutput from sagemaker.workflow.steps import ProcessingStep # Define the processor from sagemaker.sklearn.processing import SKLearnProcessor sku_processor = SKLearnProcessor( framework_version='0.23-1', role=role_arn, instance_type='ml.m5.xlarge', instance_count=2, base_job_name='data-prep-job' ) # Step for Data Processing step_process = ProcessingStep( | name="PreprocessData", | processor=sku_processor, | | --- | --- | | inputs=[ProcessingInput(source=f"s3://{bucket_name}/raw/", destination="/opt/ml/processing/input")], | outputs=[ProcessingOutput(output_name="train", source="/opt/ml/processing/train")], | | code="preprocess.py" # Script logic also assisted by Q | ) | return Pipeline(name="AutomatedMLPipeline", steps=[step_process]) 4. Performance Optimization and Instance Selection One of the most complex aspects of ML architecture is selecting the right instance type for the right task. Using the wrong instance can lead to throttled performance or excessive costs. Amazon Q Developer provides deep insights into instance families. It can suggest switching from ml.p3.2xlarge to ml.g5.2xlarge for certain inference workloads to achieve a better price-to-performance ratio. Distributed Training Sequence The following sequence diagram illustrates how Amazon Q facilitates the setup of a distributed training job across multiple nodes. 5. Security, Governance, and Compliance In highly regulated industries (e.g., finance and healthcare), ML infrastructure must adhere to strict compliance standards such as HIPAA and PCI DSS. Amazon Q Developer helps by suggesting security configurations that developers might otherwise overlook, including: Encryption at rest: Automatically adding KMS key IDs to S3 buckets and EBS volumesEncryption in transit: Enabling inter-node encryption for distributed training jobsVPC endpoints: Generating configurations for interface VPC endpoints to avoid traversing the public internet When reviewing existing IaC templates, Amazon Q can identify overly permissive IAM roles and suggest refined policies that restrict access to specific S3 prefixes or SageMaker resources. 6. Practical Use Case: Real-Time Inference Pipeline Consider a scenario in which a retail company needs to deploy a recommendation engine. The architecture requires a SageMaker endpoint, an API Gateway, and a Lambda function for preprocessing. Amazon Q Developer can generate the entire stack using the AWS Serverless Application Model (SAM). It provides the Swagger definition for the API, the Python code for the Lambda function (handling JSON validation), and the configuration for SageMaker Multi-Model Endpoints (MME) to save costs by hosting multiple models on a single instance. Performance Considerations Cold Starts: Q can suggest Lambda Provisioned Concurrency settings based on expected traffic.Endpoint Latency: It can recommend enabling SageMaker Inference Recommender to find the optimal instance configuration for sub-100 ms latency. Best Practices for Q-Driven ML Infrastructure Verify Generated Code: Always review AI-generated IaC in a sandbox environment before deploying to production.Contextual Prompting: Provide Q with specific constraints (e.g., “Use Graviton-based instances where possible”) to optimize for cost.Iterative Refinement: Use Q to refactor legacy ML pipelines. Ask it to “modernize this CloudFormation template to use AWS CDK v2.”Integrate with CI/CD: Use Q to generate GitHub Actions or AWS CodePipeline definitions that automate testing of your ML infrastructure. Conclusion Amazon Q Developer is transforming the role of the ML architect from a manual scriptwriter into a high-level system designer. By automating the boilerplate of infrastructure provisioning, security configuration, and performance tuning, Q allows teams to deploy models faster and with greater confidence. As generative AI continues to evolve, the integration between developer assistants and cloud infrastructure will become the standard for building the next generation of AI-powered applications.

By Jubin Abhishek Soni
Queueing Theory for LLM Inference
Queueing Theory for LLM Inference

If you are deploying LLM inference in production, you are no longer just doing machine learning. You are doing applied mathematics plus systems engineering. Most teams tune prompts, choose a model, then wonder why latency explodes at peak traffic. The root cause is usually not the model. It is load, variability, and the queue that forms when the arrival rate approaches the service capacity. This article gives you a practical, math-driven way to reason about LLM serving. We will use queueing theory, Little’s Law, and a simple simulation to answer the questions every leader gets asked. How many GPUs do we need? What is our safe throughput? How should we batch? What happens to p95 and p99 under bursty traffic? The goal is not to build a perfect analytical model. The goal is to build an engineering calculator you can defend. Core Mental Model Every request is a job. Jobs arrive over time. GPUs process jobs. If jobs arrive faster than you can process, they wait. Waiting is your latency. Define: Arrival rate: λ requests per secondService time: S seconds per requestService rate per worker: μ = 1 / SNumber of workers: k (GPU replicas or GPU partitions)Utilization: ρ = λ / (k μ) = λ S / k The first rule of production inference: Keep ρ comfortably below 1. As ρ approaches 1, queues grow superlinearly, and tail latency blows up. Little’s Law Little’s Law is the simplest and most useful equation you can bring into an SLO meeting. L = λ W L is average number of jobs in the systemW is average time in the system (waiting plus service)λ is arrival rate If you can measure two of these, you get the third. More importantly, it forces clarity: if you want lower W at the same λ, you must reduce L by increasing service capacity or smoothing variability. Why LLM Is Serving Harder Than Normal Web Serving LLM inference violates the assumptions people unconsciously make when they reason about latency. Service time is highly variable because prompt length varies, output length varies, tool use varies, and cache hit rate varies. Moreover, arrivals are bursty because enterprise traffic often has diurnal peaks and release-driven spikes. Batching increases throughput but can add waiting time because you may hold requests to form a batch. This variability is exactly where applied computational math helps. We do not need perfect predictions. We need safe bounds and policies that degrade gracefully. A Simple Capacity Sizing Formula Start with a capacity bound that is almost embarrassingly simple. If each request takes S seconds on average, and you have k identical workers, then stable operation requires: λ < k / S Rearrange to size k: k > λ S Then add headroom for variability and tail behavior. A common engineering rule is to target utilization ρ between 0.4 and 0.7 for strict tail latency, depending on burstiness and service time variance. So a practical sizing is: k = ceil( λ S / ρ_target ) Example Suppose peak λ is 120 requests per second. Average service time S is 0.18 seconds per request on your chosen model and hardware. If you target ρ_target = 0.6: Plain Text k = ceil(120 × 0.18 / 0.6) k = ceil(21.6 / 0.6) k = ceil(36) So you start with 36 workers. This is a starting point. Next, we incorporate batching and tail. Batching as a Control Problem Batching is not magic. It is a scheduling policy. If you batch B requests together, you often improve compute efficiency and reduce per-request service time. But you also introduce batch formation delay. A useful decomposition is: Total latency = queue wait + batch wait + compute time Batch wait is the time a request sits while you fill the batch. You can control it using a max wait timer. Consider the max batch size B_max, and the max batch wait T_max. Dynamic batching then entails accumulation until B_max or T_max expires, then dispatch. Batching improves throughput when compute cost scales sublinearly with B. For transformer decoding, you may get good scaling for prefill, and weaker scaling for long decode. The details depend on your serving stack. Batching is only beneficial if the throughput gains outweigh the added waiting, especially at p95 and p99. High-throughput serving of LLMs typically depends on batching and careful KV cache management, as described in PagedAttention and vLLM. If your workload is bursty, dynamic batching with a small T_max often dominates naive large batches. If you deploy with NVIDIA stacks, TensorRT LLM discusses in-flight batching and request scheduling. A Tail Latency Heuristic You Can Use Even without heavy theory, you can build a safe heuristic. Choose a latency SLO, for example, p95 under 800 msReserve part of the budget for model compute, for example, 300 msReserve part for network and orchestration, for example, 100 msThe rest is queueing plus batching budget, for example, 400 msEnforce T_max below your queueing budget, for example, 20 to 50 ms If T_max is too large, you manufacture tail latency even when you have capacity. Simulation: A Small Model You Can Run Analytical queueing models like M/M/k can be informative, but LLM service times are rarely exponential. A quick discrete event simulation is often more honest and aligns with standard performance modeling practice described in the Performance Modeling and Design of Computer Systems book. Below is a compact simulation that lets you explore capacity, service time variability, and batching timers. You can adapt it to your real telemetry distributions. Python import random import heapq import math from statistics import mean def percentile(xs, p): xs = sorted(xs) if not xs: return None i = int(math.ceil(p * len(xs))) - 1 i = max(0, min(i, len(xs) - 1)) return xs[i] def simulate( seconds=120, arrival_rate=100.0, # λ requests per second workers=24, # k mean_service=0.20, # seconds service_cv=0.8, # coefficient of variation batch_max=8, batch_wait_max=0.03, # seconds seed=0 ): random.seed(seed) # Arrivals as a Poisson process t = 0.0 arrivals = [] while t < seconds: t += random.expovariate(arrival_rate) if t < seconds: arrivals.append(t) # Service time model: lognormal with chosen mean and cv if service_cv <= 0: sigma = 0.0 mu = math.log(mean_service) else: sigma2 = math.log(1 + service_cv**2) sigma = math.sqrt(sigma2) mu = math.log(mean_service) - 0.5 * sigma2 def sample_service_time(batch_size): # Simple batching efficiency curve # Replace this with measurements from your stack base = random.lognormvariate(mu, sigma) efficiency = 0.55 + 0.45 / math.sqrt(batch_size) return base * efficiency # Worker availability times worker_free = [0.0 for _ in range(workers)] heapq.heapify(worker_free) latencies = [] # Batch accumulator batch = [] batch_first_arrival = None idx = 0 current_time = 0.0 def dispatch_batch(dispatch_time, batch_items): nonlocal latencies free_time = heapq.heappop(worker_free) start_time = max(free_time, dispatch_time) service_time = sample_service_time(len(batch_items)) finish_time = start_time + service_time heapq.heappush(worker_free, finish_time) for arrival_time in batch_items: latencies.append(finish_time - arrival_time) while idx < len(arrivals) or batch: next_arrival = arrivals[idx] if idx < len(arrivals) else float("inf") next_deadline = (batch_first_arrival + batch_wait_max) if batch_first_arrival is not None else float("inf") current_time = min(next_arrival, next_deadline) if current_time == next_arrival: at = next_arrival idx += 1 if not batch: batch_first_arrival = at batch.append(at) if len(batch) >= batch_max: dispatch_batch(at, batch) batch = [] batch_first_arrival = None else: dispatch_batch(current_time, batch) batch = [] batch_first_arrival = None return { "mean": mean(latencies), "p50": percentile(latencies, 0.50), "p95": percentile(latencies, 0.95), "p99": percentile(latencies, 0.99), "max": max(latencies) if latencies else None, "count": len(latencies), } if __name__ == "__main__": out = simulate( seconds=180, arrival_rate=120.0, workers=36, mean_service=0.18, service_cv=0.9, batch_max=8, batch_wait_max=0.03, seed=42 ) print(out) How to use this in practice: Replace the service time sampler with your measured distributionUse real arrival traces, not just PoissonSweep workers, batch_max, and batch_wait_maxTrack p95 and p99, not just mean This turns a fuzzy infrastructure debate into a quantitative policy discussion. A Deployment Playbook That Reads Like Applied Math Step 1: Measure the Service Time Distribution Instrument per request compute time split into prefill and decode. Track prompt tokens, output tokens, and cache hits. Step 2: Decide What You Are Optimizing If your business cares about p99, size for p99. If your business cares about cost, set a max queueing budget and accept more shedding. Step 3: Pick a Utilization Target and Enforce Admission Control Choose ρ_target and do not exceed it at peak. Use a queue length circuit breaker. When overload hits, degrade and do not accumulate an infinite queue as mentioned by Google's SRE Playbook. Step 4: Use Dynamic Batching With a Strict Timer Set batch_wait_max to protect tail latency. Use smaller batches under low load, larger batches under high load. Step 5: Add a Second Lever: Request Shaping Route long prompts to a separate pool. Cap max generation length by tier. Use early exit for low confidence tasks. Step 6: Validate With Chaos Load Tests Replay bursty traffic. Replay worst-case long outputs. Confirm SLOs under realistic spikes. What to Say to Leadership When someone asks why p99 jumped from 900 ms to 6 seconds, you can say it clearly. We moved closer to utilization 1.Queueing delay grows nonlinearly near saturation.Batching timers and variability amplified the tail.We need either more capacity, stricter batching timers, or overload policies. Applied mathematics is not an academic add-on to LLM systems. It is the difference between a demo and a reliable service. If you treat LLM inference as a queueing system, you gain levers you can measure and control: utilization, batching delay, service time variance, and admission control. That is how you hit SLOs while keeping costs rational. The opinions expressed in this article are the authors’ personal opinions and do not reflect the opinions of their employer.

By Dhyey Mavani
From Prompts to Platforms: Scaling Agentic AI (Part 2)
From Prompts to Platforms: Scaling Agentic AI (Part 2)

The tenets I introduced in Part 1 covered the functional mechanics — the core features that power an AI platform. But in production, functionality is only half the battle. These next six Operational Tenets are about how the platform survives the chaos of the real world and scales without breaking under its own complexity. Here are the pillars critical to operating an AI platform at scale: 7. Evaluation Pipelines: Making Quality Measurable In deterministic systems, code either works or it doesn’t. In agentic systems, “working” is probabilistic and context-dependent. Moving beyond the happy-path demo requires translating the agentic system’s behavior into measurable signals that engineers can act on. Quality Evaluation at Scale Manual evaluation quickly becomes a bottleneck as agent workflows grow. Automating this with an evaluation platform allows reasoning traces and responses to be assessed against Gold Datasets — hand-curated “ground truth” examples of what a perfect interaction looks like. Such systems are built to evaluate quality benchmarks such as tool-calling correctness, policy adherence, factual accuracy, and task completion. Insights from these evaluations feed directly into engineering improvements, from prompt tuning and model selection to workflow optimization. Concurrency & Latency Stress Testing Quality alone is insufficient if the system degrades under load. Actively stress-testing multi-agent workflows uncovers race conditions and reveals how latency compounds across reasoning chains. Benchmarking under peak concurrency ensures the platform remains responsive and predictable as complexity increases. 8. Graceful Degradation: Designing for Partial Failure Failures are inevitable in a complex agentic ecosystem. Models hit rate limits, tools time out, and sub-agents can misbehave. A resilient platform ensures localized failures do not cascade into a total breakdown of reasoning or user experience. Functional Tiering Agentic workflows should have multiple capability levels rather than a single “all-or-nothing” path. When a high-value function is unavailable — due to a tool outage, token exhaustion, a permission issue, or a dependency failure — the agent should gracefully pivot to the next best action. This helps preserve session continuity, maintain user trust, and allows the system to remain helpful even when optimal execution is temporarily unavailable. For example, if the agent can’t book the flight (Tier 1), it should at least provide flight options (Tier 2), and at worst, provide the booking link or customer service number (Tier 3). Model Tiering & Fallbacks Model selection can follow the same tiered philosophy. High-reasoning models are reserved for complex planning and synthesis, while lighter-weight models are sufficient for intent detection, clarification, or basic responses. The platform continuously monitors model health and performance; when latency spikes or rate limits are detected, deterministic circuit breakers can trigger an automatic fallback to lower-latency models. This ensures responsiveness — particularly Time to First Token (TTFT) — while preserving core functionality until full capacity is restored. 9. Deep Observability: Seeing the Agent Think It’s not enough to know the system is running — what matters is whether the agent is working correctly. For agentic platforms, this warrants visibility into the full agent lifecycle and reasoning process, from user intent to final output. Reasoning Trace Monitoring A simple solution is to instrument the Orchestrator, sub-agents, and tools to log each step of their decision-making process. For example, if a workflow normally resolves a member query in three reasoning steps but suddenly takes ten, it signals a potential regression — perhaps a misfired tool, policy conflict, or prompt anomaly. Correlating reasoning traces with inputs, outputs, and intermediate tool calls allows automated anomaly detection, root cause analysis, and evaluation of model or prompt changes. Agentic Distributed Tracing Using protocols like OpenTelemetry, traces propagate across the entire agent mesh — from the user request through the Orchestrator, safety guardrails, sub-agents, and external tools, back to the response. This provides a holistic view of the agent lifecycle, enabling proactive tuning, debugging, and identification of latency hotspots, logic loops, or bottlenecks at any component. 10. Telemetry-Driven Iteration: The Feedback Loop An agentic platform is an evolutionary engine: to improve, it must capture and interpret every interaction, not just the obvious signals. Implicit vs. Explicit Feedback Explicit signals — like thumbs up or down — are useful, but the real insight lies in implicit telemetry. Did the user act on the agent’s suggestion? Did they rephrase the query, issue a follow-up, or abandon the task? These subtle signals reveal whether the agent’s reasoning and recommendations truly aligned with user intent. Continuous A/B Testing Every parameter — temperature, response length, tone, or tool selection — can be treated as an experiment. Continuous A/B testing of these “micro-parameters” fine-tunes platform behavior, optimizing engagement, task completion, and user satisfaction. This telemetry-driven loop transforms every session into a source of learning, enabling the platform to evolve its personality and effectiveness over time. 11. Developer Productivity: Low-Touch Onboarding For a platform to scale, the barrier to entry for new skills must be near zero. Low-touch, guaranteed-safe onboarding democratizes agent creation across the organization. Plug-and-Play Onboarding Adding a new agent or skill should be as simple as editing a configuration file or using a lightweight UI to define the workflow, tools, and pilot prompts. The platform should be able to automatically handle UI rendering, response delivery, safety auditing, and mailbox logistics, allowing a prototype to be live in hours. Sandbox Deployment for Safe Ramping Before exposing new agents or workflows to all users, developers can deploy them in isolated sandboxes. This allows live testing under real conditions with controlled traffic, capturing telemetry and performance metrics without affecting production users. Sandboxing supports staged rollouts, gradual scaling, and safe experimentation, ensuring new capabilities are validated before wider release. 12. Resource & Token Governance: Scaling Economically Even a perfectly designed agentic platform can falter if compute and token usage spiral out of control. Resource governance is a critical pillar of operational resilience, ensuring that scale doesn’t come at the cost of sustainability. Quotas & Rate Limiting We implemented a “Token Economy,” assigning budgets to individual workflows, agents, or business units. In addition to keeping workflows accountable, this prevents a single runaway workflow from monopolizing resources or spiraling costs through erroneous and expensive reasoning loops. Cost Attribution & Optimization The token governance platform provides granular visibility into cost per task. By identifying the most token-hungry reasoning chains, we can target them for model distillation, prompt optimization, or workload reallocation — ensuring economic sustainability while scaling to millions of users. Conclusion Building a production-grade agentic platform requires a shift in mindset. We are no longer just creating static logic; we are cultivating an ecosystem of intelligent reasoning. By focusing on these six operational pillars — Evaluation, Resilience, Observability, Telemetry, Productivity, and Governance — we transform AI from a series of impressive demos into a reliable, evolving foundation for the enterprise. The transition from “cool” to “mission-critical” happens in these details.

By VIVEK KATARYA
AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale
AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale

The landscape of Artificial Intelligence has undergone a seismic shift with the emergence of Foundation Models (FMs). These models, characterized by billions (and now trillions) of parameters, require unprecedented levels of computational power. Training a model like Llama 3 or Claude is no longer a task for a single machine; it requires a coordinated symphony of hundreds or thousands of GPUs working in unison for weeks or months. However, managing these massive clusters is fraught with technical hurdles: hardware failures, network bottlenecks, and complex orchestration requirements. AWS SageMaker HyperPod was engineered specifically to solve these challenges, providing a purpose-built environment for large-scale distributed training. In this deep dive, we will explore the architecture, features, and practical implementation of HyperPod. The Challenges of Large-Scale Distributed Training Before diving into HyperPod, it is essential to understand why training Foundation Models is difficult. There are three primary bottlenecks: Hardware reliability: In a cluster of 2,048 GPUs, the probability of a single GPU or hardware component failing during a training run is nearly 100%. Without automated recovery, a single failure can crash the entire training job, wasting thousands of dollars in compute timeNetwork throughput: Distributed training requires constant synchronization of gradients and weights. Standard networking is insufficient; low-latency, high-bandwidth interconnects like Elastic Fabric Adapter (EFA) are required to prevent GPUs from idling while waiting for data.Infrastructure management: Setting up a cluster with Slurm or Kubernetes, configuring drivers, and ensuring consistent environments across nodes is an operational nightmare for data science teams. SageMaker HyperPod addresses these issues by providing a persistent, resilient, and managed cluster environment. System Architecture of SageMaker HyperPod At its core, HyperPod creates a persistent cluster of Amazon EC2 instances (such as P5 or P4d instances) preconfigured with the necessary software stack for distributed training. Unlike standard SageMaker training jobs that spin up and down, HyperPod clusters are persistent, allowing for faster iterations and a more “bare-metal” feel while retaining managed benefits. High-Level Architecture Press Enter or click to view the image in full size. In this architecture: Head node: Acts as the entry point, managing job scheduling via Slurm or Kubernetes.Worker nodes: The heavy lifters containing GPUs. They are interconnected via Elastic Fabric Adapter (EFA), enabling bypass of the OS kernel for ultra-low-latency communication.Storage layer: Typically Amazon FSx for Lustre, providing the high throughput necessary to feed data to thousands of GPU cores simultaneously.Health monitoring: A dedicated agent runs on each node, reporting status to the Cluster Manager. Deep Dive into Key Features 1. Automated Node Recovery and Resilience The standout feature of HyperPod is its ability to automatically detect and replace failing nodes. When a hardware fault is detected, HyperPod identifies the specific node, removes it from the cluster, provisions a new instance, and rejoins it to the Slurm cluster without human intervention. 2. High-Performance Interconnects (EFA) For distributed training strategies like tensor parallelism, the interconnect speed is the limiting factor. SageMaker HyperPod leverages EFA, which provides up to 3,200 Gbps of aggregate network bandwidth on P5 instances. This allows the cluster to function as a single massive supercomputer. 3. Support for Distributed Training Libraries HyperPod integrates seamlessly with the SageMaker Distributed (SMD) library, which optimizes collective communication primitives (AllReduce, AllGather) for AWS infrastructure. It also supports standard frameworks like PyTorch Fully Sharded Data Parallel (FSDP) and DeepSpeed. Comparing Distributed Training Approaches FeatureStandard SageMaker TrainingSageMaker HyperPodSelf-Managed EC2 (DIY)PersistenceEphemeral (Job-based)Persistent ClusterPersistent InstanceFault ToleranceManual restartAutomated Node RecoveryManual InterventionOrchestrationSageMaker APISlurm / KubernetesManual / ScriptsScaling LimitHighUltra-High (Thousands of GPUs)High (but complex)Best ForPrototyping/Single-nodeFoundation Models / LLMsCustom OS/Kernel Needs To use HyperPod, you first define a cluster configuration, create the cluster, and then submit jobs via Slurm. Below is a simplified look at how you might define a cluster using the AWS SDK for Python (Boto3). Step 1: Cluster Configuration What this code does: It initializes a request to create a persistent HyperPod cluster. It defines two instance groups: a head node for management and 32 p5.48xlarge nodes (H100 GPUs) for training. The LifeCycleConfig points to a script that installs specific libraries or mount points during provisioning. Step 2: Submitting a Slurm Job Once the cluster is InService, you SSH into the head node and submit your training job using a Slurm script (submit.sh). Plain Text #!/bin/bash #SBATCH --job-name=llama3_train #SBATCH --nodes=32 #SBATCH --ntasks-per-node=8 #SBATCH --gres=gpu:8 # Activate your environment source /opt/conda/bin/activate pytorch_env # Run the distributed training script srun python train_llm.py --model_config configs/llama3_70b.json --batch_size 4 What this code does: This is a standard Slurm script. It requests 32 nodes and 8 GPUs per node. The srun command handles the distribution of the train_llm.py script across all nodes in the HyperPod cluster. Advanced Parallelism Strategies on HyperPod When training models with trillions of parameters, the model weights alone might exceed the memory of a single GPU (even an H100 with 80 GB of VRAM). HyperPod facilitates several parallelism strategies: Data Parallelism (DP) Each GPU has a full copy of the model but processes different batches of data. Gradients are averaged at the end of each step. This is the easiest to implement but is memory-intensive. Tensor Parallelism (TP) A single layer of the model is split across multiple GPUs. For example, a large matrix multiplication is divided such that each GPU calculates a portion of the result. This requires the ultra-low latency of EFA. Pipeline Parallelism (PP) The model is split vertically by layers. Group 1 of GPUs handles layers 1–10, Group 2 handles layers 11–20, and so on. This reduces the memory footprint per GPU but introduces potential “bubbles,” or idle time. Fully Sharded Data Parallel (FSDP) FSDP shards model parameters, gradients, and optimizer states across all GPUs. It collects the necessary shards just in time for the forward and backward passes. This is currently the gold standard for scaling LLMs on HyperPod. Optimized Data Loading with Amazon FSx for Lustre Training scripts often become I/O-bound, meaning the GPUs are waiting for data to be read from storage. HyperPod clusters typically use Amazon FSx for Lustre as a high-performance scratch space. S3 integration: FSx for Lustre transparently links to an S3 bucket.Lazy loading: Data is pulled from S3 to the Lustre file system as the training script requests it.Local performance: Once the data is on the Lustre volume, it provides sub-millisecond latencies and hundreds of GB/s of throughput to the worker nodes. Best Practices for SageMaker HyperPod Implement robust checkpointing: Since HyperPod automatically recovers nodes, your training script must be able to resume from the latest checkpoint. Use libraries like PyTorch Lightning or the SageMaker training toolkit to handle this.Use health check scripts: You can provide custom health check scripts to HyperPod. If your application detects a specific software hang that the system-level monitor misses, you can trigger a node replacement programmatically.Optimize Batch Size: With the high-speed interconnects of HyperPod, you can often use larger global batch sizes across more nodes without a significant penalty in synchronization time.Monitor with CloudWatch: HyperPod integrates with Amazon CloudWatch, allowing you to track GPU utilization, memory usage, and EFA network traffic in real time. Conclusion AWS SageMaker HyperPod represents a significant milestone in the democratization of large-scale AI. By abstracting away the complexities of cluster management and providing built-in resilience, it allows research teams to focus on model architecture and data quality rather than infrastructure debugging. As foundation models continue to grow in complexity, the ability to maintain a stable, high-performance training environment becomes not just an advantage, but a necessity. Whether you are pretraining a new LLM from scratch or fine-tuning a massive model on a proprietary dataset, HyperPod provides the “supercomputer-as-a-service” experience required for the generative AI era. Further Reading & Resources AWS SageMaker HyperPod Official Documentation — The primary resource for technical specifications, API references, and getting started guides.Optimizing Distributed Training on AWS — A collection of blog posts detailing best practices for using EFA and SMD libraries.PyTorch Fully Sharded Data Parallel (FSDP) Guide — Technical documentation on the sharding strategy commonly used within HyperPod clusters.DeepSpeed Optimization Library — An open-source library compatible with HyperPod that offers advanced pipeline and system optimizations for LLM training.Scaling Laws for Neural Language Models — The foundational research paper exploring why large-scale distributed training is necessary for model performance.

By Jubin Abhishek Soni
Tools for Building Deterministic LLM Systems
Tools for Building Deterministic LLM Systems

It’s hard to imagine a world without LLMs nowadays. I rarely reach for Google when ChatGPT can provide me a much more curated answer with almost all the context it could need. However, these daily use cases often lean in creative directions. In the context of B2B systems, that same creativity that provides so much usefulness in day-to-day is not acceptable. This became clear when I first pitched the idea of using LLM-powered browser agents to fill out job application forms on behalf of job boards and agencies. A “small” mistake or hallucination, like choosing the wrong answer in a screening question, skipping a mandatory field, or hallucinating a value, means: The candidate never reaches the employer’s ATSAttribution breaksThe impression of your system instantly becomes “creates spam” Our product is nowadays pushing tens of thousands of applications through enterprise workflows, and “usually works” is not good enough. We need deterministic outcomes: for a given input, your system should produce the same, valid, structurally correct output nearly all the time. This article will cover some tools and patterns we’ve used to make the LLM-driven systems behave much more like deterministic software. The following examples are currently deployed to power an otherwise non-deterministic technology for enterprise customers, and they allow us to both harness the flexibility of LLMs while building software our users can trust. Determinism Can Be Viewed Across a Spectrum LLMs are never deterministic in the strict “same output every time” sense. Even with temperature=0. In practice, you can treat determinism as a spectrum: Hard constraints: “The output must be valid JSON matching this schema.”Soft constraints: “The extracted field must match one of these 12 allowed options.”Behavior consistency: “Given this distribution of inputs, how often does the system produce the correct, structurally valid result?” You will never get 100% reliability everywhere. But you can get close enough, know where the risk lives, and design the surrounding system so the rare failures are caught and handled. The techniques below describe how to build a system with different guardrails that allow you to trust the output to be what you need it to be. Structured Output: The Most Obvious and Biggest Win The simplest and most powerful tool is to force structure. OpenAI had it for a long time, and Anthropic has now followed suit. Instead of asking the model to please follow some JSON output format, APIs allow you to specify a JSON schema within the generation, forcing the LLM to generate tokens matching your schema: This is non-negotiable. To attempt to build a deterministic system, you need to know what the LLM will give you. For example, we use this to: Turn arbitrary job forms into a consistent representation of fieldsLet the LLM pick a field’s question type between a set of enum options. With structured output, your downstream logic can be regular code that expects clear types. Testing With Iterations: Measuring Your Determinism Once you have structured output and validation, you can start testing and measuring determinism instead of guessing. The core idea: run the same task many times and see how often it behaves correctly. If you’re lucky, you can test for concrete values and write standard test assertions. If you’re generating more abstract outputs, you may need to use another LLM as a ‘judge’ to test your output. For example, we: Created a fixture of a job posting, asking the LLM to return the fields it can find in the formRan this test over 50 iterationsAsserted that for each iteration, we get the correct number of fields with matching labels If 49 out of 50 runs are valid and correct, you know your success rate is 98%. That doesn’t mean 98% in production (as your input data will differ), but it gives you a baseline and lets you compare prompts, models, or schemas objectively. This is also crucial for building a reliable system that does not have regressions. In practice, this is how we iterate: Add fixture that caused problems.Change the prompt or schema to fix the problem.Re-run the test harness for 50–100 iterations on the new fixture set.Ship if you can validate a core improvement This is also where people underestimate the importance of writing good tests. You want: A simple CLI/test framework with which you can run your tests many times.Outputs that are easy to compare (“95% valid JSON, 90% fully correct vs. 98% / 95% after change”). For browser automation, I can highly recommend testing via Playwright’s own UI. For browser automation tasks, it’s often very feasible to let the LLM decide on one operation and make a Playwright assertion on that output. Taking Multiple Samples: Compounding Probabilities Say your workflow, as measured above, has a ~2% failure rate on structure or correctness. If you run the pipeline twice on the same input, and only accept the result if both outputs agree, then the probability that both runs fail in the same way is dramatically smaller. Assuming a 2% failure rate, we’ve now decreased the failure rate to 0.04% (2% * 2%). You can extend the idea while keeping in mind the cost implications. Latency should not change, as this is easily done in parallel. In practice, while you may not want to take multiple samples for every operation, this is a priceless method to reduce uncertainty when it comes to critical operations for your agent. Resolving Inconsistencies via an LLM Once you’re taking multiple samples to reduce the failure rate, you’ll notice that your agent sometimes fails too easily. For example, your output consistency check may be too strict, or it may simply be hard to resolve if two outputs are the same. Throwing them away is expensive. Instead, you can ask another LLM to act as a judge during runtime. A prompt like the following can be used, usually passing the same system prompt as the original generation, so that the judge has all the context necessary: “You are a strict judge. You receive two candidate generations of the same input and the original input text. Your job is to select the index of the correct generation, or return -1 if none are suited”. This gives you two wins: You can salvage cases where one sample is clearly wrong, and the other is fine.You have a principled way to admit uncertainty and fall back to a slower path (manual review, different model, etc.) with -1. Verification Loops: Letting LLMs Check Their Own Work The final layer is to think of your LLM pipeline as a loop, not a single shot. This is what is also often described as a true “agent” architecture: letting an LLM-based system decide its own path and decide when it is done. In practice, you’d be surprised how well a second LLM, even the same model, is able to judge if the previous generation was correct or not. You can build a system where your task always runs in a loop of asking the LLM: What should I do next to reach my goal?If I can’t take the next step, is my goal complete?If I can’t take the next step and the goal is not complete, is my goal impossible? Doing so will let the LLM decide for itself once enough uncertainty is resolved. And you can, of course, combine this with another technique, running two of these loops in parallel, etc. Putting It All Together LLMs are inherently never deterministic, but there are a surprising number of tricks you can use to build a system where you can fully trust the output to be what you expect. By combining the above ideas, you can push LLM-driven workflows close to the reliability bar of traditional software — high enough that enterprise customers are willing to put real money and processes on top of it. For our agentic job applications, that means letting job boards and agencies trust an LLM-powered agent to submit large volumes of applications into ATS systems they don’t have API access to. It’s inherently high-risk, as you can’t afford to hallucinate candidate input, but by sufficient testing, you can build a system you can trust.

By Cornelius Renken
The Future of Agentic AI
The Future of Agentic AI

The era of passive AI chatbots is ending. We are now entering the age of agentic AI: systems that actively reason, plan, and execute tasks. For organizations, this represents a potential leap in productivity, but it also introduces new engineering challenges. Moving from a simple prompt to a reliable agent ecosystem requires a new, robust architecture. In this article, we’ll explore the anatomy of AI agents, how the Model Context Protocol (MCP) has finally solved the integration bottleneck, and how you can architect safe, scalable systems where humans and agents collaborate effectively. What Is an AI Agent? Some AI agents are simply a system prompt and a collection of tools that are sent to a model that does all the thinking. However, there are also more powerful AI agents that use the LLM to recommend and propose actions, then the AI agent runs its own code to perform functions such as: Control execution: state machines, task graphs, retries, timeoutsEnforce policy: auth, scopes, RBAC, allow/deny rulesValidate actions: schema checks, safety filters, sandboxingManage memory/state: databases, vector stores, session stateCoordinate agents: message passing, role separation, votingHandle failure: rollbacks, circuit breakers, human-in-the-loop While a standard LLM (large language model) is passive and waits for your input to generate a response, an AI agent is active. It uses reasoning to break down a goal into steps, decides which tools to use, and executes actions to achieve an outcome. Agentic AI systems initiate a session by sending an LLM a system prompt, which may include the definition of multiple agents and their tools. Some of these tools may allow agents to invoke other agents, manage the context itself, and even select the model for the next step. LLM ChatbotAI Agent Flow User Input -> Model -> Output User Goal -> LLM -> Reasoning/Planning -> Tool Use -> Action -> Verification -> Output Summary Gives you advice but you have to do the work. You give the agents a goal and access to software, they come back when the job is done. The mastery of building agentic AI systems is finding the right mix of agents, tools, and prompts that allow the LLM to accomplish your goals while still providing adequate guardrails and verification. To this end, managing the tools and other resources available to the AI agents is a major focus. This is where the Model Context Protocol (MCP) comes in. The Model Context Protocol MCP is an open standard introduced by Anthropic in November 2024. It standardizes how AI systems connect to external data and services. The idea behind MCP is that all LLM API providers allow the LLM to invoke tools. Developers can benefit from a structured way to define those tools and make them available to the LLM in a uniform and consistent way. Prior to MCP, the integration of third-party tools into agentic AI systems added a lot of friction. But by providing a universal interface for reading files, executing functions, and handling contextual prompts, MCP enables AI models to access the data they need securely and consistently, regardless of where that information lives. Since its release, the protocol has been adopted by major AI providers, including OpenAI and Google, cementing its role as the industry standard for AI system integration. MCP operates through a straightforward client-server architecture with four key components: The host application, such as Claude Desktop, modern IDEs, or your AI system.The MCP client which establishes one-to-one connections from the host to a server, often a built-in capability of AI frameworks.The MCP server which exposes tools, resources, and prompts.The transport layer which manages communication between clients and servers. MCP also opened the door to an ecosystem where third-party platforms expose their capabilities to AI agents by publishing their own official MCP servers. Large enterprises such as Microsoft, AWS, Atlassian, and Sumo Logic have all published MCP servers. MCP solves an important problem, but it is just one among many for agents. Let’s look next at how to design safe Agentic AI systems. Designing Safe Agentic AI Systems Agentic AI can go catastrophically wrong. There are multiple risks, such as: Prompt injection that hijacks workflows to exfiltrate data or execute ransomware.Privilege escalation via tool chaining that drains accounts or deletes production systems.Infinite loops that burn millions in API costs.Hallucinated actions that trigger irreversible trades or compliance violations. Token torching where malicious actors hijack token spend through MCP. Agents are often entrusted with access to APIs, browsers, and infra systems. Without safeguards, this greatly amplifies your risks. Safe agentic AI requires a “defense-in-depth” approach built on multiple overlapping layers. Input validation, output auditing, and human-in-the-loop escalation form the verification backbone.Decisions are never fully autonomous when the blast radius or financial impact is high.Sandboxing and explicit permission boundaries prevent unauthorized access.Each agent should receive a distinct identity with least-privilege credentials and scoped tokens, rather than inheriting user permissions.Fault tolerance through retry logic, fallback models, and anomaly detection ensures that systems degrade gracefully under failure.Deep observability implemented via standardized telemetry, structured logging, metrics collection, and real-time monitoring dashboards enables rapid detection and response. Engineering effective multi-agent systems requires deliberate architecture design that incorporates one or more coordination patterns: Centralized orchestration where a supervisor agent coordinates specialized workers and maintains the global state.Decentralized peer-to-peer communication enabling flexible agent-to-agent interaction.Hierarchical delegation that organizes agents into levels of abstraction. Development environments like Sumo Logic's Dojo AI (an agentic AI platform for security operations centers) can help significantly, providing essential infrastructure for safely iterating on agentic systems before production deployment. Dojo AI is carefully curated according to its design principles and safeguards. Customers can use Dojo AI as is, and they can build their own agentic AI environment (similar to Dojo AI) for their own AI-based core competencies. The Sumo Logic MCP server lets you run data queries and make Dojo AI agent calls when needed from your own AI agents. Next, let's see some of the different ways people interact with Agentic AI systems. How People Collaborate With AI Agents Traditional systems follow a well-defined workflow and pre-programmed algorithms. User input and outputs are fully structured. Even in dynamic systems, user input can deterministically control the flow. Agentic AI systems, however, are different. The LLM controls the flow (within its guardrails). Users provide initial intent and motivation, and later operate as approvers and a gating function. In particular, the free text conversation is novel. So how do we best collaborate with these agents? One of the most common ways to interact with AI agents is through chatbots, where you can exchange text, images, and files with LLMs and their agents. Voice conversations are also becoming more popular. Of course, generic chatbots like ChatGPT, Gemini, and Claude Desktop are not aware of your agents out of the box. However, agents can be introduced as MCP tools. Another interesting option is to build a Slack application allowing agents to join channels, interact with users, automatically monitor the channels, and automatically respond to events. This is a rich environment as it allows humans and agents to collaborate smoothly. The Slack user experience already supports group channels and threads, so agents can add details such as their chain of thoughts or citations without cluttering the screen. Multiple human users can engage with each other and AI agents all in the same channel. If you need even more specialized user experience, you could build a custom web, desktop, or mobile application for your agents. You could even create a chatbot like Mobot, a Slack application integration, or a custom suite of agents like Dojo AI. The Future of Agents Perhaps the most important thing to understand about AI agents is that they are coming faster than you think. In the past, major technological revolutions like the personal computer, the internet, and mobile phones took decades to become ubiquitous, and the pace of innovation was manageable. AI is different. Many experts predict that in just the next few years, AI agents will be able to perform any knowledge work better than the best humans. They will unlock scientific discoveries and provide unprecedented productivity gains. Manual labor is not far behind, with humanoid robots making impressive strides powered by AI agents. LLMs can already perform many tasks as well as humans, though they lack the ability to plan and operate over long time horizons, deal with complexity, and maintain coherence. But with a carefully constructed web of agents and a curated set of tools that collaborate over multiple iterations to accomplish long-horizon tasks, these constraints are being removed. There is much innovation in this domain beyond just MCP, such as agent orchestration, toolkits, and verification layers. Industry Standardization We’re starting now to see the standardization of techniques, tools, and formats for AI agents. For example, the Agentic AI Foundation (AAIF) is a new initiative under the Linux Foundation to ensure Agentic AI evolves in an open and collaborative manner. Its members include Anthropic, OpenAI, Amazon, Google, Microsoft, and Block. It hosts several prominent agent technologies, including MCP, goose, and AGENTS.md. There are other prominent open efforts as well, including Google's Agent2Agent (A2A) protocol and Agent Skills (also originating from Anthropic). Dynamic User Experiences The future of the user experience is all about generative UI. The LLM and agents will generate an appropriate UI on the fly depending on the query, user, conversation history, and more. For example, if you ask about the stock market, rather than provide a generic overview of today’s business news, the AI system may decide to show a historical timeline and a pie chart with your current positions as well as links to relevant educational material. Everything will be tailored per user. The Shift to AI Agents The agentic shift is here. We’re moving from passive text generation to active, autonomous work. As we’ve seen, this shift requires more than just new models. It calls for a careful architecture. To succeed, organizations should focus on: Leveraging the Model Context Protocol (MCP).Moving beyond simple prompts to a "defense-in-depth" strategy.Designing interfaces, such as Slack apps and custom UIs, where humans provide the intent and agents handle the execution. AI agents may soon outperform top human knowledge workers, unlock major scientific and productivity gains, and eventually expand into physical work through robotics. Understanding their basics is the first step to harnessing their power for your organization. Have a really great day!

By John Vester DZone Core CORE
The Developer’s Guide to Local LLMs: Building, Running, and Scaling With Ollama
The Developer’s Guide to Local LLMs: Building, Running, and Scaling With Ollama

Firstly, LLMs are already widely used for working with unstructured natural data in general. Additionally, they excel at extracting information and working with semi-structured data, such as JSON files and other lengthy configuration files. It allows us to use them that way to interact with relational data, for example. Cloud-based LLMs are effective and powerful, but they have some limits. That's where locally based LLMs come into play. Local LLMs: Pros and Cons I first realized the need to use local LLMs while developing software for a critical industry (healthcare), where Personal Health Information is strictly regulated and, accordingly, the use of cloud-based LLMs is very limited. So, privacy is the first benefit of using the local LLMs. The second reason the classical LLMs may not fit is their level of customization. When the system needs custom fine-tuning or additional manipulations, it may be easier to implement them locally on the LLM. The third reason may not be so rational, but it also makes some sense. Local LLM — it is fun. You may use it the same way as you use the cloud-based LLMs, but in a convenient way, without dependency on the Internet. You may download the model of interest to your laptop and handle much of your work routine as you would with a regular ChatGPT or Gemini. For sure, each local LLM will be more limited in terms of the knowledge cutoff compared to the cloud-based LLMs, especially when working in "Thinking" mode. But if your goal is not deep research or analysis, it may be a great fit. The dark side of local LLMs is a knowledge cutoff, a lack of intelligence, and a lack of speed. It is not always a bottleneck. For example, for 70% of tasks, such as information extraction, summarization, and transformation, they will perform similarly to cloud-based systems. But the scalability may face challenges. One more limitation, not often mentioned but still critical, especially for production usage, is licensing. Architecture of the Local LLM Runtimes You may find many great LLM runtimes that help you get started with deploying and running an LLM locally. Some of them are LM Studio, Ollama, and Jan AI. Their purpose is to provide an environment and a UI/API interface for the LLMs themselves, making working with them easier and more manageable. The typical architecture of these runtimes is the following: For example, Ollama uses llama.cpp as its engine. Its function is to load a model into memory and operate on it. The web server runs by default on port 11434 and allows communication with the model from the local applications and CLI/GUI tools. User interacts with the model via the shell or via a GUI application. Software applications are also connected via a web server. After installing the LLM runtime, select the desired model(s) and download them to the local PC/Laptop. After that, the runtime loads it into memory, and it becomes accessible to the prompt. Licensing This topic is as important to consider, especially for the production or commercial usage of the LLMs. The good news is that most LLM runtimes have permissive licenses for commercial use (but double-check the specific tool for the exact details). The second layer is the LLM model itself. So, for example, if you use Ollama with the Meta Llama model, it means you need to read carefully two licenses: From OllamaFrom Meta Llama So, it is especially important to understand whether both licenses allow usage of the model for commercial purposes before building commercial applications, for example. Installation This article will showcase Ollama's capabilities. It may be good for local experiments as well as for building the applications. Once you understand how this runtime works, it will be much easier to apply similar patterns to other runtimes. Step 1. Install the Ollama Application Download an application for Windows, Linux, or Mac from the official download page. Step 2. Pull the Model and Run It For example, let's install the first local LLM. Run this in the terminal: Shell ollama pull llama3.2:3b ollama run llama3.2:3b Ollama is manageable using a terminal. So, you may find some useful commands below to manipulate the Ollama models: Shell ollama list # list installed models ollama pull llama3.2 # download a model ollama run llama3.2 # run chat in terminal ollama rm llama3.2 # remove a model ollama show llama3.2 # show model info (template, params, etc.) ollama ps # show loaded models # On Mac, if brew was used to install Ollama: brew install ollama # install ollama brew services start ollama # start server brew services stop ollama # stop server brew services restart ollama # restart server brew services list | grep -i ollama # check if ollama is running UI Interface for Interaction In July 2025, Ollama also released a GUI application for having a visual experience when prompting local LLMs. It simplifies interactions and allows loading the files as well. You may download it from their official site. The application allows prompting local LLMs like ChatGPT and other tools, including adding PDF and other text-based files. Also, some models support multimodality, meaning they can generate images using specific models. Building Applications on Top of Those Local LLMs The prerequisites to run that code are: 1. Install Ollama locally. 2. Pull and start the local model (in that particular example, it is llama3.2:3b). Shell ollama pull llama3.2:3b ollama run llama3.2:3b That is the application code itself: Python from ollama import chat messages = [ { 'role': 'user', 'content': 'Generate a 3-4 sentence description of the random product from Amazon?', }, ] response = chat('llama3.2:3b', messages=messages) print(response['message']['content']) The example answer was: Plain Text I've generated a fictional product description. Here it is: "The Intergalactic Dreamweaver" is a unique, patented sleep mask designed to enhance and control your dreams while you sleep ... Remote Application Example Using Ollama If you want to separate the Ollama server from the application server, it is very easy to do, since Ollama includes a built-in Web server. To do that, I just modified the previous code to link to the Ollama Server (which may be separate): Python from ollama import Client client = Client(host="http://localhost:11434") messages = [ { "role": "user", "content": "Generate a 3-4 sentence description of a random Amazon product?", } ] response = client.chat(model="llama3.2:3b", messages=messages) print(response["message"]["content"]) Scalability Side of the Local LLMs Let us understand the multitasking model for Ollama. If the application uses async mechanisms to generate many prompts to the LLM, Ollama currently handles them as a queue (FIFO). It means the application will not encounter an error, but latency may increase. For example, I successfully ran that code on the MacBook M4. Python import asyncio import time from ollama import AsyncClient QTY = 20 MODEL = "llama3.2:3b" PROMPT = "Please generate a random description for a product on Amazon, 3-4 sentences." async def ask(i): client = AsyncClient() messages = [ { "role": "user", "content": PROMPT, } ] response = await client.chat(MODEL, messages=messages) return i, response['message']['content'] async def main(): start = time.time() tasks = [asyncio.create_task(ask(i)) for i in range(QTY)] results = await asyncio.gather(*tasks) end = time.time() total_time = end - start results.sort(key=lambda x: x[0]) for idx, answer in results: print(f"\n=== Answer #{idx + 1} ===") print(answer) print(f"\n--- Total time: {total_time:.2f} seconds ---") if __name__ == "__main__": asyncio.run(main()) I changed only the QTY parameter, which determines the number of parallel requests sent to the Ollama server. The metrics were the following: QTY = 1: 2.4 sec (2.4 sec per request)QTY = 2: 5.2 sec (2.6 sec per request)QTY = 10: 25 sec (2.5 sec per request)QTY = 20: 49 sec (2.5 sec per request) This experiment shows that Ollama doesn't support parallelism at present. But it has an automatic queue, which means the client side should ultimately receive an answer. Conclusion To conclude, let us return to the use cases and limitations of the local LLMs. First of all, local LLMs are powerful enough to start thinking about them. It is not a toy anymore. It is a production-ready tool with rich support for frameworks, backed by intelligence, and may solve pretty complex tasks. They may be trained (fine-tuned), and while we didn't touch this topic in that article, fine-tuning remains one of the important features local LLMs offer. The limitations of the local LLMs may include scalability and speed. Licensing should not be a problem for the ethical use of LLMs. However, caution is important here, because some models may not allow commercial use. Overall, local LLMs may be the only option for some critical industries, where privacy matters the most. For other industries, it may be a good pick, with some trade-offs.

By Iurii Iurchenko

Top AI/ML Experts

expert thumbnail

Tuhin Chattopadhyay

CEO at Tuhin AI Advisory and Professor of Practice,
JAGSoM

Dr. Tuhin Chattopadhyay is a celebrated technology thought leader among both the academic and corporate fraternity. Recipient of numerous prestigious awards, Tuhin is hailed as India's Top 10 Data Scientists by Analytics India Magazine. Besides driving his consultancy organization Tuhin AI Advisory, Dr. Tuhin also serves as Professor of Practice at JAGSoM, Bengaluru. His professional accomplishments can be explored from https://www.tuhin.ai/, art portfolio from https://tuhin.art/, joie de vivre from https://tuhinism.com/ and adventures with MySon from https://dogfather.rocks/.
expert thumbnail

Frederic Jacquet

Technology Evangelist,
AI[4]Human-Nexus

My goal is to deepen my research and analysis to track technological developments and understand their real impacts on businesses and individuals. I focus on untangling exaggerated perceptions and irrational fears from genuine technological advances. My approach is critical: I aim to move beyond myths and hype to identify the concrete, realistic progress we can expect from new technologies.
expert thumbnail

Suri (thammuio)

Data & AI Services and Portfolio

Seasoned Data & AI Technologist and Innovator with deep expertise in Big Data, Data Analytics, Cloud, Machine Learning, and Generative AI. He is passionate about building modern data ecosystems that drive intelligent analytics and business transformation. As a Forbes Technology Council and Entrepreneur Leadership Network member, Suri contributes thought leadership on technology strategy, AI innovation, and digital transformation. A founder of multiple startups and a lifelong learner, he combines enterprise experience with entrepreneurial agility to deliver impactful, future-ready data solutions.
expert thumbnail

Pratik Prakash

Principal Solution Architect,
Capital One

Pratik, an experienced solution architect and passionate open-source advocate, combines hands-on engineering expertise with an extensive experience in multi-cloud and data science .Leading transformative initiatives across current and previous roles, he specializes in large-scale multi-cloud technology modernization. Pratik's leadership is highlighted by his proficiency in developing scalable serverless application ecosystems, implementing event-driven architecture, deploying AI-ML & NLP models, and crafting hybrid mobile apps. Notably, his strategic focus on an API-first approach drives digital transformation while embracing SaaS adoption to reshape technological landscapes.

The Latest AI/ML Topics

article thumbnail
Mastering the AWS Well-Architected AI Stack: A Deep Dive into ML, GenAI, and Sustainability Lenses
Use AWS’s ML, GenAI, and Sustainability lenses together to build AI systems that are production-ready, governed, cost-efficient, and energy-efficient.
February 27, 2026
by Jubin Abhishek Soni
· 547 Views · 1 Like
article thumbnail
End-to-End Automation Using Microsoft Playwright CLI
Learn how Microsoft Playwright CLI enables token-efficient, scalable browser automation for AI coding agents, improving performance and reducing costs.
February 27, 2026
by Kailash Pathak DZone Core CORE
· 456 Views
article thumbnail
Unified Intelligence: Mastering the Azure Databricks and Azure Machine Learning Integration
Bridge the gap between Big Data and production ML. Learn to integrate Azure Databricks with Azure Machine Learning for a seamless, scalable end-to-end MLOps workflow.
February 27, 2026
by Jubin Abhishek Soni
· 472 Views
article thumbnail
Similarity Search on Tabular Data With Natural Language Fields
Let us break down relational data silos with vector embeddings, unifying numerical, categorical, and natural-language fields into one semantic representation.
February 27, 2026
by CORRADO DE BARI
· 320 Views
article thumbnail
AWS Bedrock vs. SageMaker: Choosing the Right GenAI Stack in 2026
Deciding between Bedrock's serverless ease and SageMaker's deep control? This guide breaks down the 2026 AWS GenAI landscape for you.
February 26, 2026
by Jubin Abhishek Soni
· 563 Views
article thumbnail
I Watched an AI Agent Fabricate $47,000 in Expenses Before Anyone Noticed
This explores AI agent failures with organizations deploying autonomous systems faster than their governance, monitoring, and security controls can safely support.
February 26, 2026
by Igboanugo David Ugochukwu DZone Core CORE
· 532 Views
article thumbnail
A Practical Guide to Building Generative AI in Java
Genkit Java makes building generative AI features in Java finally simple. With typed inputs/outputs, structured LLM responses, built-in observability, a powerful DevUI.
February 26, 2026
by Xavier Portilla Edo DZone Core CORE
· 874 Views · 2 Likes
article thumbnail
Intelligent Load Management for LLM Calls: From Static Rate Limits to Priority-Aware "Agent QoS"
Use a fair, priority-based tool scheduler instead of static rate limits, leveraging concurrency caps, signals, abort rules, and safe degradation.
February 26, 2026
by Anusha Kovi
· 415 Views
article thumbnail
From Keywords to Meaning: The New Foundations of Intelligent Search
Learn about why keyword search fails at scale and how cloud-native vector databases enable semantic, AI-powered retrieval for smarter, more reliable results.
February 25, 2026
by Amit Kumar Padhy
· 599 Views
article thumbnail
How We Cut AI API Costs by 70% Without Sacrificing Quality: A Technical Deep-Dive
Intelligent caching and model routing reduced our AI API costs from $12,340 to $3,680 per month. Production-tested optimizer. Open source. MIT license.
February 25, 2026
by Dinesh Elumalai
· 722 Views
article thumbnail
Chunking Is the Hidden Lever in RAG Systems (And Everyone Gets It Wrong)
Chunking decisions made early in a RAG pipeline often determine whether retrieval works at all. Here is a practical look at why that matters.
February 25, 2026
by Anshul Sharma
· 662 Views
article thumbnail
Cagent: Dockers newest low code Agentic Platform
Docker’s cagent is a new open-source, low-code/ YAML-centric AI agent builder and runtime. Instead of writing code, you describe agents and cagent runs them.
February 25, 2026
by Siri Varma Vegiraju DZone Core CORE
· 788 Views
article thumbnail
How to Integrate an AI Chatbot Into Your Application: A Practical Engineering Guide
A practical engineering guide to integrating an AI chatbot into your application, covering architecture, backend flow, NLP handling, security, testing, and deployment.
February 24, 2026
by Manthan Bhavsar
· 673 Views
article thumbnail
Integration Reliability for AI Systems: A Framework for Detecting and Preventing Interface Mismatch at Scale
Prevent AI system failure by enforcing contract consistency across four layers: validation, testing, runtime monitoring, and fail-fast boundaries.
February 24, 2026
by Anurag Jindal
· 1,114 Views
article thumbnail
The DevSecOps Paradox: Why Security Automation Is Both Solving and Creating Pipeline Vulnerabilities
This article examines how DevSecOps and AI automation shifted attacks to CI/CD pipelines, making security tools themselves a growing attack surface.
February 24, 2026
by Igboanugo David Ugochukwu DZone Core CORE
· 815 Views · 1 Like
article thumbnail
The AI4Agile Practitioners Report 2026
The AI4Agile Practitioners Report 2026: 83% of Agile practitioners use AI, but most spend 10% or less of their time with AI.
February 24, 2026
by Stefan Wolpers DZone Core CORE
· 570 Views
article thumbnail
Azure SLM Showdown: Evaluating Phi-3, Llama 3, and Snowflake Arctic for Production
Evaluate Phi-3, Llama 3, and Snowflake Arctic. Learn to deploy cost-effective, high-performance SLMs on Azure for production workloads.
February 23, 2026
by Jubin Abhishek Soni
· 1,084 Views
article thumbnail
The Quantum Computing Mirage: What Three Years of Broken Promises Have Taught Me
Despite steady progress, quantum computing remains decades from practical advantage, with cryptography upgrades as its only near-term impact.
February 23, 2026
by Igboanugo David Ugochukwu DZone Core CORE
· 1,055 Views · 3 Likes
article thumbnail
Agentic AI vs Copilots: The Architectural Shift from Assistance to Autonomy
The industry is shifting from copilots that simply autocomplete code to agentic systems that autonomously plan and execute multi-step workflows in a recursive loop.
February 23, 2026
by Nikita Kothari
· 731 Views
article thumbnail
From Prompt Loops to Systems: Hosting AI Agents in Production
AI agents fail in production because they rely on prompts instead of systems. Without proper hosting, memory, tool access, and controls, they become unreliable.
February 23, 2026
by Amit Chaudhary
· 564 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×