DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Data Engineering

Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.

Functions of Data Engineering

AI/ML

AI/ML

Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.

Big Data

Big Data

Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.

Data

Data

Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.

Databases

Databases

A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.

IoT

IoT

IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.

Latest Premium Content
Trend Report
Generative AI
Generative AI
Trend Report
Database Systems
Database Systems
Refcard #401
Getting Started With Agentic AI
Getting Started With Agentic AI
Refcard #394
AI Automation Essentials
AI Automation Essentials

DZone's Featured Data Engineering Resources

Nvidia’s Open Model Super Panel Made a Strong Case for Open Agents

Nvidia’s Open Model Super Panel Made a Strong Case for Open Agents

By Corey Noles
The room for Nvidia’s Open Model Super Panel at San Jose Civic was packed well before Jensen Huang really got going. It felt less like a normal conference panel and more like one of those sessions where the industry starts saying the next platform shift out loud. Nvidia listed the session as “Open Models: Where We Are and Where We’re Headed,” moderated by Huang and held on March 18 during GTC 2026. Credit: Corey Noles/The Neuron But despite the title, the most interesting argument onstage was not really about open models. It was about open agents. The Real Story Was the Move From Models to Systems Huang opened the session by trying to kill the most boring framing in AI: the idea that the market is cleanly split between proprietary labs and open challengers. His point was broader than that. AI is not a single model, a single product, or a single winner-take-all category. It is a stack, a system, and increasingly a combination of many different model types working together. “Proprietary versus open is not a thing. It’s proprietary and open,” Huang said. “A.I. is a system of models and systems of a lot of other things.” That was the throughline of the discussion. Yes, the panel covered open models as infrastructure. Yes, it touched on why open systems widen access and why smaller players may create some of the most important specialized breakthroughs. But the stronger consensus was that the center of gravity is moving up the stack. Models matter. Open models matter a lot. But what increasingly matters more is the system wrapped around them: orchestration, memory, tools, identity, governance, and runtime. That is why the panel landed as such a strong case for open agents. Aravind Srinivas Gave the Clearest Product Abstraction The sharpest product framing came from Aravind Srinivas, who described Perplexity Computer in a way that captured where the market seems to be heading. Instead of asking users to choose a model, route tasks manually, and stitch together their own workflows, the system should take the task and decide how to solve it. “A.I. is not the model, it’s the system. It’s the computer,” Srinivas said. “Perplexity Computer is the idea that you should build the organizational system of everything that A.I. can do.” That is a bigger idea than product branding. It suggests the next useful abstraction layer in AI may not be a chatbot or even a single frontier model. It may be a computer for delegation: a system that knows which models to call, which tools to use, when open models are good enough, when closed models are worth using, and how to pull those pieces into one coherent workflow. Srinivas also made it clear that the future is unlikely to be a simple ideological split between open and closed systems. Different models will serve different functions. Harrison Chase Made the Case for the Harness Layer If Srinivas provided the cleanest product abstraction, Harrison Chase provided the clearest builder abstraction. His phrase, “harness engineering,” may have been one of the most important on the panel. Chase used it to describe everything around the model: which sub-agents are used, which skills are attached, how memory works, what tools are selected, and how the environment is configured for a specific domain or task. “Harness engineering is everything around the model,” Chase said. He made the point that when people are impressed by a polished AI product, they are often responding not just to the raw model quality but to the system surrounding it. That matters because it runs counter to one of the laziest ideas in AI discourse: that anything built around a model is “just a wrapper.” Once models get good enough, the wrapper stops being a wrapper and starts becoming the operating system. The harness is where general intelligence becomes useful intelligence. That also helps explain why routing and orchestration are starting to look like durable product layers. A useful reference point here is The Neuron’s write-up of OpenRouter. While not identical to what the panel discussed, it maps closely to the same underlying shift: value is moving into the layer that decides how intelligence gets assembled and deployed. OpenClaw Mattered Less as a Product Than as a Signal OpenClaw hovered over the whole conversation even when the panel was not explicitly about it. Huang framed it as a turning point, not just because it exists, but because it makes a new category legible. In the panel transcript, he described it as a big deal. In a separate GTC press Q&A, he went even further, calling it an inflection point for what comes after reasoning systems and arguing that it now needs enterprise-grade layers, including privacy, governance, security, and optimized runtimes. “OpenClaw is a big deal,” Huang said, a point he reiterated throughout GTC. The point is not that OpenClaw is the only product that matters. The point is that it signals the conversation has shifted from answering to acting. That is the more important category change. The panelists kept circling the same idea, even when they used slightly different language: AI systems are moving beyond responses and into execution across files, tools, workflows, and goals. Michael Truell Connected Coding Agents to the Rest of the Economy Cursor CEO and Founder Michael Truell offered one of the cleanest bridges from coding agents to the rest of the economy. His argument was that coding was simply the first place this system style began working in a real, visible way. The same pattern is now spreading into other domains. “What started working in coding last year … now, we’re going to all of these other domains,” Truell said. That is a useful lens for understanding why this panel mattered. Coding agents are the preview but not the overall endpoint. The combination of models, files, CLIs, tool use, and rapid iteration made coding the first environment where agentic systems felt obviously real. If those same primitives spread outward into research, healthcare, legal workflows, operations, and back office work, then the real market is not “AI coding.” It is the much larger category of computer work being reinterpreted as agent work. More
2026 Developer Research Report

2026 Developer Research Report

By Carisse Dumaua
Hello, our dearest DZone Community! Last year, we asked you for your thoughts on emerging and evolving software development trends, your day-to-day as devs, and workflows that work best — all to shape our 2026 Community Research Report. The goal is simple: to better understand our community and provide the right content and resources developers need to support their career journeys. After crunching some numbers and piecing the puzzle together, alas, it is in (and we have to warn you, it's quite a handful)! This report summarizes the survey responses we collected from December 9, 2025, to January 27 of this year, and includes an overview of the DZone community, the stacks developers are currently using, the rising trend in AI adoption, year-over-year highlights, and so much more. Here are a few takeaways worth mentioning: AI use climbs this year, with 67.3% of readers now adopting it in their workflows.While most use multiple languages in their developer stacks, Python takes the top spot.Readers visit DZone primarily for practical learning and problem-solving. These are just a small glimpse of what's waiting in our report, made possible by you. You can read the rest of it below. 2026 Community Research ReportRead the Free Report We really appreciate you lending your time to help us improve your experience and nourish DZone into a better go-to resource every day. Here's to new learnings and even newer ideas! — Your DZone Content and Community team More
Optimizing Data Loader Jobs in SQL Server: Production Implementation Strategies
Optimizing Data Loader Jobs in SQL Server: Production Implementation Strategies
By arvind toorpu DZone Core CORE
The AI Cost-Cutting Fallacy: Why
The AI Cost-Cutting Fallacy: Why "Doing More with Less" is Breaking Engineering Teams
By Vitalii Oborskyi
Augmenting Your Dev Org with Agentic Teams
Augmenting Your Dev Org with Agentic Teams
By Adam Mattis
AI in Patient Portals: From Digital Access to Intelligent Healthcare Experiences
AI in Patient Portals: From Digital Access to Intelligent Healthcare Experiences

Patient portals across mobile, web, and kiosk platforms have become the primary digital touchpoints between healthcare organizations and patients. The inception of these portals began with digitizing paper check-in forms and has evolved into full-fledged mobile and web applications that allow patients to view lab results, schedule appointments, and communicate with providers. As patient expectations rise — along with advances in consumer technology — traditional rule-based portals are no longer sufficient. This is where Artificial Intelligence (AI) is transforming patient portals from static systems into intelligent, adaptive healthcare experiences. In this article, I explore how AI is being applied in modern patient portals, like the ones in our healthcare organization, why it matters, and what engineering leaders should consider when introducing AI into healthcare-grade digital platforms. The Limitations of Traditional Patient Portals Despite widespread digital adoption, many patient portals still suffer from common issues that healthcare organizations must address: Complex navigation that frustrates users, especially elderly patients who are not familiar with technologyContinued dependence on call centers for basic questions and clarificationsFront-desk support still required for scheduling doctor appointmentsReactive engagement instead of proactive care support These challenges are not just UX problems — they directly impact patient satisfaction, clinician workload, and operational costs. AI offers a practical path forward by addressing these limitations without requiring complete platform rewrites. Where AI Fits Naturally in Patient Portals AI is beginning to fit naturally into patient portals, making them more helpful and easier to use while supporting better care delivery. Instead of static screens and long wait times for answers, AI features can respond to patient questions instantly, guide users through tasks, and provide personalized support. Explaining Complex Results For example, if a lab report shows an unfamiliar value like “eGFR: 52,” an AI-enabled portal can explain what that measurement represents and why it is monitored. It can also clarify normal ranges and suggest general next steps a patient might discuss with their provider. Simplifying Medical Terminology The portal can translate complex medical terms into easy-to-understand language. Preparing for Doctor Visits After reviewing lab results, patients might ask: “My glucose level is elevated — could that be related to my recent prescription changes?”“I’m concerned about my blood pressure. What should I ask my doctor about medications or lifestyle changes?” AI can help generate relevant questions so patients arrive better prepared. Scheduling Follow-Up Care AI-enabled portals can present multiple appointment options and alternative suggestions to help patients quickly book convenient times. Intelligent Virtual Assistants Intelligent virtual assistants go beyond traditional chatbots. These AI-powered assistants embedded within patient portals can handle: Appointment scheduling and reschedulingPrescription refill guidanceInsurance and billing-related questionsPre-visit instructions and reminders Personalized Patient Experiences Every patient’s journey is different. AI enables portals to move from static dashboards to context-aware personalization, such as: Highlighting relevant actions based on recent visitsAdjusting content based on chronic conditionsSurfacing reminders aligned with care plansDelivering personalized education materials This level of personalization improves engagement without overwhelming patients with unnecessary information. Predictive Engagement and Proactive Care AI models can analyze historical interaction data to identify patterns such as: Missed appointmentsDelayed follow-upsGaps in preventive care Using these insights, patient portals can proactively nudge patients at the right time and through the right channel, reducing no-shows and improving adherence. Clinical Workflow Support The goal is not to replace clinicians. Instead, AI within patient portals can assist them indirectly by: Structuring symptom inputs before visitsSummarizing patient-submitted messagesFlagging high-priority requestsReducing administrative burden This allows care teams to focus on clinical decision-making while AI handles triage support — without crossing into unsafe automation. Engineering Considerations for AI-Driven Patient Portals Engineering considerations are critical when implementing AI in patient portals to ensure optimized healthcare delivery and patient engagement. A primary focus must be data security and patient trust. Data Privacy and Trust Are Non-Negotiable Healthcare AI must be designed with: HIPAA-compliant data handlingExplicit consent boundariesAuditability and traceabilityClear patient communication about AI usage Architecture Matters More Than Algorithms In real-world patient portals, AI works best when built as decoupled, service-oriented components — often using event-driven or serverless architectures. This approach enables: Independent iteration of AI capabilitiesSafe rollback of featuresControlled exposure to web and mobile clientsBackward compatibility with existing systems Measuring Success The success of AI in patient portals should not be measured by model complexity, but by real-world outcomes such as: Reduced call-center volumeImproved appointment adherenceFaster response timesHigher patient satisfaction scoresLower clinician burnout The Road Ahead AI will not replace patient portals — but it will redefine the patient experience. Future portals will function less like digital filing cabinets and more like intelligent care companions, helping patients navigate healthcare systems that are often fragmented and overwhelming. For healthcare organizations, the challenge is not whether to adopt AI, but how to do so responsibly, securely, and incrementally — without compromising trust or safety. When implemented thoughtfully, AI has the potential to make patient portals not just more efficient, but genuinely more human. Let’s not be afraid — instead, let’s be bold and embrace the evolution of technology to advance our industry and our profession.

By Muhammed Harris Kodavath
Beyond Django and Flask: How FastAPI Became Python's Fastest-Growing Framework for Production APIs
Beyond Django and Flask: How FastAPI Became Python's Fastest-Growing Framework for Production APIs

In December 2025, FastAPI achieved what many thought was impossible just three years ago: it surpassed Flask in GitHub stars, reaching 88,000 compared to Flask's 68,400. This isn't just a popularity contest. It represents a fundamental architectural shift in how professional developers are building production APIs in 2026. The numbers from early 2026 tell an even more compelling story. According to the latest JetBrains Python Developer Survey, FastAPI jumped from 29% to 38% adoption among Python developers in 2025 — a staggering 40% year-over-year increase. The 2025 Stack Overflow survey confirmed the trend with a five-percentage-point surge, making it one of the most significant shifts in the web framework landscape. Even more revealing: among developers starting new projects in late 2025 and early 2026, FastAPI has become the default choice over Django and Flask. PyPI download statistics show FastAPI (9 million monthly downloads) has essentially caught up to Django, while Flask’s growth has plateaued. The 2026 Reality: Enterprise Adoption Accelerates What separates 2026 from previous years is the shift from early-adopter experimentation to enterprise production deployment. Microsoft, Netflix, and Uber aren’t just using FastAPI — they’re standardizing on it for new API services. According to Belitsoft’s 2025 Python Development Trends analysis, FastAPI has “eclipsed Flask and Django” among professional developers building async-native, API-first applications. This enterprise momentum creates a self-reinforcing cycle. When industry leaders adopt a framework, it validates the technology for risk-averse organizations. The talent pool grows. The ecosystem matures. The feedback loop accelerates adoption even further. The Async-First Era Has Arrived The web development landscape fundamentally changed in 2025–2026 with the rise of agentic AI systems. Modern applications don’t make single API calls — they orchestrate multiple steps: embedding generation, vector database searches, LLM inference calls, and result streaming. These are inherently I/O-bound, async-heavy workloads where traditional synchronous frameworks create bottlenecks. FastAPI was architected for exactly this pattern. Built on Starlette (an ASGI web toolkit) and leveraging Python’s native asyncio, it handles thousands of concurrent requests without the thread-per-request overhead that limits WSGI frameworks. Real-world benchmarks consistently show FastAPI handling 20,000+ requests per second compared to Flask’s 4,000 — a 5x improvement. Critical architectural change: The industry has moved from “async as an optimization” to “async as the foundation.” In 2026, if your framework wasn’t built async-first from day one, you’re working against the architectural grain of modern web applications. Type Safety Becomes Non-Negotiable Python’s dynamic typing has always involved a trade-off: rapid development at the cost of potential runtime errors. FastAPI fundamentally changes this equation through Pydantic integration, making type hints operational rather than merely documentary. Here's what this looks like in practice: Python from fastapi import FastAPI from pydantic import BaseModel, Field from typing import Optional class UserCreate(BaseModel): username: str = Field(..., min_length=3, max_length=50) email: str age: Optional[int] = Field(None, ge=0, le=150) app = FastAPI() @app.post("/users/") async def create_user(user: UserCreate): # user is guaranteed valid by this point return {"username": user.username, "email": user.email} FastAPI automatically validates incoming requests, returns detailed error messages for invalid data, and generates OpenAPI documentation. Your IDE provides autocomplete on model fields. Type checkers catch errors before runtime. This isn’t just developer convenience — it’s production reliability. Automatic Documentation That Actually Matters Every experienced developer knows the pain of outdated API documentation. Teams ship endpoint changes, parameter modifications, or response structure updates, and documentation quickly becomes inaccurate. FastAPI eliminates this entire category of problems. Navigate to /docs on any FastAPI application, and you get a fully interactive Swagger UI generated automatically from your code. The /redoc endpoint provides an alternative ReDoc interface. Both update in real time as you modify endpoints. For teams practicing contract-first API design, this is transformative. Your implementation is the contract. There’s no drift, no synchronization problems, and no separate documentation repository to maintain. When you have multiple microservices, frontend teams consuming APIs, and third-party integrations, this zero-overhead documentation becomes critical infrastructure. The AI/ML Integration Catalyst FastAPI’s timing couldn’t have been better. The explosion of AI applications in 2024–2025 created massive demand for frameworks optimized for ML model serving. Data scientists work in Python. When they need to deploy models as production APIs, FastAPI has become the obvious choice. The statistics confirm this pattern. According to 2025 survey data, 42% of ML engineers use FastAPI, compared to 22% using Django and 28% using Flask. This isn’t random — FastAPI’s async capabilities align perfectly with ML serving patterns, where inference calls take variable amounts of time. With the AI application market reaching $62.4 billion in 2025 (37.2% CAGR), FastAPI sits at the intersection of the industry’s fastest-growing segments: machine learning model serving and high-performance API development. 2026 enterprise reality: Teams building RAG (Retrieval-Augmented Generation) systems, LLM orchestration layers, or AI agent APIs are choosing FastAPI by default. The framework's async architecture naturally handles the I/O-heavy patterns of modern AI applications. Production-Ready Ecosystem Maturity FastAPI is no longer experimental. The ecosystem has matured to enterprise production standards, with robust libraries for: Database integration (SQLAlchemy 2.0, SQLModel)Authentication (OAuth2, JWT with scopes)Background tasks (Celery, Dramatiq)Observability (OpenTelemetry, Prometheus) Multiple development teams report a 30–40% reduction in API development time compared to traditional frameworks. This primarily comes from eliminating boilerplate validation code, automatic documentation generation, and intuitive design patterns. The Django and Flask Response Django added async support in version 3.1 and has improved it through Django 5.x releases. However, the framework’s ORM remains largely synchronous, creating bottlenecks in async views. Database operations are executed in thread pools, which works functionally but introduces overhead under load. Teams often isolate AI-heavy endpoints into separate FastAPI services rather than forcing Django to handle workloads it wasn't designed for. Flask can run under ASGI servers but wasn't architected async-first. The 2024 Flask ecosystem review by Miguel Grinberg noted that while Flask remains a solid choice for traditional web applications, async-first frameworks like FastAPI, Starlette, and Sanic are increasingly the default for modern API development. The verdict is clear: frameworks retrofitting async capabilities can't match the performance and developer experience of frameworks designed async-first from conception. When to Choose What in 2026 Despite FastAPI’s momentum, the choice isn’t always clear-cut. Each framework still has legitimate use cases. Choose FastAPI When: Building APIs or microservices: Especially those requiring high concurrency and async I/ODeploying ML models: The async architecture and type safety make it purpose-built for AI servingStarting greenfield projects: Modern async-first architecture aligns with where the ecosystem is headedPerformance matters: I/O-bound workloads see 3-5x throughput improvementsTeam wants modern Python: Type hints, async/await, and contemporary patterns Choose Django When: Building full-stack monoliths: The batteries-included approach with ORM, admin, and auth saves timeDatabase-heavy CRUD apps: Django's mature ORM and migration system are battle-testedTeam expertise exists: Deep Django knowledge in your organization has real valueAdmin interface matters: Django's built-in admin is still unmatched for content management Choose Flask When: Rapid prototyping MVPs: The minimalist approach gets proof-of-concepts running quicklySimple web applications: Content-driven sites with moderate traffic where async isn't neededMaximum flexibility required: Projects needing complete architectural freedomTeam familiarity: Existing Flask expertise and established patterns The 2026 Migration Reality Should your team migrate existing Django or Flask applications to FastAPI? The answer depends on context and shouldn't be driven by hype. Consider migration for I/O-heavy APIs spending most time waiting on external services, new microservices where greenfield architecture allows modern patterns, ML model serving where async and type safety provide clear value, and performance-critical endpoints where 3-5x throughput improvements justify the effort. For established Django applications deeply integrated with the ORM, admin interface, and ecosystem, wholesale migration rarely makes sense. The pragmatic approach many teams take: keep existing Django services while building new microservices in FastAPI. This hybrid strategy leverages existing investments while adopting modern patterns for new development. Looking Forward: Python 3.14 and Free-Threading The async-first movement is just one part of Python's performance evolution. Python 3.14, arriving in late 2026, will be the first version with complete free-threading support (no Global Interpreter Lock). This will enable true parallel execution of Python code across CPU cores. FastAPI positions itself well for this future. The framework's async architecture already handles I/O concurrency efficiently. When CPU-bound tasks become truly parallel through free-threading, FastAPI applications will benefit from both async I/O concurrency and parallel CPU execution. The Bottom Line FastAPI surpassing Flask in GitHub stars isn’t just a milestone — it’s confirmation of a structural shift in Python web development. The industry has moved toward async-first, type-safe, API-centric architectures. FastAPI was designed for this world. The 40% year-over-year adoption growth, enterprise deployments, and dominance among ML engineers show this isn’t hype — it’s a baseline shift for modern Python API development. For teams starting new projects in 2026, the default has changed. Unless you specifically need Django’s batteries-included approach or Flask’s minimalism, FastAPI aligns more closely with where modern Python web development is heading.

By Dinesh Elumalai
How to Use AWS IAM Identity Center for Scalable, Compliant Cloud Access Control
How to Use AWS IAM Identity Center for Scalable, Compliant Cloud Access Control

What Is AWS IAM Identity Center? Think of IAM Identity Center (previously AWS SSO) as the gatekeeper to your cloud environment. Its role is to make sure only the right users or services gain access to your AWS resources, and only with the exact permissions they need. Built as a cloud-based identity management service, it handles authentication and authorization for AWS accounts and other supported business applications, all from a single pane of glass. The Core Mission Centralized access: Decide who gets in and what they can do from a single control point.Seamless authentication: Users log in once and move across authorized applications.Extensive integrations: Integrates with AWS accounts, enterprise directories, and third-party services. How Does Identity Center Fit Into AWS? AWS environments can quickly become complex, spanning multiple accounts, regions, stacks, and workloads. In the past, managing identities, passwords, and permissions across all of them was a headache. Then came the push for single sign-on (SSO), so users wouldn’t have to juggle multiple logins. That’s where AWS IAM Identity Center steps in. Here’s how it fits into real-world setups: IAM Identity Center unifies access control across all accounts, while AWS Organizations helps manage multiple accounts.Your workforce might use applications outside AWS, like Microsoft 365, Salesforce, or Atlassian. IAM Identity Center covers those as well, giving users one login for everything.Whether you use Microsoft Active Directory or cloud-based providers like Okta or Azure AD, Identity Center integrates smoothly with them. Key Features Centralized User & Group Management You can create users and groups within Identity Center, import them from external identity providers (IdPs), or combine both strategies. Mapping groups to specific permissions makes onboarding and offboarding much easier for administrators. Fine-Grained Permissions Permissions are controlled using AWS IAM policies or custom permission sets. You apply them to groups or users, enforcing least-privilege access across AWS accounts. No more “Oops, I gave everyone admin” moments. Single Sign-On (SSO) SSO is the magic word for user experience. Logging in once and then moving between AWS services and integrated external apps saves time and eliminates password fatigue. Adaptable Identity Sources You can manage users natively or connect to an external identity provider using standards such as SAML 2.0. In other words, you can link your existing workforce directory directly to AWS. Audit & Compliance Every action — login, access request, privilege grant — can be tracked, recorded, and audited. This helps meet compliance requirements and provides clarity about who did what, when, and where. Getting Started Success with IAM Identity Center is less about wizardry and more about clarity. Step 1: Enable IAM Identity Center Navigate to the AWS Management Console, search for “IAM Identity Center,” and enable it. AWS will guide you through the initial setup. Step 2: Choose Your Identity Source Inbound users must come from somewhere. Options include: Built-in directory (manage users and groups in AWS)Active Directory (on-premises or AWS Managed AD)External SAML-based provider Step 3: Connect AWS Accounts & Applications Select which AWS accounts and external business applications should fall under centralized access control. AWS offers a growing library of pre-integrated apps, including many popular SaaS solutions. Step 4: Create and Assign Permission Sets Define permission sets (collections of IAM policies). Assign them to users or groups and map them to the appropriate accounts or applications. The goal is minimal access with maximum efficiency. Step 5: Test and Monitor A test drive never hurts. Log in as a user, verify access, and glance at audit logs. You’ll refine things as you go, almost certainly. How Organizations Leverage IAM Identity Center Here is how teams make their lives easier with IAM Identity Center: Onboarding & offboarding: Single-step assignment and revocation of privileges when employees join, relocate, or depart. No longer will there be orphaned access. Role-based access: Rather than control access one-by-one, utilize groups representing actual-world roles (dev, finance, admin, read-only, etc.).External user collaboration: Provide secure, time-limited access to partners or contractors without opening up keys to your kingdom.Compliance audit trails: Simplify the auditor’s work with detailed logs of who did what, when. Lessons Learned and Best Practices Of course, there’s no journey in the cloud without its humps. IAM Identity Center is robust, but here’s what I make sure to keep an eye out for: Overlap of permissions: Double-check permissions, particularly if a user belongs to several groups with conflicting sets.Directory sync latency: If using external directories, sometimes sync times bring temporary disarray.Custom app support: Not all business apps natively support SAML or OIDC. You might require additional configuration.Credential lifecycle: Certain users continue to require long-lived API keys these need to be handled outside the SSO framework. When IAM Identity Center Might Not Be Enough Although IAM Identity Center is well-designed, certain edge cases may require additional configurations or alternative solutions: Massive-scale environments: Some organizations with tens of thousands of users and ultra-complex hierarchies might require federated setups or hybrid models.Non-AWS resources: For fully multi-cloud or on-prem environments, consider broader tools like Azure AD or Okta. Final Thoughts Adopting AWS IAM Identity Center streamlines access management and improves daily life for both users and administrators. Its alignment with AWS security best practices and flexible integration options make it a strong foundation for cloud-first organizations. My suggestion? Start small. Experiment. Test thoroughly. You’ll likely see improvements in both team morale and security posture as manual, time-consuming processes fade into the background.

By Ankush Madaan
Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)
Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)

Point-of-Sale (POS) systems are no longer just cash registers. They are becoming real-time, connected platforms that handle payments, manage inventory, personalize customer experiences, and feed business intelligence. Small and medium-sized merchants can now access capabilities once reserved for enterprise retailers. Mobile payment platforms like Square, SumUp, and Shopify make it easy to sell anywhere and integrate sales channels seamlessly. At the same time, data streaming technologies such as Apache Kafka and Apache Flink are transforming retail operations. They enable instant insights and automated actions across every store, website, and supply chain partner. This post explores the current state of mobile payment solutions, the role of data streaming in retail, how Kafka and Flink power POS systems, the SumUp success story, and the future impact of Agentic AI on the checkout experience. Mobile Payment and Business Solutions for Small and Medium-Sized Merchants The payment landscape for small and medium-sized merchants has undergone a rapid transformation. For years, accepting card payments meant expensive contracts, bulky hardware, and complex integration. Today, companies like Square, SumUp, and Shopify have made mobile payments simple, mobile, and affordable. Block (Square) offers a unified platform that combines payment processing, POS systems, inventory management, staff scheduling, and analytics. It is especially popular with small retailers and service providers who value flexibility and ease of use. SumUp started with mobile card readers but has expanded into full POS systems, online stores, invoicing tools, and business accounts. Their solutions target micro-merchants and small businesses, enabling them to operate in markets that previously lacked access to digital payment tools. Shopify integrates its POS offering directly into its e-commerce platform. This allows merchants to sell in physical stores and online with a single inventory system, unified analytics, and centralized customer data. These companies have blurred the lines between payment providers, commerce platforms, and business management systems. The result is a market where even the smallest shop can deliver a payment experience once reserved for large retailers. Data Streaming in the Retail Industry Retail generates more event data every year. Every scan at a POS, every online click, every shipment update, and every loyalty point redemption is a data event. In traditional systems, these events are collected in batches and processed overnight or weekly. The problem is clear: by the time insights are available, the opportunity to act has often passed. Data streaming solves this by making all events available in real time. Retailers can instantly detect low stock in a store, trigger replenishment, or offer dynamic discounts based on current shopping patterns. Fraud detection systems can block suspicious transactions before completion. Customer service teams can see the latest order updates without contacting the warehouse. In previous retail industry examples, data streaming has powered: Omnichannel inventory visibility for accurate stock counts across stores and online channels.Dynamic pricing engines that adjust prices based on demand and competitor activity.Personalized promotions triggered by live purchase behavior.Real-time supply chain monitoring to handle disruptions immediately. Emerging Trend: Unified Commerce The next stage beyond omnichannel is Unified Commerce. Here, all sales channels — physical stores, online shops, mobile apps, marketplaces, and social commerce — operate on a single, real-time data foundation. Instead of integrating separate systems after the fact, every transaction, inventory update, and customer interaction flows through one unified platform. Data streaming technologies like Apache Kafka make Unified Commerce possible by ensuring all touchpoints share the same up-to-date information instantly. This enables consistent pricing, seamless cross-channel returns, accurate loyalty balances, and personalized experiences no matter where the customer shops. Unified Commerce turns fragmented retail technology into a single, connected nervous system. Data Streaming with Apache Kafka and Flink for POS in Retail In an event-driven retail architecture, Apache Kafka acts as the backbone. It ingests payment transactions, inventory updates, and customer interactions from multiple channels. Kafka ensures these events are stored durably, replayable for compliance, and available to downstream systems within milliseconds. Apache Flink adds continuous stream processing capabilities. For POS use cases, this means: Running fraud detection models in real time, with alerts sent instantly to the cashier or payment gateway.Aggregating sales data on the fly to power live dashboards for store managers.Updating loyalty points immediately after a purchase to improve customer satisfaction.Ensuring that both physical stores and e-commerce channels reflect the same stock levels at all times. Together, Kafka and Flink create a foundation for operational excellence. They enable a shift from manual, reactive processes to automated, proactive actions. Using data streaming at the edge for POS systems enables ultra-low latency processing and local resilience, but scaling and managing it across multiple locations can be challenging. Running data streaming in the cloud offers central scalability and simplified governance, though it depends on reliable connectivity and may introduce slightly higher latency. SumUp: Real-Time POS at Global Scale with Data Streaming in the Cloud SumUp processes millions of transactions per day across more than 30 countries. To handle this scale and maintain high availability, they adopted an event-driven architecture powered by Apache Kafka and fully managed Confluent Cloud. In the Confluent customer story, SumUp explains how Kafka has allowed them to: Process every payment event in real time.Maintain a unified data platform across regions, ensuring compliance with local payment regulations.Scale easily to handle seasonal transaction spikes without service interruptions.Speed up developer delivery cycles by providing event data as a service across teams. Implementing Critical Use Cases Across the Business More than 20 teams at SumUp now rely on Confluent Cloud to deliver mission-critical capabilities. Global Bank Tribe: Operates SumUp’s banking and merchant payment services. Real-time data streaming keeps transaction records updated instantly in merchant accounts. Reusable data products improve resilience for high-volume processes such as 24/7 monitoring, fraud detection, and personalized recommendations.CRM Team: Delivers customer and product information to operational teams in real time. Moving away from batch processing creates a smoother customer experience and enables data sharing across the organization.Risk Data and Machine Learning Platform: Feeds standardized, near-real-time data into machine learning models. These models make decisions on the freshest data available, improving outcomes for both teams and merchants. By embedding Confluent Cloud across multiple domains, SumUp has turned event data into a shared asset that drives operational efficiency, customer satisfaction, and innovation at scale. For merchants, this means faster transaction confirmations, improved reliability, and new digital services without downtime. The Future of POS and Impact of Agentic AI The POS of tomorrow will be more than a payment device. It will be a connected intelligence hub. Agentic AI, with autonomous systems capable of proactive decision-making, will play a central role. Future capabilities could include: AI-driven recommendations for upsells, customized to each shopper’s behavior and context.Predictive inventory replenishment that automatically places supplier orders when stock is low.Automated fraud prevention that adapts in real time to emerging threats.Dynamic loyalty program offers tailored at the exact moment of purchase. When Agentic AI is powered by real-time event data from Kafka and Flink, decisions will be both faster and more accurate. This will shift POS systems from passive endpoints to active participants in business growth. For small and medium-sized merchants, this evolution will unlock capabilities previously available only to enterprise retailers. The result will be a competitive, data-driven retail landscape where agility and intelligence are built into every transaction.

By Kai Wähner DZone Core CORE
Best Practices to Make Your Data AI-Ready
Best Practices to Make Your Data AI-Ready

The key problem organizations encounter when implementing AI is not the technology itself, but the data needed to feed AI models. Many companies have plenty of data, but when it comes to quality, it often turns out to be messy, inconsistent, or biased. If you want your AI investments to deliver real value, you must make your data AI‑ready first. Below, I share some best practices for building an AI-ready culture and establishing a data management framework that ensures high-quality data pipelines for AI initiatives. Start with Understanding Which Data You Need AI readiness begins with use cases. You need to understand what type and how much data you require to build an efficient data analytics platform. Start by defining how AI will change a specific process, decision, or metric for your company. A good AI data strategy aligns data usage with business goals. This approach prevents you from investing time and resources in cleaning data you won’t use. Trust me, it can greatly optimize costs for your AI projects. Once you have defined your use cases, you need to specify the exact data requirements, including formats, fields, latency, and more. A common mistake I see is making vague statements instead of focused specifications. For example, “customer data” is too broad; it’s better to divide it into specific fields like “customer ID,” “email address,” and “signup date.” This makes validation concrete and automatable. Build Strong Data Governance and Ownership One thing I know for sure is that AI projects fail fast if no one owns the data quality process. You need someone in your organization accountable for field definitions, data catalogs, access policies, and quality metrics. Without clear ownership, data changes often go unnoticed. Governance should also enforce role‑based access, encryption standards, and lineage tracking so that data is traceable from source to model input. These factors help you comply with policies like GDPR while also reducing risk in AI decision-making. Use Metadata and Catalogs to Make Data Discoverable Metadata helps you quickly understand what each dataset contains, how it was created, and how it changes over time. This makes data easy to find for analysts and AI engineers. Build or use a data catalog that: indexes tables, schemas, and fields;documents ownership and definitions;tracks lineage and usage Metadata catalogs also serve as the basis for trust and reproducibility. When someone knows exactly where a dataset came from and how it has been transformed, they can validate that the model is working with reliable inputs. Maintain a Central Data Platform Data silos are a common problem for most organizations. Implementing data analysis in healthcare, I experienced this firsthand. Data tied up in departmental systems slows discovery and increases fragmentation. I don’t say that you need the “everything goes here” system. This would be risky. But you need a data management layer that allows you to find, query, and monitor data from a single place. Think of it like a shared library. Start by registering your most critical datasets, but not everything at once. Document ownership, field definitions, refresh frequency, and known quality issues. Standardize access through shared query interfaces, whether teams use SQL, APIs, or other tools. Also, build quality checks directly into pipelines, adding validation rules for freshness, completeness, and schema changes at ingestion. Track and Improve Quality Continuously AI models require fresh data to retrain, so ensuring data quality is an ongoing process. Automate checks and set thresholds that trigger alerts. This allows your team to intervene before issues become costly problems. If a pipeline breaks or a critical field starts missing values, you should know before a model retrains on bad data. Once models are live, monitor their outputs and link them back to data quality signals. If a model consistently makes errors tied to certain data fields, trace the issue back and fix it upstream. Test AI Readiness Before Full Deployment Implementing AI iteratively has become best practice. The same applies to testing data for AI readiness. Before committing to full production, run small pilot projects to validate that data quality is sufficient and measure whether the dataset actually supports the business use case. In one project I worked on, we tried to build an employee attrition model using HR system data and moved too quickly toward implementation. We assumed core fields like job level, manager ID, and role history were reliable. During model testing, we realized that role changes were overwritten instead of tracked over time. As a result, the model learned misleading patterns. We had to step back, redesign the data model, and introduce proper history tracking before continuing. Pilot tests like this help catch gaps and adjust quality standards without significant risk. Wrapping Up AI success depends on data that is complete, accurate, and structured. Models trained on partial or inconsistent data will perform poorly and produce misleading results. In this article, I intentionally didn’t focus on cleaning and preparing a specific dataset, but rather on building a framework for effective data management in organizations pursuing AI projects. To see real results from your AI initiatives, ensure a consistent and reliable data flow. This reduces costly errors and transforms data into a strategic asset rather than just a byproduct of operations.

By Mykhailo Kopyl
Software Testing in LLMs: The Shift Towards Autonomous Testing
Software Testing in LLMs: The Shift Towards Autonomous Testing

I wanted to unpack a simple, clear reality on intelligent testing in the large language models (LLM) era. LLMs redefine software testing principles by accelerating intelligent testing across the entire SDLC, enabling autonomous test generation, self-verifying AI agents, and true shift-left quality across build and deployment pipelines. Why Are LLMs a Testing Game-Changer? The "why" cuts to the heart of testing's oldest challenges: People write tests. People maintain flaky scripts. People explore complex systems. These tasks are deeply rooted in language (specifications, bug reports, code) and reasoning (what to test next, why something failed). LLMs have learned the patterns of code, natural language, and logical discourse from a vast corpus of human knowledge. They can now participate in the intellectual work of testing. If my test suite is a sprawling, fragile beast, an LLM can help me refactor it. If I'm faced with a new, undocumented API, an LLM can help me explore it and hypothesize test scenarios. If a CI pipeline fails at 2 AM with a cryptic error, an LLM can triage it. We're not using them to replace testers, but to augment our cognitive reach. They automate the translation of thought into action, turning a risk idea into a test script, a failure trace into a diagnosis. This frees us to focus on higher-order strategy: designing better test oracles, understanding system risk, and guiding truly autonomous testing agents. That's the game-changer. What Is an LLM in a Testing Context? Let's start with a core notion: In testing, an LLM is a reasoning engine for quality. Forget the chatbot box. Think of it as a new kind of testing tool. It doesn't "know" your application. It doesn't "understand" quality in a human sense. Instead, it has learned a statistical map of how concepts like "login test," "boundary value," "race condition," or "XPath selector" relate to billions of lines of code, bug reports, and testing tutorials. I have to ask: How can it create a valid test if it's never seen my app? This is the shift. It's not recalling a specific test. It's synthesizing a new one by following the patterns of what test code, logical steps, and descriptive language look like. When you prompt, "Write a Playwright test for a login flow that includes an invalid password attempt," it predicts the most probable sequence of code tokens and actions that matches that request, much like a senior tester drawing on a lifetime of experience to draft a new test case. The tester's role evolves from authoring every script to orchestrating and validating the output of this reasoning engine. The LLM becomes a force multiplier. How We "Program" This Testing Engine: The New Art of the Test Prompt Our primary interface is the prompt. This is where testing skill meets AI interaction. My initial model was simple: "Write a test for X." But I learned by doing, just like exploratory testing. For example: Weak prompt:"Test the checkout page." This prompt gets a generic, likely useless script.Context-rich prompt:"Act as a security-focused QA. Given this HTML snippet of our checkout form, identify three key risks for a fraudulent transaction. For the top risk, generate a Puppeteer script that demonstrates it. Assume the card number field uses custom validation." The output from this prompt is targeted, insightful, and actionable. I'm not just asking; I'm setting a testing mission. I provide context (HTML, user stories), assign a testing role ("performance engineer," "accessibility auditor"), specify techniques ("use equivalence partitioning"), and demand a specific output format. This is meta-testing. I compare the LLM's output to my mental model of good testing. I refine, iterate, and guide. The prompt becomes the test charter for an AI co-pilot. From Automation to Autonomy: The Evolving "Models" of Testing LLMs are introducing new layers into our testing architecture: The Script Generator: This is entry-level. Translating natural language descriptions into executable test code (Selenium, Playwright, Cypress). It kills boilerplate.The Intelligent Explorer: Here's where autonomy begins. An LLM-powered agent explores applications via Model Context Protocol (MCP), an open standard connecting AI models to external tools and data for better responses. It clicks, observes, infers state, and decides next steps dynamically. "This looks like a data grid; let's test sorting and filtering", mimicking exploratory testing at machine speed.The Analyst and Diagnostician: This is crucial. When a test fails, the LLM can analyze the stack trace, logs, video, and DOM snapshot. It can hypothesize the root cause: "The element wasn't found because a loading overlay is still present. The script needs an explicit wait for the overlay to disappear." It turns CI/CD failures into actionable insights.The Adaptive Test Manager: The future is systems where LLMs don't just write and run tests, but manage them. They can prioritize tests based on code changes, cluster similar failures, suggest flakiness fixes, and even generate "tests for your tests" to improve coverage. What Does "Testing" Become in This Era? The practice is splitting, much like the shift from manual to automated testing before it: LLM-augmented scripted testing: Enhancing traditional automation. "Maintain this test suite," "Convert these 100 manual test cases into API tests," "Generate performance test data." It's about scale and efficiency.LLM-driven exploratory testing: This is the frontier. Here, the tester defines a mission and constraints, and an LLM-powered agent executes a unique, adaptive exploration path. Each session is different. The tester's job is to analyze the agent's findings, refine the mission, and build new models. It's a collaborative, investigative loop. New Testing Techniques for the LLM Era New skills are emerging: Prompt engineering for testing: This is the new test case design. Being precise about scope, context, risk, and expected output format.Context engineering: Using retrieval-augmented generation (RAG) to ground the LLM in your specific context, your codebase, your bug database, your API docs. This turns a generic LLM into a domain expert on the system.Orchestration and validation: Designing the systems and guardrails that let LLM agents operate safely. Writing the "tests for the AI tester" and validating its outputs is now a critical testing activity. Conclusion This is a high-level map of the changing testing landscape. The key takeaways: LLMs are reasoning allies that translate testing intuition into action at unprecedented scale.The tester's role is shifting from sole executor to strategic orchestrator and validator of AI-assisted processes.The goal is evolving from automated execution (running scripts) to augmented intelligence (LLM-powered exploration) and, ultimately, toward guided autonomy (self-adapting test systems).The core of testing remains: critical thinking, risk assessment, and a relentless curiosity about the system. LLMs provide a powerful new lens through which to apply that thinking. Just as software testing has always been about learning the reality of the system, testing in the LLM era is about learning to partner with a new kind of intelligence. We build a shared model, test its boundaries, and evolve together. And that's what software testing is becoming.

By Kathiresan Jayabalan
Hands-On With Kubernetes 1.35
Hands-On With Kubernetes 1.35

Kubernetes 1.35 was released on December 17, 2025, bringing significant improvements for production workloads, particularly in resource management, AI/ML scheduling, and authentication. Rather than just reading the release notes, I decided to test these features hands-on in a real Azure VM environment. This article documents my journey testing four key features in Kubernetes 1.35: In-place pod vertical scaling (GA)Gang scheduling (Alpha)Structured authentication configuration (GA)Node declared features (Alpha) All code, scripts, and configurations are available in my GitHub repository for you to follow along. Test Environment Setup: Cloud: Azure VM (Standard_D2s_v3: 2 vCPU, 8GB RAM)Kubernetes: v1.35.0 via MinikubeContainer runtime: containerdCost: ~$2 for full testing sessionRepository: k8s-135-labs Why Azure VM instead of local? Testing on cloud infrastructure provides production-like conditions and helps identify real-world challenges you might face during deployment. Feature 1: In-Place Pod Vertical Scaling (GA) Theory: The Resource Management Problem Traditional Kubernetes pod resizing has a critical limitation: it requires pod restart. Old Workflow: User requests more CPU for podPod must be deletedNew pod created with updated resourcesApplication downtimeState lost (unless persistent storage) For production workloads, this causes: Service interruptionsLost in-memory stateLonger scaling timesComplex orchestration needed What's New in K8s 1.35 In-place pod vertical scaling (now GA) allows resource changes without pod restart: YAML apiVersion: v1 kind: Pod spec: containers: - name: app resources: requests: cpu: "500m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi" resizePolicy: - resourceName: cpu restartPolicy: NotRequired # No restart for CPU! - resourceName: memory restartPolicy: RestartContainer # Memory needs restart Key innovation: Different restart policies for different resources. CPU changes typically don't require restart, while memory might. Hands-On Testing Repository: lab1-in-place-resize I created an automated demo script that simulates a real-world scenario: Scenario: Application scaling up to handle increased load Initial (Light Load): 250m CPU, 256Mi memoryTarget (Peak Load): 500m CPU, 1Gi memoryIncrease: 2x CPU, 4x memory Shell # Run the automated demo ./auto-resize-demo.sh Auto-resize script output showing 250m →500m and Memory 256Mi → 1Gi Results: CPU doubled (250m → 500m) without restartMemory quadrupled (256Mi → 1Gi) without restartRestart count: 0Total time: 20 seconds Critical Discovery: QoS Class Constraints During testing, I encountered an important limitation that's not well-documented: The error: Plain Text The Pod "qos-test" is invalid: spec: Invalid value: "Guaranteed": Pod QOS Class may not change as a result of resizing QoS error message when trying to resize only requests What I learned: Kubernetes has three QoS classes: Guaranteed: requests = limitsBurstable: requests < limitsBestEffort: no requests/limits The rule: In-place resize cannot change QoS class. Wrong (fails): YAML # Initial: Guaranteed QoS requests: { cpu: "500m" } limits: { cpu: "500m" } # Resize attempt: Would become Burstable requests: { cpu: "250m" } limits: { cpu: "500m" } # QoS change! Correct (works): YAML # Resize both proportionally requests: { cpu: "250m" } limits: { cpu: "250m" } # Stays Guaranteed Production Impact Before K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Startup: 2 CPUs × 5 minutes = wasted during idle - Scaling event: Pod restart required - Result: Over-provisioned or frequent restarts After K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Dynamic: High CPU during startup, low during steady-state - Scaling: No restarts needed - Result: 30-40% cost savings observed in testing Key Takeaways Production-ready: GA status means stable for critical workloadsReal savings: 30-40% cost reduction for bursty workloadsQoS constraint: Plan resource changes to maintain QoS classFast: Changes apply in seconds, not minutes Best use cases: Java applications (high startup, low steady-state)ML inference (variable load)Batch processing (scale down after processing) Feature 2: Gang Scheduling (Alpha) Theory: The Distributed Workload Problem Modern AI/ML and big data workloads often require multiple pods to work together. Traditional Kubernetes scheduling treats each pod independently, leading to resource deadlocks: The problem: Shell PyTorch Training Job: Needs 8 GPU pods (1 master + 7 workers) Cluster: Only 5 GPUs available What happens: ├─ 5 worker pods scheduled → Consume all GPUs ├─ Master + 2 workers pending ├─ Training cannot start (needs all 8) ├─ 5 GPUs wasted indefinitely └─ Other jobs blocked This is called partial scheduling — some pods run, others wait, nothing works. What Is Gang Scheduling? Gang Scheduling ensures a group of pods (a "gang") schedules together atomically: Shell Training Job: Needs 8 GPU pods Cluster: Only 5 GPUs available With Gang Scheduling: ├─ All 8 pods remain pending ├─ No resources wasted ├─ Smaller jobs can run └─ Once 8 GPUs available → all schedule together Key principle: All or nothing. Implementation Challenge Kubernetes 1.35 introduces a native Workload API for gang scheduling (Alpha), but I discovered it requires feature gates that caused kubelet instability: YAML # Attempted native approach --feature-gates=WorkloadAwareScheduling=true # Result: kubelet failed to start Error: "context deadline exceeded" Solution: Use scheduler-plugins — the mature, production-tested implementation. Hands-On Testing Repository: lab2-gang-scheduling Setup: YAML # Automated installation ./setup-gang-scheduling.sh # What it installs: # 1. scheduler-plugins controller # 2. PodGroup CRD # 3. RBAC permissions Key discovery: Works with default Kubernetes scheduler — no custom scheduler needed! Test 1: Small Gang (Success Case) YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: training-gang spec: scheduleTimeoutSeconds: 300 minMember: 3 # Requires 3 pods minimum YAML # Create 3 pods with the gang label for i in {1..3}; do kubectl apply -f training-worker-$i.yaml done Result: Plain Text NAME READY STATUS AGE training-worker-1 1/1 Running 6s training-worker-2 1/1 Running 6s training-worker-3 1/1 Running 6s All pods scheduled within 1 second of each other! PodGroup status: YAML status: phase: Running running: 3 Test 2: Large Gang (All-or-Nothing) Now let's prove gang behavior by creating a gang that's too large: YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: large-training-gang spec: minMember: 5 YAML # Create 5 pods requesting 600m CPU each # Total: 3000m (exceeds our 2 vCPU VM) for i in {1..5}; do kubectl apply -f large-training-$i.yaml done All 5 pods staying pending, proving all-or-nothing behavior Result: Plain Text NAME READY STATUS AGE large-training-1 0/1 Pending 15s large-training-2 0/1 Pending 15s large-training-3 0/1 Pending 15s large-training-4 0/1 Pending 15s large-training-5 0/1 Pending 15s Event: Plain Text Warning FailedScheduling 60s default-scheduler 0/1 nodes are available: 1 Insufficient cpu Perfect gang behavior: All pending, no partial scheduling, no wasted resources! Comparison: With vs Without Gang Scheduling Scenariowithout gangwith gangSmall gang (3 pods, enough resources)Schedule individuallyAll schedule togetherLarge gang (5 pods, insufficient resources)❌ Partial: 2-3 Running, rest PendingAll remain PendingResource efficiencyWasted (partial gang can't work)Efficient (resources available for other jobs)Deadlock preventionNo protectionProtected Production Considerations Alpha feature warning: Not recommended for production yetScheduler-plugins is the mature alternativeNative API will improve in K8s 1.36+ Production alternatives: Volcano SchedulerKAI Scheduler (NVIDIA)Kubeflow with scheduler-plugins Key Takeaways Critical for AI/ML: Distributed training needs gang schedulingPrevents deadlocks: All-or-nothing prevents resource wasteWorks today: scheduler-plugins is production-readyAlpha status: Native API needs maturation Best use cases: PyTorch/TensorFlow distributed trainingApache Spark jobsMPI applicationsAny multi-pod workload Feature 3: Structured Authentication Configuration (GA) Theory: The Authentication Configuration Challenge Traditional Kubernetes authentication uses command-line flags on the API server: Shell kube-apiserver \ --oidc-issuer-url=https://accounts.google.com \ --oidc-client-id=my-client-id \ --oidc-username-claim=email \ --oidc-groups-claim=groups \ --oidc-username-prefix=google: \ --oidc-groups-prefix=google: Problems: Command lines become extremely longDifficult to validate before restartNo schema validationHard to manage multiple auth providersRequires API server restart for changes What's New in K8s 1.35 Structured authentication configuration moves auth config to YAML files: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://accounts.google.com audiences: - my-kubernetes-cluster claimMappings: username: claim: email prefix: "google:" groups: claim: groups prefix: "google:" Benefits: Clear, structured formatSchema validationVersion controlledEasy to manage multiple providersBetter error messages Hands-On Testing Repository: lab3-structured-auth ⚠️ Warning: This lab modifies the API server configuration. While safe in minikube, this is risky in production without proper testing. The challenge: Modifying API server configuration requires editing static pod manifests — get it wrong and your cluster breaks. My approach: Create backup firstTest in disposable minikubeVerify thoroughly before production Test: GitHub Actions JWT Authentication I configured the API server to accept JWT tokens from GitHub Actions: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: - kubernetes-test claimMappings: username: claim: sub prefix: "github:" Implementation steps: Plain Text # 1. Create auth config cat > /tmp/auth-config.yaml <<EOF [config above] EOF # 2. Copy to minikube minikube cp /tmp/auth-config.yaml /tmp/auth-config.yaml # 3. Backup API server manifest minikube ssh sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/backup.yaml # 4. Add authentication-config flag sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml # Add: --authentication-config=/tmp/auth-config.yaml API server manifest showing authentication-config flag added API Server Restart: The API server automatically restarts when the manifest changes: Shell kubectl get pods -n kube-system -w | grep kube-apiserver Verification: Shell # Check authentication-config flag is active minikube ssh "sudo ps aux | grep authentication-config" Process showing --authentication-config=/tmp/auth-config.yaml flag API verification: Shell # Check authentication API is available kubectl api-versions | grep authentication Result: Shell authentication.k8s.io/v1 Success! Structured authentication is working. Before/After Comparison Before: YAML spec: containers: - command: - kube-apiserver - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC After: YAML spec: containers: - command: - kube-apiserver - --authentication-config=/tmp/auth-config.yaml # NEW! - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC Multiple Providers Example The structured format makes multiple auth providers easy: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: [kubernetes-test] claimMappings: username: {claim: sub, prefix: "github:"} - issuer: url: https://accounts.google.com audiences: [my-cluster] claimMappings: username: {claim: email, prefix: "google:"} - issuer: url: https://login.microsoftonline.com/{tenant-id}/v2.0 audiences: [{client-id}] claimMappings: username: {claim: preferred_username, prefix: "azuread:"} Key Takeaways Production-ready: GA status, safe for critical clustersBetter management: Clear structure beats command-line flagsMulti-provider: Easy to configure multiple identity providersRequires restart: API server must restart to load config Best use cases: Organizations with multiple identity providersComplex authentication requirementsDynamic team structuresCompliance requirements Feature 4: Node Declared Features (Alpha) Theory: The Mixed-Version Cluster Problem During Kubernetes cluster upgrades, you typically have a rolling update: Plain Text Cluster During Upgrade: ├─ node-1 (K8s 1.34) → Old features ├─ node-2 (K8s 1.34) → Old features ├─ node-3 (K8s 1.35) → New features ✅ └─ node-4 (K8s 1.35) → New features ✅ The challenge: Scheduler doesn't know which nodes support which featuresPods using K8s 1.35 features might land on 1.34 nodes → FailManual node labeling requiredHigh operational overhead What Is Node Declared Features? Nodes automatically advertise their supported Kubernetes features: Plain Text status: declaredFeatures: - GuaranteedQoSPodCPUResize - SidecarContainers - PodReadyToStartContainersCondition Benefits: Automatic capability discoverySafe rolling upgradesIntelligent schedulingZero manual configuration Hands-On Testing Repository: lab4-node-features This Alpha feature requires enabling a feature gate in kubelet configuration. Initial state: Shell kubectl get --raw /metrics | grep NodeDeclaredFeatures Result: Shell kubernetes_feature_enabled{name="NodeDeclaredFeatures",stage="ALPHA"} 0 Feature disabled by default. Enabling the Feature Shell minikube ssh # Backup kubelet config sudo cp /var/lib/kubelet/config.yaml /tmp/backup.yaml # Edit kubelet config Add feature gate: YAML apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: NodeDeclaredFeatures: true # ADD THIS authentication: anonymous: enabled: false Kubelet config after (featureGates added)] Restart kubelet: Shell sudo systemctl restart kubelet sudo systemctl status kubelet Verification Shell # Check node now declares features kubectl get node minikube -o jsonpath='{.status.declaredFeatures}' | jq Result: JSON [ "GuaranteedQoSPodCPUResize" ] Success! The node is advertising its capabilities! The Connection to Lab 1 Notice something interesting? The declared feature is GuaranteedQoSPodCPUResize - the exact capability we tested in Lab 1! What this means: Node running K8s 1.35 knows it supports in-place pod resizingAdvertises this capability automaticallyScheduler can route pods requiring this feature hereOlder nodes (K8s 1.34) wouldn't declare this feature Testing Feature-Aware Scheduling YAML # Create a pod kubectl apply -f feature-aware-pod.yaml YAML # Check scheduling kubectl get pod feature-aware-pod Result: Plain Text NAME READY STATUS RESTARTS AGE feature-aware-pod 1/1 Running 0 7s Complete test flow showing feature declared, pod created, and successfully scheduled] Pod successfully scheduled on feature-capable node! Future: Smart Scheduling In future Kubernetes versions (when this reaches Beta/GA), you'll be able to: YAML apiVersion: v1 kind: Pod metadata: name: resize-requiring-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/declared-feature-InPlacePodVerticalScaling operator: Exists # Only schedule on nodes with this feature containers: - name: app image: myapp:latest Key Takeaways Automatic discovery: Nodes advertise capabilities without manual configSafe upgrades: Mixed-version clusters handled intelligentlyFeature connection: Links to Lab 1 in-place resize capabilityAlpha status: Requires feature gate, not production-ready Best use cases: Rolling cluster upgradesMixed-version environmentsFeature-dependent workloadsTesting new capabilities Lessons Learned: What Worked and What Didn't Challenges Encountered Alpha features are tricky Native Workload API caused kubelet failuresSolution: Used mature scheduler-plugins insteadLesson: Alpha doesn't mean "almost ready" QoS constraints not well-documented Spent time debugging resize failuresDiscovered QoS class immutability requirementLesson: Test thoroughly, document findings API server modifications are risky Required careful backup strategyMinikube made recovery easyLesson: Always test in disposable environments first What Worked Well GA features are solid In-place resize: FlawlessStructured auth: No issuesBoth ready for production Scheduler-plugins maturity More reliable than native Alpha APIsProduction-tested by many organizationsLesson: Mature external projects > Alpha native features Azure VM testing environment Realistic conditionsEasy to resetCost-effective (~$2 total)Lesson: Cloud VMs ideal for feature testing Production Readiness Assessment Ready for Production 1. In-place pod vertical scaling (GA) Stable, tested, documentedReal cost savings (30-40%)Clear constraints (QoS preservation)Recommendation: Deploy to production now 2. Structured authentication configuration (GA) Mature, well-designedBetter than command-line flagsRequires API server restartRecommendation: Use for new clusters, migrate existing ones carefully Use With Caution ⚠️ 3. Gang scheduling (Alpha) Native API unstableUse scheduler-plugins instead (production-ready)Essential for AI/ML workloadsRecommendation: Use scheduler-plugins, not native API 4. Node Declared Features (Alpha) Requires feature gateLimited current valueWill be critical when GARecommendation: Wait for Beta/GA unless testing upgrades Cost and Time Investment Testing Environment Costs Azure VM: Standard_D2s_v3Duration: 8 hours of testingCompute cost: ~$0.77 (VM stopped between sessions)Storage cost: ~$0.10Total: Less than $1 for comprehensive testing Time Investment activitytimeEnvironment setup30 minLab 1 (In-place resize)1.5 hoursLab 2 (Gang scheduling)2 hoursLab 3 (Structured auth)1 hourLab 4 (Node features)1.5 hoursDocumentation1.5 hoursTotal8 hours ROI: Knowledge gained far exceeds time invested. Testing prevented production issues. Recommendations for Your Kubernetes Journey If You're Running K8s 1.34 or Earlier Upgrade path: 1.34 → 1.35 is straightforwardFocus on GA features first: In-place resize, structured authTest in dev/staging: Use my repository as starting pointMeasure impact: Track cost savings from in-place resize If You're Running AI/ML Workloads Implement gang scheduling immediately: Use scheduler-pluginsTest distributed training: Prevent resource deadlocksMonitor scheduling: Ensure all-or-nothing behavior workingPlan for native API: Will mature in K8s 1.36+ If You're Managing Large Clusters Structured auth: Migrate now for better managementRolling upgrades: Plan for node feature declaration (future)Cost optimization: In-place resize reduces over-provisioningMulti-tenancy: Gang scheduling prevents noisy neighbor issues Complete Repository All code, scripts, and detailed instructions are available: GitHub: https://github.com/opscart/k8s-135-labs Each lab includes: Detailed theory and backgroundStep-by-step instructionsAutomated scripts where possibleTroubleshooting guidesProduction recommendationsRollback procedures Conclusion Kubernetes 1.35 brings meaningful improvements to production workloads: For cost optimization: In-place pod resize delivers real savings (30-40% in my tests)Eliminates over-provisioning for bursty workloadsNo application changes required For AI/ML workloads: Gang scheduling prevents resource deadlocksEssential for distributed trainingScheduler-plugins provides production-ready solution For operations: Structured authentication simplifies managementNode declared features will improve rolling upgradesBetter observability and debugging The bottom line: K8s 1.35 GA features are production-ready and deliver immediate value. Alpha features show promising future directions but need more maturation. Connect: Blog: https://opscart.comGitHub: https://github.com/opscartLinkedIn: linkedin.com/in/shamsherkhan Other projects: Kubectl-health-snapshot – Kubernetes Optimization Security Validatork8s-ai-diagnostics – Kubernetes AI Diagnostics References Kubernetes 1.35 Release NotesKEP-1287: In-Place Pod Vertical ScalingScheduler-Plugins DocumentationKEP-3331: Structured Authentication ConfigurationKEP-4568: Node Declared Features

By Shamsher Khan DZone Core CORE
Failure Handling in AI Pipelines: Designing Retries Without Creating Chaos
Failure Handling in AI Pipelines: Designing Retries Without Creating Chaos

Retries have become an integral part of the AI tools or systems. In most systems I have seen, teams usually approach failures with blanket retrying. This often yields duplicate work, cost spikes, wasted compute, and operational instability. Every unnecessary retry triggers another inference call, an embedding request, or a downstream write, without improving the outcome. In most early-stage AI tools, the pattern is that if a request fails, a retry is added. If the retry succeeds intermittently, then the logic is considered sufficient. This approach works fine until the application is in the test environment or in low-user-usage mode; as soon as the application sees higher traffic and concurrent execution, retries begin to dominate system behavior. The consequences like these become visible: Increased token usage and costInconsistent latencyRepeated processing of the same JobWorkers look busy, but the queues are not drainingNon-meaningful logs To avoid these consequences, AI tools must treat failures as structured states and respond appropriately to their nature. At a minimum, failures should be categorized into 3 broad categories. 1. Transient (Retryable) Temporary failures should be retried with appropriate backoff. For example, timeouts, HTTP 429 rate limits, 5xx upstream errors, short-lived network interruptions, etc. 2. Permanent (Non-Retryable) For these, retries won't change the outcome and should not be retried. For example, invalid payload, schema mismatch, missing required fields, authentication errors, incorrect model configuration, API key failures, policy violations, etc. 3. Unknown (Quarantine) Any failures that cannot be confidently classified into the two categories above should be marked as unknown. These should not be retried indefinitely. These require controlled handling, often through quarantine or dead-letter routing. For example, inconsistent upstream data, unexpected response structures, edge cases, exceptions, etc. Let's understand this with a real-world AI application. Consider an AI-based data enrichment workflow inside a multi-tenant SaaS platform. A typical job within this workflow is structured as: Step 1: The system receives source dataStep 2: An LLM is invoked to normalize or enrich selected fields.Step 3: The enriched output is written to a database.Step 4: An event is emitted for downstream indexing or analytics. This flow appears to be straightforward. The complexity arises when any of the individual step fails. These complexities can be anything. The ideal response to these complexities should depend on the nature of the failure. A few examples of these complexities are: LLM returns a 429 rate-limit response. In this case, the workflow should retry with bounded backoff.LLM returns a 503 temporary outage. In this case, retrying may also be reasonable.Payload is missing a required field, such as title. In this case, retrying will not resolve the issue; the job should be marked failed with a clear reason.Tenant configuration lacks a required model name. In this case, it is a configuration error rather than a transient failure, so no retry is needed.Database write times out. In this case, retry behavior depends on idempotency guarantees and write semantics. Simple and Powerful Production-Friendly Failure Model We should have failure records that operators can read and understand. For example: JSON { "job_id": "job_84721", "tenant_id": "tenant_A", "stage": "LLM_CALL", "status": "FAILED", "failure_type": "TRANSIENT", "reason": "RATE_LIMIT", "http_status": 429, "attempts": 3, "next_action": "RETRY", "timestamp": "2026-02-12T16:10:00Z" } To understand it better, let’s look at this code defining failure classification and retry policy. Step 1: Defining the Failure Types and Classification Python import random import time from dataclasses import dataclass from typing import Optional, Dict, Any class FailureType: TRANSIENT = "TRANSIENT" # retryable NON_RETRYABLE = "NON_RETRYABLE" UNKNOWN = "UNKNOWN" @dataclass class Failure: failure_type: str reason: str http_status: Optional[int] = None detail: Optional[str] = None def classify_failure(err: Exception, http_status: Optional[int] = None) -> Failure: """ Classify failures into TRANSIENT / NON_RETRYABLE / UNKNOWN. Keep this logic small and explicit. """ # Common transient HTTP statuses if http_status in (408, 429, 500, 502, 503, 504): reason = "RATE_LIMIT" if http_status == 429 else "UPSTREAM_UNAVAILABLE" return Failure(FailureType.TRANSIENT, reason, http_status=http_status) # Auth/config errors are usually permanent until fixed if http_status in (401, 403): return Failure(FailureType.NON_RETRYABLE, "AUTH_OR_PERMISSION", http_status=http_status) # Bad request / schema problems are usually permanent if http_status in (400, 404, 422): return Failure(FailureType.NON_RETRYABLE, "BAD_REQUEST_OR_SCHEMA", http_status=http_status) # Known local validation errors if isinstance(err, ValueError): return Failure(FailureType.NON_RETRYABLE, "INPUT_VALIDATION", detail=str(err)) # Everything else: quarantine unless you have a reason to retry return Failure(FailureType.UNKNOWN, "UNCLASSIFIED_EXCEPTION", detail=str(err)) Step 2: Retry Policy With Exponential Backoff and Jitter Python @dataclass class RetryPolicy: max_attempts: int = 5 base_delay_sec: float = 0.5 # initial delay max_delay_sec: float = 15.0 # cap jitter_ratio: float = 0.2 # +/- 20% randomness def compute_backoff(policy: RetryPolicy, attempt: int) -> float: # Exponential backoff: base * 2^(attempt-1), capped delay = min(policy.base_delay_sec * (2 ** (attempt - 1)), policy.max_delay_sec) # Add jitter to avoid synchronized retries jitter = delay * policy.jitter_ratio return max(0.0, delay + random.uniform(-jitter, jitter)) Step 3: A Wrapper That Applies Classification and Policy Python def run_with_failure_handling( *, job_id: str, tenant_id: str, stage: str, policy: RetryPolicy, fn, fn_kwargs: Dict[str, Any] ) -> Dict[str, Any]: """ Runs a single stage (e.g., LLM call) with: - classification - bounded retries - backoff + jitter """ last_failure: Optional[Failure] = None for attempt in range(1, policy.max_attempts + 1): try: return fn(**fn_kwargs) except Exception as e: # If your fn can provide http_status, pass it in explicitly. http_status = getattr(e, "http_status", None) failure = classify_failure(e, http_status=http_status) last_failure = failure # Decide what to do next if failure.failure_type == FailureType.NON_RETRYABLE: return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "STOP" } if failure.failure_type == FailureType.UNKNOWN: # Conservative choice: do not retry unknown failures forever. # Quarantine after 1 attempt (or 2 if you prefer). return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "QUARANTINE" } # Transient: retry if attempts remain if attempt < policy.max_attempts: delay = compute_backoff(policy, attempt) time.sleep(delay) continue # Ran out of attempts return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "DLQ" } # Should not reach here, but return last known state return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": (last_failure.failure_type if last_failure else FailureType.UNKNOWN), "reason": (last_failure.reason if last_failure else "UNKNOWN"), "attempts": policy.max_attempts, "next_action": "DLQ" } Failure Handling and Idempotency Failure handling and idempotency are a pair. Idempotency prevents duplicates from retries, whereas failure handling prevents retries from becoming chaotic. If the retry logic is aggressive and jobs are not idempotent, the usage cost will be high, as there will be duplicate inference calls and duplicate DB writes, leading to a confusing state. If the retry logic is disciplined and jobs are idempotent, the system becomes predictable: retries resolve to state checks, only one execution wins, and operators can reprocess failures intentionally. Closing Thoughts In summary, retries are not the enemy for any AI tool; uncontrolled retries are. A production-grade AI tool shouldn’t just retry because of failure; it should understand why the job failed and should retry with discipline when retry proves to be beneficial and stops when it doesn’t.

By Aditya Gupta
Reducing Daily PM Overhead With a Chat-Based AI Agent
Reducing Daily PM Overhead With a Chat-Based AI Agent

As a project manager, I have often encountered time losses caused by daily operational routines. Depending on how many departments are involved in development, these delays can range from two extra days per task to one or even two weeks for a relatively small feature. These delays usually occur in processes not directly related to development itself: clarifying requirements, working in task trackers, searching for information, duplicating work, and constantly switching between tasks. This is also supported by research: around 90% of professionals say they regularly lose time because of inefficient processes and tools, and about half of them lose more than 10 hours every week because of this. Against this background, task management tools like Jira play a mixed role. On the one hand, they are an industry standard. On the other hand, developers frequently mention its complexity, heavy configuration, and process overhead. As a result, Jira often appears at the top of “most disliked but still widely used” tool lists. With this in mind, I decided to run an experiment to see whether I could build an AI agent myself that could eliminate these issues — or at least part of them. Before this experiment, I had no prior experience working with AI agents. Task Build an agent that can be integrated into a work messenger and configured so that task creation and monitoring happen entirely inside the messenger, without switching to external tools. Must: Allow developers to update the status of their tasks directly in the chat, without opening Jira.Allow the PM to generate an automatic report showing the current project status. Should: Allow developers to create tasks for themselves by selecting from a predefined project task database. The Agent Agent workflow Architecture and Components Interaction Channel Telegram Bot API – serves as both the inbound channel (messages, buttons, photos) and the outbound channel (menus, task lists, confirmations), and may be replaced with Slack or Microsoft Teams as alternative messaging interfaces. Task Tracker Jira Cloud REST API/Agile API – used for retrieving active sprints, searching issues via JQL, fetching issue details and changelogs, logging work time, attaching files, adding comments, and performing workflow transitions. State Storage Redis – used to store conversational context, including the selected task and the expected user action (e.g., update or close). Summarization module LLM (openai) – generates a concise daily report based on Jira activity and issue changelogs. Inputs, Outputs, and Interfaces Inputs Telegram message: Text commands (/start)Photo messages with a caption (time spent, optional link)Telegram callback_query: Inline button interactions such as update_task, task_<KEY>, close_task_<KEY>, get_tasks, get_report, start_task_<KEY> Outputs Telegram messages: MenusTask listsPrompts requesting a photo and time spentConfirmation messagesDaily summary reportJira operations: Attachments (screenshots)Comments (structured)Worklog entries (time tracking)Workflow transitions (in progress/ready for review)Reading active sprint data and issue changelogs Core Logic: Router → Action Scenarios Entry point: The Telegram Trigger listens for both message and callback_query events (button clicks). Next, the Get Chat ID node normalizes the context (chatId, userId, username), allowing the same workflow to handle both messages and callback buttons consistently. All events are then routed through a Router (Switch) node, which branches logic based on the event type: /startupdate_task / task_... (update selected task)close_task / close_task_... (close or transition a task)get_tasks (retrieve active tasks)get_report (daily report)photo upload (incoming photo)info_... / start_task_... (start a task from To Do) Core Use Cases Use Case 1. /start: Show Menu and Active Sprint Information Purpose: It serves as the entry point for all actions and provides a lightweight sprint “dashboard.” The bot sends an HTTP request to the Jira Agile API to retrieve the active sprint for a specific board (in this example: board/34, state=active).Node Get Sprint Data extracts: sprint name, sprint goal, and calculates days_left until the sprint endDate.Node Send Menu sends a message with interactive buttons: Update task Close task Get task Get a report The developer sees this as a bot within the interface of a messenger they are already familiar with: Use case 1 Use Case 2. Update Task: Select a Task → Send a Screenshot With Time → Bot Updates Jira Automatically This flow is implemented as a mini-dialog, with conversational state stored in Redis. 2.1. Selecting a Task to Update When the user clicks update_task, the bot retrieves a list of active tasks via JQL (statuses such as In Progress, In Review, Ready For Review).The Get Active Tasks To Send node builds an inline keyboard with buttons in the format task_<KEY>.The Send Tasks node sends this list to the user as clickable buttons. 2.2. Persisting the Selected Task (State) When the user selects a specific task (e.g., task_TOBB-37), the workflow stores state in Redis: key: TOBB-<chatId>value: { "action": "update", "task": "<callback_data>" } The Prompt Update node then asks the user to: “Send a screenshot and include the time spent in the caption (e.g., 2h 30m).” Use case 2 2.3. Photo Received: Validate → Upload to Jira → Log Time When a photo is received, the Analyze Photo node performs three checks: Whether a task is selected in Redis (otherwise: “no task selected to update”)Whether a photo is presentWhether the caption contains a valid time format (regex such as 2h 30m) If all checks pass, the following pipeline is executed: Retrieve the file from Telegram (getFile) and download the imageUpload to Jira: Attach the image as an issue attachmentAdd a comment with time spent: Structured comment containing “time spent + screenshot reference”Jira Log Work: Create a worklog entry: (/rest/api/3/issue/<key>/worklog),Clear Saved Task: Remove state from Redis,Send Task Updated: Send confirmation and return to the main menu. Use case 2.3 Key effect for reducing “noise”: The bot enforces a minimal Definition of Done for task updates with a screenshot and time spent. This significantly reduces empty updates and fragmented chat messages. Use Case 3. Close Task: Similar Flow With an Additional Status Transition The close flow mirrors Update Task, but additionally requires a link to file in the caption. The user selects a task from In Progress (close_task_<KEY>).State is saved in Redis as { "action": "close", "task": ... }.When a photo is received, the bot extracts: time spent and link (URL from the caption). The bot then: Attaches the screenshotAdds a comment containing the linkLogs work timePerforms a Jira workflow transition to Ready For Review (transition id "2" in the JSON)Clears Redis stateSends confirmation and returns the menu Practical outcome: Task closure becomes formalized and reproducible: proof of work (screenshot), time accounting (worklog), and a deterministic next workflow state. Use Case 4. Start Task: move from To Do to In Progress In the “new tasks” branch, the bot runs a JQL query:status IN ("To Do") and displays tasks as buttons info_<KEY>.When info_<KEY> is clicked, the bot sends: a link to the task and a Start task button.On start_task_<KEY>, the bot performs a Jira transition to In Progress (transition id "21")Sends “task started”, and returns the menu. Use Case 5. PM Report: Daily Summary via LLM This is the only place where the LLM is used as an actual “reasoning component.” Get Jira Day Activity: Retrieves issues updated in the last 24 hours(updated >= -1d ORDER BY updated DESC).For each issue, the changelog is fetched (Get Issue Details).Format Activity Data computes: Number of status changes,Number of unique issues affected,How many moved to Done / In Progress / In Review,Authors of changes,A list of changes (key, summary, from → to, date).LLM summary: converts this data into a concise 5–7 sentence English report.Send report: delivers the report to Telegram. Outcome: The PM receives a human-readable daily summary without manually navigating Jira or Slack. Conclusion As a result of this experiment, we ended up with an agent that does more than simply “pass data through.” It performs three critical functions: Enforces rules (data hygiene): task updates and closures require explicit proof of work (a screenshot) along with time spent (and, in some cases, a link).Maintains conversational context: using Redis, the agent remembers which task the user selected and what action is currently expected (update vs. close).Executes Jira actions automatically: handling attachments, comments, worklog entries, workflow transitions, and generating a daily LLM-based summary report. Building an agent like this required a subscription and 3–4 weeks of work by someone with no prior preparation or technical background. An experienced person could assemble a similar agent in about four working days, plus one additional day for bugfix. This is especially beneficial during the pre-production stage, when a manager can not only design the product plan but also build an agent to simplify the delivery process. As a result, during production, there is no need to be distracted by routine tasks, repetitive actions, or constant switching to the Jira interface. In addition, an agent like this — and the tools that allow it to be built — provides several advantages: It saves the time of technical specialists, who no longer need to be distracted by such improvements, as they can be implemented by management independently.It gives managers autonomy from technical specialists — once they understand how their agent works, they can adapt and refine it to suit development needs.An agent integrated as a bot into an existing and familiar messenger helps developers stay focused on their tasks within an interface they already know. This reduces stress within the team. Potential I arrived at this idea only during the experiment itself, but perhaps agents could become the first step toward changing outdated team structures and reshaping daily routines. According to classic agile principles, one PM effectively manages a team of 5–10 people. This is believed to make daily stand-ups more efficient. However, what if production were organised so that an agent acted as the link between team members and the PM? In this setup, the PM would monitor not seven people directly, but a single agent connected to 5–7 teams. The agent could handle the initial technical checks and generate daily reports accessible at any time, allowing the PM to step into the production process only when something goes off plan. Thank you.

By Evgeniy Tolstykh
When Million Requests Arrive in a Minute: Why Reactive Auto Scaling Fails and the Predictive Fix
When Million Requests Arrive in a Minute: Why Reactive Auto Scaling Fails and the Predictive Fix

Reactive autoscaling is a critical safety net. Demand rises, metrics spike, policies trigger, and capacity increases. But flash-crowd events, product drops, major campaigns, and limited-inventory moments do not ramp. They cliff. Users arrive at once, and reactive scaling is structurally late because “scale triggered” is only the start of the journey to usable capacity. If your demand spike arrives faster than your system can warm up, reactive scaling will lag no matter how well you tune it. The fix is planning and verification, scaling before the event, and proving the system is ready before customers arrive. This article outlines a practitioner approach: schedule-aware, tier-based predictive scaling using capacity targets and an executor that verifies readiness. Why Reactive Scaling Loses Against Flash Crowds Reactive scaling assumes: Demand ramps gradually enough to be detected early.Signals (CPU, request rate, latency) change soon enough to trigger action.Provisioning time is short relative to demand growth.Workloads are ready to serve traffic as soon as they are “up.” Flash crowds violate all four. Time is consumed by provisioning compute, registering capacity and passing health checks, application warm-up (caches and connection pools), and dependency readiness (datastores, rate limits, downstream saturation). The result is predictable. Traffic arrives instantly, usable capacity arrives minutes later, after customers have already experienced errors and latency. The Pivot: Treat Peak Traffic Events as Planned Operational Events Peak traffic is unpredictable in volume but often predictable in timing. Drops, campaigns, and major announcements have scheduled start times. That enables a different operating model: Scale ahead of time instead of waiting for metrics to turn red.Define what “ready” means beyond desired capacity.Continuously verify readiness as the event approaches. The questions shift from “What is the load right now?” to. What event is coming (and when), how risky is it (tier), what capacity do critical services need, and when must scaling begin so the system is ready by start time? A Practitioner Architecture: Control Plane, Policy Engine, Executor A robust predictive scaling solution typically looks like three components: 1. Control Plane (Operations Hub) The control plane orchestrates the workflow and maintains operational state: schedule and window (pre-, during-, and post-), tier, services in scope, controls (manual overrides/safety locks), and an audit trail. It triggers actions as events enter the pre-scale window and coordinates readiness checks through the peak period. 2. Policy Engine (Config-Driven Capacity Targets) The policy engine maps tier + service identity → capacity target. The key design choice: capacity is configuration, not code. Define tiers such as BASELINE (normal day), ELEVATED (higher demand), and PEAK (launch posture). Store tier targets in version-controlled config so service owners can adjust safely with review without deploying code to change capacity. 3. Scaling Executor (Actuation With Verification) The executor applies targets to your scaling mechanism (autoscaling groups, container orchestrators, platform scaling APIs) and verifies that reality matches intent. Teams often treat “set desired = X” as success. It is not. Fig. 1: Reactive auto scaling architecture Healthy, routed, warmed capacity equals the target before T-0. At a minimum, the executor should provide overlap protection, drift detection (non-convergence), bounded scaling, and break-glass override. The Peak Traffic Scaling Playbook: What to Do and When Predictive scaling works when it is operationalized into a repeatable timeline: T-90 to T-60 Minutes: Start Pre-Scale Apply tier targets to critical path services.Start warm-up actions where appropriate (cache priming, connection pre-establishment). T-30 Minutes: Convergence Verification Gate Confirm capacity is provisioned, healthy, and routable.Confirm key SLO signals are stable under synthetic traffic. T-0 Through Tail: Maintain Peak Posture Hold capacity through the predicted peak and tail.Monitor error budget burn and dependency saturation.Allow controlled overrides if reality exceeds forecasts. Tail End: Controlled Scale-Down Step down gradually and confirm stability at each step.Capture metrics for tuning tiers next time. Readiness Verification: Beyond “Desired Count” A readiness checklist should reflect user impact, not just fleet size: Fleet and Routing Healthy targets meet threshold (e.g., ≥ 95% of target)Capacity is registered and receiving trafficNo abnormal imbalance (hot nodes/shards) Application Warm-Up Cache behavior stable (hit rate or warm complete)Connection pools within limitsStartup behavior normal (no repeated crashes/restarts) Dependencies Downstream error rate stableRate limits not near exhaustionDatastore/queue/cache metrics within safe bands A simple drift rule can be highly effective: if time-to-peak traffic is within 30 minutes and healthy capacity is below threshold, escalate early. The goal is to discover “not ready” before customers do. When Reactive Scaling Is Enough Reactive scaling is often sufficient when demand ramps over minutes (not seconds), warm-up time is short, workloads are stateless and immediately ready, or strict budget caps forbid pre-scaling. But for high-heat events where demand arrives faster than readiness can be achieved, predictive scaling is a structural advantage. Bottom Line If your peak arrives faster than your platform can warm up, reactive scaling will always lag. A schedule-aware, tier-based predictive framework paired with readiness verification and strong guardrails shifts peak events from reactive firefighting to planned operations. In flash-crowd systems, readiness beats reactivity.

By Shalini Sudarsan

The Latest Data Engineering Topics

article thumbnail
Vibe Coding Is Great for Demo; It’s Not a Strategy for GenAI Value in the SDLC
Vibe coding speeds prototyping, but SDLC gains need guardrails, tests, specs, repo context, and secure workflows-optimizing feedback and quality, not code generation.
March 20, 2026
by Bhupender Saini
· 473 Views
article thumbnail
10 Strategies for Scaling Synthetic Data in LLM Training
Learn 10 proven strategies to scale synthetic data for LLM training, ensuring quality, diversity, governance, and long-term model performance.
March 20, 2026
by Chirag Shivalker
· 627 Views
article thumbnail
Modern Best Practices for Web Security Using AI and Automation
AI-driven security boosts threat detection, automates response, and enhances proactive defenses for smarter, faster, and safer cybersecurity operations.
March 20, 2026
by Sandesh Basrur
· 731 Views · 1 Like
article thumbnail
Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads
Custom Kubernetes scheduler plugins improve GPU utilization by understanding GPU topology, workload types, and gang scheduling requirements.
March 20, 2026
by Varun Kumar Reddy Gajjala
· 635 Views
article thumbnail
AI as a SQL Performance Tuning Assistant: A Structured Evaluation
Can AI genuinely provide engineering‑grade SQL optimization insights, or does it mainly offer confident‑sounding but shallow guidance?
March 20, 2026
by Manish Adawadkar
· 601 Views
article thumbnail
Why Agentic AI Demands Intent-Based Chaos Engineering
Intent-based chaos engineering tests AI systems with calculated stress, using topology, sensitivity, and SLA insights to ensure predictable resilience.
March 20, 2026
by Sayali Patil
· 626 Views · 4 Likes
article thumbnail
Scalable Cloud-Native Java Architecture With Microservices and Serverless
Learn all about scalable, cloud-native architectures with microservices and serverless technologies, boosting agility, performance, and cost-efficiency.
March 20, 2026
by Harris Anderson
· 669 Views
article thumbnail
Stop Trusting Your RAG Pipeline: 5 Guardrails I Learned the Hard Way
RAG alone doesn’t stop hallucinations. I use five guardrails: relevance scoring, forced citations, NLI checks, staleness detection, and confidence scoring.
March 20, 2026
by Mayur Vekariya
· 581 Views
article thumbnail
Toward Intelligent Data Quality in Modern Data Pipelines
Data quality goes beyond null checks. Learn how GenAI helps detect subtle issues, explain anomalies, and strengthen testing in modern data pipelines.
March 20, 2026
by Sireesha Pulipati
· 523 Views
article thumbnail
Why Security Scanning Isn't Enough for MCP Servers
A secure MCP server can still break production. Twenty heuristic rules score readiness by catching missing timeouts, unsafe retries, and absent error schemas.
March 19, 2026
by Nik Kale
· 1,417 Views · 1 Like
article thumbnail
Nvidia’s Open Model Super Panel Made a Strong Case for Open Agents
At GTC 2026, Jensen Huang, Aravind Srinivas, Harrison Chase, Mira Murati, and Michael Truell made a compelling case that the future of AI belongs to open agent systems, not just open models.
March 19, 2026
by Corey Noles
· 902 Views
article thumbnail
Microsoft Fabric: The Developer's Guide on API Automation of Security and Data Governance
The article discusses popular scenarios for automation governance in Microsoft Fabric, which may help organizations more easily govern Fabric.
March 19, 2026
by Iurii Iurchenko
· 858 Views
article thumbnail
From DLT to Lakeflow Declarative Pipelines: A Practical Migration Playbook
Migrating from DLT to Lakeflow is mostly an API refactor, swapping DLT for pipelines, separating streaming and materialized tables, and updating CDC logic.
March 19, 2026
by Seshendranath Balla Venkata
· 828 Views
article thumbnail
AI-Assisted Code Review With Claude Code (Terminal)
A practical tutorial, security-first guide from installation to your first code review session with Claude Code in terminal.
March 19, 2026
by Hanna Labushkina
· 942 Views · 2 Likes
article thumbnail
Push Filters Down, Not Up: The Data Layer Design Principle Most Developers Learn Too Late
Data fetching without filters or limits is a costly, hidden bug in the backend. API parameters must flow into SQL queries, not filter after full data transfer.
March 19, 2026
by Sanjay Mishra
· 846 Views
article thumbnail
Building MCP Hub for DevOps and CI/CD Pipelines
Model Context Protocol links AI with DevOps tools, automating code reviews, deployments, and security to speed up workflows and reduce manual work.
March 19, 2026
by Suman Basak
· 973 Views
article thumbnail
Agentic AI: A New Threat Surface
Learn about agentic AI, its autonomous capabilities, and emerging security threats, including memory poisoning, API misuse, and multi-agent vulnerabilities.
March 19, 2026
by Sandesh Basrur
· 736 Views
article thumbnail
Zero-Cost AI with Java
Create a zero-cost AI application quickly using Ollama and Java with Spring AI — with no extra costs and full compatibility with other LLMs like OpenAI.
March 18, 2026
by Fernando Boaglio
· 1,513 Views · 1 Like
article thumbnail
How Piezoelectric Energy Harvesting Is Solving the Battery Waste Crisis in Industrial IoT
Industrial piezoelectric sensors decouple IIoT reliability from battery dependence that compromises data resolution and responsiveness.
March 18, 2026
by Emily Newton
· 755 Views
article thumbnail
How LLMs Reach 1 Million Token Context Windows — Context Parallelism and Ring Attention
Learn how Ring Attention and context parallelism enable LLMs to scale to 10M tokens through distributed GPU training and memory optimization.
March 18, 2026
by Kevin Vu
· 920 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×