DZone Spotlight

Friday, October 31 View All Articles »

Series (2/4): Toward a Shared Language Between Humans and Machines — From Multimodality to World Models: Teaching Machines to Experience

By Frederic Jacquet

CORE

What if the key to a shared language lay in experience itself? Researchers are now exploring approaches that connect text with images, sounds, and interactions within a three-dimensional world. Sensorimotor grounding, multimodal perception, and world models, all these paths aim to give machines the kind of anchoring they still so painfully lack. Since the machine shares neither our cultural memory nor our perception of the world, several ways can be imagined to bridge that gap. Connecting Language With Real-World Experience The first is to “root symbols in sensorimotor experience.” As early as the 1990s, Stevan Harnad proposed a hybrid model: words should be linked to both iconic representations (images, direct perceptions) and categorical ones (learned invariants), rather than floating in a purely symbolic space. To understand “cat,” then, is not merely to manipulate the word, but above all to connect its use to a perceptual experience, to see it, and ideally one day, to hear and even to touch it. This idea now inspires multimodal approaches, where text and vision are combined to bring linguistic processing closer to grounding in the real world. In practical terms, this amounts to giving the machine a richer form of “experience.” For example, if a model is shown thousands of images of cats accompanied by the caption “cat,” it learns to associate the word not only with other words but also with shapes, colors, and postures. When later asked to describe a photo, it no longer merely manipulates text; it retrieves visual features that refer to a perceptual experience. This combination is what now allows a multimodal model to recognize that “a cat is sleeping on a couch,” instead of merely predicting a string of words unrelated to the image. But here again, the gap between the cognitive abilities of a human and a machine is enormous. As research in vision and cognition reminds us, a young child can recognize a new category with very few examples, sometimes just one, while artificial systems require dozens, hundreds, or even thousands of examples. Teaching Machines to Understand the World In this perspective, spatial intelligence and “world models” play a central role in research. Fei-Fei Li emphasizes the need for AI to reason within a 3D universe, where objects have permanence and physical laws impose constraints. Yann LeCun extends this vision with the concept of “world models”: internal representations that allow systems to simulate, predict, and plan before acting. IBM is part of this same dynamic, working on digital twins for industry and medical research. In concrete terms, these digital twins do not merely represent a “snapshot” of a system, but its “movie.” They make it possible to model both the shape and the evolution of a phenomenon, whether it involves tracking atmospheric currents or understanding how genes interact with one another. All these approaches aim to bring machines closer to the way humans connect language, perception, and action. For my part, I am convinced that these approaches are not mutually exclusive but complementary. None of them will be enough on its own: it is likely by combining embodied perception, efficient processing capabilities, and a solid ethical framework that we will truly be able to move forward. Other lines of research aim instead to adapt operational languages to human intentions. The TransCoder project, for instance, has shown that AI can perform accurate translations between different programming languages (C++, Java, Python) without human supervision. To achieve this, it learned on its own to align its structures and libraries. The level of difficulty is lower compared to understanding human language, since between “machine” languages, meaning is operational and strictly defined. From that starting point, one can hope that it will one day be possible to build an analogous bridge between human language and machine language. The idea would not be to try to imitate our emotions, but to formalize our intentions within an executable protocol. To Be Continued... These approaches outline a future where artificial language would finally be linked to a world of perceptions and actions. But other researchers are choosing a radically different path: rather than imitating human experience, they seek to go beyond its limits by harnessing the power of quantum computing. In the next part, we will dive into the emerging world of Quantum Natural Language Processing. Links to the previous articles published in this series: Series: Toward a Shared Language Between Humans and MachinesSeries (1/4): Toward a Shared Language Between Humans and Machines — Why Machines Still Struggle to Understand Us References Abbaszade, Mina; Zomorodi, Mariam; Salari, Vahid; Kurian, Philip. "Toward Quantum Machine Translation of Syntactically Distinct Languages". [link] Brodsky, Sascha. "World models help AI learn what five-year-olds know about gravity". IBM. [link] Gubelmann, Reto. "Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs". [link]Harnad, Stevan. "The Symbol Grounding Problem". [link]LEO (Linguist Education Online). "Human Intelligence in the Age of AI: How Interpreters and Translators Can Thrive in 2025". [link]Meta AI. "Yann LeCun on a vision to make AI systems learn and reason like animals and humans". [link]Opara, Chidimma. "Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis". [link]Qi, Zia; Perron, Brian E.; Wang, Miao; Fang, Cao; Chen, Sitao; Victor, Bryan G. "AI and Cultural Context: An Empirical Investigation of Large Language Models' Performance on Chinese Social Work Professional Standards". [link] Roziere, Baptiste; Lachaux, Marie-Anne; Chanussot, Lowik; Lample, Guillaume. "Unsupervised Translation of Programming Languages". [link]Strickland, Eliza. "AI Godmother Fei-Fei Li Has a Vision for Computer Vision". IEEE Spectrum. [link]Trott, Sean. "Humans, LLMs, and the symbol grounding problem (pt. 1)". [link]Nature. “Chip-to-chip photonic quantum teleportation over optical fibers, 2025”. [link] More

A Comprehensive Analysis of Async Communication in Microservice Architecture

By Aniruddha Maru

Microservice architecture has become a standard practice for companies, small and large. One of the challenges is communication between different services. I’ve worked with microservices for a decade now, and I’ve seen a lot of people struggle to understand how to implement a proper communication protocol. In this series of articles, I’ll share my knowledge and expertise on async communication in microservices. Part I: Introduction To Asynchronous Communication In MicroservicesPart II: Cloud Message Brokers – AWS SQS/SNS vs Google Pub/SubPart III: Messaging Patterns and Best Practices in Asynchronous SystemsPart IV: Implementation Considerations for Production SystemsPart V: Advanced Topics and Migration Strategies Introduction To Asynchronous Communication In Microservices The Evolution from Monoliths to Microservices The software industry has witnessed a dramatic shift from monolithic architectures to distributed microservices over the past decade. This transformation has brought unprecedented scalability and flexibility, but it has also introduced new challenges in how services communicate with each other. At the heart of this challenge lies a fundamental question: how do independent services exchange information reliably, efficiently, and without creating tight coupling that defeats the purpose of microservices in the first place? Asynchronous communication has emerged as the answer to this question, becoming the backbone of modern distributed systems. Unlike traditional synchronous request-response patterns, where services wait for immediate replies, asynchronous communication allows services to send messages and continue processing without blocking. This seemingly simple shift in approach has profound implications for system design, scalability, and resilience. Understanding Synchronous vs. Asynchronous Communication In synchronous communication, exemplified by REST APIs and gRPC calls, Service A makes a request to Service B and waits for a response before continuing. This pattern is intuitive and mirrors how we often think about program execution. However, it creates direct dependencies between services. If Service B is slow, unavailable, or experiencing high load, Service A suffers the consequences directly. This tight coupling cascades through the system, where the failure or slowness of one service can bring down entire chains of dependent services. Asynchronous communication breaks this direct dependency. When Service A needs to inform Service B about an event or request an action, it publishes a message to an intermediary system called a message broker. Service A doesn't wait for Service B to process the message; it immediately continues with its own work. Service B, whenever it's ready and available, consumes the message from the broker and processes it. This decoupling is the cornerstone of resilient, scalable microservices architectures. The Role of Message Brokers Message brokers are the infrastructure components that make asynchronous communication possible. They act as intermediaries that receive messages from publishers (services sending messages) and deliver them to consumers (services receiving messages). Modern cloud platforms provide managed message broker services that handle the complexity of message storage, delivery, retry logic, and scaling. The two dominant cloud message broker platforms are Amazon Web Services (AWS) with its SQS (Simple Queue Service) and SNS (Simple Notification Service), and Google Cloud Platform (GCP) with its Pub/Sub service. These platforms have democratized access to enterprise-grade messaging infrastructure, allowing teams of any size to build sophisticated distributed systems without managing the underlying infrastructure. AWS SQS provides a queue-based model where messages are stored in a queue and consumed by one or more workers. SNS implements a publish-subscribe pattern where messages are broadcast to multiple subscribers. Google Pub/Sub combines both patterns into a unified service that supports both queue-like subscriptions and fan-out patterns. All these services guarantee message durability, at least once delivery, and can scale to handle millions of messages per second. Why Asynchronous Communication Matters The importance of asynchronous communication in microservices cannot be overstated. It addresses several critical architectural concerns: First, it enables true service independence. Services can evolve, scale, and deploy independently without coordinating with their dependencies. A service can publish events without knowing or caring how many other services consume them, or what those services do with the information. This decoupling is critical to the stability of a distributed system. Second, it improves system resilience. When services communicate asynchronously, the temporary unavailability of a consumer doesn't impact the publisher. Messages accumulate in the broker and are processed when the consumer recovers. This natural buffering effect absorbs traffic spikes and service failures without cascading problems through the system. Third, it facilitates better scalability. Asynchronous systems can handle variable load more gracefully. If messages arrive faster than they can be processed, they queue up in the broker, and consumers can scale horizontally to handle the backlog. This is far more difficult with synchronous communication, where callers must wait for responses, limiting throughput and creating pressure on upstream services. Fourth, it enables event-driven architectures. Rather than services actively polling for changes or tightly coordinating their actions, they can react to events published by other services. This creates more loosely coupled, maintainable systems where adding new functionality often means creating a new consumer for existing events rather than modifying existing services. Real-World Frameworks: Hedwig and Taskhawk The practical implementation of asynchronous communication patterns has been simplified by frameworks I’ve built, such as Hedwig and Taskhawk, both designed to work seamlessly with AWS and GCP message brokers. Hedwig is an inter-service communication bus that focuses on message validation and enforcing contracts between publishers and consumers. It validates message payloads before they're sent, catching incompatibilities early in development rather than in production. Hedwig maintains separation of concerns between consumers and publishers, ensuring services remain loosely coupled while the message schema provides a clear contract. This approach makes asynchronous communication as reliable and predictable as synchronous APIs while maintaining all the benefits of decoupling. Taskhawk takes a different approach, focusing on asynchronous task execution similar to Celery but designed specifically for cloud message brokers. Any function can be converted into a Taskhawk task, making it simple to offload work to background workers. Tasks can be scheduled using native cloud services like AWS CloudWatch Events or GCP Cloud Scheduler, enabling sophisticated workflow patterns without additional infrastructure. Both frameworks abstract away the complexity of working directly with SQS, SNS, or Pub/Sub, providing developer-friendly APIs while leveraging the reliability and scalability of cloud infrastructure. They represent the maturity of asynchronous messaging patterns, making them accessible to teams without deep expertise in distributed systems. The Journey Ahead This article has introduced the fundamental concepts of asynchronous communication in microservices and explained why it has become essential for modern distributed systems. The remaining articles in this series will dive deeper into specific aspects: Article 2 will explore message broker architectures in detail, comparing AWS SQS/SNS with Google Pub/Sub, examining their features, guarantees, and appropriate use cases. Article 3 will cover messaging patterns and best practices, including event sourcing, CQRS, saga patterns, and handling failures and implementing idempotency. Article 4 will focus on implementation considerations, including message schema design, versioning strategies, monitoring and observability, and testing asynchronous systems. Article 5 will address advanced topics and challenges, including ordering guarantees, exactly-once semantics, handling poison messages, and migrating from synchronous to asynchronous architectures. Understanding asynchronous communication is no longer optional for building scalable, resilient microservices. It's a fundamental skill that every team working with distributed systems must master. The shift from synchronous to asynchronous thinking represents not just a technical change, but a conceptual evolution in how we design and reason about software systems. More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone’s 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you’re on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes’ demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

Building a New Testing Mindset for AI-Powered Web Apps

The technology landscape is undergoing a profound transformation. For decades, businesses have relied on traditional web-based software to enhance user experiences and streamline operations. Today, a new wave of innovation is redefining how applications are built, powered by the rise of AI-driven development. However, as leaders adopt AI, a key challenge has emerged: ensuring its quality, trust, and reliability. Unlike traditional systems with clear requirements and predictable outputs, AI introduces complexity and unpredictability, making quality assurance (QA) both more challenging and more critical. Business decision-makers must now rethink their QA strategy and investments to safeguard reputation, reduce risk, and unlock the full potential of intelligent solutions. If your organization is investing in AI capabilities, understanding this quality challenge isn’t just a technical concern; it’s a business necessity that could determine the success or failure of your AI initiatives. In this blog, we’ll explore how AI-driven development is reshaping QA — and what organizations can do to ensure quality keeps pace with innovation. Why Traditional Testing Falls Short Let’s take a practical example. Imagine an interview agent built on top of a large language model (LLM) using the OpenAI API. Its job is to screen candidates, ask context-relevant questions, and summarize responses. Sounds powerful, but here’s where traditional testing challenges emerge: Non-Deterministic Outputs Unlike a rules-based form, the AI agent might phrase the same question differently each time. This variability makes it impossible to write a single “pass/fail” test script. Dynamic Learning Models Updating the model or fine-tuning with new data can change behavior overnight. Yesterday’s green test might fail today. Contextual Accuracy An answer can be grammatically correct yet factually misleading. Testing must consider not just whether the system responds, but whether it responds appropriately. Ethical and Compliance Risks AI systems can accidentally produce biased or non-compliant outputs. Testing must expand beyond functionality to include fairness, transparency, and safety. Clearly, a new approach is needed. AI-Powered Testing So, what does a modern approach to testing look like? We call it the AI-powered test, a fresh approach that redefines quality assurance for intelligent systems. Instead of force-fitting traditional, deterministic testing methods onto non-deterministic AI models, businesses need a flexible, risk-aware, and AI-assisted framework. At its core, AI-powered testing means: Testing at the behavioral level, not just the functional level.Shifting the question from “Does it work?” to “Does it work responsibly, consistently, and at scale?”Using AI itself as a tool to enhance QA, not just as a subject to be tested. This approach ensures that organizations not only validate whether AI applications function, but also whether they are reliable, ethical, and aligned with business goals. Pillars of AI-Powered Testing To make this shift practical, we recommend you plan your AI QA strategy around the following key pillars: 1. Scenario-Based Validation Instead of expecting identical outputs, testers validate whether responses are acceptable across a wide range of real-world scenarios. For example, does the Interview Agent always ask contextually relevant questions, regardless of candidate background or job description? 2. AI Evaluation Through Flexibility AI systems should be judged on quality ranges rather than rigid outputs. Think of it as setting “guardrails” instead of a single endpoint. Does the AI stay within acceptable tone, accuracy, and intent even if the exact wording varies? 3. Continuous Monitoring and Drift Detection Since AI models evolve, testing can’t be a one-time activity. Organizations must invest in continuous monitoring to detect shifts in accuracy, fairness, or compliance. Just as cybersecurity requires constant vigilance, so too does AI assurance. 4. Human Judgment Automation is powerful, but human judgment remains essential. QA teams should include domain experts who can review edge cases and make subjective assessments that machines can’t. For business leaders, this means budgeting not only for automation tools but also for skilled oversight. The Future Is Moving From ‘AI for Testing’ to ‘Testing for AI’ AI is reshaping every part of the technology ecosystem, and software testing is no exception. We have AI-driven test automation tools like robonito, KaneAI, testRigor, testim, loadmill and Applitools. These are powerful allies that use AI to make traditional testing faster and more efficient. They can write test scripts from plain English, self-heal when the user interface changes, and intelligently identify visual bugs. These tools are excellent for improving the efficiency of testing traditional applications. But the real frontier is “AI platforms designed to test other AIs.” This is where the future lies. Think of these as “AI test agents,” specialized AI systems built to audit, challenge, and validate other AI. This emerging space is transforming how we think about quality assurance in the age of intelligent systems. Key Directions in “Testing for AI” LLM evaluation platforms: New platforms are being developed to rigorously test applications powered by LLMs. For example , an Interview Agent can generate thousands of diverse, adversarial prompts to check for robustness, test for toxic or biased outputs, and compare the model’s responses against a predefined knowledge base to check for factual accuracy/hallucinations.Model monitoring and bias detection tools: Companies like Fiddler AI and Arize AI provide platforms that monitor your AI in production. They act as a continuous QA system, flagging data drift (when real-world data starts to look different from training data) and detecting in real-time if the model’s outputs are becoming biased or skewed.Testing for AI: There are many companies working on AI agent testing and agent-to-agent testing tools. For example, LambdaTest recently launched a beta version of its agent-to-agent testing platform — a unified environment for testing AI agents, including chatbots and voice assistants, across real-world scenarios to ensure accuracy, reliability, efficiency, and performance. Why This Matters for Business Leaders From a C-suite perspective, investing in AI-powered testing isn’t just a technical decision; it’s a business imperative. Here’s why: Customer trust. A chatbot that provides incorrect medical advice or a hiring tool that shows bias can damage brand reputation overnight. Quality isn’t just about uptime anymore; it’s about ethical, reliable experiences.Regulators are watching. AI regulation is tightening worldwide. Whether it’s GDPR, the EU AI Act, or emerging US frameworks, organizations will be held accountable for how their AI behaves. Testing for compliance should be part of your risk management strategy.Cost of failure. With AI embedded in core business processes, errors don’t just affect a single user; they can cascade across markets and stakeholders. Proactive QA is far cheaper than reactive damage control.Competitive advantage. Companies that can assure reliable, responsible AI will differentiate themselves. Just as “secure by design” became a competitive market in software, “trustworthy AI” will become a business differentiator. Building Your AI QA Roadmap So, how should an executive get started? Here’s a phased approach we recommend to clients: Phase 1: Assess Current Gaps Map where AI is currently embedded in your systems. Identify areas where quality risks could impact customers, compliance, or brand reputation. Phase 2: Redefine QA Metrics Move beyond pass/fail. Introduce new metrics such as accuracy ranges, bias detection, explainability scores, and response relevance. Phase 3: Invest in AI-Powered Tools Adopt platforms that can automate scenario generation, inconsistency detection, and continuous monitoring. Look for solutions that scale with your AI adoption. Phase 4: Build Cross-Functional Oversight Build a governance model that includes compliance, legal, and business leaders alongside IT. Quality must reflect business priorities, not just technical checklists. Phase 5: Establish Continuous Governance Treat AI QA as an ongoing discipline, not a project phase. Regularly review model performance, monitor for drift, and update guardrails as the business evolves. Final Thoughts The era of AI-driven applications is here, and it’s accelerating. But with innovation comes responsibility. Traditional QA approaches built for deterministic systems are no longer sufficient. By adopting an AI-powered testing strategy, organizations can ensure their AI systems are not only functional but also ethical, reliable, and aligned with business goals. The message for leaders is clear: if you want to harness AI as a competitive advantage, you must also invest in the processes that make it trustworthy. Modern QA is no longer just about preventing bugs; it’s about protecting your brand, your customers, and securing your organization’s future in an AI-first world.

By Kailash Pathak

CORE

How to Get a Frequency Table of a Categorical Variable as a Data Frame

Categorical data is data with a predefined set of values. Using “Child,” “Adult,” or “Senior” instead of a person's age as a number is one example of age categorization. However, before using categorical data, one must know about various forms of categorical data First of all, categorical data may or may not be defined in an order. To say that the size of a box is small, medium, or large means that there is an order described as small < medium < large. The same does not hold for, say, sports equipment, which could also be categorial data, but differentiated by names like dumbbell, grippers, or gloves; that is, you can order the items on any basis. Those that can be ordered are known as “ordinal” while those where there is no such ordering are “nominal” in nature. Many times, an analyst converts numerical data to categorical data to make things easier. Besides using “Adult,” “Child,” or “Senior” class instead of age as a number, there can also be special cases, such as using “regular item” or “accessory” for equipment. In many problems, the output is also categorical. Whether a customer will churn or not, whether a person will buy a product or not, or whether an item is profitable — these are classic classification problems often tackled in AI consulting engagements. All problems where the output is categorical are known as classification problems. R provides various ways to transform and handle categorical data. A simple way to transform data into classes is to use the split and cut functions in R, or the cut2 function in the Hmisc library. Let’s use the iris dataset to categorize data. This dataset is available in R and can be called by using the ‘attach’ function. The dataset consists of 150 observations across five features: sepal length, sepal width, petal length, petal width, and species. Plain Text attach(iris) #Call the iris dataset x=iris #store a copy of the dataset into x #using the split function list1=split(x, cut(x$Sepal.Length, 3)) #This will create a list of 3 split on the basis of sepal.length summary(list1) #View the class ranges for list1 Length Class Mode (4.3,5.5] 6 data.frame list (5.5,6.7] 6 data.frame list (6.7,7.9] 6 data.frame list #using Hmisc library library(Hmisc) list2=split(x, cut2(x$Sepal.Length, g=3)) #This will also create a similar list but with left boundary included summary(list2) #View the class ranges for list2 Length Class Mode [4.3,5.5) 6 data.frame list [5.5,6.4) 6 data.frame list [6.4,7.9] 6 data.frame list The first list, list 1, divides the dataset into 3 groups based on sepal length, with equal ranges. The second list, list 2, also divides the dataset into 3 groups based on sepal length, but it tries to keep the number of values equal in each group. We can check this using the range function. Plain Text #Range of sepal.length range(x$Sepal.Length) #The output is 4.3 to 7.9 We can see that list 1 consists of three groups: the first has the range 4.3–5.5, the second has the range 5.5–6.4, and the third has the range 6.5–7.9. There is, however, one difference between the output of list1 and list2. List1 ensures the range across the three groups is equal. On the other hand, list 2 allows the number of values in each group to be balanced. An alternative code to the following is to add the group range as another feature in the dataset. R x$class <- cut(x$Sepal.Length, 3) #Add the class label instead of creating a list of data x$class2 <- cut2(x$Sepal.Length, 3) #Add the class label instead of creating a list of data If the classes are to be indexed as numbers 1, 2, 3… instead of their actual range, we can just convert our output to numeric. Using the indexes is also easier than the range of each group. R x$class=as.numeric(x$class) In our example, the class values will now be transformed to either 1, 2, or 3. Suppose we now want to find the number of values in each class. How many rows fall into class 1? Or class 2? We can use the table() function in R to get that count. R class_length=table(x$group) class_length #The sizes are 59,71 and 20 as indicated in the output below 1 2 3 59 71 20 This is a good way to get a quick summary of the classes and their sizes. However, this is where it ends. We cannot make further computations or use this information in our dataset. Moreover, class_length is a table and needs to be converted to a Data Frame before it is useful. The issue is that transforming a table into a Data Frame will create the variable names as Var1 and Freq, as the table does not retain the original feature name. R #Transforming the table to a Data Frame class_length_df=as.data.frame(class_length) Class_length_df #The output is: Var1 Freq 1 1 59 2 2 71 3 3 20 #Here we see that the variable is named as Var1. We need to rename the variable using the names() function names(class_length_df)[1]=”group” #Changing the first variable Var1 to group class_length_df group Freq 1 1 59 2 2 71 3 3 20 In this case, where we have a few variables, we can easily rename the variable, but this is very risky in a large dataset where one can accidentally rename another important feature. As I said, there is more than 1 way to do the same thing in R. All this hassle could have been avoided if there had been a function that would generate our class size as a Data Frame to start with. The “plyr” package has the count() function, that accomplishes this task. Using the count function in the plyr package is as simple as passing the original Data Frame and the variable for which we want to use the count. R #Using the plyr library library(plyr) class_length2=count(x,”group”) #Using the count function class_length2 #The output is: group freq 1 1 59 2 2 71 3 3 20 The same output, in less number of steps. Let’s verify our output. R #Checking the data type of class_length2 class(class_length2) #Output is data.frame The plyr package is very useful when it comes to categorical data. As we see, the count() function is really flexible and can generate the Data Frame we want. It is now easy to add the frequency of the categorical data to the original Data Frame x. Comparison The table() function is really useful as a quick summary and, with a little work, can produce an output similar to that given by the count() function. When we go a little further towards N-way tables, the table function transformed to a Data Frame works just as the count() function. R #Using the table for 2 way two_way=as.data.frame(table(subset(x,select=c(“class”,”class2″)))) two_way class class2 Freq 1 (4.3,5.5] [4.3,5.5) 52 2 (5.5,6.7] [4.3,5.5) 0 3 (6.7,7.9] [4.3,5.5) 0 4 (4.3,5.5] [5.5,6.4) 7 5 (5.5,6.7] [5.5,6.4) 49 6 (6.7,7.9] [5.5,6.4) 0 7 (4.3,5.5] [6.4,7.9] 0 8 (5.5,6.7] [6.4,7.9] 22 9 (6.7,7.9] [6.4,7.9] 20 two_way_count=count(x,c(“class”,”class2″)) two_way_count class class2 freq 1 (4.3,5.5] [4.3,5.5) 52 2 (4.3,5.5] [5.5,6.4) 7 3 (5.5,6.7] [5.5,6.4) 49 4 (5.5,6.7] [6.4,7.9] 22 5 (6.7,7.9] [6.4,7.9] 20 The difference is still noticeable. While both outcomes are similar, the count() function omits values that are null or have a size of 0. Hence, the count() function produces cleaner output and outperforms the table() function, which produces frequency tables for all possible combinations of the variables. What if we want the N-way frequency table of the entire Data Frame? In this case, we can simply pass the entire Data Frame into the table() or count() function. However, the table() function will be very slow in this case, as it will take time to calculate the frequencies of all possible combinations of features, whereas the count() function will only calculate and display the combinations where the frequency is non-zero. R #For the entire dataset full1=count(x) #much faster full2=as.data.frame(table(x)) What if we want to display our data in a cross-tabulated format instead of displaying it as a list? We have the xtabs function for this purpose. R cross_tab = xtabs(~ class + class2, x) cross_tab class2 class [4.3,5.5) [5.5,6.4) [6.4,7.9] (4.3,5.5] 52 7 0 (5.5,6.7] 0 49 22 (6.7,7.9] 0 0 20 However, the class type of this function is an xtabs table. R class(cross_tab) “xtabs” “table” Converting the same to a Data Frame regenerates the same output as the table() function does. R y=as.data.frame(cross_tab) y class class2 Freq 1 (4.3,5.5] [4.3,5.5) 52 2 (5.5,6.7] [4.3,5.5) 0 3 (6.7,7.9] [4.3,5.5) 0 4 (4.3,5.5] [5.5,6.4) 7 5 (5.5,6.7] [5.5,6.4) 49 6 (6.7,7.9] [5.5,6.4) 0 7 (4.3,5.5] [6.4,7.9] 0 8 (5.5,6.7] [6.4,7.9] 22 9 (6.7,7.9] [6.4,7.9] 20 There is another difference when we use cross-tabulated output for N-way classification when N > 3. Because we can show only two features in a cross-tabulated format, xtabs divides the data by the third variable and displays cross-tabulated outputs for each value of the third variable. Illustrating the same for class, class 2, and Species. R threeway_cross_tab = xtabs(~ class + class2 + Species, x) threeway_cross_tab , , Species = setosa class2 class [4.3,5.5) [5.5,6.4) [6.4,7.9] (4.3,5.5] 45 2 0 (5.5,6.7] 0 3 0 (6.7,7.9] 0 0 0 , , Species = versicolor class2 class [4.3,5.5) [5.5,6.4) [6.4,7.9] (4.3,5.5] 6 5 0 (5.5,6.7] 0 28 8 (6.7,7.9] 0 0 3 , , Species = virginica class2 class [4.3,5.5) [5.5,6.4) [6.4,7.9] (4.3,5.5] 1 0 0 (5.5,6.7] 0 18 14 (6.7,7.9] 0 0 17 The output becomes larger and harder to read as N increases in an N-way cross-tabulation. In this situation again, the count() function seamlessly produces a clean, easily visualizable output. R threeway_cross_tab_df = count(x, c(‘class’, ‘class2’, ‘Species’)) threeway_cross_tab_df class class2 Species freq 1 (4.3,5.5] [4.3,5.5) setosa 45 2 (4.3,5.5] [4.3,5.5) versicolor 6 3 (4.3,5.5] [4.3,5.5) virginica 1 4 (4.3,5.5] [5.5,6.4) setosa 2 5 (4.3,5.5] [5.5,6.4) versicolor 5 6 (5.5,6.7] [5.5,6.4) setosa 3 7 (5.5,6.7] [5.5,6.4) versicolor 28 8 (5.5,6.7] [5.5,6.4) virginica 18 9 (5.5,6.7] [6.4,7.9] versicolor 8 10 (5.5,6.7] [6.4,7.9] virginica 14 11 (6.7,7.9] [6.4,7.9] versicolor 3 12 (6.7,7.9] [6.4,7.9] virginica 17 The same output is presented in a concise way by count(). The count() function in the plyr package is thus very useful when it comes to counting frequencies of categorical variables. Authored by Chaitanya Sagar, Founder and CEO, Perceptive Analytics. A recognized thought leader in analytics and data science, frequently writes on performance dashboards, AI integration, and decision intelligence.

By Chaitanya Sagar

Make Static Sites Feel Dynamic With APIs Only (No Backend Needed)

A static site does not have to feel frozen. With a bit of JavaScript, a static page can request data from an API and update the page on the fly. That is the whole idea behind an API-only approach: HTML, CSS, and JavaScript live on a CDN, the browser calls APIs for content, and the page updates itself. Why should teams care? It is fast, cheap, and simple. Static files load from a CDN, deploys are trivial, and scale happens without heavy servers. It also works for real sites, like a blog fed by a headless CMS API, a product grid powered by a commerce API, or a contact form that posts to a forms service. This guide covers how the flow works, how to fetch and render data safely, and how to handle speed, SEO, and reliability. It does not teach custom backend builds or advanced app frameworks. The roadmap: start with the API-only model, then learn the basic flow, when to use it, how to pick APIs, how to build dynamic parts, and how to make it fast and reliable. API-Only: How a Static Site Gets Dynamic Without a Backend Picture the loop. A user opens a static page. JavaScript runs. It calls an API. The API returns JSON. JavaScript turns JSON into HTML and updates the DOM. That is the full move from static shell to dynamic content. How does this differ from SSR or SSG? With SSR, the server builds the HTML for each request. With SSG, a build step renders HTML ahead of time. With API-only, the browser builds the view after the page loads. Each model has a fit. API-only shines when the page can hydrate with data after first paint. Good dynamic pieces for API-only: News feeds and event listsProduct grids with price and stockSite search results and filtersComments through a hosted serviceForms that submit to an APIMaps and geodata overlays Know the limits. Heavy login flows, secret tokens, and strict SEO that requires full HTML on first load may not fit. Also, some features that demand protected logic belong on a server. API-only can still use serverless functions or third-party APIs. Those count as APIs that the static site calls, not a full custom server. For a deeper walkthrough of adding dynamic data to a static site with a hosted API, review this clear guide on displaying dynamic content on a Pages static site: Cloud.gov’s knowledge base article. The Basic Flow: Fetch JSON, Render HTML, Repeat as Data Changes Keep a simple mental model: Load static assets from a CDN. The page shell is instant and cacheable.Use fetch with async functions to call one or more endpoints. Parse JSON.Insert data into the page with DOM updates or small templates. This can run when the page loads, on button clicks, or on a timer for live updates. Think in three layers: data in, transform, view out. That mental model works for vanilla JS or any small library. When API-Only Shines, and When to Pick Something Else Good Fits Public or semi-public data where anyone can view the contentRead heavy content, like blogs or docs with dynamic sectionsDashboards that can render after loadMarketing pages with live testimonials or a featured products carouselTeams that value low hosting cost and simple deploys Poor Fits Pages that must ship full HTML on first paint for SEOComplex auth flows that require server secrets, like client secrets for OAuthWrite heavy apps that need strict rules, audits, or secure business logic Workarounds Use a tiny serverless proxy to handle secrets or strict CORS, then return safe data to the browser.Pre-render key pages on a build hook for SEO, and load the rest via API calls after paintUse providers that support public browser keys scoped to domains and routes. Choosing APIs: REST or GraphQL, Headless CMS, and Quotas Pick a data source that fits the job. REST offers simple URLs with resources and verbs. GraphQL uses one endpoint and lets the client choose fields. REST tends to be easier for small sites; GraphQL can cut extra fields and calls. Common providers: Headless CMS: Contentful, Sanity, or similar, for blog posts and pagesSpreadsheets as APIs: Airtable, or Google Sheets via API, for quick data tablesSearch APIs: Algolia for instant searchCommerce APIs: Stripe products, Shopify Storefront for listingsForms APIs: Formspree, Getform for contact or lead formsPublic open data: city, state, or federal datasets Always check rate limits, pricing tiers, CORS support, and uptime SLAs. A friendly API with poor CORS or low limits can break a launch. Build the Dynamic Parts: Fetch Data, Render UI, Handle State The same ideas work with vanilla JS or a light library. Start small. Fetch on load, render a list, and show loading and error states. Then add detail pages and posting forms. Key patterns: Fetch and render on page loadLists and grids for arrays of itemsDetail pages keyed by an ID from the URLClient-side search and filters for small data setsServer-side filters and pagination for large data setsForm posts to a forms API, with optimistic UI and fallback For a practical primer that keeps things light, this write-up shows how to attach an API to a static site with clear examples: Raymond Camden’s article. Set Up Clean Data Fetching With Fetch and CORS in Mind Use fetch to call the endpoint, parse JSON, and handle errors. Set a timeout with AbortController so the UI can fail fast. Wrap calls in a helper so endpoints and headers live in one place. Keep keys and base URLs in a config module, not scattered across the app. CORS matters because the browser blocks cross-origin requests unless the API allows it. Some APIs block direct browser calls. Use a tiny serverless proxy if a private key is required or the provider does not support CORS to the browser. Keep the proxy minimal, return only what the UI needs, and cache responses when safe. Tips: Centralize API configs and headersUse try/catch around await fetchSurface clear error messages for the userLog technical details for developers, not users For separation of concerns between serving static HTML and making API requests, this Stack Overflow thread explains the split well: How to serve static or dynamic HTML files with a RESTful API. Render Lists, Detail Pages, Search, and Filters Without a Framework Most UI needs boil down to repeatable patterns: Lists and grids: loop through an array, clone a small template, and fill fields. Use DocumentFragment for speed.Detail views: read an ID from the query string or hash, fetch one item, then render. Fall back to a not found state if the ID is missing.Search and filters: for small data, filter in memory. For big data, pass query params to the API and render results. Debounce input events to avoid spam calls.Pagination and infinite scroll: request a page at a time, append new items, and stop when there is no next page. Keep the URL in sync with the current page or filter. Keep templates simple and testable. A few small helpers can keep the DOM code tidy. Loading, Errors, and Empty States That Feel Friendly Users judge how an app behaves when things go wrong. Set expectations with clear statements: Show skeletons or spinners while loadingUse short, human error messages with a hint to retryOffer a retry button on network errorsShow empty states that teach the next step, not a blank screen Add a safe timeout per request. If a search takes too long, cancel and invite the user to try again. For accessibility, update aria-live regions with status messages and keep focus stable on updates. Do not trap keyboard users in modals or spinners. Keep Secrets Safe: API Keys, Tokens, and a Minimal Proxy Never ship private secrets in the browser. Public keys are fine only if the provider marks them as public and allows origin restrictions. Options that work: Use browser-safe keys with strict domain and route rulesStore secrets in serverless or edge functions, and call those functions from the clientUse OAuth flows that are designed for public clients, like PKCE Avoid keeping sensitive tokens in localStorage. Prefer memory during the session or secure cookies from a proxy when needed. Rotate keys, limit scopes, and watch logs for abuse. Make It Fast, Secure, and SEO-Friendly For Real Users The polish moves a demo into production. Focus on caching, payload size, SEO, and monitoring. The advice here works on Netlify, Vercel, Cloudflare Pages, GitHub Pages, or any static host. Cache Smart With the Browser, CDN, and a Service Worker Use HTTP caching to get instant loads: Set Cache-Control headers on static assets to a long max-age with fingerprinted file namesUse ETags for API responses where data changes oftenPrefer stale-while-revalidate so repeat visits feel instant If a serverless proxy sits in front of third-party APIs, cache responses at the edge when data can be stale for a short time. For advanced use, a simple service worker can cache API JSON. Serve the cached data right away, then refresh in the background and update the view when new data arrives. Speed Wins: Cut Payloads, Lazy Load, and Batch Requests Quick wins add up: Request only the fields needed, not full objectsCompress JSON at the edge when the host supports itDebounce search inputs to reduce callsBatch small requests into one when possibleLazy load sections when they scroll into viewUse an image CDN for thumbnails, with WebP or AVIF formatsMeasure with Lighthouse and WebPageTest, then fix the biggest issues first A small table helps teams decide where to optimize first. AreaSymptomQuick fix API payload Slow JSON transfers Reduce fields, gzip, cache at the edge Images Heavy thumbnails WebP/AVIF, responsive sizes, CDN JS execution Main thread feels blocked Split bundles, defer non-critical JS Network chatter Too many round-trip Batch requests, prefetch on hover SEO for Client-Rendered Pages: Pre-Render, Metadata, and Structured Data Client-rendered pages can be crawled, but first paint HTML may be thin. For key pages, add a light pre-render step. Trigger a build with a webhook when content changes. Render static HTML for top routes, then hydrate with fresh data after load. Practical steps: Set titles, meta descriptions, canonical tags, and social tags in the static shellAdd JSON-LD structured data where it fits, like article or product schemaProvide fallback HTML for key sections so crawlers see some contentKeep URLs clean and stable, and avoid hash-only routing for indexable pages For a nice example of pulling API data at build time to reduce runtime calls, see this write-up on adding dynamic content to a static site at build time: Griffa’s post. Reliability and Monitoring: Timeouts, Retries, and Graceful Fallbacks Network hiccups happen. Design for them: Set per-request timeoutsUse exponential backoff when retryingCircuit break after repeated failures and show a friendly noticeCache the last good data and display it for a short windowLog errors with correlation IDs so issues are trackableWire light alerts with your host or a service like Sentry A small status widget can show whether the API is healthy. If the service is down, switch to a cached mode and avoid hammering the endpoint. Quick Comparison: API-Only vs. SSR vs. SSG ApproachWhere HTML is BuiltBest forTradeoffs API-only Browser at runtime Dynamic sections on static sites SEO can be lighter on first paint SSR Server per request SEO critical pages, auth-heavy Higher cost, more infrastructure SSG Build time Content sites with stable pages Needs rebuilding on content change For more on mixing static and dynamic, including build-time pulls and runtime calls, this practical guide shows the spectrum well: Cloud.gov’s knowledge base article. Conclusion API-only sites follow a simple path: ship a static shell, fetch data from APIs, and render it fast and safely. Start with one small section, then scale the pattern across the site. The result is a fast, low-cost site that still feels alive. Quick start checklist: Pick one section to make dynamicChoose an API with good docs and CORSWire fetch, show loading, and friendly errorsAdd caching in the browser, CDN, or proxyPre-render key pages and add structured data Ready to try it? Build a small card list fed by a headless CMS or a spreadsheet API. Keep the first slice tiny, get it live, then grow with confidence.

By Jeff Tomas

Series (1/4): Toward a Shared Language Between Humans and Machines — Why Machines Still Struggle to Understand Us

Language models give the impression of conversing with us as if they really understood. But behind this fluency lies an illusion: machines share neither our experiences nor our intentions. This article explores the fundamental barriers that prevent any genuine mutual understanding: the absence of lived experience, the absence of a world, and the radical difference in how reasoning works. Anyone who has ever translated between two human languages can’t help but notice that the task is quite complex, even when mastering both languages perfectly. Language holds many subtleties and ambiguities, unspoken meanings, and things that are simply untranslatable from one language to another. These difficulties often have their roots in cultural grounding as well as in lived experience, frames of thought that shape languages. But as soon as translation moves from human-to-human to human-to-machine, the difficulty takes on an entirely different dimension. Absence of Shared Experience We then face obstacles such as the lack of shared experience and cultural memory, the absence of a perception of the world. The point is that the machine has no grounding in reality. Finally, we have to face the radical divergence between our intentions and meaningful emotions versus the purely logical operations executed by the machine. In other words, we are talking about the gap between the richness of human meaning and the mechanics of calculation. Machine language, its algorithms, and its applications therefore achieve only an imitation of human language, even when their performance is often striking. What artificial intelligence manages to produce reduces to mathematical formalisms, logic, and statistics. Its algorithms and processes break down our intentions into strictly executable instructions, precisely where human language conveys experiences, emotions, and values. Symbol Grounding Problem This is precisely the question raised by the Symbol Grounding Problem (SGP): how could a machine attach true meaning to words without going through an embodied experience of the world? Today, it is clear that large language models (LLMs) give the illusion of human conversation. They generate coherent texts, but in reality, they are limited to predicting word sequences, without demonstrating genuine cultural or contextual understanding. Faced with these limits, we will see that several paths are emerging: Fei-Fei Li (widely regarded as a pioneer of modern computer vision and today co-director of Stanford’s Human-Centered AI Institute) advocates a 3D spatial intelligence, Yann LeCun (often described as one of the founding fathers of deep learning and now Chief AI Scientist at Meta) is developing “world models” with the goal of simulating reality, and other researchers are exploring hybrid approaches to language processing, from automatic translation between programming languages (TransCoder) to quantum methods. “Before we reach human-level AI, we will have to reach cat- and dog-level AI. We’re far from that. We’re still missing something important. Despite the linguistic capabilities of LLMs, a house cat has much more common sense and understanding of the world than any LLM.” — Yann LeCun IBM (long recognized as a pioneer in computing and now a leading player in quantum and AI research) is part of this movement by combining these two axes: its research on “world models” aims to equip machines with internal representations of physical dynamics, while its work on Quantum Natural Language Processing (QNLP) seeks to overcome the current limits of automatic translation by leveraging the properties of quantum computing. Experience, Lived Reality Humans speak by drawing on cultural memory and shared human experience; machines, by contrast, manipulate symbols without ever linking them to lived reality. This is exactly what Stevan Harnad formulated under the name “Symbol Grounding Problem”: as long as a system is limited to processing signs that refer only to other signs, it remains trapped in a closed dictionary, unable to connect words to things. Where a human understands “cat” because they have seen, heard, or touched one, a machine merely aligns statistical correlations. This absence of embodied experience explains why the language produced by today’s large models, however fluent it may be, remains on the surface of what a conversation truly is. We have all observed that these models seem like a natural exchange, but it is only an illusion generated by word sequence prediction. Behind this fluency, there is no intention, no emotional charge, no social memory, and let alone morality or consciousness. Their outputs reflect the corpora on which they were trained, including their biases. A striking example is presented in the article “AI Speaks for the World — But Whose Humanity Does It Learn From?” (DZone), which shows how the models end up privileging dominant voices at the expense of others. I encourage you to check it out to better grasp the extent of this bias. The Absence of a World The second barrier lies in what could be called “the absence of a world.” Human language is fundamentally rooted in a connection to reality: we describe what we see, we anticipate actions, and we interpret gestures. Machines, by contrast, have no direct access to a sensory or motor foundation. Their syntax is without a world. A striking example illustrates this absence of a world. When asked to generate the image of a glass filled to the brim, a generative AI almost always draws a glass literally half full. Why? Because it has no direct connection to physical reality: for it, “filled” corresponds to the dominant representations in its training data, where a “full” glass is often shown… half full. This simple mismatch reveals that it does not understand the concrete notion of “to the brim,” which is obvious to any human who has ever seen liquid right up to the rim. You can try it yourself by asking any image generator: “Produce a photorealistic image of a wine glass, filled to the brim.” As Fei-Fei Li reminds us, “Language does not exist in nature. Humans not only survive, live, and work, but we also build a civilization beyond language.” “The world is in 3D.” To understand a scene is also to grasp the permanence of objects, spatial coherence, and the laws of physics. Without embodied perception, AI can only simulate fragments of reality, often incoherent, or simply concepts it has never experienced, such as the notion of “filled to the brim.” The Mode of Functioning Finally, the third barrier lies in the fact that, fundamentally, humans and machines simply do not function in the same way. Human language carries emotions, intentions, and acknowledged ambiguities. Machine language, on the other hand, is functional: it breaks down instructions and executes them without ever projecting meaning. Where we correct our words to better persuade or move, the machine produces without ever reviewing what it generates. These three obstacles show that, despite what we have called the “illusion of conversation” of today’s models, building a true common language with machines remains a real challenge. From my point of view, it is precisely this triple fracture, experience, perception, and intention, that explains why LLMs, despite their ability to surprise and impress us, remain far from any genuine understanding. To Be Continued… These limits reveal that, for now, machine language remains floating in a void. But if understanding cannot arise spontaneously, can it be taught? In the next article, we will examine the ways of giving machines a kind of perceptual and spatial experience through multimodality and world models. Links to the previous articles in this series: Series: Toward a Shared Language Between Humans and Machines References Abbaszade, Mina; Zomorodi, Mariam; Salari, Vahid; Kurian, Philip. "Toward Quantum Machine Translation of Syntactically Distinct Languages". [link] Brodsky, Sascha. "World models help AI learn what five-year-olds know about gravity". IBM. [link] Gubelmann, Reto. "Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs". [link]Harnad, Stevan. "The Symbol Grounding Problem". [link]LEO (Linguist Education Online). "Human Intelligence in the Age of AI: How Interpreters and Translators Can Thrive in 2025". [link]Meta AI. "Yann LeCun on a vision to make AI systems learn and reason like animals and humans". [link]Opara, Chidimma. "Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis". [link]Qi, Zia; Perron, Brian E.; Wang, Miao; Fang, Cao; Chen, Sitao; Victor, Bryan G. "AI and Cultural Context: An Empirical Investigation of Large Language Models' Performance on Chinese Social Work Professional Standards". [link] Roziere, Baptiste; Lachaux, Marie-Anne; Chanussot, Lowik; Lample, Guillaume. "Unsupervised Translation of Programming Languages". [link]Strickland, Eliza. "AI Godmother Fei-Fei Li Has a Vision for Computer Vision". IEEE Spectrum. [link]Trott, Sean. "Humans, LLMs, and the symbol grounding problem (pt. 1)". [link]Nature. “Chip-to-chip photonic quantum teleportation over optical fibers, 2025”. [link]

By Frederic Jacquet

CORE

*You* Can Shape Trend Reports: Join DZone's Database Systems Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By DZone Editorial

Debugging a Spark Driver Out of Memory (OOM) Issue With Large JSON Data Processing

As a data engineer, I recently encountered a challenging scenario that highlighted the complexities of Apache Spark memory management and Spark internal processing. Despite working with what seemed like a moderate dataset (25 GB), I experienced a driver Out of Memory (OOM) error that halted my data replication job. In this article, I will discuss Spark's internal processing complexity and memory management that can help us build a resilient data replication solution. Scenario Let’s jump into the code snippet that is causing the OOM issue. Scala //Read all data files val df = spark.read.json(allFiles: _*) //Create DDBRecord dataset val recordsDS: Dataset[DDBRecord] = df.map( row => Record.fromRow(row, dynamoDBPartitionKey, Option(dynamoDBSortKey))) //Aggregate using custom aggregrator val aggregatedDS = recordsDS.groupByKey{ r => (r.ddbPartitionKey.orNull, r.ddbSortKey.orNull) }.agg(aggregator.toColumn.name("result")) //Convert back to dataFrame from the transformed dataset val resultDf = aggregatedDS.map(r => r._2.toRow)(RowEncoder.apply(df.schema)) if (resultDf.isEmpty) { return } //apply the net changes in the Apache Iceberg target Our data file is DynamoDB exported data, which typically has rows containing Metadata, Keys, OldImage and NewImage. Our goal is to apply the net changes to the Apache Iceberg target. The objective of the code above is to determine net changes by keeping track of the oldest and newest images in the data files. For example, if there are multiple updates, we need to keep only the oldest (OldImage) and the latest (NewImage) data. We are using a custom model class (DDBRecord) and a custom aggregator to achieve this task. Here are additional details about the data and infrastructure. Dataset: 25 GB of DynamoDB exported JSON dataCluster: 9-machine Glue Spark cluster (G.8X)Cluster Configuration: 32 vCPU per machine128 GB memory per machine The Unexpected Challenge Surprisingly, while performing a simple isEmpty() operation on a transformed DataFrame, the Spark driver immediately threw an Out of Memory error. This was unexpected, given the seemingly modest data size and powerful cluster configuration. As we know that Spark defers the execution of transformations until an action is called. In this scenario, Spark starts the execution when the action df.isEmpty is triggered. The driver is throwing an OOM error as part of executing this action. Dive Deep Below is a typical stack trace for an OOM issue where Spark experiences low memory and ultimately OOM when it tries to execute the action. As is evident from the stack trace, Spark calls explainString method to generate a plain text of the whole plan. It uses plain text for Spark events and logging purposes. You can control the plain text generation by setting config spark.sql.explain.mode to Simple, Formatted, Extended, etc. Java Caused by: java.lang.OutOfMemoryError at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:141) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:203) at scala.collection.immutable.Stream.addString(Stream.scala:691) at scala.collection.TraversableOnce.mkString(TraversableOnce.scala:377) at scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:376) at scala.collection.immutable.Stream.mkString(Stream.scala:760) at org.apache.spark.sql.catalyst.util.package$.truncatedString(package.scala:179) at org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:307) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:136) at org.apache.spark.sql.catalyst.expressions.If.toString(conditionalExpressions.scala:105) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:136) at org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:213) at org.apache.spark.sql.catalyst.trees.TreeNode.formatArg(TreeNode.scala:918) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$formatArg$1(TreeNode.scala:911) at scala.collection.immutable.List.map(List.scala:297) at org.apache.spark.sql.catalyst.trees.TreeNode.formatArg(TreeNode.scala:911) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$argString$1(TreeNode.scala:931) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.AbstractIterator.addString(Iterator.scala:1431) at scala.collection.TraversableOnce.mkString(TraversableOnce.scala:377) at scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:376) at scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:379) at scala.collection.AbstractIterator.mkString(Iterator.scala:1431) at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:949) at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:956) at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:402) at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:404) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:1070) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:1098) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:1098) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:1098) at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:991) at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:657) at org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:284) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:313) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:267) **at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:246)** at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3920) at org.apache.spark.sql.Dataset.isEmpty(Dataset.scala:3201) ... 16 more But why does Spark explain the plan during execution? Spark needs to generate a plan string representation for logging, Spark UI, optimization, and debugging purposes. Spark has an internal event bus where this information is sent for other components to act on. For example, Spark UI listens to the event bus and uploads the event for Spark UI rendering. Whenever an action is invoked, Spark materializes the logical plan and determines the corresponding physical plan for execution. The code block below shows that Spark calls explainString to send it to the event bus. Why would a plan string representation cause OOM? On a further deep dive, we find that the generated plan spans multiple pages. There are two factors contributing to the massive plan. First, the schema size, and second, the dataset created from the DataFrame. In our implementation, we are creating a dataset from a dataframe to produce net changes. The Spark dataset is a distributed collection of data that combines the benefits of RDDs and DataFrames. Since DataSet has schema + type safety, Spark adds null safety for every field. If we have nested attributes this type safety check is applied recursively making the whole plan a giant string. Below is a typical part of the plan when an attribute Key from dataframe is used to represent the same in a dataset. Type safety checks make the plan string representation bigger. createexternalrow(if (isnull(Keys#786)) null else createexternalrow(if (Keys#786.isNullAt) null else if (isnull(Keys#786.uuid)) null else createexternalrow(if (Keys#786.uuid.isNullAt) null else Keys#786.uuid.S.toString, StructField(S,StringType,true)), StructField(uuid,StructType(StructField(S,StringType,true)),true)) It is fine to generate a plan. However, storing the plain-text representation of the plan is expensive and causes memory issues when the source schema itself is big. The Spark Jira issue mentions the same OOM issue that we are facing. In Spark, there are two properties to control the plan generation maxToStringFields or maxPlanStringLength. These properties only affect the formatting or truncate output; they don’t control the traversal itself. Deep Dive Conclusion: Spark plan text representation is massive due to dataset internal representation and Spark traversal of generating a plan. This is contributing to a significant memory drop and, in some cases, OOM. What Is the Fix? Our investigation indicates that the current implementation creates a DataDet from a DataFrame, and its internal representation is massive when Spark represents the plan as a plain string. We have a couple of options to reduce the memory footprint to address the issue. We need a plan for debugging purposes, but not at the cost of compromising the functionality. We would prefer to turn the explain plan off to address the low memory issue. However, this feature is not available in Spark. This pull request talks about having Off mode. We work on optimizing our implementation, where we will not create a DataSet because the DataSet's internal representation is contributing to the massive string representation of the plan. Below is the equivalent code to compute the net changes using dataframe operations only. Essentially, we group records by key (sorted by timestamp) and pick the oldImage and newImage for net changes. Scala //Instead of using dataset, perform dataframe aggregation only. val df = spark.read.json(allFiles: _*) df.groupBy(keyColumns: _*) .agg( min_by(col("OldImage"), col("_write_ts")).as("_earliest_old"), max_by(col("NewImage"), col("_write_ts")).as("_latest_new"), max_by(col("Keys"), col("_write_ts")).as("_final_keys"), max_by(col("Metadata"), col("_write_ts")).as("_final_metadata") ) if (df.isEmpty) { return } //proceed with the rest My local testing clearly shows significant improvement in memory consumption after the fix. The image below shows the object size during execution. After the fix, the object size has reduced drastically. Quick Rules of Thumb for OOM Driver → User Memory risk: Anything that pulls results back (collect, show, take large n).Anything that expands metadata (schema explosion, file listing).Executor → Unified Memory risk: Anything that redistributes data (join, groupBy, shuffle).Caching or broadcast of large datasets. The table below helps you narrow down the investigation. AspectDriverExecutorSchema, Logical/Physical PlansLives in User MemoryNot relevantFile Metadata (S3 file listing, partition info)User MemoryNot relevantTask Scheduling MetadataUser MemoryNot relevantRow Collection (collect(), count(), isEmpty())Results land in User MemoryExecutors compute themShuffle Buffers, Joins, AggregationsMinimal at driver (control plane only)Unified MemoryCaching (persist, broadcast)Broadcasts take Storage MemoryCached data in executors takes Storage MemoryUDF ExecutionRare (only if driver executes an action locally)Executors run UDFs in Execution/User MemoryOOM CauseOften schema explosion, collect(), file listing, huge planOften shuffle/join spill, skew, insufficient execution memory Conclusion Memory management in Spark is nuanced. Even seemingly simple operations can lead to unexpected memory challenges. I would advise keeping track of memory usage and using appropriate configuration and improvements to make a reliable data processing solution.

By Raju Ansari

ZEISS Demonstrates the Power of Scalable Workflows With Ampere® Altra® and SpinKube

The Challenge The cost of maintaining a system capable of processing tens of thousands of near-simultaneous requests, but which spends greater than 90 percent of its time in an idle state, cannot be justified. Containerization promised the ability to scale workloads on demand, which includes scaling down when demand is low. Maintaining many pods among a plurality of clusters just so the system doesn’t waste time in the upscaling process contradicts the point of workload containerization. The Solution Fermyon produces a platform called SpinKube that leverages WebAssembly (WASM), originally created to execute small elements of bytecode in untrusted web browser environments, as a means of executing small workloads in large quantities in Kubernetes server environments. Because WASM workloads are smaller and easier to maintain, pods can be spun up just-in-time as network demand rises without consuming extensive time in the process. And because WASM consists of pre-compiled bytecode. It can be executed on server platforms powered by Ampere® Altra® without all the multithreading and microcode overhead that other CPUs typically bring to their environments — overhead that would, in less compute-intensive circumstances such as these, be unnecessary anyway. Implementation As a demonstration of SpinKube’s effectiveness, ZEISS Group’s IT engineers partnered with Ampere, Fermyon, and Microsoft to produce a system that spins up new WASM pods as demand rises in a just-in-time scenario. The demonstration proves that, in practice, a customer order processing system running on SpinKube, compared to a counterpart running with conventional Kubernetes pods, yields dramatic benefits. “When we looked at a runtime-heavy workload with Node.js, we could process the same number of orders in the same time with an Ampere processor VM environment for 60% cheaper than an alternative x86 VM instance” - Kai Walter, Distinguished Architect, ZEISS Group Background: The Overprovisioning Conundrum It’s still one of the most common practices in infrastructure resource management today: overprovisioning. Before the advent of Linux containers and workload orchestration, IT managers were told that overprovisioning their virtual machines was the proper way to ensure resources were available at times of peak demand. Indeed, resource oversubscription used to be taught as a “best practice” for VM administrators. The intent at the time was to help admins maintain KPIs for performance and availability while limiting the risks involved with overconsumption of compute, memory, and storage. Because of their extensive experience with object cache at AWS, the Momento team settled on caching for their initial product. They have since expanded their product suite to include services like pub-sub message buses. The Momento serverless cache, based on the Apache Pelikan open-source project, enables its customers to automate away the resource management and optimization work that comes with running a key-value cache yourself. At first, Kubernetes promised to eliminate the need for overprovisioning entirely by making workloads more granular, more nimble, and easier to scale. But right away, platform engineers discovered that using Kubernetes’ autoscaler add-on to conjure new pods into existence at the very moment they’re required consumed minutes of precious time. From the end user’s point of view, minutes might as well be hours. Today, there’s a common provisioning practice for Kubernetes called paused pods. Simply put, it’s faster to wake up sleeping pods than create new ones on the fly. The practice involves instructing cluster autoscalers to spin up worker pods well in advance of when they’re needed. Initially, these pods are delegated to worker nodes where other pods are active. Although they’re maintained alongside active pods, they’re given low priority. When demand increases and the workload needs scaling up, the status of a paused pod is changed to pending. This triggers the autoscaler to relocate it to a new worker node where its priority is elevated to that of other active pods. Although it takes just as much time to spin up a paused pod as a standard one, that time is spent well in advance. Thus, the latency involved with spinning up a pod gets moved to a place in time where it doesn’t get noticed. Pod pausing is a clever way to make active workloads seem faster to launch. But when peak demand levels become orders of magnitude greater than nominal demand levels, the sheer volume of overprovisioned, paused pods becomes cost prohibitive. ZEISS Stages a Breakthrough This is where ZEISS found itself. Founded in 1846, ZEISS Group is the world leader in scientific optics and optoelectronics, with operations in over 50 countries. In addition to serving consumer markets, ZEISS’s divisions serve the industrial quality and research, medical technology, and semiconductor manufacturing industries. The behavior of customers in the consumer markets can be very correlated, resulting in occasional large waves of orders with a lull in activity in between. Because of this, ZEISS’s worldwide order processing system can receive as few as zero customer orders at any given minute, and over 10,000 near-simultaneous orders the next minute. Overprovisioning isn’t practical for ZEISS. The logic for an order processing system is far more mundane than, say, a generative AI-based research project. What’s more, it’s needed only sporadically. In such cases, overprovisioning involves allocating massive clusters of pods, all of which consume valuable resources, while spending more than 90 percent of their existence essentially idle. What ZEISS requires of its digital infrastructure instead are: Worker clusters with much lower profiles, consuming far less energy while slashing operational costs Behavior management capabilities that allow for automatic and manual alterations to cloud environments in response to rapidly changing network conditions.Planned migration in iterative stages, enabling the earlier order processing system to be retired on a pre-determined itinerary over time, rather than all at once “The whole industry is talking about mental load at the moment. One part of my job... is to take care that we do not overload our teams. We do not make huge jumps in implementing stuff. We want our teams to reap the benefits, but without the need to train them again. We want to adapt, to iterate — to improve slightly.” - Kai Walter Distinguished Architect, ZEISS Group The solution to ZEISS’s predicament may come from a source that, just three years ago, would have been deemed unlikely, if not impossible: WebAssembly (WASM). It’s designed to run binary, untrusted bytecode on client-side web browsers — originally, pre-compiled JavaScript. In early 2024, open source developers created a framework for Kubernetes called Spin. This framework enables event-driven, serverless microservices to be written in Rust, TypeScript, Python, or TinyGo, and deployed in low-overhead server environments with cold start times measurable only in milliseconds. Fermyon and Microsoft are principal maintainers of the SpinKube platform. This platform incorporates the Spin framework, along with the containerd-shim-spin component that enables Fermyon and Microsoft to be the principal maintainers of the SpinKube platform. This platform incorporates the Spin framework, along with the containerd-shim-spin component that enables WASM workloads to be orchestrated in Kubernetes by way of the runwasi library. Using these components, a WASM bytecode application may be distributed as an artifact rather than a conventional Kubernetes container image. Unlike a container, this artifact is not a self-contained system packaged together with all its dependencies. It’s literally just the application compiled into bytecode. After the Spin app is applied to its designated cluster, the Spin operator provisions the app with the foundation, accompanying pods, services, and underlying dependencies that the app needs to function as a container. This way, Spin redefines the WASM artifact as a native Kubernetes resource. For its demonstration, ZEISS developed three Spin apps in WASM: a distributor and two receivers. A distributor app receives order messages from an ingress queue, then two receiver apps process the orders, the first handling simpler orders that would take less time, and the second handling more complex orders. The Fermyon Platform for Kubernetes manages the deployment of WASM artifacts with the Spin framework. The system is literally that simple. In practice, according to Kai Walter, distinguished architect with ZEISS Group, a SpinKube-based demonstration system could process a test data set of 10,000 orders at approximately 60% less cost for Rust and TypeScript sample applications by running them on Ampere-powered Dpds v5 instances on Azure. Migration Without Relocation Working with Microsoft and Fermyon, ZEISS developed an iterative migration scheme enabling it to deploy its Spin apps in the same Ampere arm64-based node pools ZEISS was already using for its existing, conventional Kubernetes system. The new Spin apps would then run in parallel with the old apps without having to first create new, separate network paths, and then devise some means of A/B splitting ingress traffic between those paths. “We would not create a new environment. That was the challenge for the Microsoft and Fermyon team. We expected to reuse our existing Kubernetes cluster and, at the point where we see fit, we will implement this new path in parallel to the old path. The primitives that SpinKube delivered allows for that kind of co-existence. Then we can reuse Arm node pools for logic that was not allowed on Arm chips before.” - Kai Walter Distinguished Architect, ZEISS Group WASM apps use memory, compute power, and system resources much more conservatively. (Remember, WASM was created for web browsers, which have minimal environments.) As a result, the entire order processing system can run on two of the smallest, least expensive instance classes available in Azure: Standard DS2 (x86) and D2pds v5 (Ampere Altra 64-bit), both with just two vCPUs per instance. However, ZEISS discovered in this pilot project that by moving to WASM applications running on SpinKube, it could transparently change the underlying architecture from x86 instances to Ampere-based D2pds instances, reducing costs by approximately 60 percent. SpinKube and Ampere Altra make it feasible for global organizations like ZEISS to stage commodity workloads with high scalability requirements, on dramatically less expensive cloud computing platforms, potentially cutting costs by more than one-half without impacting performance. References It’s Time to Reboot Software Development by Matt Butcher, CEO, Fermyon Introducing Spin 3.0 by Radu Matei and Michelle Dhanani, Fermyon blog Building a Serverless Python WebAssembly App with Spin by Matt Butcher, CEO of Fermyon Taking Spin for a spin on AKS by Kai Walter, Distinguished Architect, ZEISS Group Cloud Native Processors & Efficient Compute — Ampere Developer Summit session featuring Ampere chief evangelist Sean Varley, ScyllaDB CEO Dor Laor, and Fermyon senior software engineer Kate Goldenring, conducted September 26, 2024 Integrating serverless WebAssembly with SpinKube and cloud services — video featuring Sohan Maheshwar, Lead Developer Advocate, AuthZed Check out the full Ampere article collection here.

By Scott Fulton III

A Developer's Practical Guide to Support Vector Machines (SVM) in Python

Support vector machines (SVMs) are one of the most powerful and versatile supervised machine learning algorithms. Initially famous for their high-performance "out of the box," they are capable of performing both linear and non-linear classification, regression, and outlier detection. For classification tasks, the core idea behind SVM is to find the optimal hyperplane that best separates the different classes in the feature space. In this developer's guide, we'll go beyond a simple fit and predict. We'll walk through the essential practical steps to build, tune, and evaluate a high-performance SVM classifier using Python's Scikit-learn library. We will focus on the details that make the difference between a mediocre model and a production-ready one, including data preprocessing, hyperparameter tuning, and a deep dive into evaluation. This guide will cover: How SVMs work (hyperplanes, margins, and the kernel trick).A critical, must-do step: Preparing and scaling your data.Building a robust training Pipeline.Tuning the key hyperparameters (C and gamma) with GridSearchCV.Evaluating the model using a confusion matrix and the ROC-AUC score with code and visualizations. How Do SVMs Work? The Core Concepts The Linear Case: Hyperplanes and Margins Imagine you have data points belonging to two different classes on a scatter plot. The goal of an SVM is to draw a line (or a hyperplane in higher dimensions) that separates these two classes. But it doesn't just draw any line; it finds the optimal line. This optimal hyperplane is the one that has the maximum margin, meaning the largest possible distance between the hyperplane and the nearest data points of each class. These nearest points, the ones that touch the edge of the margin, are called the "support vectors." They are the most critical data points because they alone define the position and orientation of the decision boundary. This focus on the maximum margin and support vectors is what makes SVMs so robust and effective, as it leads to better generalization on unseen data. The Non-Linear Case: The Kernel Trick The real power of SVMs becomes apparent when your data isn't linearly separable. What if your classes are arranged in concentric circles? You can't draw a single straight line to separate them. This is where the kernel trick comes in. A kernel is a function that takes your low-dimensional data and projects it into a higher-dimensional space where it does become linearly separable. Imagine your concentric circles in 2D. A kernel function could project this data into 3D, turning the circles into two parallel "bowls," one nested inside the other. Now, in this 3D space, you can easily slide a 2D plane (a hyperplane) right between them to separate the two classes. The most popular kernel is the Radial Basis Function (RBF) kernel, which is the default in Scikit-learn. It's incredibly flexible and can create complex, non-linear decision boundaries. Step 1: Preparing Data for SVM (A Critical Step) This is the most common pitfall for developers new to SVMs. SVMs are not scale-invariant. They work by finding the hyperplane that maximizes the distance of the margins. If one feature (e.g., "Salary" in dollars) ranges from 0 to 1,000,000, while another feature (e.g., "Years of Experience") ranges from 0 to 50, the "Salary" feature will completely dominate the distance calculations. Your model will perform terribly. Rule #1 of SVMs: You must scale your features before training. The most common method is Standardization (or Z-score normalization), which rescales the data to have a mean of 0 and a standard deviation of 1. Let's create some sample data to work with. Python import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.pipeline import Pipeline from sklearn.metrics import ( classification_report, confusion_matrix, roc_curve, auc, RocCurveDisplay ) # 1. Create a synthetic dataset # We'll make it non-linear to show the power of the RBF kernel X, y = make_classification( n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, class_sep=0.8, # Make them a bit hard to separate random_state=42 ) # 2. Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 3. Visualize the unscaled data # We'll plot the test set to see what we're trying to predict sns.scatterplot( x=X_test[:, 0], y=X_test[:, 1], hue=y_test, palette=['#FF5733', '#335BFF'], alpha=0.8 ) plt.title("Unscaled Test Data") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.legend(["Class 0", "Class 1"]) plt.show() Step 2: Building a Preprocessing and Training Pipeline Instead of scaling our data manually (scaler.fit_transform(X_train), scaler.transform(X_test)), the best practice is to use a Pipeline. A Pipeline chains steps together. It will only fit the StandardScaler on the training data and then safely transform both the training and test data. This prevents the cardinal sin of data leakage (letting information from the test set "leak" into your training process). Python # Create a pipeline that first scales the data, then trains the SVM # This is the standard, robust way to do it. # probability=True is needed later for the ROC-AUC curve # C and gamma are hyperparameters we will tune svm_pipeline = Pipeline([ ('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', C=1.0, gamma='auto', probability=True, random_state=42)) ]) # Now, just fit the entire pipeline. It handles scaling automatically. print("Training the SVM pipeline...") svm_pipeline.fit(X_train, y_train) print("Training complete.") # Make predictions on the test data y_pred = svm_pipeline.predict(X_test) Step 3: Tuning Key Hyperparameters (C and Gamma) Our pipeline works, but how do we know C=1.0 and gamma='auto' are the best choices? We don't. We need to tune them. C (Regularization Parameter): This parameter controls the trade-off between achieving a "clean" (low misclassification) margin and a "wide" (smooth) margin. Low C: A very soft margin. The model allows for more misclassifications in the training data to get a wider, simpler margin. This can prevent overfitting (high bias, low variance).High C: A hard margin. The model tries to classify every training sample correctly. This can lead to a very complex, narrow margin that overfits the training data (low bias, high variance).gamma (Kernel Coefficient for 'rbf'): This parameter defines how far the influence of a single training example (a support vector) reaches. Low gamma: A large radius of influence. The model is simpler and smoother. This can lead to underfitting.High gamma: A small radius of influence. Each support vector has a very local "bubble" of influence. This can create highly complex, "island-like" decision boundaries that overfit the data. We can find the best combination of C and gamma using GridSearchCV (Grid Search Cross-Validation). Python # Define the "grid" of parameters to search # We'll try a few values for C and gamma param_grid = { 'svm__C': [0.1, 1, 10, 100], # Note the 'svm__' prefix 'svm__gamma': [1, 0.1, 0.01, 0.001] } # IMPORTANT: We pass the *pipeline* to GridSearchCV, not just the model. # This ensures that scaling is part of the cross-validation for each grid combination. # cv=5 means 5-fold cross-validation. n_jobs=-1 uses all CPU cores. grid_search = GridSearchCV( svm_pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2, n_jobs=-1 ) print("Running GridSearchCV to find best C and gamma...") grid_search.fit(X_train, y_train) # Get the best model found by the grid search best_svm = grid_search.best_estimator_ print(f"\nBest parameters found: {grid_search.best_params_}") print(f"Best cross-validation ROC-AUC score: {grid_search.best_score_:.4f}") # Make new predictions using the *best* model y_pred_tuned = best_svm.predict(X_test) Step 4: Evaluating Your Tuned SVM Model Now that we have tuned our model, let's see how it actually performs on the held-out test data. The Confusion Matrix: A Deeper Look The confusion matrix is the foundation for most classification metrics. It gives you a detailed breakdown of your model's predictions versus the actual labels. True Positive (TP): Actual was 1, Model predicted 1.True Negative (TN): Actual was 0, Model predicted 0.False Positive (FP): Actual was 0, Model predicted 1. (Type I Error)False Negative (FN): Actual was 1, Model predicted 0. (Type II Error) From this, Scikit-learn's classification_report calculates: Precision: Of all the times the model predicted "Positive," how often was it right? TP / (TP + FP)Recall (Sensitivity): Of all the actual "Positive" cases, how many did the model find? TP / (TP + FN)F1-Score: The harmonic mean of Precision and Recall. A great all-around metric. Python # 1. Classification Report print("\n--- Classification Report (Tuned Model) ---") print(classification_report(y_test, y_pred_tuned)) # 2. Confusion Matrix print("\n--- Confusion Matrix (Tuned Model) ---") cm = confusion_matrix(y_test, y_pred_tuned) print(cm) # 3. Visualize the Confusion Matrix plt.figure(figsize=(7, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1']) plt.title('Tuned SVM Confusion Matrix') plt.ylabel('Actual') plt.xlabel('Predicted') plt.show() The ROC Curve and AUC Score The Receiver Operating Characteristic (ROC) curve is one of the most important evaluation metrics for a binary classifier. It plots the True Positive Rate (Recall) against the False Positive Rate as you vary the decision threshold of the classifier. The goal is to have a curve that bows as far as possible to the top-left corner. Top-Left Corner: A perfect classifier (100% True Positives, 0% False Positives).Diagonal Line: A model that is no better than random guessing. The Area Under the Curve (AUC) summarizes the ROC curve into a single number. AUC = 1.0: A perfect classifier.AUC = 0.5: A classifier that is no better than random chance. Python # We need the predicted *probabilities* for the ROC curve # This is why we set probability=True in the SVC y_pred_proba = best_svm.predict_proba(X_test)[:, 1] # Get probabilities for class 1 # Calculate ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) # Calculate AUC roc_auc = auc(fpr, tpr) print(f"\nTest Set ROC-AUC Score (Tuned Model): {roc_auc:.4f}") # Plot the ROC curve plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, color='blue', lw=2, label=f'Tuned SVM (AUC = {roc_auc:.4f})') plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--', label='Random Chance (AUC = 0.50)') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.grid(True) plt.show() An AUC score (e.g., 0.97) indicates a high-performing model that is excellent at distinguishing between the two classes. Conclusion The Support Vector Machine is a robust and effective algorithm, but only when used correctly. We've seen that SVMs are not a simple "plug-and-play" model. By following this guide, you've moved from a basic concept to a production-ready approach. You now know that feature scaling is mandatory, not optional. You know how to use a Pipeline to prevent data leakage and streamline your workflow. And, most importantly, you know how to tune the critical C and gamma hyperparameters to build a model that truly generalizes, and then prove its performance with a confusion matrix and ROC-AUC curve.

By Soumya Banerjee

Understand and Optimize AWS Aurora Global Database

AWS Aurora Database supports a global multi-region setup, including a primary and a secondary region. When engineering with Aurora Global, the default settings are great, but understanding all the available configuration options and how those come together saves time and effort. This article explains Global Write Forwarding and its effects in detail, which is a very handy setting that lets applications read and write from applications running on both primary and secondary regions. Note that this article is not about Aurora DSQL, which is a different service that supports active-active setup out of the box. Aurora Defaults Aurora Global Database sets up a writer and a reader in the primary region, plus a reader and a standby writer instance in the secondary region. The standby writer will be promoted to a writer during a region failover, where the secondary region becomes primary. Endpoints are created to connect to 3 of the 4 instances (shown in Figure 1): a read-write endpoint to interact with the primary writer, a read-only endpoint to interact with the primary reader, and a read-replica endpoint to interact with the secondary reader. Figure 1 Based on how the security groups are configured, these endpoints can be accessed from both regions, but interact only with their respective DB instance. This default setup works best as an active-standby setup where the application on the secondary region can be available but not actively serving any traffic involving the database connection. Configuring cross-region access is possible, but inconvenient to maintain. Global Endpoint A global endpoint can be enabled, which proxies the primary region writer endpoint by default. The purpose of a global endpoint is that the application using the endpoint does not need to change when a region failover event occurs, in which case the global endpoint switches to point at the newly promoted writer. This setup can work with active-active compute, when using the same global endpoint in both regions, and the application can tolerate the latency everywhere a database access is involved in the secondary region. Global endpoint connects only the primary writer. The reader instances in either region are not used unless those endpoints are explicitly coded as application data sources. See Figure 2 Figure 2 Global Write Forwarding We had established that the default settings, combined with the usage of the global endpoint, are great for region failover scenarios. If you coded the application to use a global endpoint for writes and local reader endpoints for reads, you care about sharing the load in read-heavy/ write-heavy workloads. One of the settings available is to enable write forwarding. When this setting is enabled, using the global endpoint from the secondary region, reads locally and writes to the primary region, without needing any additional application code to switch between reads and writes. This setting has an effect only on the secondary cluster. The primary region will need additional code to separate reads and writes. This is great for read-heavy applications, letting them remain active-active, providing low-latency local database reads. However, this brings a challenge of distributed data consistency. See Figure 3 for updated connectivity patterns with and without global write forwarding. Figure 3 Consistency If you observe the path 1->2->3 in Figure 3, which is the data flow from the secondary region when write forwarding is enabled, you see that we are writing and reading from different clusters, relying on Aurora to keep the data consistent. Thankfully, Aurora handles this by default, but the question remains: are the default settings good enough for your use case? There are three ways Aurora provides read consistency: session, eventual, and global. For write forwarding to work, the consistency setting must be configured correctly. The defaults are also different in MySQL and PostgreSQL. Session consistency is the default setting in Postgres; it limits blocking and works in most cases. MySQL does not have a default consistency set. Until a value is set, write forwarding setting is ignored by Aurora. Click through the links above to see how global and eventual consistency settings work before choosing one. Replication Lag If there is a delay in replication between the regions, write forwarding based on its consistency configuration may not be working as expected. The recovery point objective setting controls the lag and is set to 1 minute by default, meaning Aurora tries to keep the lag under 1 minute; it can be adjusted, but cannot go any lower than 20 seconds. This setting blocks transactions until the lag clears to maintain data integrity. Before changing the default setting, make sure that your application tolerates the delay and data integrity trade-offs. Conclusion Understanding the configuration options available with the Aurora database and their trade-offs helps in choosing the optimal setting for your workload. Default settings are a good starting point, but sometimes we learn the trade-offs the hard way, like during a production incident when the settings fail under specific conditions. Fully understand the application throughput, data access patterns, and related configurations before using them.

By Subhash Kovela

Building Cloud Ecosystems With Autonomous AI Agents: The Future of Scalable Data Solutions

AI agents are a reality now and are one of the key research goals for AI companies and research labs. These agents automate monotonous and complicated workflows within cloud environments. They can enhance human capabilities in code generation and debugging. They improve productivity by reducing manual efforts for creative and higher-level thinking, while the AI agents do what they do best. With this, AI agents are evolving cloud and data systems. Scalability is maximized and efficiency is realized through their implementation, because humans finally have the time to revolutionize while AI handles the tedious work, optimizing resources, predicting problems, and tailoring solutions. They can even detect errors quickly and make decisions based on data. Understanding Autonomous AI Agents AI agents are autonomous, rational software systems that can perform a variety of tasks, like process data, conduct analysis, or orchestrate processes in cloud ecosystems. While their expectations are set by humans, they work independently, utilizing data to help drive their decisions. They redefine and expand upon generative AI by working alongside or for humans. They will only continue to improve with improved memory, entitlements, and tools. With the upgrading of large language models (LLMs), they will also advance because they build off of current LLM models, with an extra layer of autonomy. Current frameworks allow anyone to take advantage of AI agents such as Microsoft Copilot, OpenAI, AutoGen, and LangChain. The agents can be connected to existing data to take on repetitive tasks. Customer service, healthcare, and automotive enterprises are taking on AI agents that shift how businesses function with data-driven decision-making that is autonomous and increasing efficiencies across sectors. AI Agents Transforming Data Ecosystems AI agents are beneficial in enhancing data pipelines, optimizing data storage and governance, and preparing data for machine learning models. Streamlining ETL Processes Extract, transform, and load (ETL) processes encompass the extraction, transformation, and loading of data from multiple sources. It then cleans and streamlines data while making sure it is ready for analysis. This enables data integration autonomously and makes it easier for the team to use. One of the main challenges with ETL is the human error with oversights or issues with coding. With the AI agent’s ability to detect errors, data quality can be maintained seamlessly with AI that can flag issues and rectify them before they are manipulated or analyzed. Optimizing Data Storage and Governance Another critical aspect of AI agents is for maximizing data storage intelligently through options like OneLake or Microsoft Purview, where governance is the default, and data requires significantly less time to locate, respectively. This comes from AI, which is able to detect and classify data, and the agent can interact and take actions with the data. AI agents have an uncanny capability of predicting and mitigating risks with compliance by analyzing massive amounts of historical data. They are able to notice patterns, manage risk, and run tests on thousands of scenarios, which benefits governance and compliance. Scalability and Real-Time Analytics Enabled by AI Agents Many common challenges arise with cloud ecosystems, one of which is costs. It is difficult to forecast costs, and third-party services and energy costs can become difficult, especially when it comes time to scale. Latency becomes difficult with cloud computing in real time, and there are limits on infrastructure capabilities with the need to integrate numerous data sources and services across networks. AI agents can help address these common challenges through containerization, such as Docker. Serverless deployment tackles infrastructure challenges and offers simplified scaling with Kubernetes. Distributed computing frameworks such as Spark on Azure can, with just a few lines of code, deploy and scale AI agents. AWS Redshift is able to leverage LLMs with a simple SQL command to do anything from translation to summarization in real time. Architecting Autonomous and Resilient Data Systems Dynamic Workload Management and Self-Healing Pipelines AI agents make it possible to scale dynamic workloads with automation and decision-making. They can break complex scenarios into pieces and remain adaptable and capable in adversity, representing self-healing data pipelines no matter the situation. Bottlenecks and delays can be avoided by their competence for resource allocation with regularly monitored data to adjust as needed. With their ability to predict failures, recovery is not even a stress since they can adjust accordingly to expected failures. Industry Adaptation Many industries are seeing the benefits of AI agents, from healthcare to finance to retail. AI-powered chatbots can help with insurance, prescription refills, or care instructions. AI agents can track market trends and identify company risks prior to them happening. Finally, AI can offer insights to inventory management for retail sectors in real time. Addressing Ethics, Governance, and Cost Optimization As with every new tech, there are concerns with ethics, governance, and cost. With all the historical data that AI agents intake, there can potentially be sensitive information to be careful with. It is important for organizations to prioritize ethics to reduce breaches. To help quell concerns, explainable AI (XAI) in workflows is useful to show transparency and fairness. Tracking data access and conducting audits is important through regular monitoring of data to maintain compliance with GDPR. The costs of AI agents can be optimized with serverless cloud technologies that can help, as was proven with Azure Functions, Google Cloud Functions, and AWS Lambda, to improve the availability and scalability. Actionable Steps To integrate AI agents, it is crucial to evaluate which areas of the business require assistance and which AI agents are most equipped for these areas, and to test and learn over time. There is also a certain return of investment to assess the impact of the AI agents for businesses: operational impact, governance impact, customer impact, employee impact, and financial impact. In order to ensure investments are made into the right technologies, it is important to select AI agents that will support organizations in the long term. Conclusion The time is now to revolutionize cloud and data ecosystems with the inclusion of AI agents. Their strength is to automate tasks, clean and prepare data, detect anomalies in numerous industries, and increase the productivity of individuals. To future-proof cloud strategies and allow for scalable solutions to data storage, management, and analysis, consideration of AI agents where foreseen to benefit businesses is an important next step.

By Aravind Nuthalapati

CORE