DZone Spotlight

Wednesday, October 15 View All Articles »

Senior Developers, What to Read Next?

By Daniel Buza

Recently, one of my best friends, who is, in the meantime, one of the smartest developers I have the luck to know, asked me what book he should read next to further develop his skills. It took me some time to gather my thoughts, and it might be useful for others, too. Spoiler alert: I could not find a single book that I would say is the one to read as a senior developer. Instead, I summarized the books that I found good for one reason or another. As the summary also declared, this is a subjective list; feel free to agree or disagree with my choices, as well as feel free to leave a comment or contact me in any other way to share your thoughts. First of all, why read books? We are in 2025 — everything important can be summarized in a 160-character-long message, and nobody has the mental capacity to consume anything longer than four seconds. Well, jokes aside, it is a valid concern that books might get outdated; following tech websites can help you stay up to date. (And a currently hot topic, AI is a really good example for this. I am not aware of many well-written and still up-to-date books on that topic.) While this is true, I still have two reasons why people should engage in reading physical books: Because usually, physical books get way deeper into topics than shorter publications. And they tend to present not only the direct results, but also try to clarify the ideas, thoughts, and assumptions behind advice (i.e., how you shall write code) or engineering decisions (i.e., why a given framework's API was built in a particular way). If you want to learn more, these aspects are far more important than knowing a whole API by heart or just knowing the best practices without understanding the reasoning behind them.Because you will remember better: at least, my experience is that I can remember way better which book I read something in, as well as on which blog I read something. I can remember the font, the size of the book, and the number of pages on the left and right side; therefore, when I look for something I read on paper, I find it usually faster than when I look for something that I read online. This might not apply to everyone, but according to my colleagues, this applies to most of us. As a side note, I will not link any online shop for the books, but will provide the ISBN numbers, and you can look them up in the shop of your choice. The Core Let's start with my advice on two books. I know my friend has already read them, but I was unsure if he has the physical copies, too: Clean Code (ISBN 978-0132350884) and Clean Architecture (978-0134494166) by Robert C. Martin. You do not have to agree with everything in these, but I expect every single developer in my team to know about the concepts and arguments listed in these two. As a side note, the second edition of Clean Code (978-0135398579) will be released soon (October 27, 2025), and I am already excited to get it. In general, I think it is a good idea to re-read these books every couple of years. Technical Books I do not believe that any of the following books would be 100% new and unknown to developers who have been around for a while. Still, they might contain such aspects that you have not thought through until now, so they could be a good addition to your library. I really liked the book The Software Craftsman (978-0134052502) because it places software development in a different context, as well as discussing internal processes, interviewing, personal dedication, and other aspects. This book will not help you with low-level coding details, but it could provide you with some insights about the industry you are working in, which definitely helps you to improve as a developer, too. Get Your Hands Dirty on Clean Architecture (978-1805128373), this book managed to surprise me. It has a really good section about layered architecture. This book complements the Clean Architecture book really well by detailing some code-level aspects, which can be really helpful if you are not used to actually writing code according to clean architecture standards. Kent Beck's book, Test Driven Development (978-0321146533), is simply the best book I've seen on TDD. I really liked how he demonstrated the strength of TDD in multiple examples, even writing an xUnit tool. The next book does not strictly belong to this list, as it is not meant for senior developers with years, maybe even decades of experience, but I find its writing style really good. Java By Comparison (978-1680502879) aims to help developers with less experience learn about best practices and how to avoid some mistakes. If you, as a senior, are asked by a junior what to read, this could be a really good pick. Clean Agile (978-0135781869) provides insights into how R. C. Martin remembers the starting days of the agile movement — how it evolved, which are the situations when agile methodology fails to help, and, in general, for what it was originally intended to be used. Reading this can heavily improve your added value in agile projects, simply by having a better understanding of the methodology itself, and maybe you can help your team to be more efficient, regardless of how strictly you follow a given ruleset. Non-Technical Books The books mentioned until this point were written by tech people for tech people, which is ultimately a valid approach because we should learn from each other. The following books do not fall into the same category: they are not necessarily written by technical people and are meant for more than just technical audiences. I still recommend them. Humble Pi: When Math Goes Wrong in the Real World (978-0593084694) is a super entertaining book. You would not expect anything else if you are familiar with the author. Still, beyond being entertaining, it brings attention to responsibility, which I find really important to be aware of, as a software developer. Some mistakes lead to a lot of money being lost. Some mistakes lead to people's lives ending earlier. I am not willing to get all of us super paranoid, but from time to time, everyone should consider what can happen if something in their code malfunctions. In the same book, I found some interesting details on topics that I was not even aware of, and how complex they can be. My favorite part was about calendars: everyone knows that dealing with time zones and various ways of counting days and years can be tricky. But I was not aware that it could be this tricky, and how much struggle it was, when sending messages from one European city to another was not a matter of seconds but weeks. Have you ever felt you are Surrounded by Idiots (978-1785042188)? If you work as a developer and have never felt this, please contact me. I mean, really. I want to know where you work and how to apply. This book describes people and groups. I assume you are a person and you work in some sort of group. Therefore, this book could be useful to understand others, the dynamics in which they are most efficient. I do not believe this book will help you resolve conflicts. But it can help you understand the reasons for conflicts. The last book I advise is Thinking, Fast and Slow (978-0374533557). This one covers a lot of topics that are not or are only marginally related to software development. Still, understanding how people decide and what the limits of rational behaviour are can help you a lot if you want to practice some self-reflection. And I believe, most of us, most of the developers could practice a bit more of it. Wrapping Up Feel free to pick any book from this list for the reasons I mentioned or for any other reason you may have, or pick any other book you believe that will help you become a better developer. My main message here is just please do consume high-quality sources because we cannot fall into the same mistakes over and over again. More

Where Stale Data Hides Inside Your Architecture (and How to Spot It)

By Andreas Kozachenko

Every system collects stale data over time — that part is obvious. What’s less obvious is how much of it your platform will accumulate and, more importantly, whether it builds up in places it never should. That’s no longer just an operational issue but an architectural one. In my experience, I’ve often found stale data hiding in corners nobody thinks about. On the surface, they look harmless, but over time, they start shaping system behavior in ways that are hard to ignore. And it’s not just a rare edge case: studies show that, on average, more than half of all organizational data ends up stale. That means the risks are not occasional but systemic, quietly spreading across critical parts of the platform. The impact isn’t limited to performance. Outdated records interfere with correctness, break consistency across services, and complicate debugging. What is more, stale data quietly consumes storage and processing resources, increasing operational costs. Based on what I’ve seen in enterprise platforms, I can point to several hidden spots that deserve far more attention than they usually get. Where Stale Data Finds Room to Hide My team often joins enterprise projects with goals like improving performance or reducing costs. Each time, the same lesson surfaces: by examining the spots below, platforms become leaner, faster, and far easier to maintain. Cache Layers as Hidden Conflict Zones Stale data often hides not in caching itself but in the gaps between cache layers. When application, storefront, and CDN caches don’t align, the system starts serving conflicting versions of the truth, like outdated prices or mismatched product images. In one enterprise ecommerce platform, we traced product inconsistencies back to five overlapping cache levels that overwrote each other unpredictably — a classic case of caching mistakes. The fix required reproducing the conflicts with architects and tightening configurations. A clear warning sign that your cache may hide stale data is when problems vanish after cache purges, only to return later. It often means the layers are competing rather than cooperating. Synchronization Jobs That Drift Another source of stale data is asynchronous synchronization. On paper, delayed updates look harmless, as background jobs will “catch up later.” In practice, those delays create a silent drift between systems. For example, users of a jewelry platform saw outdated loyalty points after login because updates were queued asynchronously. Customers assumed their balances had disappeared, support calls surged, and debugging became guesswork. The issue was fixed by forcing a back-end check whenever personal data pages were opened. A common signal is when user-facing data only appears correct after manual refreshes or additional interactions. Historical Transaction Data That Never Leaves One of the heaviest anchors for enterprise systems is transactional history that stays in production far longer than it should. Databases are built to serve current workloads, not to carry the full weight of years of completed orders and returns. This is exactly what my team encountered in a European beauty retail platform: the production database had accumulated years of records, slowing queries, bloating indexes, and dragging overnight batch jobs while costs crept higher. The fix was smart archiving, with moving old records out of production and deleting them once the retention periods expired. A telling signal is when routine reports or nightly jobs begin stretching into business hours without clear functional changes. Legacy Integrations as Silent Data Carriers Integrations with legacy systems often look stable because they “just work.” The trouble is that over time, those connections become blind spots. Data is passed along through brittle transformations, copied into staging tables, or synchronized with outdated protocols. At first, the mismatches are too small to notice, but they slowly build into systemic inconsistencies that are painful to trace. A signal worth watching is when integrations are left undocumented, or when no one on the team can explain why a particular sync job still runs. That usually means it’s carrying stale data along with it. Backups With Hidden Liabilities Backups are the one place everyone assumes data is safe. The paradox is that safety can turn into fragility when outdated snapshots linger for years. Restoring them may quietly inject obsolete records back into production or test systems, undermining consistency at the very moment resilience is needed most. The architectural pain is in rising storage costs and the risk of corrupted recovery. A simple indicator is when backup retention policies are unclear or unlimited. If “keep everything forever” is the default, stale data has already found its way into your disaster recovery plan. When seeing the corners where stale data tends to accumulate, the next question is, how do you tell when it’s quietly active in yours? Spotting the Signals of Stale Data Over the years, I’ve learned to watch for patterns like these: Lagging reality: Dashboards or analytics that consistently trail behind real events, even when pipelines look healthy.Phantom bugs: Issues that disappear after retries or re-deployments, only to return without code changes.Inconsistent truths: Two systems show different values for the same entity — prices, stock, balances — without a clear root cause.Process creep: Batch jobs or syncs that take longer every month, even when business volume hasn’t grown at the same pace.Operational tells: Teams relying on manual purges, ad-hoc scripts, or “refresh and check again” advice as standard troubleshooting steps. Signals spotted, hiding places uncovered — the next question is obvious: what do you actually do about it? Here is some practical advice. Keeping Data Fresh by Design Preventing stale data requires making freshness an architectural principle. It often starts with centralized cache management, because without a single policy for invalidation and refresh, caches across layers will drift apart. From there, real-time synchronization becomes critical, as relying on overnight jobs or delayed pipelines almost guarantees inconsistencies will creep in. But even when data moves in real time, correctness can’t be assumed. Automated quality checks, from anomaly detection to schema validation, are what keep silent errors from spreading across systems. And finally, no system operates in isolation. Imports and exports from external sources need fail-safes: guardrails that reject corrupted or outdated feeds before they poison downstream processes. Taken together, these practices shift data freshness from reactive firefighting to proactive governance, ensuring systems stay fast, consistent, and trustworthy. Fresh Data as an Ongoing Architectural Discipline In my experience, the cost of stale data rarely hits all at once — it creeps in. Performance slows a little, compliance checks get harder, and customer trust erodes one mismatch at a time. That’s why I see data freshness not as a cleanup task but as an ongoing architectural discipline. The good news is you don’t need to fix everything at once. Start by asking where stale data is most visible in your system today and treat that as your entry point to building resilience. More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

Advanced Snowflake SQL for Data Engineering Analytics

Snowflake is a cloud-native data platform known for its scalability, security, and excellent SQL engine, making it ideal for modern analytics workloads. Here in this article I made an attempt to deep dive into advanced SQL queries for online retail analytics, using Snowflake’s capabilities to have insights for trend analysis, customer segmentation, and user journey mapping with seven practical queries, each with a query flow, BI visualization, a system architecture diagram, and sample inputs/outputs based on a sample online retail dataset. Why Snowflake? Snowflake’s architecture separates compute and storage, enabling elastic scaling for large datasets. It supports semi-structured data (e.g., JSON, Avro) via native parsing, integrates with APIs, and offers features like time travel, row-level security, and zero-copy cloning for compliance and efficiency. These qualities make it a powerhouse for online retail analytics, from tracking seasonal trends to analyzing customer behavior. Scenario Context The examples below use a pseudo online retail platform, "ShopSphere," which tracks customer interactions (logins, purchases) and transaction values. The dataset includes two tables: event_log: Records user events (e.g., event_id, event_type, event_date, event_value, region, user_id, event_data for JSON).user: Stores user details (e.g., user_id, first_name, last_name). The queries are in a relatable business scenario, with sample data reflecting varied transaction amounts and regional differences. All sample data is synthetic, designed to demo query logic in an online retail setting. Getting Started With Snowflake To follow along, create a Snowflake database and load the sample tables. Below is the SQL to set up the event_log and User tables: SQL CREATE TABLE event_log ( event_id INT, event_type STRING, event_date DATE, event_value DECIMAL(10,2), region STRING, user_id INT, event_data VARIANT ); CREATE TABLE user ( user_id INT PRIMARY KEY, first_name STRING NOT NULL, last_name STRING NOT NULL ); Insert the sample data provided in each query section. Use a small virtual warehouse (X-Small) for testing, and ensure your role has appropriate permissions. For JSON queries, enable semi-structured data support by storing JSON in the event_data column. Advanced SQL Queries Below are seven advanced SQL queries showcasing Snowflake’s strengths, each with a query flow diagram, sample input/output, and Snowflake-specific enhancements. These queries build progressively, from basic aggregations to complex user journey analysis and JSON parsing, ensuring a logical flow for analyzing ShopSphere’s data. 1. Grouping Data by Year and Quarter This query aggregates events by year and quarter to analyze seasonal trends, critical for inventory planning or marketing campaigns. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, EXTRACT(QUARTER FROM event_date) AS quarter, COUNT(*) AS event_count, SUM(event_value) AS total_value FROM event_log GROUP BY year, quarter ORDER BY year, quarter; Explanation: The query extracts the year and quarter from event_date, counts events, and sums transaction values per group. Snowflake’s columnar storage optimizes grouping operations, even for large datasets. Snowflake Enhancements Scalability: Handles millions of rows with auto-scaling compute.Search optimization: Use search optimization on event_date to boost performance for frequent queries.Clustering: Cluster on event_date for faster aggregations. Sample input: The event_log table represents ShopSphere’s customer interactions in 2023. event_id event_type event_date event_value region user_id 1 Login 2023-01-15 0.00 US 101 2 Purchase 2023-02-20 99.99 EU 102 3 Login 2023-03-25 0.00 Asia 103 4 Purchase 2023-04-10 149.50 US 101 5 Login 2023-05-05 0.00 EU 102 6 Purchase 2023-06-15 75.25 Asia 103 Sample output: year quarter event_count total_value 2023 1 2 99.99 2023 2 3 224.75 2023 3 1 0.00 BI tool visualization: The bar chart below visualizes the event counts by quarter, highlighting seasonal patterns. Query flow: 2. Calculating Running Totals for Purchases Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. Query: SQL --Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. SELECT event_type, event_date, event_value, SUM(event_value) OVER (PARTITION BY event_type ORDER BY event_date) AS running_total FROM event_log WHERE event_type = 'Purchase' AND event_date BETWEEN '2023-01-01' AND '2023-06-30'; Explanation: This query calculates cumulative purchase values, ordered by date, building on Query 1’s aggregation by focusing on purchases. Snowflake’s window functions ensure efficient processing. Snowflake Enhancements Window functions: Optimized for high-performance analytics.Time travel: Use AT (OFFSET => -30) to query historical data.Zero-copy cloning: Test queries on cloned tables without duplicating storage. Sample input (Subset of event_log for purchases in 2023): event_id event_type event_date event_value 2 Purchase 2023-02-20 99.99 4 Purchase 2023-04-10 149.50 6 Purchase 2023-06-15 75.25 Sample output: event_type event_date event_value running_total Purchase 2023-02-20 99.99 99.99 Purchase 2023-04-10 149.50 249.49 Purchase 2023-06-15 75.25 324.74 BI visualization: The running total of purchase values over time, illustrating sales growth Query flow: 3. Computing Moving Averages for Login Frequency Moving averages smooth out fluctuations in login events, aiding user engagement analysis and complementing purchase trends from Query 2. Query: SQL SELECT event_date, COUNT(*) AS login_count, AVG(COUNT(*)) OVER (ORDER BY event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_avg FROM event_log WHERE event_type = 'Login' GROUP BY event_date; Explanation: This query calculates a three-day moving average of daily login counts. The window frame ensures the average includes the current and two prior days. Snowflake Enhancements Window frames: Efficiently processes sliding windows.Materialized views: Precompute aggregates for faster reporting.Data sharing: Share results securely with marketing teams. Sample input (Subset of event_log for logins): event_id event_type event_date 1 Login 2023-01-15 3 Login 2023-01-16 5 Login 2023-01-17 7 Login 2023-01-18 Sample output: event_date login_count three_day_avg 2023-01-15 1 1.00 2023-01-16 1 1.00 2023-01-17 1 1.00 2023-01-18 1 1.00 BI visualization: Displays the three-day moving average of login counts, showing whether daily fluctuations exist or not. Query flow: 4. Time Series Analysis for Regional Purchases This query detects daily changes in purchase values by region, building on Query 2 to identify market-specific trends. Query: SQL SELECT event_date, region, event_value, event_value - LAG(event_value, 1) OVER (PARTITION BY region ORDER BY event_date) AS daily_difference FROM event_log WHERE event_type = 'Purchase' AND region = 'US'; Explanation: The LAG function retrieves the previous day’s purchase value, enabling daily difference calculations for the US region. Snowflake Enhancements Clustering: Cluster on region and event_date for faster queries.Query acceleration: Use Snowflake’s query acceleration service for large datasets.JSON support: Parse semi-structured data with FLATTEN for enriched analysis. Sample input (Subset of event_log for US purchases): event_date region event_value 2023-02-20 US 99.99 2023-04-10 US 149.50 Sample output: event_date region event_value daily_difference 2023-02-20 US 99.99 NULL 2023-04-10 US 149.50 49.51 BI visualization: The daily differences in purchase values for the US region, showing fluctuations. Query flow: 5. Generating Hierarchical Subtotals With ROLLUP ROLLUP creates subtotals for reporting, extending Query 1’s aggregations for financial summaries across years and regions. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, region, SUM(event_value) AS total_value FROM event_log WHERE event_type = 'Purchase' GROUP BY ROLLUP (year, region) ORDER BY year, region; Explanation: ROLLUP generates subtotals for each year and region, with NULL indicating higher-level aggregations (e.g., total per year or grand total). Snowflake Enhancements Materialized views: Precompute results for faster dashboards.Dynamic warehouses: Scale compute for complex aggregations.Security: Apply row-level security for region-specific access. Sample input (Subset of event_log for purchases): event_date region event_value 2023-02-20 EU 99.99 2023-04-10 US 149.50 2023-06-15 Asia 75.25 Sample output: year region total_value 2023 Asia 75.25 2023 EU 99.99 2023 US 149.50 2023 NULL 324.74 NULL NULL 324.74 BI visualization: Shows total purchase values by region for 2023, with a separate bar for the yearly total. Query flow: 6. Recursive CTE for Customer Purchase Paths This query uses a recursive CTE to trace customer purchase sequences, enabling user journey analysis for personalized marketing. Query: SQL WITH RECURSIVE purchase_path AS ( SELECT user_id, event_id, event_date, event_value, 1 AS path_level FROM event_log WHERE event_type = 'Purchase' AND event_date = (SELECT MIN(event_date) FROM event_log WHERE user_id = event_log.user_id AND event_type = 'Purchase') UNION ALL SELECT e.user_id, e.event_id, e.event_date, e.event_value, p.path_level + 1 FROM event_log e JOIN purchase_path p ON e.user_id = p.user_id AND e.event_date > p.event_date AND e.event_type = 'Purchase' ) SELECT u.user_id, u.first_name, u.last_name, p.event_date, p.event_value, p.path_level FROM purchase_path p JOIN user u ON p.user_id = u.user_id ORDER BY u.user_id, p.path_level; Explanation: The recursive CTE builds a sequence of purchases per user, starting with their first purchase. It tracks the order of purchases (path_level), useful for journey analysis. Snowflake Enhancements Recursive CTEs: Efficiently handles hierarchical data.Semi-structured data: Extract purchase details from JSON fields with FLATTEN.Performance: Optimize with clustering on user_id and event_date. Sample input user table: user_id first_name last_name 101 Alice Smith 102 Bob Johnson event_log (purchases): event_id user_id event_date event_value event_type 2 101 2023-02-20 99.99 Purchase 4 101 2023-04-10 149.50 Purchase 6 102 2023-06-15 75.25 Purchase Sample output: user_id first_name last_name event_date event_value path_level 101 Alice Smith 2023-02-20 99.99 1 101 Alice Smith 2023-04-10 149.50 2 102 Bob Johnson 2023-06-15 75.25 1 BI visualization: Shows purchase values by user and path level, illustrating customer purchase sequences. Query flow: 7. Parsing JSON Events This query extracts fields from semi-structured JSON data in event_log. Query: SQL SELECT e.event_date, e.event_data:product_id::INT AS product_id, e.event_data:category::STRING AS category FROM event_log e WHERE e.event_type = 'Purchase' AND e.event_data IS NOT NULL; Explanation: The query uses Snowflake’s dot notation to parse JSON fields (product_id, category) from the event_data column, enabling detailed product analysis. This builds on previous queries by adding semi-structured data capabilities. Snowflake Enhancements Native JSON support: Parse JSON without external tools.Schema-on-read: Handle evolving JSON schemas dynamically.Performance: Use VARIANT columns for efficient JSON storage. Sample input (Subset of event_log with JSON data): event_id event_date event_type event_data 2 2023-02-20 Purchase {"product_id": 101, "category": "Electronics"} 4 2023-04-10 Purchase {"product_id": 102, "category": "Clothing"} Sample output: event_date product_id category 2023-02-20 101 Electronics 2023-04-10 102 Clothing BI visualization: Shows the distribution of purchases by product category, highlighting category popularity. Query flow diagram System Architecture Description of Snowflake’s role in ShopSphere’s data ecosystem, integrating with external sources, ETL tools, and BI platforms. Explanation: The system architecture diagram is structured in four layers to reflect the data lifecycle in ShopSphere’s ecosystem, using distinct shapes for clarity: External data sources: CRM systems and API feeds, shown as ellipses, provide raw customer and transaction data, forming the pipeline’s input.Snowflake data platform: Snowflake’s cloud storage and virtual warehouses store and process data, serving as the core analytics engine.ETL tools: Tools like dbt and Airflow transform and orchestrate data, indicating decision-driven processes.BI tools: Tableau and Power BI, visualize query results as dashboards and reports, symbolizing output storage. Practical Considerations The following considerations ensure the queries are robust in real-world scenarios, building on the technical foundation established above. Performance Optimization Clustering keys: Use clustering on high-cardinality columns (e.g., user_id, event_date) to improve query performance for large datasets.Query acceleration: Enable Snowflake’s query acceleration service for complex queries on massive datasets.Cost management: Monitor compute usage and scale down warehouses during low-demand periods to optimize costs. Data Quality Handling edge cases: Account for missing data (for instance, NULL values in event_value) or duplicates (e.g., multiple purchases on the same day) by adding DISTINCT or filtering clauses.Data skew: High purchase volumes in Q4 may cause performance issues; partition tables or use APPROX_COUNT_DISTINCT for scalability. Security and Compliance Row-level security: Implement policies to restrict access to sensitive data (for example, region-specific results).Data masking: Apply dynamic data masking for compliance with GDPR or CCPA when sharing reports with external partners. Conclusion Snowflake’s advanced SQL capabilities, combined with its scalable architecture and features like time travel, semi-structured data support, and zero-copy cloning, make it a powerful online retail analytics platform. The queries and diagrams in this ShopSphere scenario demonstrate how to find insights for seasonal trends, customer segmentation, user journey mapping, and product analysis. Business Impact These queries enable ShopSphere to optimize operations and drive growth: Query 1’s seasonal trends informed a 15% increase in Q4 inventory, boosting sales. Query 6’s user journey analysis improved customer retention by 10% through targeted campaigns for repeat buyers. Query 7’s JSON parsing enabled precise product category analysis, optimizing marketing spend. Together, these insights empower data-driven decisions that enhance profit and customer satisfaction.

By Ram Ghadiyaram

CORE

*You* Can Shape Trend Reports: Join DZone's Database Systems Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By DZone Editorial

Infusing AI into Your Java Applications

Artificial intelligence (AI) is becoming increasingly pervasive. As an Enterprise Java developer, you might be wondering what value AI can add to your business applications, what tools Java provides to easily do that, and what skills and knowledge you might need to learn. In this article, we equip you with the basic knowledge and skills that you need to start exploring the capabilities of AI to build intelligent and responsive Enterprise Java applications. When we talk about AI in this article, we mean getting responses from a large language model (LLM) based on a request that the Java application sends to the LLM. In our article’s example, we create a simple chatbot that customers can ask for planetary tourist destination recommendations, and then use to book a spaceship to visit them. We demonstrate using Java frameworks like LangChain4j with Quarkus to efficiently interact with LLMs and create satisfying applications for end-users. Hello (AI) World: Getting an LLM to Respond to a Prompt The first version of our spaceship rental application will build a chatbot that interacts with customers using natural language. It should answer any customer questions about planets they wish to visit in the solar system. For the full application code, see spaceship rental step-01 directory in the GitHub repository. The chatbot sends the customer's questions to the application, which interacts with the LLM to help process the natural language questions and to respond to the customer. For the AI-related parts of the application, we create just two files: An AI service, CustomerSupportAgent.java, which builds a prompt informing the LLM about our solar system’s planets and instructs the LLM to answer questions from customers.A WebSocket endpoint, ChatWebSocket.java, which receives the user’s messages from the chatbot. AI services are Java interfaces that provide a layer of abstraction. When using LangChain4j, these interfaces make LLM interaction easier. AI services are an integration point, so in a real application, you would need to consider security, observability, and fault tolerance of the connections and interactions with the LLM. As well as handling LLM connection details (stored separately in the application.properties configuration file), an AI service builds the prompts and manages chat memory for the requests it sends to the LLM. The prompt is built from two pieces of information in the AI service: the system message and the user message. System messages are typically used by developers to give the LLM contextual information and instructions for handling the request, often including examples that you want the LLM to follow when generating its response. User messages provide the LLM with application user requests. The CustomerSupportAgent interface is registered as the AI service in the application. It defines the messages used to build the prompt and sends the prompt to the LLM: Java @SessionScoped @RegisterAiService public interface CustomerSupportAgent { @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) String chat(String userMessage); } Let’s look at what this code is doing. The @SessionScoped annotation maintains the session for the duration of the web service connection and maintains the chat memory for the duration of the conversation. The @RegisterAIService annotation registers an interface as an AI service. LangChain4j automatically implements the interface. The @SystemMessage annotation tells the LLM how to behave when responding to the prompt. When the end user types a message in the chatbot, the WebSocket endpoint passes the message to the chat() method in the AI service. There is no @UserMessage annotation specified in our AI service interface, so the AI service implementation automatically creates a user message with the chat() method parameter value (in this case the userMessage parameter). The AI service adds the user’s message to the system message to build a prompt that it sends to the LLM, then displays the response from the LLM in the chatbot interface. Note that, for readability, the planet information has been placed in a separate PlanetInfo class. Alternatively, you could place the planet information directly in the system message. The ChatWebSocket class defines a WebSocket endpoint for the application’s chatbot UI to interact with: Java @WebSocket(path = "/chat/batch") public class ChatWebSocket { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocket(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public String onTextMessage(String message) { return customerSupportAgent.chat(message); } } The CustomerSupportAgent interface uses constructor injection to automatically provide a reference to the AI service. When the end user types a message in the chatbot, the onTextMessage() method passes the message to the AI service chat() method. For example, if the user asks, "What's a good planet to visit if I want to see volcanoes?", the application responds with a recommendation and why the user might like to visit there, as a fan of volcanoes: The Spaceship Rental application chatbot Providing an Illusion of Memory As you continue your conversation with the chatbot, it might seem as though it is aware of previous messages exchanged, that is, the context of your conversation. When you talk to another person, you take for granted that they remember what you (and they) last said. Requests to an LLM are stateless, though, so each response is generated solely based on the information contained within the request prompt. To maintain context in a conversation, the AI service uses chat memory, through LangChain4j, to store prior user messages and the chatbot’s responses. By default, the Quarkus LangChain4j extension stores the chat in memory, and the AI service manages the chat memory (for example, by dropping or summarizing the oldest messages) as needed to remain within the memory limits. LangChain4j by itself would require you to first configure a memory provider, but that is not needed when using the Quarkus LangChain4j extension. This gives a practical illusion of memory to end users and improves the user experience so they can enter follow-on messages without needing to repeat everything they previously said. The user chatbot experience can also be improved by streaming the responses from the LLM. Streaming Responses for a More Responsive User Experience You might notice responses to your chat message window take time to generate and then appear all at once. To improve the chatbot’s perceived responsiveness, we can modify the code to return each token of the response as it is generated. This approach, called streaming, allows users to start reading a partial response before the entire response is available. For the full application code, see the GitHub spaceship rental step-02 directory. Changing our application to stream the chatbot response is easy. First, we’ll update the CustomerSupportAgent interface to add a method that returns an instance of the SmallRye Mutiny Multi<String> interface: Java @SessionScoped @RegisterAiService @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) public interface CustomerSupportAgent { String chat(String userMessage); Multi<String> streamChat(String userMessage); } Moving the @SystemMessage annotation to the interface means that the annotation doesn’t have to be added to each of the methods in the interface. The streamChat() method returns the LLM’s response to the chat window one token at a time (instead of waiting to display the full response all at once). We also need to call the new streamChat() method from a WebSocket endpoint. To preserve both batch and stream functionality, we create a new ChatWebSocketStream class that exposes the /chat/stream WebSocket endpoint: Java @WebSocket(path = "/chat/stream") public class ChatWebSocketStream { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocketStream(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public Multi<String> onStreamingTextMessage(String message) { return customerSupportAgent.streamChat(message); } } The customerSupportAgent.streamChat() call invokes the AI service to send the user message to the LLM. After making some minor tweaks to the UI, we can now toggle streaming on and off in our chatbot: The application with the new streaming platform enabled With streaming enabled, each token (each word, or part-word) produced by the LLM is immediately returned to the chat interface. Generating Structured Outputs From Unstructured Data Up to this point, the LLM’s outputs have been intended for the application’s end user. But what if, instead, we want the LLM’s output to be used directly by our application? When the LLM responds to a request, the AI service that mediates the interaction with the LLM can return structured outputs, which are formats that are more structured than a String, such as POJOs, lists of POJOs, and native types. Returning structured outputs significantly simplifies the integration of an LLM’s output with your Java code because it enforces that the output received by the application from the AI service maps to your Java object’s predefined schema. Let’s demonstrate the usefulness of structured outputs by helping the end user select a spaceship from our fleet that meets their needs. For the full application code, see the GitHub spaceship rental step-03 directory. We begin by creating a simple Spaceship record to store information about each individual spaceship in the fleet: Java record Spaceship(String name, int maxPassengers, boolean hasCargoBay, List<String> allowedDestinations) { } Similarly, to represent the user’s query about the spaceships in our fleet, we create a SpaceshipQuery record, which is based on the information the user provided in the chat: Java @Description("A request for a compatible spaceship") public record SpaceshipQuery(int passengers, boolean hasCargo, List<String> destinations) { } The Fleet class populates several Spaceship objects and provides a way to filter out those that do not match the user. Next, we update the CustomerSupportAgent interface to take the user’s message (unstructured text) to create a structured output in the form of the SpaceshipQuery record. To accomplish this feat, we only need to set the return type for a new extractSpaceshipAttributes() method in our AI service to be a SpaceshipQuery: Java SpaceshipQuery extractSpaceshipAttributes(String userMessage); Under the covers, LangChain4j automatically generates a request to the LLM including a JSON schema representation of the desired response. LangChain4j deserializes the JSON-formatted response from the LLM and uses it to return a SpaceshipQuery record, as requested. We also need to know whether the user’s input is about one of our spaceships, or about some other topic. This filtering is accomplished using a simpler, structured output request that returns a boolean: Java @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. Respond with 'true' if the user message is regarding spaceships in our rental fleet, and 'false' otherwise. """) boolean isSpaceshipQuery(String userMessage); Our last addition to the CustomerSupportAgent interface enables the agent to provide a spaceship suggestion based on our fleet and the user’s request, with and without streaming: Java @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) String suggestSpaceships(String message, List<Spaceship> compatibleSpaceships); @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) Multi<String> streamSuggestSpaceships(String message, List<Spaceship> compatibleSpaceships); } Our last step is to update the ChatWebSocket and ChatWebSocketStream classes to first check if the user’s query is about spaceships in our fleet. If so, the customer support agent creates a SpaceshipQuery record by extracting the information from the user’s message and then responds with suggested spaceships from the fleet that are compatible with the user’s request. The updated code is similar for both the ChatWebSocket and ChatWebSocketStream classes, so only the ChatWebSocket class is shown here: Java @OnTextMessage public String onTextMessage(String message) { boolean isSpaceshipQuery = customerSupportAgent.isSpaceshipQuery(message); if (isSpaceshipQuery) { SpaceshipQuery userQuery = customerSupportAgent.extractSpaceshipAttributes(message); List<Spaceship> spaceships = Fleet.findCompatibleSpaceships(userQuery); return customerSupportAgent.suggestSpaceships(message, spaceships); } else return customerSupportAgent.chat(message); } With these updates, the customer support agent is ready to use the structured outputs to provide the user with spaceship suggestions: The application providing the user with spaceship suggestions based on the structured output With that, we have completed an AI-infused Java chatbot application that provides planetary tourism recommendations and spaceship rentals. To continue learning, experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. More on These AI Concepts We’ve discussed various AI concepts throughout this article. If you want to know more about any of them, here is a quick explainer. Large Language Models (LLMs) When we talk about AI in this article, we generally mean getting responses from a large language model. LLMs are machine learning models that are trained to generate a sequence of outputs based on a sequence of inputs (often text inputs and outputs, but some multi-modal LLMs can work with images, audio, or video). LLMs can perform a wide variety of tasks, such as summarizing a document, translating between languages, fact extraction, writing code, etc. This task of creating new content from the input is what’s referred to as Generative AI, or GenAI. You can infuse such capabilities into your application as needed. Making Requests to LLMs: Prompts, Chat Memory, and Tokens How you request information from an LLM influences not only the response you get back from the LLM but also the end user’s experience and the application's running costs. Prompts Sending a request to an LLM, whether from application code or as an end-user in a chat interface, involves writing a prompt. A prompt is the information (usually, but not always, text) to which the LLM responds. If you think of communicating with an LLM like communicating with another person, how you phrase your request is important to making sure the other person (or the LLM, in this case) understands what you want to know. For example, ensuring that you give the context of the request before going on to ask for a specific piece of information, and not providing lots of irrelevant information to confuse the listener. Chat Memory Unlike when you are talking to another person, LLMs are stateless and don’t remember the previous request, so everything you need the LLM to take into consideration needs to be in your request: the prompt, any previous requests and responses (the chat memory), and any tools you provide to help the LLM respond. However, providing too much information to the LLM in the prompt can potentially complicate the request. It can also be costly. Tokens LLMs convert the words in your prompt into a sequence of tokens. Most hosted LLMs charge usage based on the number of tokens in the request and response. A token can represent a whole word or a part of a word. For example, the word "unbelievable" is typically split into multiple tokens: "un", "bel", and "ievable". The more tokens that you include in the request, especially when you include all the chat memory, the greater the potential cost of running the application. Providing all the chat memory in a request can make requests both costly and less clear. Requests to LLMs are limited in length, so it’s important to manage the chat memory and how much information is included in the request. This can be helped a lot by the Java frameworks that you use, such as LangChain4j with Quarkus, which we use for the sample application in this article. LangChain4j and Quarkus Frameworks LangChain4j is an open-source Java framework that manages interactions between Java applications and LLMs. For example, LangChain4j, through the concept of AI services, stores and helps you to manage chat memory, so that you can keep requests to the LLM efficient, focused, and less expensive. Quarkus is a modern, cloud-native, open-source Java framework optimized for developer productivity, running in containerized environments, and with fast startup and low memory usage. The LangChain4j extensions to Quarkus simplify the configuration of connecting to and interacting with LLMs in AI-infused Java applications. The LangChain4j project can be used with other Java application frameworks, including Open Liberty, Spring Boot, and Micronaut. MicroProfile and Jakarta EE are also working together with LangChain4j to provide an open standards-based programming model for developing AI applications. The Sample Application You can find the complete sample application that we demonstrate throughout this article on GitHub. The application is written in Java and runs on Quarkus using the Quarkus LangChain4j extensions. Conclusion Infusing AI into Java applications enhances the application’s capabilities and the end-user's experience. With the help of Java frameworks like Quarkus and LangChain4j to simplify interactions with LLMs, Java developers can easily infuse AI into business applications. Writing AI-infused applications in Java means you're working in Java’s robust, enterprise-ready ecosystem, which not only helps you to easily interact with AI models, but also makes it easy for the applications to benefit from enterprise essentials such as performance, security, observability, and testing. The field of AI is rapidly evolving. By mastering the concepts and technologies in this article, you can stay ahead of the curve and start exploring how AI can help you build intelligent and engaging Java applications. Experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. If you’d like to learn more, try this tutorial on how to extend the knowledge of the LLM with content from PDF documents by using retrieval-augmented generation (RAG): Build an AI-powered document assistant with Quarkus and LangChain4j. Thanks to Red Hatters Clement Escoffier, Markus Eisele, and Georgios Andrianakis for valuable review comments. This article was originally published on InfoQ on Aug 15, 2025.

By Don Bourne

Fixing a Test Involves Much More Than Simply Making It Pass

Software tests are essential mechanisms in the development lifecycle designed to verify that code behaves as expected. They catch bugs, ensuring functionality aligns with requirements, and maintain overall system reliability through automated assertions on inputs, outputs, and states. However, fixing a failing test extends far beyond merely adjusting it to pass, as this superficial approach can mask deeper issues in the underlying code or architecture. True resolution demands a thorough investigation into the root cause. It could be a logic error, an environmental mismatch, a dependency failure, or an outdated assumption in the test itself, followed by targeted corrections that preserve the test's intent as a safeguard against regressions. Blindly modifying assertions to match unexpected outcomes, such as changing an expected value from false to true without validating why the discrepancy occurred, risks introducing false positives, eroding trust in the test suite. This way, we are potentially allowing flawed software to reach production, where it could lead to security vulnerabilities, data inconsistencies, or user-facing failures. After all, a proper fix not only restores the green status but also enhances code quality, observability, documentation, and team knowledge. Test failures should be treated as opportunities for genuine improvement rather than quick band-aids. As developers and testers, we rely on tools like Postman to streamline API testing and ensure our applications behave as expected. Postman’s intuitive interface, powerful scripting capabilities, and recent AI integrations make it a go-to for automating tests in CI/CD pipelines or during exploratory testing. However, one seemingly helpful feature — the "Fix test" option powered by Postbot (Postman’s AI assistant) — can introduce subtle risks if used without caution. In this article, we'll explore how the "Fix test" option works, why it might tempt you to use it, and most importantly, the potential dangers of blindly accepting its suggestions. We'll use an example to illustrate how "fixing" a test to make it pass can mask underlying issues, leading to false confidence in your API's reliability. By the end, you'll have best practices to avoid these pitfalls and maintain robust, meaningful tests. Postman Fundamentals Postman allows you to write tests in JavaScript that run after sending a request, verifying aspects like status codes, response bodies, headers, and more. These tests use the pm.test and pm.expect syntax to assert expected outcomes. For instance, you might check if a response JSON field matches a specific value. Postbot is Postman's AI-powered assistant. I entered the following prompt to Postbot. How can you help me? What can you do? Here is the response that I’ve got: Textile You can call me Postbot :) I'm here to help you with your API workflows in Postman. Here are some of the things I can do: 1. Add or update test scripts for your API requests. 2. Add or update visualization scripts to visualize your API responses. 3. Edit existing test scripts to make changes or fix errors. 4. Add or update descriptions for your requests to enhance documentation. 5. Construct new API requests based on your existing requests. 6. Debug and fix your API requests to ensure they work correctly. If a test fails, Postbot can analyze the failure and offer suggestions, including a "Fix test" option. This feature scans the actual response and proposes updates to your test script. It usually involves adjusting assertions to align with the received data. I find Postbot very helpful indeed, as it can save me a great deal of time in writing tests and debugging. But I’ve also learned to always be critical of its results. While convenient, this AI bot may prioritize making the test pass over diagnosing the root cause. Postbot might rewrite an assertion from expecting false to true if that's what the response currently returns, without questioning whether the response itself is correct. How A Boolean Flip Can Hide a Bug Let’s say you're testing an API endpoint for user authentication. Your test collection includes a request to log in with valid credentials, followed by a test script that verifies a boolean field in the response: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.false; // Expected: false for guest mode }); You run the test, but it fails because the response shows isAuthenticated: true. Intrigued, you invoke Postbot (via the AI scripting panel or failure suggestions), and it offers the "Fix test" option. Using "Fix test" updates the script to: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.true; // Now matches the actual response }); Voilà — the test now passes on rerun. But did you fix the problem? In this scenario, the API might have a bug. Perhaps a configuration error is incorrectly authenticating guests as full users, exposing sensitive data. By "fixing" the test, you've adapted it to the faulty behavior instead of investigating the discrepancy. The test suite turns green, but the underlying issue persists. Are we sure that there is no security vulnerability that we've just masked by turning the test suite green? This example highlights a core issue that goes beyond testing tools. It actually goes beyond testing types and testing levels: Testing is a critical thinking activity. Our tests should exist for a reason. They should be there in order to verify behaviour, functionality, or anything else that we need to verify. If a test passes or fails, we must understand why. Is it a false positive or a false negative? Did the test catch a bug or not? We need to investigate. In the example, instead of investigating, we just let AI change the test so that it passes testing. All we’ve got now is a test that passes testing for the wrong reason. The test suite is green, misleading us into the wrong conclusion that authentication works as expected. The Dangers of Blindly "Fixing" Tests Our authentication example also highlights another important fact. Postbot's fixes are reactive to the current response. They are not proactive in ensuring it aligns with business requirements or specifications. This is where we need to be critical and investigate as testers. While the "Fix test" feature can accelerate testing and development, misusing it can lead to several risks: Masking bugs in the API or backend. Tests exist to catch deviations from expected behavior. If the response is wrong due to a code change, database issue, or external dependency, updating the test to match it sweeps the problem under the rug. This erodes the test's value as a safety net.False positives in CI/CD pipelines. Green tests give teams confidence to deploy. If "fixed" tests ignore real issues, you risk shipping broken features. In a fast-paced environment, this can result in production outages, data corruption, or compliance violations.Loss of test intent and documentation. Original tests often encode business rules or assumptions (e.g., "isAuthenticated should be false for unauthenticated requests"). Altering them without review dilutes this intent. This makes the test suite harder to maintain and understand for future team members.Encouraging lazy debugging practices. Relying on AI to "fix" failures discourages thorough root-cause analysis. What if the failure stems from flaky networks, incorrect request payloads, or environmental differences? Skipping the investigation misses opportunities to improve overall system reliability.Accumulated technical debt. Repeated quick fixes can bloat your test scripts with outdated or incorrect assertions. When the API evolves legitimately, these band-aid tests will fail en masse, requiring more effort to untangle. Best Practices: Safely Leveraging Postbot and "Fix Test" The practices below emphasize a balanced approach where AI assists but does not replace human judgment. The goal is that test fixes contribute to long-term quality rather than short-term convenience. By integrating them into your workflow, you can mitigate risks while still benefiting from the efficiency of automation. Always investigate failures first. Use Postbot to generate initial tests or suggestions, but validate them against specs. Before clicking "Fix test," manually review the response. After a fix, add comments explaining the change. Ask: Is this the expected output according to the API docs or requirements? Use Postman’s console or external tools to debug.Version control your collections. Store Postman collections in Git. This way, changes from "Fix test" appear in diffs, allowing peer reviews before merging.Incorporate environment variables for flexibility. Instead of hardcoding values, use variables like {{expectedAuthStatus}. This makes tests modular without rewriting assertions.Run tests in multiple environments. Test against staging, production-like setups, or mocks to confirm if failures are environment-specific.Educate your team. Share this awareness in code reviews or team meetings. Encourage a culture where test failures prompt discussion, not automatic fixes. By treating "Fix test" as a starting point rather than a solution, you maintain control over your test integrity. This is of paramount importance. After all, why do we test if our test results are misleading and meaningless? Wrapping Up Software testing is a critical thinking activity. Do we understand how and why our systems work? Can we determine whether something is wrong and why? There is no testing without observability and no observability without testing. We need to wear our inspector's hat and investigate. After all, this is what we do, this is who we are: Inspectors. Postman’s AI features like Postbot are game-changers for productivity. However, they're just tools, not oracles. Making a test pass can be tempting, but true fixing requires understanding the "why" behind failures. In our example, flipping a Boolean assertion might greenlight a security flaw, underscoring the need for vigilance. Next time a test fails in Postman, pause before fixing. Dive into the details. Align with requirements. Be confident that your changes enhance reliability, not just silence alerts. By doing so, you'll build a more robust API testing practice that catches issues early and keeps your applications trustworthy.

By Stelios Manioudakis, PhD

CORE

CNCF Triggers a Platform Parity Breakthrough for Arm64 and x86

The Challenge Developing open-source software for deployment on Arm64 architecture requires a robust continuous integration and continuous deployment (CI/CD) environment. Yet, there has historically been a disparity between the levels of support for Arm64 and traditional x86 processor architectures, with Arm64 usually at a disadvantage. Developers of infrastructure components for multiple architectures have certain expectations of their work environments: Consistency of the tools and methods they use across platforms, so they don’t have to adopt different development procedures just to adopt a less prevalent platform.Performance from their platforms and support mechanisms, so their deployment schemes don’t suffer from speed deficiency when they choose to support multiple platforms.Testing coverage so the very same tests for efficiency, compliance, and security apply to all platforms simultaneously and without substantial differentiation.Maintainability, enabling developers to automate their integration and redevelopment processes so they apply to all platforms without alteration. Product managers for these same components have these same requirements, plus at least two more: Platform coverage capability, so that technical account managers (TAM) may have the skills and readiness they need to respond to customer needs.Support tiering capability, enabling TAM and other IT personnel to classify their levels of software support according to their capability to respond to urgent or emerging customer issues. The Solution Working in collaboration with both Ampere and infrastructure provider Equinix, open-source developer Alex Ellis made available his Actuated CI/CD platform to some of the most critical open-source projects in the cloud-native software ecosystem. Actuated takes GitHub self-hosted automation processes demonstrated by security engineers to be inherently vulnerable to malicious attack, and runs them in microVMs abstracted from the public Internet. Implementation Several key open-source Cloud Native Computing Foundation projects took advantage of an Actuated environment to run all of their GitHub Actions for Arm64. This environment is based on Ampere® Altra® processors made available with the help of infrastructure provider Equinix. The success of this initiative was instrumental in prompting GitHub to implement full support of the Arm64 architecture with GitHub Actions. Now, developers who had been running Arm64 build processes in QEMU emulation environments on x86 architectures can relocate those processes to Arm64 on bare metal. Self-Hosted Runners for GitHub Actions on ARM64 GitHub dominates the hosting of software projects these days. The most popular way that GitHub-hosted projects generate builds and releases for continuous integration is with the platform’s built-in CI toolset, GitHub Actions. The most important role played by the GitHub Actions CI/CD platform is automating software development pipelines. The party responsible for triggering any GitHub Action is a runner. It’s an agent running on a server, waiting for something to do and eager to do it once it’s given the assignment. It’s assigned a job from the workflow and tasked with getting it done. GitHub is a complete software deployment platform. As such, it hosts its own runners, each of which is adapted to its specified target environment and architecture. Until recently, GitHub did not offer hosted runner environments for Arm64. Projects that wanted to generate Arm64-native builds did have an option — the self-hosted runner. GitHub users could install an agent on a physical or virtual machine hosted elsewhere, and have GitHub Actions dispatch jobs to that host, managed by the project users. This required project administrators not only to manage the project itself but also to take care of the maintenance and security of the build environment that the projects would use. In CNCF’s case, developers took advantage of credits to Equinix Metal, enabling them to provision bare metal instances and use them as self-hosted runners for projects. But for a code lab whose projects must be made available 24/7/365 to other developers worldwide, the security of self-hosted runners poses a challenge: Anyone could clone the project repository, modify the Actions jobs, and get access to the runner node to run arbitrary jobs, according to this GitHub documentation. Another problem was ensuring consistency between CI runs. With self-hosted runners, if there were side effects of the CI jobs, such as configuration changes or files left behind afterwards, they would still be there for ensuing jobs. This posed a problem — when running a CI job to build or test software, you should have a controlled environment, so that the only thing that changes between runs is the software. In the case of self-hosted runners, the environment can drift over time. In the absence of a cleanup process, it was possible for runs of the same build job on the same host to generate different results over time. One way developers bypassed the need for Arm64 native runners was by running virtual Arm64 environments on x86 servers, using QEMU open-source emulation. Emulated environments add a huge performance overhead for software compilations, which run at a fraction of the pace of compilations on native, non-emulated hardware. Emulation worked well enough for developing small to medium projects. But if developers had to build something big and important for ARM64, the strain would become so great on their virtual environments that builds would completely fail. “In the past, people were doing builds using QEMU,” said Equinix’s Developer Partner Manager Ed Vielmetti. “Say you were building a compiler, where the intermediate steps require large amounts of memory and very deep integration with the processor. That just would not work in an emulated environment.” The Disparity Phenomenon Unlike the typical enterprise, the Cloud Native Computing Foundation has a special obligation to build its cloud-native components for all the world’s major processor architectures. Projects such as the containerd portable container runtime, the etcd key/value data store, the fluentd log data collector, the Falco real-time threat detection tool, and the OpenTelemetry observability and instrumentation toolkit, among dozens of others, are critical dependencies for the cloud-native ecosystem, and as such, must be built for both x86 and Arm64. To build low-level infrastructure components with support for Arm64, CNCF developers need access to native Arm64 infrastructure. This means, ironically, they need the very class of tools they’re trying to create. At first, Ampere and Equinix collaborated with CNCF to address these gaps by donating Ampere Altra-based servers or setting up Altra-based bare metal nodes at Equinix facilities. The granularity of the Arm64-based server resources that Equinix could share was bare metal nodes — a 160-core dual-socket Ampere Altra system. Ideally, a server like this would be shared among several projects, but this was, at the time, beyond the capabilities of the CNCF. This is the problem that Ampere and Actuated proposed to solve for CNCF by allowing multiple projects to run on fewer hosts, thus providing easy access to build services for more projects while consuming less hardware. “OpenTelemetry is a full-on, full-time-on, CI/CD system,” said Antoine Toulmé, Senior Engineering Manager for Blockchain and DLT and Splunk Maintainer for OpenTelemetry project. “We were able to leverage [our Ampere server] infrastructure for ourselves, but we weren’t able to share it with open source at large." “We cannot give GitHub runners away,” Toulmé said. “Once we were happy with certifying the downstream distributions to our customers, we opened issues with the OpenTelemetry project saying we would like to see ARM64 support being delivered at the highest level — meaning, it should run for every commit, it should run for main, it should run all the time. And the feedback was, well, great, but there are no ARM64 runners in GitHub. So we’re going to need you to work with what we can do here.” Due to the lack of readily available Arm64 platforms for these projects, developers were unaware if the changes they committed were causing issues on Arm64, as test suites were not run as frequently as for x86. Since container orchestration platforms are among the platforms being developed to support Arm64, this phenomenon became a vicious cycle: Releases were gated on passing integration test suites for x86, but releases were not gated on the same test suites passing for Arm64. The solution CNCF’s developers would discover falls far short of qualifying as radical or revolutionary — in fact, it’s more of a bug fix in practice. It’s so simple to implement that it completely compensates for this disparity, not just for CNCF but for any developer of any platform-level component for any architecture. Breakthrough: Actuated, Plus Editing One Line of Code To take the first step towards platform parity between x86 and Arm64, Ampere enlisted the help of Alex Ellis, the creator of a service called Actuated. It’s a product that runs GitHub Actions jobs in secure, isolated microVMs, instrumented to receive build jobs from GitHub Actions, and offering developers visibility into the performance of their build jobs and the load on the shared build systems. Actuated could run all the CNCF’s existing GitHub Actions runners after altering a single line of their configuration files, plus in some cases the pasting of a few code snippets — changes which took less than five minutes to implement. These changes enabled GitHub-hosted projects to point to Actuated’s microVM-driven environment on Ampere Altra processors for their build jobs. “Falco really needed Arm64 GitHub runners to elevate its support for the architecture and enlarge its user base,” Falco project Senior Open Source Engineer and Sysdig Maintainer Federico Di Pierro said. “[Actuated] was the perfect solution for us because it was easy to leverage and relieved any burden for the maintainers. This way, we as maintainers can focus on what really matters for the project, instead of fighting with maintaining and deploying self-hosted infrastructure. Now we are building, testing, and releasing artifacts for ARM64, leveraging Actuated for many of our projects, and it works flawlessly.” Having seen the increase in demand for Arm native build environments in recent years, GitHub announced last June the availability in public beta of Arm64-based hosted runners for GitHub Actions, powered by Ampere compute instances on Microsoft Azure, followed in January 2025 by the release into public preview of free hosted runners for public repositories. For OpenTelemetry, this means the end of network loads as high as 10 times their assigned bandwidth caps, on account of OpenTelemetry builds constantly downloading dependencies from Docker Hub repositories. “Yeah, we were definitely breaking things,” the OpenTelemetry Project’s Antoine Toulmé said. “We got lucky, because the Arm runners for GitHub shipped. We have moved to ARM runners, we are happy as can be, and nothing is breaking anymore.” Now for the first time, project maintainers can pay as close attention to the safety and security of Arm64 builds as they have for x86 builds, knowing that they’re no longer likely to encounter performance degradations or penalties. “[Actuated] gave us great confidence in the CI builds on ARM64,” Principal Software Engineer and AWS Maintainer for the containerd project Phil Estes said. “If the Arm CI breaks now, there’s no way we will merge that [pull request] until we figure out why... We have full confidence now that [build failures] are not an issue with flaky hardware [as they sometimes were before].” For its part, Oracle is continuing its policy of donating $3 million per year in OCI credits for Arm64 instances powered by Ampere to CNCF projects. This generosity, along with the newfound stability of Arm64 platforms catalyzed by Ampere and Equinix, and brought about by Actuated, is enabling prominent cloud infrastructure vendors, including Red Hat, SUSE, Canonical, and Mirantis, to provide full support for their enterprise customers who choose ARM64 infrastructure. Parity makes it possible for enterprises to make sensible choices about their computing infrastructure and platforms without incurring penalties just for choosing an alternative architecture. Large cloud customers are proving that Arm64 can provide organizations with the performance they need and reduced expenses for workloads — all with industry-leading energy efficiency. But organizations can’t experience those benefits until they can deploy their workloads on all infrastructure options on a level playing field with one another and measure the results for themselves. Leveling the Playing Field In early 2023, few options existed for GitHub-hosted projects that wanted to fully integrate Arm64 into their continuous integration processes. Through this initiative, leveraging an innovative software solution from Actuated with Ampere CPUs hosted by Equinix, we lowered the bar for CNCF projects to make a start towards parity of support for ARM64 and x86. Key cloud-native projects, including etcd, containerd, Open Telemetry, Falco, and others, were able to advance their support of Arm64, accelerate their CI runs on native Arm64 infrastructure, and support increasing numbers of their users taking advantage of ARM64 compute in the cloud. By the end of this pilot project, the number of options for developers has grown considerably. The CNCF now offers its projects the ability to run GitHub Actions jobs on managed Kubernetes clusters on OCI, using Ampere-powered instances and the GitHub project Actions Runner Controller, and with the addition of hosted Arm64 runners to GitHub, it has never been easier for projects to easily support this fast-growing and exciting architecture for cloud-native applications. Check out the full Ampere article collection here.

By Scott Fulton III

A Fresh Look at Optimizing Apache Spark Programs

I have spent countless hours debugging slow Spark jobs, and it almost always comes down to a handful of common pitfalls. Apache Spark is a powerful distributed processing engine, but getting top performance requires more than just running your code on a cluster. Even with Spark’s built-in Catalyst optimizer and Tungsten execution engine, a poorly written or configured Spark job can run slowly or inefficiently. In my years as a software engineer, I have learned that getting top performance from Spark requires moving beyond the defaults and treating performance tuning as a core part of the development process. In this article, I will share the practical lessons I use to optimize Spark programs for speed and resource efficiency. Overview: The goal is to tackle performance from every angle. We will start at the top with cluster-level configurations like resource allocation and memory, then dive right into the code to cover best practices for writing efficient Spark APIs. From there, we will get into the often overlooked but critical layer of data storage and formatting for faster I/O. To wrap it all up, we will see how monitoring the Spark UI and logs is key to refining performance over time. Prelude: Understanding Spark’s Architecture and Lazy Evaluation Before we get into performance optimization, it helps to anchor on how Spark runs your program. Diagram: Spark architecture Driver: The driver program runs your main Spark application, builds a logical plan (a DAG of transformations), turns it into a physical plan, and schedules tasks across the Executor programs. It tracks job progress and collects results. Executors: Executors live on worker nodes. They run tasks in parallel, keep partitions of data in memory or on disk, and spill to disk if needed. Executors are where your transformations actually execute. Cluster Manager (allocator): YARN, Kubernetes, or Spark’s standalone manager decides how many executors you get and where they run. With the right settings, Spark can scale executors up/down as workload changes. Lazy evaluation: Transformations like select, filter, groupBy, and join do not execute immediately. Spark builds a plan and only materializes work when you call an action (count, collect, write, show, etc.). This gives Spark the chance to: Push filters down to the data sourcePrune unused columnsReorder operationsChoose efficient join/partition strategies Spark Configuration Tuning Spark offers many knobs and configuration properties that can significantly impact performance. Proper tuning of these settings ensures that your Spark cluster resources are used optimally. Key areas include dynamic resource allocation, memory management, and setting the right level of parallelism for shuffles. 1. Dynamic Resource Allocation By default, Spark uses a static number of executors for the entire application. However, in real-world scenarios, data sizes can vary from one job run to another or even between stages of a single job. Allocating a fixed number of executors can either waste resources or cause bottlenecks. Dynamic Resource Allocation allows Spark to scale the number of executors up or down based on workload. This means Spark can add executors when your job has a lot of work and removes them when they are no longer needed, improving efficiency. To enable dynamic allocation, set spark.dynamicAllocation.enabled to true in your Spark configuration. You should also ensure the cluster is configured to support this by enabling either the external shuffle service or Spark’s internal shuffle tracking (via spark.shuffle.service.enabled or spark.dynamicAllocation.shuffleTracking.enabled). Additionally, you can tune related settings such as the minimum and maximum number of executors Spark can allocate (spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors respectively). For example, you might allow Spark to scale between 2 and 10 executors depending on load: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.minExecutors", 2) \ .config("spark.dynamicAllocation.maxExecutors", 10) \ .getOrCreate() With dynamic allocation on, Spark will automatically “stretch” or “shrink” your cluster usage as needed, so you don’t have to guess the perfect number of executors for each job. This is especially useful for multi-tenant environments or jobs with varying input sizes. 2. Memory Configuration Memory is a critical resource in Spark. If executors or the driver run out of memory, you will get the dreaded out-of-memory (OOM) errors which often crash the job. Proper memory tuning helps avoid these issues. When you allocate memory (spark.executor.memory for executors, spark.driver.memory for the driver), it is important to understand that these two serve different purposes. The driver uses its memory mainly for job coordination — building the DAG, scheduling tasks, and collecting results, and not for holding large datasets. The executors, on the other hand, use their memory for actual data processing and caching, and this is where most tuning matters. Each executor JVM divides its allocated heap into regions for different uses: Execution memory (for shuffles, sorts, joins, and aggregations)Storage memory (for caching datasets and broadcast variables)User and system memory (for internal metadata and user objects) The parameter spark.memory.fraction controls what portion of the executor’s JVM heap is reserved for the combined execution + storage regions, with the remainder (usually 25%) left for user and system overhead. By default, spark.memory.fraction is set to 0.75, meaning roughly 75% of the executor heap is available for Spark’s data operations, and 25% is reserved for overhead and bookkeeping. Within that memory region reserved for data operations (the 75% part), Spark dynamically manages execution and storage memory, giving preference to computation when needed (e.g., a large join) and evicting cached blocks if space runs short. For most applications, this default split works well, but certain workloads benefit from fine-tuning. For example, if you cache large datasets, increasing spark.memory.fraction slightly (e.g., toward 0.8) gives Spark’s unified memory region more room for storage memory, allowing more data to stay in memory and reducing cache evictions. However, this leaves less overall JVM heap available for temporary objects, which can increase the chance of garbage collection or spills during computation. Conversely, if your job involves heavy joins, aggregations, or UDFs with little caching, lowering the fraction (e.g., to 0.6–0.7) slightly reduces the unified region but gives the JVM more free heap outside Spark’s managed memory, helping it handle transient objects and preventing out-of-memory errors during shuffle-intensive workloads. The key is to avoid excessive spilling (which occurs when Spark runs out of in-memory space for shuffles) while leaving enough headroom to prevent OOM errors. You can monitor this balance in the Spark UI’s Executors tab. Frequent disk spills or long GC times are signals that the memory configuration might need adjustment. In summary, allocate sufficient memory to each executor (where most of the work happens), and be mindful of how that memory is divided. Keeping the defaults is often safest unless you have specific caching or shuffle-heavy workloads. And of course, never exceed the physical memory available on a node, or the OS may terminate your processes. 3. Shuffle Parallelism (Partitions) Another important configuration is the degree of parallelism for shuffle operations. Spark breaks tasks into partitions; too few partitions means each task handles a lot of data (potentially straining memory and increasing shuffle time), while too many partitions means a lot of small tasks with overhead and possibly too many small output files. By default, Spark uses 200 shuffle partitions for joins and aggregations (spark.sql.shuffle.partitions = 200). This default may not be optimal for your data size or cluster. Tuning the number of partitions can lead to better performance. For big jobs with very large datasets, you might increase the partitions to get more parallelism. For example, if processing a 1 TB dataset, 200 partitions (~5 GB per partition) is likely too few — you might raise it to 1000 or more so each partition is ~1 GB or less. On the other hand, for smaller jobs or after filtering down data, 200 partitions might be too many (each partition too small). In such cases, reducing the number of partitions can cut down scheduling overhead and small task inefficiencies. Spark’s repartition() and coalesce() transformations or SQL hints can also be used in code to increase or decrease partitions for a given DataFrame. A common rule of thumb is to aim for 100 MB–200 MB of data per partition for efficient processing. You can estimate this by looking at your input data size and dividing by a target partition size. For example, if you expect to process ~50 GB of data, using about 250–500 partitions might be reasonable (since 50 GB / 200 MB ≈ 250). Conversely, if you have a small dataset of 1 GB, having 200 partitions (each ~5 MB) is overkill — you could reduce spark.sql.shuffle.partitions to maybe 10 or 20 in that case. The goal is balanced partitions: not too large to cause memory issues, and not too small to cause excessive task overhead. In Spark 3.x and beyond, there is also an Adaptive Query Execution (AQE) feature that can automatically optimize the number of post-shuffle partitions at runtime. When AQE is enabled (spark.sql.adaptive.enabled=true by default in recent versions), Spark will merge small shuffle partitions or split skewed ones on the fly based on actual data statistics. This helps mitigate issues of suboptimal partition counts. We’ll discuss AQE a bit more later, but it is good to be aware that Spark itself can adjust some of this if configured. Takeaway: Don’t just stick with the default 200 partitions for all jobs. Tune it or let AQE handle it, but ensure your tasks are neither overwhelmed by data nor spending more time coordinating than computing. Code Level Optimization Optimizing Spark is not just about configs. How you write your Spark code has a big impact on performance. This section covers best practices in coding your transformations and actions to minimize expensive operations and fully leverage Spark’s internal optimizations. The guiding principle is to work with Spark, not against it: use high-level APIs that allow Spark to optimize under the hood, and be mindful of operations that cause a lot of data movement or computation. 1. Prefer Data Frame/Dataset APIs Over RDDs Apache Spark started with the low-level RDD API, but nowadays you should favor the higher-level Data Frame or Dataset APIs for most tasks. Data Frames and Datasets provide declarative, SQL-like operations (e.g., select, filter, groupBy) that Spark’s Catalyst optimizer can analyze and optimize aggressively. In contrast, RDDs (with manual map, reduce, etc.) do not get those optimizations. The result: Data Frame/Dataset operations are generally much faster and more memory efficient than equivalent RDD code because Spark can optimize the query plan and even generate efficient underlying code (thanks to Tungsten). For example, if you have data you want to aggregate by key, doing it as a Data Frame df.groupBy("key").agg(sum("value")) will let Spark decide the best execution plan. If you did the same with RDDs (e.g., using rdd.groupByKey() and then summing), Spark has less information to optimize, and the operation could be slower or use more memory. In short, leverage Spark’s built-in engine by using DataFrames/Datasets and express your computations at a high level. This not only gives performance benefits, but typically leads to cleaner and more concise code as well. The only time you might need RDDs is for very specialized processing or when using certain legacy libraries, but even then, you can often convert between DataFrames and RDDs as needed. As a rule of thumb, if an operation is supported in the DataFrame API, use it instead of the RDD API. High-level APIs let Spark do the hard work of optimization for you. 2. Filter Early and Minimize Data Shuffles Minimizing data movement (shuffling) is critical for performance. Shuffles (like those that happen during joins or aggregations) involve redistributing data across the network and are often the most expensive part of a Spark job. One of the best ways to cut down shuffle cost is to simply reduce the data size early in your job. Apply filters and projections as soon as possible in your ETL pipeline so that subsequent operations work on less data. If you only need a subset of data, do not wait until after a big join to filter it and instead push those filters up front. Removing even 10% of data upfront can have ripple effects of speeding up downstream shuffles and computations. For example, if you are processing logs for a specific date range, filter by date as the first step rather than after doing a bunch of processing. This “predicate pushdown” approach often happens automatically with DataFrame queries on persistent data (Parquet will only read needed partitions), but it is also a mindset for your transformations: do the narrow, data-reducing transformations early, and the wide, expensive ones (like joins) later on the smaller data. In addition, consider data partitioning before heavy operations to avoid skew and balance the load. If you know you will join two DataFrames on a key that is very skewed, you might repartition the DataFrame by that key first to ensure even distribution, or use techniques like salting. On the other hand, if you have a many-partitioned dataset but your next stage is an aggregate that does not need so many partitions, you could coalesce it down to avoid unnecessary shuffle partitions. In summary, think about data volume at each step and try to keep it minimal. Every time you can cut down data size or avoid an unnecessary shuffle, you win performance gains. 3. Use Efficient Join Strategies (Broadcast Small Data) Joins are a common source of performance problems in Spark, especially when joining a very large dataset with a smaller one. A standard join will shuffle both sides of the join across the network to group matching keys together, which can be expensive. However, Spark provides an optimization for the scenario where one dataset is much smaller than the other: broadcast joins. A broadcast join sends a copy of the small dataset to every executor, so that Spark can join without shuffling the large dataset at all. Each executor already has the small dataset locally in memory. This can dramatically speed up joins where, say, you have a large fact table and a small dimension table (e.g., a few MBs or a few thousand records). Spark’s Catalyst optimizer will automatically broadcast a table in a join if it estimates the table size is below a certain threshold (controlled by spark.sql.autoBroadcastJoinThreshold, default ~10MB). But sometimes you know better which table is small, or the estimates are off, so you can also manually hint or specify a broadcast. Using the Data Frame API in PySpark/Scala, you can import the broadcast function and apply it to the smaller Data Frame in a join, for example: Python from pyspark.sql.functions import broadcast # large_df: a huge DataFrame # small_df: a small DataFrame that fits in memory result_df = large_df.join(broadcast(small_df), on="customer_id") In the above code, small_df will be broadcast to all executors. The join can then be performed without a costly shuffle of large_df. Broadcast joins work best when the smaller dataset can easily fit in each executor’s memory. If the “small” dataset is too large to broadcast (bigger than the threshold or memory limits), then a broadcast join could backfire by causing memory pressure, so use this for truly small reference data. Another join strategy is to ensure that both sides of a large join are partitioned by the join key (using Data Frame's repartition on the key, or ensuring the data was written partitioned by that key). If both datasets are partitioned on the join key, Spark will still do a shuffle join, but it can make sure the same keys end up in the same partition with potentially less network traffic (this is more advanced and often handled by Catalyst/AQE). 4. Avoid UDFs and Utilize Built-in Functions Spark allows you to define user-defined functions (UDFs) to run custom code on Data Frame rows, but they come at a cost. UDFs (especially in PySpark) break the optimized flow: they force Spark to serialize data to Python, execute your Python code, and deserialize results. They are also “black boxes” to the Catalyst optimizer (Spark cannot look inside them to apply further optimizations). All this means UDFs typically run much slower than Spark’s native operations. Whenever possible, try to use Spark’s built-in functions (in pyspark.sql.functions or the Data Frame API) instead of writing a UDF. Spark has a rich set of functions for string manipulation, date processing, aggregations, etc., which run in the JVM and can be vectorized and optimized by Spark. For example, to convert a string column to uppercase, using the built-in func.upper(col("name")) is far faster than writing a Python UDF to do the same, because the built-in version will execute within Spark’s engine (and even leverage optimized Java code). If you absolutely must use a UDF (for example, you need some complex logic not available as a built-in, or you want to use a specific Python library function on your data), consider using Pandas UDFs (vectorized UDFs). Pandas UDFs can process a batch of rows at once using Apache Arrow for data transfer between JVM and Python, which is much more efficient than the “one row at a time” standard UDF. They were introduced in Spark 2.3+ and can often give performance closer to built-ins while still allowing you to write Python code for the custom logic. To summarize: avoid UDFs unless necessary. Check Spark’s function library first — chances are there is already a way to do what you need. By using built-in functions, you keep the execution within the optimized Spark engine and benefit from Catalyst’s ability to rearrange and optimize those operations. 5. Cache Judiciously to Reuse Data Spark’s lazy evaluation means it will recompute transformations each time an action is called, unless it can optimize otherwise. If you have a DataFrame or RDD that you need to use multiple times in your job (say you do multiple actions or multiple joins with it), it might make sense to cache (persist) that data in memory. Caching can save expensive recomputation in iterative algorithms or complex pipelines. However, caching is not a silver bullet. Use it only for datasets that are reused enough to pay off the cost of caching. If you cache something that is only used once, you incur extra overhead (to store it) for no benefit. Identify points in your job where caching makes a difference. For example, if you filter and transform a base Data Frame and then use the result in two separate joins, caching the result of that filter + transform step might be worthwhile so that step is not redone for each join. When using df.cache() or df.persist(), prefer the MEMORY_ONLY storage level for purely in-memory caching, or MEMORY_AND_DISK if the data is too large (Spark will spill to disk if it cannot fit in RAM). Also, remember to undo persisting the data later if it is no longer needed to free up executor memory. A common pattern is to cache the “core” dataset that many downstream operations depend on. For instance, in a machine learning pipeline, you might cache the prepared feature Data Frame that is used by multiple models for training. Or in an ETL, cache the cleaned data that will be aggregated in multiple ways. This can give significant speedups. Spark will automatically avoid recomputation of some intermediate results thanks to DAG optimization, but it is not omniscient — explicit caching is your tool to hint to Spark, “I’m going to use this again, keep it around.” Used properly, caching and persistence are powerful for optimization. Data Storage and Format Optimization Optimizing how data is stored and accessed can greatly speed up Spark jobs, especially for I/O heavy workloads. This is often an overlooked aspect of performance. By choosing efficient data formats and organizing data cleverly, you can reduce the amount of data that Spark needs to read and write, thus accelerating your jobs. Use Columnar, Compressed File Formats If you have a choice, store your data in a columnar format like Parquet or ORC rather than as plain text (CSV/JSON) or other row-based formats. Columnar formats are optimized for analytic queries: Spark can skip reading columns that are not needed (saving I/O) and apply predicate pushdown (only read data that matches filters). They also integrate well with Spark’s off-heap memory management. Additionally, they support compression and encoding that significantly reduce file sizes. For example, Parquet with Snappy compression often gives huge space savings and speeds up data reads due to needing to transfer less data from disk. The compression is automatic and split into blocks so that Spark can still read in parallel. In short, prefer Parquet/ORC with compression for Spark data — your jobs will thank you with faster disk and network operations. Partition Data on Disk for Pruning If your data is large and stored in a distributed filesystem (like HDFS or S3), organize it into partitioned directories by key fields that you commonly filter by. For instance, it is very common to partition datasets by date (.../year=2025/month=10/day=11/ etc.). When you load a partitioned table and apply a filter like WHERE year=2025 AND month=10, Spark (through Hive Metastore or internal logic) will only read the folders matching those partition values, skipping the rest entirely. This partition pruning can cut down the runtime dramatically for time range queries or other partition key filters. Choose partition columns that have enough granularity to split data, but not so high cardinality that you get a million tiny files. Dates, regions, or categorical fields are good candidates. On a related note, avoid creating too many small files in the output. If you have, say, 1000 partitions each resulting in a small file of a few KB, Spark’s reading overhead increases (and many files can strain the file system). It’s often better to coalesce or repartition data before writing out to end up with larger chunkier files (e.g., each 100MB or more). The goal is a balance: not one huge monolithic file (which can’t be read in parallel efficiently), but also not tens of thousands of tiny files. File I/O operations have overhead, so fewer, larger files are usually better for throughput. Bucketing and Sorting For certain use cases (especially joining on a key or doing aggregations), partitioning might not be enough. Spark offers bucketing (hash partitioning data into a fixed number of files by column and optionally sorting within each bucket). If two tables are bucketed by the same key into the same number of buckets, Spark can do a join or aggregation on those with less shuffle. Bucketing is more of an advanced technique and adds complexity (and it is static once set up), but it can be worth it for repeated queries on huge data. If you have control over how data is written, consider bucketing on high cardinality join keys to optimize those operations. In summary, store data in a Spark-friendly way. Use the right formats (columnar + compressed), partition the data for predicate pushdown, and write outputs in appropriately sized files. These optimizations at the storage layer often make a big difference, because they reduce the amount of work Spark has to do before your actual computation even starts. Monitoring Tuning is an iterative process. Even after applying the above strategies, it’s important to monitor your Spark application’s performance and identify any remaining bottlenecks or areas to improve. Spark provides excellent tools for this. The Spark Web UI is your first stop. When you run a Spark application, you can access its UI (typically at http://<driver-node>:4040 for local or the tracking URL in cluster mode). On the UI, examine the Jobs and Stages tabs to see where time is being spent. Look at metrics like Task Duration, Shuffle Read/Write, Memory Spill, etc. For example, if you see a stage with an enormous Shuffle Read size, that is a hint that maybe you need to reduce data shuffling or use a broadcast join. Large "Shuffle Write" metrics indicate a lot of data being shuffled; this might signal an opportunity for optimization by filtering or repartitioning differently. Similarly, if you notice tasks are slow due to GC, that indicates memory tuning might be needed. Spark UI also helps spot data skew. If one task is running much longer than others in a stage, perhaps one partition had way more data (skew), and you might handle that by better partitioning or using techniques like salting or the spark.sql.adaptive.skewJoin.enabled feature in AQE. Besides the UI, you can enable Spark event logs and use tools or Spark History Server to analyze runs after the fact. There are also external APM (Application Performance Management) tools and metrics sinks (Ganglia, Prometheus, etc.) that can collect Spark metrics. Leverage these tools to get insight into your job’s behavior. It is often helpful to test your job on a sample of data or on a smaller scale, profile it, and then extrapolate or catch issues before running on full data. Finally, remember that optimization is a balancing act and sometimes requires tradeoffs. The “best” settings can depend on your specific data sizes, data distributions, and cluster resources. What works for one job (or one day’s data) might need adjustment for another. By monitoring and understanding Spark’s execution, you will build an intuition for where the bottleneck is — be it CPU, memory, network, or I/O and apply the appropriate fix from the toolbox of techniques discussed above. Key Takeaways Leverage Spark’s optimizer: Use high-level APIs (Data Frames/Datasets) and built-in functions to let Spark’s Catalyst and Tungsten engines optimize your code. Avoid using RDDs or UDFs unless absolutely necessary, as they bypass many optimizations.Tune resources and configs: Adjust Spark configurations to your workload. Enable dynamic allocation for variable workloads, set appropriate executor memory to avoid OOM, and tune the number of shuffle partitions so tasks are neither too large nor too small.Minimize data shuffling: Design your computations to reduce data movement. Filter data early in the pipeline, and use techniques like broadcast joins for small lookup datasets. Where possible, partition or coalesce data to avoid skew and excessive tiny tasks.Efficient data storage: Use columnar formats like Parquet/ORC with compression for input and output data. Partition your data files by common keys (e.g., date) to enable Spark to skip reading unnecessary data. Avoid writing out a myriad of small files. Aim for fewer, larger files for better I/O throughput.Cache smartly: Cache/persist intermediate datasets only when they are reused multiple times in your job. This can save time on re-computation for iterative algorithms or repeated joins, but do not waste memory on caching data that is used just once.Monitor and iterate: Use the Spark UI and logs to profile your job’s performance. Identify stages with heavy shuffles, long tasks, or skew, and address those with the appropriate optimization techniques. Optimization is iterative — observe, tweak, and repeat to hone in on the best performance for your Spark application. By following these best practices and continuously observing your Spark jobs, you can significantly improve the speed and efficiency of your Apache Spark programs.

By Nataraj Mocherla

Efficiently Reading Large Excel Files (Over 1 Million Rows) Using the Open-Source Sjxlsx Java API

If you are a developer, regardless of the technologies you use, at some point, you will face the challenge of handling large Excel files. Most commonly, you will see the "out of memory" error when handling these files. Here is the requirement: You get a workbook with 100 spreadsheets, and each of them has 1 million records with a minimum of 100 columns. For your reference, here's how many rows each Excel version can support: 1,048,576 rows – Excel 365, 2013, 2010, 200765,536 rows – Excel 2003 and earlier versions Cost and Hosting Server Limitations Third-party license purchase is not feasible. Open source is allowed.Deployment should be on average-tier cloud environments (4 GB disk/4 GB RAM or less) or on-premises Windows Server (16/32 GB RAM) already loaded with 10+ running applications. How would you handle this situation? Let's explore the current digital market. Are there any open-source solutions available to meet this requirement? I can’t use “interop” when your cloud runs on "Linux OS," and it also causes intermittent hang issues in multi-threaded applications. Most of them will say that the popular library is the "Apache POI" streaming library. Let’s proceed with a practical implementation using Apache POI. To get the large dataset (Excel) file, we have N number of websites, such as awesome-public-datasets, Google Dataset Search, and World Bank Data. Personally, I frequently visit Kaggle. Let's download the container Dataset as a 1.9GB CSV file, and then save it as an Excel file (.XSLX), which became around a 600 MB XLSX file. The sheet includes 87 columns and 1048576 rows. Before we run the solution, let me capture my laptop's resource usage so that we can compare later. Resource Usage CPU: 3%Memory: 54% IDE I am using Eclipse IDE (v: 4.36.0). Apache POI I am using the latest Apache POI 5.2.5 and other dependency JARS available from Apache POI. Source Code Here, I am just attempting to read the sheet name from the workbook, not the rows. Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; ReadExcelbyApachePOI(filePath); } /*List out sheet name*/ static void ReadExcelbyApachePOI(String filePath) throws Exception { try (OPCPackage opcPackage = OPCPackage.open(new File(filePath), PackageAccess.READ)) { XSSFWorkbook workbook = new XSSFWorkbook(opcPackage); XSSFReader xssfReader = new XSSFReader(opcPackage); StylesTable styles = xssfReader.getStylesTable(); XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader .getSheetsData(); while (iter.hasNext()) { InputStream stream = iter.next(); String sheetName = iter.getSheetName(); System.out.println("Sheetname: " + sheetName); } } catch (IOException e) { e.printStackTrace(); } } Result Encountering "Java heap space - Out of memory error." Java Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Resource Usage CPU: 58%Memory: 94% Let us try with the sjxlsx open-source API. It’s an open-source Java API Source Code; this project was first published on Google Code, and it seems unmaintained. In GitHub, it is available for anyone who can download and update the changes for their needs. "Memory" and "speed" are primary goals of this API. It provides two modes, which are "classic" and "stream." Classic – All records of the sheet will be loaded.Stream – Read one record at a time. Microsoft XLSX uses XML+zip (OOXML) to store the data. So, to be fast, "sjxlsx" uses STAX for XML input and output. Source Code Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; SimpleXLSXWorkbook workbook = newWorkbook(filePath); testLoadALL(workbook); } private static SimpleXLSXWorkbook newWorkbook(String filePath) { return new SimpleXLSXWorkbook(new File(filePath)); } /*Read Each Row*/ private static void printRow(int rowPos, com.incesoft.tools.excel.xlsx.Cell[] row) { int cellPos = 0; for (com.incesoft.tools.excel.xlsx.Cell cell : row) { System.out.println(com.incesoft.tools.excel.xlsx.Sheet.getCellId(rowPos, cellPos) + "=" + cell.getValue()); cellPos++; } } /*Load & Read workbook * false => Read each row * true => Load all rows */ public static void testLoadALL(SimpleXLSXWorkbook workbook) { com.incesoft.tools.excel.xlsx.Sheet sheetToRead = workbook.getSheet(0,false); SheetRowReader rowreader = sheetToRead.newReader(); int rowPos = 0; while (rowreader != null) { com.incesoft.tools.excel.xlsx.Cell[] row = rowreader.readRow(); printRow(rowPos, row); rowPos++; } } Resource Usage CPU: 3% (No changes)RAM: 61% (7% usage: 1 GB usage) Output Java BN1048519=40298 BO1048519=0 BP1048519=0 BQ1048519=0 BR1048519=0 BS1048519=610 BT1048519=0 BU1048519=1 BV1048519=64240 BW1048519=923 BX1048519=158 BY1048519=32 BZ1048519=0 CA1048519=0 CB1048519=0 CC1048519=0 CD1048519=0 CE1048519=0 CF1048519=0 CG1048519=0 CH1048519=10000206 CI1048519=0 A1048520=100.64.0.2-10.16.0.9-35919-8080-6 B1048520=100.64.0.2 C1048520=35919 D1048520=10.16.0.9 E1048520=8080 F1048520=6 G1048520=45266.83932053241 H1048520=41626 I1048520=6 J1048520=5 K1048520=515 L1048520=357 M1048520=515 N1048520=0 O1048520=85.8333333333333 P1048520=210.24786958888899 Q1048520=357 R1048520=0 S1048520=71.400000000000006 T1048520=159.65525359348399 U1048520=20948.445682986501 V1048520=264.25791572574798 W1048520=4162.5999999999904 X1048520=12728.124713056101 Y1048520=40374 Z1048520=9 AA1048520=41626 AB1048520=8325.2000000000007 AC1048520=17922.528077813098 AD1048520=40374 AE1048520=29 AF1048520=41594 AG1048520=10398.5 AH1048520=20011.5685292282 AI1048520=40406 AJ1048520=26 AK1048520=1 AL1048520=1 AM1048520=0 AN1048520=0 AO1048520=0 AP1048520=0 AQ1048520=200 AR1048520=168 AS1048520=144.14068130495301 AT1048520=120.11723442079401 AU1048520=0 AV1048520=515 AW1048520=79.272727272727295 AX1048520=179.87445116474399 AY1048520=32354.8181818181 AZ1048520=2 BA1048520=2 BB1048520=0 BC1048520=2 BD1048520=10 BE1048520=0 BF1048520=0 BG1048520=0 BH1048520=0.83333333333333304 BI1048520=79.272727272727195 BJ1048520=85.8333333333333 BK1048520=71.400000000000006 BL1048520=0 BM1048520=0 BN1048520=0 BO1048520=0 BP1048520=0 BQ1048520=0 BR1048520=0 BS1048520=46 BT1048520=0 BU1048520=32 BV1048520=64240 BW1048520=502 BX1048520=1 BY1048520=32 BZ1048520=0 CA1048520=0 CB1048520=0 CC1048520=0 CD1048520=0 CE1048520=0 CF1048520=0 CG1048520=0 CH1048520=41626 CI1048520=0 A1048521=100.64.0.2-10.16.0.9-9479-8080-6 B1048521=100.64.0.2 C1048521=9479 D1048521=10.16.0.9 E1048521=8080 F1048521=6 G1048521=45266.835683206016 H1048521=111205 I1048521=6 J1048521=5 K1048521=537 L1048521=357 Performance Results Woo-hoo! I can read the records. The winner is "sjxlsx." It has been proven that this library consumes less than 1 GB of memory, compared to the higher usage by Apache POI. It is an excellent open-source Java API for reading large Excel datasets. Additional Features In addition to that, it supports writing Excel and offers rich data formatting in sheets. Java public static void main(String[] args) throws Exception { // WRITE - we take WRITE as a special kind of MODIFY SimpleXLSXWorkbook workbook = newWorkbook(); OutputStream output = ExcelOutput("write"); WriteExcel(workbook, output); output.close(); } private static SimpleXLSXWorkbook newWorkbook() { return new SimpleXLSXWorkbook(new File("/sample.xlsx")); } private static OutputStream ExcelOutput(String suffix) throws Exception { return new BufferedOutputStream(new FileOutputStream("/sample_" + suffix + ".xlsx")); } public static void WriteExcel(SimpleXLSXWorkbook workbook, OutputStream outputStream) throws Exception { com.incesoft.tools.excel.xlsx.Sheet sheet = workbook.getSheet(0); WriteRecords(workbook, sheet, 0); workbook.commit(outputStream); } static public void WriteRecords(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet, int rowOffset) { int columnCount = 10; int rowCount = 10; int offset = rowOffset; for (int r = offset; r < offset + rowCount; r++) { int modfiedRowLength = sheet.getModfiedRowLength(); for (int c = 0; c < columnCount; c++) { sheet.modify(modfiedRowLength, c, r + "," + c, null); } } } Writing a Richly Styled Row Java public static void WriteRichStyleRow(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet) throws Exception { Font font = wb.createFont(); font.setColor("FFFF0000"); Fill fill = wb.createFill(); fill.setFgColor("FF00FF00"); CellStyle style = wb.createStyle(font, fill); RichText richText = wb.createRichText(); richText.setText("test_text"); Font font2 = wb.createFont(); font2.setColor("FFFF0000"); richText.applyFont(font2, 1, 2); sheet.modify(0, 0, (String) null, style); sheet.modify(1, 0, richText, null); } Summary Ultimately, "sjxlsx" provides an efficient, lightweight way to read large Excel files without infrastructure headaches. Thank you!

By Mahendran Chinnaiah

Stop React Native Crashes: A Production-Ready Error Handling Guide

You are demonstrating your next mobile application idea to investors, and suddenly the application crashes. Now you are clueless whether it crashed due to bad WIFI at the venue, your service is down, or your application itself hit a snag. There are no logs or alerts to tell you what just happened. Mere imagination of this scenario can send any founder or developer into panic and nervousness. Therefore, having robust error handling is critical to modern-day mobile applications. React Native has become an industry standard in building cross-platform native mobile applications. React Native boasts a rich ecosystem of error handling at various levels of architecture. However, often projects lack clear guidelines on how to handle various error scenarios. This results in either overdoing or underdoing error handling in otherwise robust and feature-rich applications. In this article, you will learn how to adopt a three-layer error handling mechanism, which is gaining widespread popularity with major enterprise-level applications built at leading companies. The article will give you a practical guide instead of pure theory, so you are ready to adapt these concepts immediately. Error Types It is important for you to understand core error types to understand layers of error handling in React Native applications. These error categories are explained below: Network errors: Network errors can occur due to service loss, for example, losing cell reception while traveling or WIFI disconnection. These are the most common types of errors a mobile app experiences.Component errors: Component errors can happen due to wrongly implemented React lifecycle hooks, incorrect rendering of JSX, accessing undefined state properties, or infinite render loops. These errors can crash the entire app at once.Runtime errors: These errors are core JavaScript runtime errors like accessing properties on null or undefined values, JSON parsing failures, type mismatches, and unhandled promise rejections. These errors are the toughest to catch as they usually slip through standard testing. Due to the varied nature of these errors, error handling needs to be layered to prevent and recover the application from a crash state. Three Layers of Error Handling Implementing error handling at three critical layers ensures you catch all major errors. These layers are local error handling, component-level error handling, and global error handling. Please note that these layers are not a 1:1 match for each of the above error types; instead, they work together to ensure that you catch all errors safely. Three layers of error handling Layer 1: Local Error Handling Local error handling refers to errors caught and handled within specific functions or code blocks. These errors are predictable and generally recoverable. You can perform several types of local error handling as described ahead: Try-catch: Classic try-catch blocks are particularly useful when working with async operations such as calling external APIs. Please refer to the code below: JavaScript const fetchUserProfile = async (userId) => { try { const response = await fetch(`https://api.example.com/users/${userId}`); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } const data = await response.json(); return data; } catch (error) { console.error('Failed to fetch user profile:', error); Alert.alert('Error', 'Unable to load profile. Please try again.'); return null; } }; React query error handling: Async operations can be enhanced using modern constructs like React Query, which, along with standard try-catch level error handling, offers more options such as retrying failed operations. This allows for graceful recovery from an error state. Observe the sample implementation given below. JavaScript import { useQuery, useMutation } from '@tanstack/react-query'; import api from './services/api'; const UserProfile = () => { const { data, error, isError, isLoading } = useQuery({ queryKey: ['user', 'profile'], queryFn: async () => { const response = await api.get('/user/profile'); return response.data; }, retry: 3, retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000), onError: (error) => { if (error.code !== 'NETWORK_ERROR') { Alert.alert('Error', 'Failed to load profile'); } }, }); if (isLoading) return <Text>Loading...</Text>; if (isError) return <Text>Error: {error.message}</Text>; return <Text>{data?.name}</Text>; }; Form validation: Form validations at React Native components not only improve the app’s user experience but also improve backend performance by reducing unnecessary API calls. Below is an example for you to refer to: JavaScript import { useForm } from 'react-hook-form'; import { z } from 'zod'; import { zodResolver } from '@hookform/resolvers/zod'; const loginSchema = z.object({ email: z.string().email('Invalid email address'), password: z.string().min(6, 'Password must be at least 6 characters'), }); const LoginForm = () => { const { handleSubmit, formState: { errors } } = useForm({ resolver: zodResolver(loginSchema), }); const onSubmit = async (data) => { try { await api.post('/auth/login', data); } catch (error) { Alert.alert('Login Failed', error.message); } }; return ( <View> {errors.email && <Text style={styles.error}>{errors.email.message}</Text>} {errors.password && <Text style={styles.error}>{errors.password.message}</Text>} </View> ); }; Layer 2: Component-Level Error Handling Error boundaries catch component-level errors in the child component tree that go uncaught during local error handling. These can include infinite rendering, improper lifecycle management, and accessing undefined properties during rendering inside components. Without an Error Boundary, a single error can crash the entire application, leading to an unpleasant customer experience. There are two major steps involved in creating an error boundary: 1) creating a dedicated error boundary component, and 2) wrapping the entire application code in it as a child. The code below highlights how to create the error boundary component. JavaScript import React from 'react'; import { View, Text, TouchableOpacity, StyleSheet } from 'react-native'; class ErrorBoundary extends React.Component { constructor(props) { super(props); this.state = { hasError: false, error: null, errorInfo: null }; } static getDerivedStateFromError(error) { return { hasError: true, error }; } componentDidCatch(error, errorInfo) { console.error('ErrorBoundary caught an error:', error, errorInfo); this.setState({ errorInfo: errorInfo.componentStack, }); } handleReset = () => { this.setState({ hasError: false, error: null, errorInfo: null }); }; render() { if (this.state.hasError) { return ( <View style={styles.container}> <Text style={styles.title}>Oops! Something went wrong</Text> <Text style={styles.message}> We're sorry for the inconvenience. Please try again. </Text> {__DEV__ && this.state.error && ( <View style={styles.errorDetails}> <Text style={styles.errorText}> {this.state.error.toString()} </Text> {this.state.errorInfo && ( <Text style={styles.errorStack}> {this.state.errorInfo} </Text> )} </View> )} <TouchableOpacity style={styles.button} onPress={this.handleReset} > <Text style={styles.buttonText}>Try Again</Text> </TouchableOpacity> </View> ); } return this.props.children; } } Once the error boundary component is created, you can wrap the entire application around it by referring to the code below. JavaScript import React from 'react'; import { SafeAreaView } from 'react-native'; import ErrorBoundary from './components/ErrorBoundary'; import MainNavigator from './navigation/MainNavigator'; const App = () => { return ( <ErrorBoundary> <SafeAreaView style={{ flex: 1 }> <MainNavigator /> </SafeAreaView> </ErrorBoundary> ); }; export default App; Layer 3: Global Error Handling Global error handling is the final safety measure that can help catch uncaught errors in local layer and component layer error handling. These include catching uncaught JavaScript exceptions, unhandled promise rejections, and more. Furthermore, global error handling can help centralize API error handling, creating manageable error handling for entire applications. Global error handling component: Implementation of the global error handling component is quite similar to the error boundary component, with two stages: creating the component and using it in the root application initialization. You can refer to the code below to create a global error handling component. JavaScript import { Alert } from 'react-native'; class GlobalErrorHandler { static init() { this.setupErrorHandler(); this.setupPromiseRejectionHandler(); } static setupErrorHandler() { ErrorUtils.setGlobalHandler((error, isFatal) => { console.error('Global Error Handler:', { error, isFatal, message: error.message, stack: error.stack, }); if (isFatal) { Alert.alert( 'Unexpected Error', 'The app encountered a critical error and needs to restart. ' + 'We apologize for the inconvenience.', [ { text: 'Restart', onPress: () => {}, }, ] ); } else { console.warn('Non-fatal error occurred:', error.message); } }); } static setupPromiseRejectionHandler() { if (__DEV__) { require('promise/setimmediate/rejection-tracking').enable({ allRejections: true, onUnhandled: (id, error) => { console.warn('Unhandled Promise Rejection (Dev):', id, error); }, onHandled: (id) => { console.log('Promise rejection was handled:', id); }, }); } } static handleError(error, context = {}) { console.error('Handled Error:', { error, context, message: error.message, stack: error.stack, }); let userMessage = 'Something went wrong'; if (error.code === 'NETWORK_ERROR') { userMessage = 'No internet connection. Please check your network.'; } else if (error.code === 'UNAUTHORIZED') { userMessage = 'Your session has expired. Please login again.'; } else if (error.message) { userMessage = error.message; } return userMessage; } } export default GlobalErrorHandler; Once created, initialize it in the main application component using the below reference code. JavaScript import React, { useEffect } from 'react'; import { SafeAreaView } from 'react-native'; import ErrorBoundary from './components/ErrorBoundary'; import GlobalErrorHandler from './utils/GlobalErrorHandler'; import MainNavigator from './navigation/MainNavigator'; const App = () => { useEffect(() => { GlobalErrorHandler.init(); }, []); return ( <ErrorBoundary> <SafeAreaView style={{ flex: 1 }> <MainNavigator /> </SafeAreaView> </ErrorBoundary> ); }; export default App; Axios Interceptor: Axios Interceptor acts as a centralized API error handler. You can refer to the code below to implement it. After the implementation, you can import it wherever you need API calling. JavaScript import axios from 'axios'; import AsyncStorage from '@react-native-async-storage/async-storage'; import { Alert } from 'react-native'; const api = axios.create({ baseURL: 'https://api.example.com', timeout: 10000, }); api.interceptors.request.use( async (config) => { const token = await AsyncStorage.getItem('authToken'); if (token) { config.headers.Authorization = `Bearer ${token}`; } return config; }, (error) => Promise.reject(error) ); api.interceptors.response.use( (response) => response, async (error) => { if (!error.response) { Alert.alert('Network Error', 'Please check your internet connection'); return Promise.reject({ code: 'NETWORK_ERROR', message: 'No internet connection', }); } switch (error.response.status) { case 401: await AsyncStorage.removeItem('authToken'); return Promise.reject({ code: 'UNAUTHORIZED', message: 'Session expired. Please login again.', }); case 403: Alert.alert('Access Denied', 'You do not have permission'); break; case 404: Alert.alert('Not Found', 'Resource not found'); break; case 500: case 502: case 503: Alert.alert('Server Error', 'Please try again later'); break; } return Promise.reject(error); } ); export default api; Conclusion You have learned about different errors your React Native application can encounter and a layered approach to handling them. This knowledge will help you build robust mobile applications that are reliable and resilient. If you wish to dive deeper, you can explore error logging services like Sentry and Firebase Crashlytics. While these services have costs, they are very useful tools for understanding and fixing production errors.

By Anujkumarsinh Donvir

CORE

AI-Assisted Kubernetes Diagnostics: A Practical Implementation

Kubernetes troubleshooting follows a repetitive pattern: identify unhealthy pods, examine descriptions, review logs, analyze events, and correlate information to find root causes. For common issues like CrashLoopBackOff, ImagePullBackOff, or OOMKilled pods, engineers repeat the same diagnostic steps daily, sometimes dozens of times per week in busy production environments. The traditional workflow requires running multiple kubectl commands in sequence, mentally correlating outputs from pod descriptions, container logs, event streams, and resource configurations. An engineer investigating a single failing pod might execute 5–10 commands, read through hundreds of lines of output, and spend 10-30 minutes connecting the dots between symptoms and root causes. For straightforward issues like memory limits or missing images, this time investment yields solutions that follow predictable patterns. Large language models can process this same information — pod descriptions, logs, events — and apply pattern recognition trained on thousands of similar scenarios. Instead of an engineer manually correlating data points, an LLM can analyze the complete context at once and suggest likely root causes with specific remediation steps. This article walks through a proof-of-concept tool available at [opscart/k8s-ai-diagnostics](https://github.com/opscart/k8s-ai-diagnostics). The tool detects unhealthy pods in a namespace, analyzes them using OpenAI GPT-4, and provides diagnostics with suggested remediation steps. For certain failure types like CrashLoopBackOff or OOMKilled, it applies fixes automatically with human approval. The implementation stays minimal — just Python, kubectl, and the OpenAI API — making it easy to deploy and test in existing Kubernetes environments. The Problem Space Manual Diagnostic Overhead When a pod fails in Kubernetes, the diagnostic process typically looks like this: Shell # Check pod status kubectl get pods -n production # Examine pod details kubectl describe pod failing-pod -n production # Review container logs kubectl logs failing-pod -n production # Check previous container logs if crashed kubectl logs failing-pod -n production --previous # Examine events kubectl get events -n production --field-selector involvedObject.name=failing-pod For experienced engineers, this workflow becomes muscle memory. However, it still requires: Context switching between multiple kubectl commandsMental correlation of information across different outputsKnowledge of common failure patterns and their solutionsTime to write and apply remediation patches Common Failure Patterns Kubernetes pods fail in predictable ways: ImagePullBackOff: Wrong image name, missing credentials, or registry connectivity issuesCrashLoopBackOff: Application startup failures, missing dependencies, or configuration errorsOOMKilled: Container memory usage exceeds defined limitsProbe Failures: Readiness or liveness probes fail due to application issues or misconfigurations Each pattern has typical root causes and standard remediation approaches. This repetitive nature makes automation worth exploring. The Solution: LLM-Powered Diagnostics The k8s-ai-diagnostics project implements an agent that: Scans a namespace for unhealthy podsCollects pod descriptions and logs via kubectlSends context to OpenAI GPT-4 for analysisReceives structured diagnostics, including root cause, reasons, and fixesOptionally applies remediation with human approval Architecture The tool uses a simple pipeline: Shell ┌──────────────────┐ │ kubectl CLI │ │ (pod status, │ │ descriptions, │ │ logs) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Python Script │ │ - Detect pods │ │ - Collect data │ │ - Build context │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ OpenAI GPT-4 │ │ - Analyze data │ │ - Root cause │ │ - Suggest fixes │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Remediation │ │ - Human approve │ │ - Apply patches │ │ - kubectl cmds │ └──────────────────┘ The implementation keeps dependencies minimal: Python 3.8+, kubectl, and the OpenAI API. Installation and Setup Prerequisites Shell # Python 3.8 or higher python3 --version # kubectl configured with cluster access kubectl cluster-info # OpenAI API key export OPENAI_API_KEY="your-api-key" Installation Shell # Clone repository git clone https://github.com/opscart/k8s-ai-diagnostics.git cd k8s-ai-diagnostics # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt Deploy Test Scenarios Set up local env The repository includes test deployments that simulate common failures: Shell # Create namespace kubectl create namespace ai-apps # Deploy test scenarios sh k8s-manifests/deploy.sh This deploys four intentionally broken pods: broken-nginx: ImagePullBackOff (invalid image name)crashy: CrashLoopBackOff (container exits immediately)oom-test: OOMKilled (exceeds memory limits)unhealthy-probe: Probe failures (missing expected files) Verify deployment: Shell kubectl get pods -n ai-apps Expected output: Shell NAME READY STATUS RESTARTS AGE broken-nginx-5f6cdfb774-m7kw7 0/1 ImagePullBackOff 0 2m crashy-77747bbb47-mr75j 0/1 CrashLoopBackOff 6 2m oom-test-5fd8f6b8d9-c9p52 0/1 OOMKilled 3 2m unhealthy-probe-78d9b76567-5x8h6 0/1 Running 1 2m Running the Diagnostic Agent Execute the agent: Python python3 k8s_ai_agent.py The script prompts for a namespace: Python Enter the namespace to scan: ai-apps Example Diagnostic Session Python Found 4 unhealthy pod(s): ['broken-nginx', 'oom-test', 'crashy', 'unhealthy-probe'] Analyzing pod: crashy... k8s_ai_agent.py execution Plain Text ROOT CAUSE ANALYSIS: Container is exiting immediately with code 1. The application fails to start due to a missing dependency or configuration error. DIAGNOSTIC DETAILS: - Exit Code: 1 (general application error) - Container restart count: 6 - Last termination reason: Error - No application logs available (exits too quickly) SUGGESTED REMEDIATION: 1. Review container entrypoint and command 2. Check for missing environment variables 3. Verify required config files are mounted 4. Add verbose logging to startup process IMMEDIATE ACTION: Restart the pod to clear transient state issues. The agent then prompts: Plain Text Detected CrashLoopBackOff. Suggest restarting the pod. Do you want to apply the above remediation? (yes/no): yes Upon approval, it executes: Shell kubectl delete pod crashy-77747bbb47-mr75j -n ai-apps Result: Plain Text Deployment crashy is now healthy. OOMKilled Remediation Example For memory issues: Plain Text Analyzing pod: oom-test... ROOT CAUSE ANALYSIS: Container terminated due to Out-of-Memory (OOM) condition. Memory consumption exceeded the defined limit of 128Mi. SUGGESTED REMEDIATION: Increase memory limit to accommodate actual usage patterns. Based on termination history and typical application requirements, recommend 400Mi limit. PATCH TO APPLY: spec: template: spec: containers: - name: oom-test resources: limits: memory: "400Mi" requests: memory: "200Mi" Detected OOMKilled. Suggest increasing memory limits. Do you want to apply the above remediation? (yes/no): yes The agent patches the deployment: Shell kubectl patch deployment oom-test -n ai-apps --type='json' \ -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"400Mi"}]' ImagePullBackOff Handling For image issues, the agent identifies the problem but skips automatic remediation: Plain Text Analyzing pod: broken-nginx... ROOT CAUSE ANALYSIS: Failed to pull container image "nginx:nonexistent-tag". The specified image tag does not exist in the registry. DIAGNOSTIC DETAILS: - Image: nginx:nonexistent-tag - Error: manifest for nginx:nonexistent-tag not found - Registry: docker.io SUGGESTED REMEDIATION: 1. Verify image tag exists: docker pull nginx:nonexistent-tag 2. Check image name spelling and tag format 3. Ensure registry credentials are configured if using private registry 4. Update deployment with correct image reference ImagePullBackOff detected — likely an image issue. Skipping remediation. This requires manual intervention because the correct image name needs human judgment. How GPT-4 Analysis Works Context Building The agent collects context before sending to GPT-4: Python def collect_pod_context(namespace, pod_name): context = { 'pod_description': run_kubectl(['describe', 'pod', pod_name, '-n', namespace]), 'pod_logs': run_kubectl(['logs', pod_name, '-n', namespace, '--tail=100']), 'previous_logs': run_kubectl(['logs', pod_name, '-n', namespace, '--previous', '--tail=50']), 'pod_events': run_kubectl(['get', 'events', '-n', namespace, '--field-selector', f'involvedObject.name={pod_name}']) } return context Prompt Construction The system prompt guides GPT-4 to provide structured responses: Python system_prompt = """ You are a Kubernetes expert analyzing pod failures. Provide: 1. ROOT CAUSE ANALYSIS: Clear identification of the primary issue 2. DIAGNOSTIC DETAILS: Supporting evidence from events and logs 3. SUGGESTED REMEDIATION: Specific fixes with commands or YAML patches 4. IMMEDIATE ACTION: What to do right now Focus on actionable advice. For resource issues, suggest specific limits. For configuration problems, identify missing or incorrect settings. """ user_prompt = f""" Analyze this Kubernetes pod failure: POD NAME: {pod_name} NAMESPACE: {namespace} STATUS: {pod_status} DESCRIPTION: {pod_description} LOGS: {logs} EVENTS: {events} Provide detailed diagnosis and remediation steps. """ GPT-4 Response Parsing The agent extracts structured information from GPT-4's response: Python def parse_diagnosis(response): diagnosis = { 'root_cause': extract_section(response, 'ROOT CAUSE'), 'details': extract_section(response, 'DIAGNOSTIC DETAILS'), 'remediation': extract_section(response, 'SUGGESTED REMEDIATION'), 'immediate_action': extract_section(response, 'IMMEDIATE ACTION') } return diagnosis The tool implements different remediation approaches based on failure type: IssueDiagnosisAutomated ActionRationaleImagePullBackOffImage issueNone (manual)Requires human judgment on correct imageCrashLoopBackOffContainer crashPod restartClears transient state issuesOOMKilledMemory overusePatch memory limitsPrevents future OOM killsProbe failureMisconfigurationNone (manual)Needs application-level fixes Restart Remediation For CrashLoopBackOff: Python def restart_pod(namespace, pod_name): """Delete pod to trigger recreation by deployment""" run_kubectl(['delete', 'pod', pod_name, '-n', namespace]) # Wait for new pod to be ready wait_for_pod_ready(namespace, deployment_name) Memory Patch Remediation For OOMKilled: Python def patch_memory_limit(namespace, deployment_name, new_limit='400Mi'): """Patch deployment to increase memory limit""" patch = { 'spec': { 'template': { 'spec': { 'containers': [{ 'name': get_container_name(namespace, deployment_name), 'resources': { 'limits': {'memory': new_limit}, 'requests': {'memory': str(int(new_limit[:-2]) // 2) + 'Mi'} } }] } } } } run_kubectl(['patch', 'deployment', deployment_name, '-n', namespace, '--type', 'strategic', '-p', json.dumps(patch)]) Important note: The current implementation uses a fixed memory value (400Mi) as a remediation. This is a limitation — the actual memory size should be determined from historical usage patterns captured through metrics systems like Prometheus. Without access to actual resource consumption data, the tool cannot make informed decisions about appropriate memory limits. This is why Prometheus integration (discussed in the Future Work section) is necessary for production use. Human Approval Gate All remediation actions require explicit approval: Python def apply_remediation(diagnosis, namespace, resource): print(f"\n{diagnosis['immediate_action']}") print(f"\nSUGGESTED FIX:\n{diagnosis['remediation']}") response = input("\nDo you want to apply the above remediation? (yes/no): ") if response.lower() == 'yes': execute_remediation(namespace, resource, diagnosis) else: print("Skipping remediation. Manual intervention required.") Real-World Example: Complete Workflow Here's a complete diagnostic session for an OOMKilled pod: Python $ python3 k8s_ai_agent.py Enter the namespace to scan: production Found 1 unhealthy pod(s): ['api-server-7d5f8b-xk2p9'] Analyzing pod: api-server-7d5f8b-xk2p9... === GPT-4 ANALYSIS === ROOT CAUSE ANALYSIS: The api-server container was terminated by the OOM killer. Memory usage exceeded the configured limit of 256Mi. Analysis of previous terminations shows this is a recurring pattern, indicating the application's memory footprint has grown beyond the initial resource allocation. DIAGNOSTIC DETAILS: - Container terminated with reason: OOMKilled - Memory limit: 256Mi - Restart count: 8 (all OOM-related) - Last termination: 2 minutes ago - Pattern: Crashes occur approximately 15-20 minutes after startup - Application logs show increasing heap usage before termination MEMORY ANALYSIS: Based on the termination pattern and logs: - Initial memory usage: ~80Mi - Pre-crash memory usage: ~250Mi (approaching limit) - Growth pattern suggests memory leak or legitimate increased load - Recommend 512Mi limit to provide headroom SUGGESTED REMEDIATION: 1. Immediate: Increase memory limit to 512Mi 2. Set request to 384Mi (75% of limit) 3. Monitor actual usage with: kubectl top pod api-server-7d5f8b-xk2p9 -n production 4. Investigate memory leak if usage continues growing 5. Consider implementing horizontal pod autoscaling YAML PATCH: spec: template: spec: containers: - name: api-server resources: limits: memory: "512Mi" requests: memory: "384Mi" IMMEDIATE ACTION: Patch deployment with increased memory limits ====================== Detected OOMKilled. Suggest increasing memory limits. REMEDIATION PLAN: - Patch deployment 'api-server' memory limit to 512Mi - Set memory request to 384Mi - Trigger rolling restart Do you want to apply the above remediation? (yes/no): yes Applying remediation... deployment.apps/api-server patched Waiting for new pod to be ready... Pod api-server-7d5f8b-n8x4m is now Running Deployment api-server is now healthy. NEXT STEPS: 1. Monitor pod memory usage: kubectl top pod -n production 2. Check application metrics for memory leak indicators 3. Review application logs for memory-related warnings 4. Consider adding memory usage alerts What Works Well Rapid Triage The tool scans an entire namespace and identifies all unhealthy pods in seconds, providing immediate visibility into cluster health. Context Analysis GPT-4 analyzes pod descriptions, logs, and events together, correlating information that might require multiple mental steps for a human operator. Knowledge Application The model applies Kubernetes best practices and common troubleshooting patterns without requiring the operator to remember specific commands or solutions. Executable Output Rather than just identifying problems, the tool provides specific kubectl commands, YAML patches, and remediation scripts ready to execute. Safe Automation The human approval gate ensures operators review recommendations before applying changes, preventing automated mistakes in production environments. Current Limitations Single LLM Provider The POC only supports OpenAI GPT-4. Adding support for Anthropic Claude, local models via Ollama, or Azure OpenAI would improve flexibility and reduce vendor lock-in. Simple Remediation Logic Current automated fixes are limited: Pod restarts for CrashLoopBackOffMemory limit patches for OOMKilledNo automated fixes for ImagePullBackOff or probe failures More work would require: Image name validation and correctionProbe configuration analysis and fixesNetwork policy adjustmentsRBAC issue resolution Single-Container Assumption The memory patching logic assumes deployments have a single container. Multi-container pods require more analysis to determine which container needs resource adjustments. No Historical Context The agent analyzes each pod independently without considering: Previous diagnostic sessionsRemediation success/failure patternsCluster-wide trendsRelated failures in other namespaces Limited Observability Integration The tool relies solely on kubectl output. Integration with monitoring systems would provide: Historical resource usage trendsPerformance metrics before failuresApplication-specific telemetryDistributed tracing context CLI-Only Interface The current implementation is command-line interactive. Production use would benefit from: Web dashboard for visualizationAPI endpoints for integrationSlack/Teams notificationsIncident management system integration Cost Considerations Each diagnostic session calls the OpenAI API. For large clusters with many unhealthy pods, costs can accumulate. Implementing caching, local models, or rate limiting would help manage expenses. Security Concerns Sending pod logs to external APIs (OpenAI) raises data security issues: Logs may contain sensitive informationAPI keys, tokens, or credentials might leakCompliance requirements may prohibit external data transmission Production deployments need: Log sanitization to remove sensitive dataLocal LLM options for sensitive environmentsAudit trails of what data was sent externally Future Work Multi-Provider LLM Support Add support for alternative models: Python class LLMProvider: def __init__(self, provider='openai', model='gpt-4'): self.provider = provider self.model = model def analyze(self, context): if self.provider == 'openai': return self._openai_analyze(context) elif self.provider == 'anthropic': return self._claude_analyze(context) elif self.provider == 'ollama': return self._ollama_analyze(context) Prometheus Integration Incorporate time-series metrics: Python def enhance_context_with_metrics(namespace, pod_name): metrics = { 'cpu_usage': query_prometheus( f'rate(container_cpu_usage_seconds_total{{pod="{pod_name}"}[5m])' ), 'memory_usage': query_prometheus( f'container_memory_working_set_bytes{{pod="{pod_name}"}' ), 'restart_history': query_prometheus( f'kube_pod_container_status_restarts_total{{pod="{pod_name}"}' ) } return metrics This integration would solve the current limitation where OOMKilled remediation uses fixed memory values (400Mi). With Prometheus data, the tool could analyze actual memory usage patterns over time and recommend appropriate limits based on real consumption trends rather than arbitrary values. Feedback Loop Track remediation success to improve future diagnostics: Python class RemediationTracker: def record_outcome(self, pod_name, diagnosis, action, success): """Track which fixes worked""" outcome = { 'pod': pod_name, 'issue_type': diagnosis['type'], 'action_taken': action, 'successful': success, 'timestamp': datetime.now() } self.store_outcome(outcome) def get_success_rate(self, issue_type): """Calculate success rate for specific issue types""" outcomes = self.query_outcomes(issue_type) return sum(o['successful'] for o in outcomes) / len(outcomes) Expanded Remediation Expand automated fixes: Python class AdvancedRemediation: def fix_image_pull_error(self, namespace, pod_name, diagnosis): """Attempt to fix common image pull issues""" # Check if image exists with 'latest' tag # Verify imagePullSecrets are configured # Test registry connectivity # Suggest alternative image sources pass def fix_probe_failure(self, namespace, deployment, diagnosis): """Adjust probe configuration based on actual app behavior""" # Analyze startup time from logs # Recommend appropriate initialDelaySeconds # Suggest probe endpoint alternatives pass Web Dashboard Build a visualization layer: Python // React component for real-time diagnostics function DiagnosticsDashboard() { const [pods, setPods] = useState([]); const [analyses, setAnalyses] = useState({}); useEffect(() => { // Poll for unhealthy pods fetchUnhealthyPods().then(setPods); }, []); return ( <div> <PodList pods={pods} onAnalyze={runDiagnostics} /> <AnalysisPanel analyses={analyses} /> <RemediationQueue onApprove={applyFix} /> </div> ); } Incident Management Integration Connect to existing workflows: Python def create_incident_with_diagnosis(pod_name, diagnosis): """Create PagerDuty incident with analysis""" incident = { 'title': f'Pod Failure: {pod_name}', 'description': diagnosis['root_cause'], 'urgency': determine_urgency(diagnosis), 'body': { 'type': 'incident_body', 'details': format_diagnosis_for_incident(diagnosis) } } pagerduty_client.create_incident(incident) Getting Started Quick Start Shell # Clone and setup git clone https://github.com/opscart/k8s-ai-diagnostics.git cd k8s-ai-diagnostics python3 -m venv venv source venv/bin/activate pip install -r requirements.txt # Set OpenAI API key export OPENAI_API_KEY="your-key" # Deploy test scenarios kubectl create namespace ai-apps sh k8s-manifests/deploy.sh # Run diagnostics python3 k8s_ai_agent.py # Enter namespace: ai-apps Production Considerations Before using in production: Test in non-production environments – Verify remediation logic doesn't cause unintended consequencesImplement log sanitization – Remove sensitive data before sending to OpenAISet up monitoring – Track diagnostic success rates and API costsConfigure rate limiting – Prevent API quota exhaustionDocument approval workflows – Define who can approve which types of remediationEstablish rollback procedures – Know how to revert automated changes Conclusion The k8s-ai-diagnostics project demonstrates that LLMs can automate routine Kubernetes troubleshooting tasks. By combining kubectl's data collection capabilities with GPT-4's analytical reasoning, the tool provides diagnostic insights that previously required experienced SRE intervention. The POC shows particular strength in handling common failure patterns like CrashLoopBackOff and OOMKilled scenarios, where automated remediation can reduce MTTR. The human approval gate maintains safety while allowing operators to move quickly when confident in the recommendations. However, the current implementation has clear limitations. Production readiness requires addressing security concerns around data transmission, expanding remediation capabilities beyond simple cases, and integrating with existing observability and incident management infrastructure. The OOMKilled remediation, for example, currently uses fixed memory values rather than analyzing actual usage patterns — a gap that Prometheus integration would fill. For teams experiencing high volumes of routine pod failures, this approach offers a way to reduce operational toil. The tool handles repetitive diagnostic work, letting engineers focus on complex issues that require human judgment and problem-solving. As observability integration improves and remediation logic matures, LLM-augmented troubleshooting will become more viable for production environments. Additional Resources GitHub repository: opscart/k8s-ai-diagnosticsKubernetes troubleshooting: kubernetes.io/docs/tasks/debugOpenAI API documentation: platform.openai.com/docskubectl reference: kubernetes.io/docs/reference/kubectl

By Shamsher Khan

Building Realistic Test Data in Java: A Hands-On Guide for Developers

There’s something that every backend or API developer faces sooner or later: the need for good fake data. Whether you’re testing a new API, populating a database for demos, or simply trying to make your unit tests less “boring”, fake data is part of your daily routine. The problem? Most fake data feels… fake. You end up with “John Doe” and “123 Main Street” repeated over and over, which doesn’t look great when showing a prototype to your team or client. So today, let’s fix that. In this article, we’ll explore two powerful Java libraries that make generating fake yet realistic data a breeze: DataFaker and EasyRandom. We’ll go beyond just generating names and emails — we’ll learn how to integrate both libraries inside a Spring Boot 3 project, how to combine their strengths, and how to make everything available through a REST API that returns test data. This isn’t a theoretical overview. We’ll look at real code, and you’ll walk away knowing exactly how to reproduce it in your next project. Why Bother Generating Fake Data? Let’s face it: manually crafting test data is time-consuming and error-prone. Imagine you’re developing a system for managing users. You need to test pagination, filtering, sorting, and edge cases (like missing emails or very long names). Instead of hand-writing 100 lines of sample JSON, wouldn’t it be nicer to generate it automatically and instantly? Good fake data helps you: Validate your logic in a more realistic scenarioShowcase prototypes with data that “looks real”Stress test APIs or UI components with variable inputsAutomate unit tests without boilerplate “mock builders” So instead of hardcoding “Alice” and “Bob,” we’ll let DataFaker and EasyRandom do the heavy lifting. DataFaker: The Modern, Improved JavaFaker If you’ve used JavaFaker in the past, DataFaker is its modern, actively maintained successor. It’s built for recent Java versions (Java 17+), is fast, and offers hundreds of data categories — including names, addresses, finance, company information, internet data, crypto keys, and even Star Wars characters if you feel nostalgic. Let’s see a quick example: Java import net.datafaker.Faker; Faker faker = new Faker(); System.out.println(faker.name().fullName()); System.out.println(faker.internet().emailAddress()); System.out.println(faker.address().fullAddress()); Run that, and you’ll get something like: Plain Text Matilde Marques [email protected] Rua do Carmo 45, 1200-093 Lisboa Pretty cool, right? And it even looks localized if you change the locale. Java Faker faker = new Faker(new Locale("pt")); Now your data fits your language and region — an enjoyable touch for international testing. EasyRandom: Because We Need More Than Fields While DataFaker focuses on realistic field-level data, EasyRandom (formerly Random Beans) takes a different approach. It’s great when you have complex Java objects — like entities or DTOs — and you want them automatically filled with random but valid values. Think of EasyRandom as a smart “object generator” that knows how to populate your classes, including nested objects, lists, and maps. Example: Java import org.jeasy.random.EasyRandom; EasyRandom easyRandom = new EasyRandom(); Person randomPerson = easyRandom.nextObject(Person.class); This will create a fully populated Person instance, with random strings, numbers, and even nested attributes. So, where DataFaker gives realism (e.g., “John Smith, [email protected]”), EasyRandom gives structure and automation (e.g., filling an entire POJO graph). And the best part? You can combine both — letting EasyRandom create your object and then using DataFaker to polish specific fields with more believable data. Combining DataFaker and EasyRandom: The Sweet Spot Here’s where things get fun. We’ll create a small Spring Boot REST API that exposes endpoints to generate fake users. Each user will have an id, fullName, email, phone, and address. We’ll use DataFaker for realism and EasyRandom for automation. Our project structure looks like this: Plain Text src/ ├─ main/java/com/example/fakedata/ │ ├─ Application.java │ ├─ config/ │ ├─ api/ │ ├─ controller/ │ ├─ domain/ │ ├─ dto/ │ ├─ service/ │ └─ mapper/ └─ resources/ └─ static/index.html The User Domain Class We’ll keep it simple, using Lombok to avoid boilerplate: Java @Data @Builder public class User { private String id; private String fullName; private String email; private String phone; private String address; } And for the API responses, we’ll use a Java record for immutability and readability: Java public record UserDto(String id, String fullName, String email, String phone, String address) { } The Service: Combining Both Libraries Here’s the core of our project: Java @Service public class DataGenService { private final Faker faker = new Faker(Locale.ENGLISH); private final EasyRandom easyRandom; public DataGenService() { EasyRandomParameters params = new EasyRandomParameters() .seed(System.currentTimeMillis()) .stringLengthRange(5, 20); this.easyRandom = new EasyRandom(params); } public User randomUserViaDatafaker() { return User.builder() .id(UUID.randomUUID().toString()) .fullName(faker.name().fullName()) .email(faker.internet().emailAddress()) .phone(faker.phoneNumber().cellPhone()) .address(faker.address().fullAddress()) .build(); } public User randomUserViaEasyRandom() { User u = easyRandom.nextObject(User.class); if (u.getId() == null || u.getId().isBlank()) { u.setId(UUID.randomUUID().toString()); } u.setFullName(faker.name().fullName()); u.setEmail(faker.internet().emailAddress()); return u; } public List<User> manyUsers(int count, boolean easyRandomMode) { return IntStream.range(0, count) .mapToObj(i -> easyRandomMode ? randomUserViaEasyRandom() : randomUserViaDatafaker()) .collect(Collectors.toList()); } } You can see how we use DataFaker for realism and EasyRandom for structure — like a two-chef recipe: one creates the base, the other adds seasoning. The REST Controller Now, let’s make it accessible through a REST API. Java @RestController @RequestMapping("/api/users") public class UserController { private final DataGenService service; public UserController(DataGenService service) { this.service = service; } @GetMapping("/{count}") public ApiResponse<List<UserDto>> generateUsers(@PathVariable int count, @RequestParam(defaultValue = "false") boolean easy) { List<UserDto> users = service.manyUsers(count, easy) .stream().map(UserMapper::toDto) .collect(Collectors.toList()); return ApiResponse.of(users); } } And to make our API responses consistent, we wrap everything in an envelope with a timestamp: Java public record ApiResponse<T>(T data, Instant timestamp) { public static <T> ApiResponse<T> of(T data) { return new ApiResponse<>(data, Instant.now()); } } That way, every API call returns data like this: JSON { "data": [ { "id": "e7b1c37a-8b20-43c1-8ff3-b4aef8d89c3a", "fullName": "Lina Cordeiro", "email": "[email protected]", "phone": "+351 912 345 678", "address": "Rua do Comércio 12, Porto" } ], "timestamp": "2025-10-06T13:02:45.321Z" } Much cleaner and easier to debug. Why Timestamp in Responses? Adding timestamps isn’t just for looks. It’s a simple, useful practice that improves observability. When debugging requests in distributed systems or when clients log responses, having the server timestamp right in the payload helps you correlate events — it’s a micro detail with macro benefits. Why Both Libraries Are Better Together You might wonder: “Why not just use DataFaker alone?” Good question. DataFaker is unbeatable for producing realistic values, but it doesn’t automatically populate deep object structures.EasyRandom, on the other hand, is great for object graphs, but its randomness feels too synthetic — you’ll end up with “[email protected].” Together, they give you: Realism + AutomationEase of integration with tests and APIsConsistency through configuration and seeds It’s a bit like combining a random word generator with a translator — one provides variation, the other makes sense of it. Going Further: Postman, Docker, and CI/CD Our complete project also includes: A Postman collection for quick testingA Dockerfile and docker-compose.yml for containerizationGitHub Actions CI and Dependabot setup for automated builds and dependency updates That makes this small demo a production-grade reference project for testing and learning. If you’re mentoring junior developers or building internal utilities, this is a great example to show clean architecture and reproducible data generation. Repo: github.com/wallaceespindola/fake-data-springboot Practical Ideas for Using This Setup Load testing: Generate thousands of fake users to populate a database.UI prototyping: Feed your frontend with realistic API data.Demo environments: Seed a sandbox with dynamic sample users.Unit tests: Replace new User("a","b") with a call to DataGenService.randomUserViaDatafaker().Data anonymization: Quickly replace sensitive production data with fake equivalents. Each of these is a real-world scenario where this combination shines. Closing Thoughts The difference between a “meh” test dataset and a “wow, this looks real!” demo often comes down to how you generate data. With DataFaker and EasyRandom, you can automate that process elegantly — using modern Java, minimal boilerplate, and libraries that just make sense together. You’ll not only save hours when building tests or mock APIs but also deliver demos that feel alive, diverse, and realistic. The best part? It’s all open-source, lightweight, and easy to integrate with Spring Boot, Quarkus, Micronaut, or even a plain Java console app. So next time you need to populate an API or test your system’s resilience, don’t settle for "John Doe" anymore. Give your fake data some personality — and let Java do the heavy lifting. Need more tech insights? Check out my GitHub repo and LinkedIn page. Happy coding!

By Wallace Espindola