DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

DZone Spotlight

Monday, May 12 View All Articles »
The Human Side of Logs: What Unstructured Data Is Trying to Tell You

The Human Side of Logs: What Unstructured Data Is Trying to Tell You

By Alvin Lee DZone Core CORE
It’s Friday afternoon, and your dashboards look great. Charts are green. CPU usage is stable. Database query times are within your SLA. You’re feeling great and ready for the weekend. But little do you know, there’s a significant issue being overlooked by all your metrics — and it’s about to ruin your weekend. Unfortunately, you don’t know about the problem yet. That’s because there’s a disparity between your metrics and the actual user experience. Your dashboards might look great, but your users are telling a different story. It happens to even the best of us. On April 17, 2025, for example, Walmart's website experienced a significant outage. Users were unable to add items to their carts or access certain pages. Revenue was lost, user complaints surged on X and Downdetector, and undoubtedly someone was woken up from a deep sleep to log on and help get it fixed. Critical signals don’t always show up in CPU graphs or 5xx error charts. They surface in chat threads, Downdetector complaints, and even failed login attempt logs. The first signals often come from people, not probes. Why? Because traditional monitoring tools focus on structured data, such as CPU usage, memory consumption, database usage, and network throughput. And while these metrics are essential, they can miss the nuances of user behavior and experience. But unstructured data, such as error messages, user feedback, and logs, can tell a more thorough story. It can provide critical insights into system issues that structured data can overlook and that you otherwise wouldn’t know about — until it’s too late. The signs are there — if you're listening. In this article, I’ll explore why unstructured data matters, where you should look for unstructured data, what signals to watch for, and how observability platforms can help you tap into unstructured data without drowning you in noise. Why Unstructured Data Matters Structured data is the data that comes in the formats you expect — rows, columns, numbers, stats — and tells you all about what logically happened. It’s the duration of an API call, the response status code, or the CPU load on a node. Unstructured data, on the other hand, is the messy data. And it’s everywhere. It can be found in support tickets, bug reports, chat threads, error messages, and offhand complaints. It’s the data that often arrives in natural language, not numeric values. It’s messy, not always clear in meaning, and as the name suggests, it’s unstructured. But unstructured data is critical. It tells you where confusion lives, where intent breaks down, and where the user’s mental model clashes with what the software does. It captures the emotions, intentions, and frustrations that users feel when systems misbehave. And when ingested, interpreted, and aggregated, important patterns can emerge. For example, unstructured data can start to paint a picture if your app is seeing: A surge in password reset attemptsRage clicks after a UI releaseA sudden drop in engagement Support tickets clustered around a broken journey Sometimes your best clues aren’t in the metrics — they’re in this unstructured data. Structured observability gives you the dashboard. Unstructured data gives you the story. And if you're not reading both, you're missing half the plot. What Signals Should You Watch For? So, where should you look for unstructured data, and what should you watch for? There are many sources. Here are a few to start: Pay attention to session logs that show users repeatedly attempting the same action. That’s not just a retry — it’s friction.Watch for freeform error messages that never get piped into your dashboard. That’s often where the real context behind a failure lives.Don’t ignore the chatter on Slack, Jira, or even social media. When three engineers complain about the same “sluggish page,” chances are there’s a performance regression in your latency graph smoothed over.Even vague user feedback can be invaluable. A spike in “can’t log in” support tickets may be attributed to session expiry handling, rather than infrastructure failure. You’ll only catch it if you’re collecting and analyzing the whole narrative — system logs, yes, but also what people say and do when something doesn’t work the way they expect.Watch for security anomalies. Failed login attempts, credential stuffing, token mismatches — these may not trigger alarms if thresholds aren’t breached, but patterns buried in raw logs can signal a threat weeks before your SIEM lights up. How Observability Platforms Can Wrangle Unstructured Data One of the biggest misconceptions about unstructured data is that it has to be cleaned, labeled, and modeled before it’s useful. Yes, that was true in the past. Teams often spend hours writing regular expressions or building brittle parsers just to extract a few fields from a messy log line. But that’s no longer the case. Modern observability platforms are designed to ingest unstructured data at scale without requiring perfect formatting or predefined schemas. You can pump in raw error messages, user reviews, support tickets, and Slack threads — and the platform handles the rest. Machine learning, natural language processing, and pattern recognition do the bulk of the work. That means you don’t need a data wrangler to find value. A modern observability platform can: Automatically surface spikes in login failures by IP address block or geographic location.Cluster similar feedback into themes using sentiment analysis, even if the wording varies.Correlate failed transactions to specific deployments, even when the logs don’t follow strict naming conventions. To be effective, you need an observability platform that can ingest both structured and unstructured data. One that provides a comprehensive view of system health. One that, by analyzing unstructured data, helps you to identify and address issues proactively — before your weekend is ruined. For example, here’s a screenshot from Sumo Logic showing a parse statement from a section of unstructured log data, and how it helps you make sense of the data. With a modern platform like this, you can use a “schema-on-read” approach. You just store the data as is, then analyze it when needed. And if you can get friendly pricing, you won’t have to worry about the amount of ingest; you can ingest everything — system logs, application traces, and behavioral data — and explore it later without gaps. Example Use Cases For example, let’s say you work at an e-commerce company that experiences a sudden surge in negative comments on social media and customer reviews, all of which mention difficulties with the checkout process. Traditional monitoring tools, focused on structured data like transaction success rates, show no anomalies. But by employing sentiment analysis on unstructured data sources, you identify a pattern: customers are frustrated with a recent update to the checkout interface. This insight means you can promptly address the issue, improve customer satisfaction, and stop any potential revenue loss. When observability platforms process behavioral signals at scale, the value isn’t just technical — it’s operational and financial. E-commerce teams can identify and resolve friction in the checkout flow before it tanks conversion rates.SaaS platforms can correlate a rise in support volume with a regression in a recent release and take action before churn increases.SRE and platform teams can detect misconfigurations or silent failures earlier, which means reduced incident duration and lower downtime costs. This kind of pattern recognition turns what used to be support overhead into strategic insight. Conclusion You’re already tracking the numbers — latency, error rates, CPU usage. But that’s only half the story. The other half lives in the messy commotion: Slack rants, support tickets, log messages, and user reviews that don’t fit into a tidy schema. Unstructured data is where behavioral signals live. It can show you why things are breaking, not just what is broken. It captures confusion, intent, and frustration long before structured telemetry raises a red flag. If you're responsible for user experience, reliability, or security, you can’t afford to ignore what people are saying — or how they're interacting — with your product. The tools exist to make unstructured data useful. Now it’s a matter of putting them to work. The human side of logs is talking. Start listening. More
How to Convert XLS to XLSX in Java

How to Convert XLS to XLSX in Java

By Brian O'Neill DZone Core CORE
Why Upgrading XLS to XLSX Is Worth Your Time Ask any seasoned Java developer who's worked with Excel files long enough, and you’ll probably hear a similar refrain: the old XLS Excel format is clunky and annoying. It’s been around since the late ’80s, and while it’s still supported in a lot of systems, it's not doing us many favors today. It was, after all, replaced with XLSX for a reason. Unfortunately, there’s still a lot of important data packed in those old binary XLS containers, and some developers are tasked with making clean conversions to XLSX to improve the usability (and security) of that data for the long run. In this article, we're taking a close look at why converting legacy XLS files to the newer XLSX format is important. We'll dig into what changes under the hood during that conversion, why XLSX is clearly much better suited to modern workflows, and what your best options are to efficiently build out programmatic XLS to XLSX conversions in a Java application. Why People Still Use XLS — And Why It’s a Headache The older XLS format uses a binary file structure. That alone sets it apart from almost every major document format we use today, which tend to favor XML or JSON-based standards. If you've stumbled across an old finance export or been forced to inherit reporting logic from 2006, there's a good chance you've dealt with binary Excel data, and you probably haven’t been excited to do that again. XLS format has some hard limits baked in that don't make a ton of sense today. It’s capped at just over 65,000 rows and 256 columns per sheet (a staggering 983,576 rows and 16,128 columns short of modern XLSX capabilities), and finding clean interoperability for XLS with newer APIs or cloud services can be hit-or-miss. Even more frustratingly, because XLS is a binary container, you can't easily crack it open to see what's wrong internally. You’re stuck relying on some library to parse XLS contents correctly — and good luck if you hit something nonstandard during that process. In comparison, looking for errors in the Open XML file structure that XLSX (and all modern MS Office) files use is like searching for typos in a kid’s book. And then there's the tooling: popular open-source Java conversion libraries like Apache POI can handle XLS files, but that option requires a different code path, different classes, and generally more brittle behavior compared to working with XLSX and other Open XML files. We’ll cover this particular challenge in some more detail later on in this article. What’s So Great About XLSX Anyway? Modern Excel’s XLSX format is part of Microsoft’s Open Office XML standard. That means it’s just a ZIP archive full of plain, neatly organized XML files at its most basic level. XML is both human-readable and machine-friendly: the best of both worlds. Instead of being one giant binary blob, each part of the XLSX spreadsheet — the worksheets, the shared string table, the style definitions — is broken out into neatly into a series of structured XML documents. For example, if we built a simple spreadsheet with the following content: We would find this exact data represented in the worksheet XML file like so: XML <cols> <col min="2" max="2" width="16.81640625" customWidth="1"/> <col min="3" max="3" width="25.81640625" customWidth="1"/> </cols> <sheetData> <row r="2" spans="1:3" x14ac:dyDescent="0.35"> <c r="A2" s="1"/> <c r="B2" s="2" t="s"> <v>0</v> </c> <c r="C2" s="2" t="s"> <v>1</v> </c> </row> <row r="3" spans="1:3" x14ac:dyDescent="0.35"> <c r="B3" t="s"> <v>2</v> </c> <c r="C3" t="s"> <v>2</v> </c> </row> <row r="4" spans="1:3" x14ac:dyDescent="0.35"> <c r="B4" t="s"> <v>2</v> </c> <c r="C4" t="s"> <v>2</v> </c> </row> </sheetData> The column display settings are defined in the <cols> tag, and the actual cell data (which carries shared string references in this case) is represented in the <sheetData> tag. You don't need an advanced degree in any computational field to figure out what's going on here, and that's a great thing. This structure really does matter. It makes debugging easier, version control more sensible, the format extensible, and the entire file more future-proof. And, of course, it plays far, far more nicely with open-source tools, cloud APIs, and Java libraries, which typically prefer a diet of well-defined portable formats. So, if you’re working on anything that involves transforming spreadsheet data, exposing it via APIs, or piping it through cloud platforms, XLSX is by far the safer and more scalable choice. There are numerous reasons why binary containers were eliminated in favor of compressed XML to begin with, and everything we’ve talked about here is a contributing factor. What Actually Happens During the XLS to XLSX Upgrade Upgrading XLS to XLSX programmatically is a bit more complex than the simple “Save As” operation Excel lets you do manually within the Excel desktop application. Under the hood, binary to compressed XML conversion involves some heavy lifting. The old binary workbook must be unpacked and rewritten entirely into an XML-based structure. That means all the cells, rows, and sheets get redefined as the appropriate set of XML elements. Each style, font, and border from the XLS binary container gets converted into an XML equivalent, too, and the formulas get reserialized. If XLS files carry legacy macros or embedded objects (which, by the way, you should never implicitly trust the security of in ANY spreadsheet handler), the story gets a little messier. Old macros and objects don’t always translate cleanly into modern Excel, and you can easily lose fidelity depending on the conversion library you’re using. Excel XLSX also doesn’t support macros directly the way XLS does; macros will either be cleansed from the XLSX file automatically, or the Excel application will suggest redefining the file as an XLSM (macro-enabled XLSX) document. Thankfully, though, the vast majority of XLS spreadsheet conversions will store little more than tabular data, basic formatting, and simple formulas. The conversion for those files to XLSX tends to be much smoother. Open-Source Libraries That Get the Job Done Apache POI is still the best open-source default for Excel work in Java, and it supports both XLS and XLSX. That said, there’s a significant catch in this instance: you’re working with two separate APIs to handle XLS and XLSX documents. For XLS files, you’ll be using the HSSF API (which literally stands for “Horrible Spreadsheet Format”), and for XLSX, you’ll be using the XSSF API (which simply stands for “XML Spreadsheet Format”). In practice, building a conversion workflow via Apache POI means you’ll need to 1) load the XLS file with HSSFWorkbook, 2) build a new XSSFWorkbook, and 3) (tragically) manually copy each sheet, row, and cell from one to the other. It’s certainly doable — but it’s also extremely tedious. In this case, POI unfortunately doesn’t give you the magic method you probably want for file format conversions. You’ll need to write that translation logic yourself. Still, if you're already using POI in your project for another purpose, or if you just want maximum control over the workbook structure, it's a solid option. Just don’t expect it to be elegant. Handling XLS to XLSX With a Third-Party Web API A simpler option for handling XLS to XLSX conversions involves using a fully realized web API solution. This abstracts the complexity away from your environment. The option we'll demonstrate here isn’t open source, and it does require an API key, but it’ll plug straight into your Java project, and it’ll use very minimal code compared to patchwork open-source solutions. Below, we’ll walk through code examples you can use to structure your API call for XLS to XLSX conversions. If we're working with Maven, we'll first add the following reference to our pom.xml repository: XML <repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories> And we'll then add a reference to our pom.xml dependency: XML <dependencies> <dependency> <groupId>com.github.Cloudmersive</groupId> <artifactId>Cloudmersive.APIClient.Java</artifactId> <version>v4.25</version> </dependency> </dependencies> If we're working with Gradle, we'll need to add it in our root build.gradle (at the end of repositories): Groovy allprojects { repositories { ... maven { url 'https://jitpack.io' } } } And then add the dependency in build.gradle: Groovy dependencies { implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25' } After we've installed the SDK, we'll place the Import classes at the top of our file (commented out for now): Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.ConvertDocumentApi; Finally, we'll configure the API client, set our API key in the authorization snippet, and make our XLS to XLSX conversion: Java ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); ConvertDocumentApi apiInstance = new ConvertDocumentApi(); File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on. try { byte[] result = apiInstance.convertDocumentXlsToXlsx(inputFile); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling ConvertDocumentApi#convertDocumentXlsToXlsx"); e.printStackTrace(); } We'll get our XLSX file content as a byte array (byte[] result), and we can write that content to a new file with the .xlsx extension. This simplifies automated XLS to XLSX conversion workflows considerably. Conclusion In this article, we learned about the differences between XLS and XLSX formats and discussed the reasons why XLSX is clearly the superior modern format. We suggested a popular open-source library as one option for building automated XLS to XLSX conversion logic in Java and a fully realized web API solution to abstract the entire process away from our environment. More

Trend Report

Generative AI

AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.

Generative AI

Refcard #158

Machine Learning Patterns and Anti-Patterns

By Tuhin Chattopadhyay DZone Core CORE
Machine Learning Patterns and Anti-Patterns

Refcard #269

Getting Started With Data Quality

By Miguel Garcia DZone Core CORE
Getting Started With Data Quality

More Articles

*You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research
*You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you choose) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Software Supply Chain Security Research Supply chains aren't just for physical products anymore; they're a critical part of how software is built and delivered. At DZone, we're taking a closer look at the state of software supply chain security to understand how development teams are navigating emerging risks through smarter tooling, stronger practices, and the strategic use of AI. Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. We're exploring key topics such as: SBOM adoption and real-world usageThe role of AI and ML in threat detectionImplementation of zero trust security modelsCloud and open-source security posturesModern approaches to incident response Join the Security Research We’ve also created some painfully relatable memes about the state of software supply chain security. If you’ve ever muttered “this is fine” while scanning dependencies, these are for you! Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By Lauren Forbes
AI’s Role in Everyday Development
AI’s Role in Everyday Development

Introduction From automated code generation to intelligent debugging and DevOps optimization, AI-powered tools are enhancing efficiency and improving software quality. As software engineering evolves, developers who leverage AI can significantly reduce development time, minimize errors, and improve productivity. Can software engineers be really replaced? Let’s explore briefly. This article explores how AI can be integrated into various aspects of software development, covering real-world examples and tools that can assist engineers in their daily workflows. 1. AI for Code Generation and Auto-Completion AI-driven code generation tools have transformed how developers write code by providing real-time suggestions, automating repetitive tasks, and even generating entire functions based on context. These tools significantly reduce development effort and allow engineers to focus on building innovative features instead of writing boilerplate code. Key AI Tools GitHub Copilot – Uses OpenAI’s Codex to suggest code snippets and entire functions.Tabnine – Predicts and completes lines of code using deep learning models.Codeium – Offers AI-driven auto-completions to accelerate development. Example Use Case A back-end developer working with Node.js can use GitHub Copilot to auto-generate API route handlers, reducing boilerplate code and improving efficiency. AI-powered completion can also help suggest parameter names, detect missing dependencies, and improve code consistency. 2. AI for Debugging and Error Detection Debugging is a time-consuming process, but AI-powered tools can detect errors, suggest fixes, and even predict potential bugs before they cause major issues. AI can analyze millions of lines of code, compare patterns, and identify potential runtime errors with remarkable accuracy. Key AI Tools DeepCode – Analyzes code and suggests security and performance improvements.Snyk – Identifies and fixes vulnerabilities in open-source dependencies.CodiumAI – Helps with AI-driven bug detection and auto-fixes. Example Use Case A software engineer using Python may run their code through DeepCode to get suggestions for potential null-pointer exceptions or security vulnerabilities. AI-based debugging assistants can also provide real-time explanations of why a particular error occurred and suggest best practices for fixing it. 3. AI-Driven Automated Code Reviews Code reviews ensure high-quality software, but manually reviewing every line of code can be tedious. AI can automate this process, helping teams maintain clean and efficient codebases while enforcing best practices. Key AI Tools Codacy – Provides automated feedback on security, performance, and style.SonarQube – Performs static code analysis to find vulnerabilities.Amazon CodeWhisperer – Suggests code improvements during the review process. Example Use Case A DevOps engineer integrates SonarQube into the CI/CD pipeline to enforce best practices before merging pull requests. AI-assisted code reviews can automatically check for security vulnerabilities, inconsistent styling, and inefficiencies, ensuring that teams maintain high code quality. 4. AI-Powered Documentation Generation Writing documentation is often overlooked, but AI can automatically generate and update documentation based on code changes. AI-based documentation generators can extract information from function definitions, comments, and structured data to create detailed and easy-to-read documentation. Key AI Tools Mintlify – Generates API documentation from function comments.AutoDocs – Extracts documentation from structured codebases. Example Use Case A full-stack engineer working on a React project uses Mintlify to generate up-to-date API documentation effortlessly. AI-based tools can also suggest inline comments and documentation improvements to enhance code readability. 5. AI for Software Testing and QA AI generates test cases, detects anomalies, and automates regression testing. AI-powered testing solutions help teams catch bugs faster and optimize test coverage. Key AI Tools Testim – Uses AI to create automated tests for web applications.Applitools – Provides AI-powered visual testing for UI verification.Mabl – Automates functional UI and API testing using AI insights. Example Use Case A quality assurance (QA) engineer uses Testim to generate Selenium-based automated tests for a new web application. AI testing tools also analyze historical test results to suggest potential areas of failure and improve test efficiency. 6. AI in DevOps and CI/CD Optimization AI-powered DevOps solutions help with intelligent deployments, resource management, and infrastructure monitoring. AI can analyze server logs, predict failures, and optimize deployment workflows. Key AI Tools Harness – AI-driven continuous deployment automation.Google Cloud AI Ops – Provides predictive analytics for infrastructure monitoring.AWS DevOps Guru – Uses ML to analyze system performance and prevent failures. Example Use Case A cloud engineer integrates AWS DevOps Guru to analyze logs and predict potential downtime in a Kubernetes cluster. AI-driven alerts help teams proactively address performance issues before they impact users. 7. AI for Code Refactoring and Optimization Refactoring large codebases manually is tedious, but AI can analyze and restructure code for better readability and performance. AI refactoring tools help detect inefficient patterns and suggest optimizations. Key AI Tools Refact.ai – Suggests code refactoring improvements.ChatGPT – Can explain and optimize code snippets. Example Use Case An enterprise developer working on a legacy Java application uses AI to convert the monolithic architecture into microservices. AI tools can also help with renaming variables, restructuring classes, and improving modularization. 8. AI for Continuous Learning and Skill Enhancement AI-powered educational platforms help engineers stay up to date with the latest technologies and best practices by providing interactive learning experiences and real-time coding assistance. Key AI Tools ChatGPT – Explains concepts and provides learning recommendations.Copilot Labs – Offers interactive coding assistance and explanations. Example Use Case A junior engineer learning machine learning asks ChatGPT to explain decision trees in simple terms with code examples. AI-based learning assistants can also provide code snippets and interactive coding exercises to help learners grasp complex concepts faster. Conclusion Using these AI tools for coding, debugging, testing, DevOps, and learning, developers can focus on solving more significant engineering challenges while AI handles routine tasks. AI isn’t here to replace software engineers; instead, it empowers them to build software faster, smarter, and with higher quality. What AI tools do you use in your daily workflow?

By Mahesh Ganesamoorthi
Automatic Code Transformation With OpenRewrite
Automatic Code Transformation With OpenRewrite

Code Maintenance/Refactoring Challenges As with most problems in business, the challenge with maintaining code is to minimize cost and maximize benefit over some reasonable amount of time. For software maintenance, costs and benefits largely revolve around two things: the quantity and quality of both old and new code. Quantity SonarQube suggests our organization maintain at least 80 million lines of code. That’s a lot, especially if we stay current with security patches and rapid library upgrades. Quality In fast-paced environments, which we often find ourselves in, a lot of code changes must come from: Copying what you see, either from nearby code or from places like StackOverflow.Knowledge that can be applied quickly. These typically boil down to decisions made by individual programmers. Of course, this comes with pros and cons. This post is not meant to suggest human contributions are not extremely beneficial! I will discuss some benefits and negatives of automated refactoring and why we are moving from Butterfly, our current tool for automated refactoring, to OpenRewrite. Benefits and Costs of Automated Refactoring When we think about automation, we typically think about the benefits, and that is where I’ll start. Some include: If a recipe exists and works perfectly, the human cost is almost 0, especially if you have an easy way to apply recipes on a large scale. Of course, this human cost saving is the obvious and huge benefit.Easy migration to newer libraries/patterns/etc. brings security patches, performance improvements, and lower maintenance costs.An automated change can be educational. Hopefully, we still find time to thoroughly read documentation, but we often don’t! Seeing your refactored code should be educational and should help with future development costs. There are costs to automated refactoring. I will highlight: If a recipe does not exist, OpenRewrite is not cost-free. As with all software, the cost of creating a recipe will need to be justified by its benefit. These costs may become substantial if we try to move towards a code change that is not reviewed by humans.OpenRewrite and AI reward you if you stick with commonly used programming languages, libraries, tools, etc. Sometimes going against the norm is justified. For example, Raptor 4's initial research phases looked at other technology stacks besides Spring and JAX-RS. Some goals included performance improvement. One of the reasons those other options were rejected is that they did not have support in Raptor's automated refactoring tool. Decisions like this can have a big impact on a larger organization.Possible loss of ‘design evolution.’ I believe in the ‘good programmers are lazy’ principle, and part of that laziness is avoiding the pain you go through to keep software up to date. This laziness serves to evolve software so that it can easily be updated. If you take away the pain, you take away one of the main incentives for doing that. What We’ve Been Using: Butterfly ‘Butterfly’ is a two-part system. Its open-source command-line interface (CLI), officially named ‘Butterfly’, modifies files. There is also a hosted transformation tool called Butterfly, which can be used to run Butterfly transformations on GitHub repositories. This post focuses on replacing the CLI and its extension API with OpenRewrite. There is an OpenRewrite-powered large-scale change management tool (LSCM) named Moderne, which is not free. Where We Are Going: OpenRewrite Why are we switching to OpenRewrite? Adopted by open source projects that we use (Spring, Java, etc.).Maintained by a company, Moderne.Lossless Semantic Trees (akin to Abstract Syntax Trees), which allow compiler-like transformation. These are much more powerful than tools like regular expression substitution.Visitor pattern. Tree modification happens primarily by visiting tree members.They are tracking artificial intelligence to see how it can be leveraged for code transformation. We are still early in the journey with OpenRewrite. While it is easy to use existing recipes, crafting new ones can be tricky. What About Artificial Intelligence? If you aren’t investigating AI, you certainly should be. If AI can predict what code should be created for a new feature, it certainly should be useful in code transformation, which is arguably easier than creation. Our organization has started the journey of incorporating AI into its toolset. We will be monitoring how tools like OpenRewrite and AI augment one another. On that note, we are investigating using AI to create OpenRewrite recipes. How We’ve Used OpenRewrite Manually running recipes against a single software project. There have been multiple uses of OpenRewrite against an individual software project. I come from the JVM framework team, so our usage involved refactoring Java libraries. You can find some examples of that below: JUnit 4 to JUnit 5 JAX-RS refactoring. Comments discuss some impressive changes. Note that there are multiple commits. More on why that was needed later.Nice GitHub release notes refactoring. This is a trivial PR, but being able to do it on a large scale with low cost helps with cost-based arguments when value is not widely agreed upon.Running UpgradeSpringBoot_3_2, CommonStaticAnalysis, UpgradeToJava17, and MigrateHamcrestToAssertJ recipes on a larger organization project with a whopping 800K lines of code resulted in ~200K modified lines spanning ~4K files with an estimated time savings of ~8 days. I believe that is quite an underestimate of the savings! JUnit4 -> JUnit5 refactoring. Estimated savings: 1d 23h 31m.Common static analysis refactoring. Estimated savings: 3d 21h 29m. If you are tired of manually satisfying Sonar, then this recipe could be for you! Unfortunately, these need to be bulk closed due to an issue (we’re not trying to hide anything!). You can read about that here. Again, I think OpenRewrite significantly underestimates some of these savings. Execution time was ~20 minutes. That was the computer’s time, not mine! Caveat: It’s Only Easy When It’s Easy When a recipe exists and has no bugs, everything is great! When it doesn’t, you have multiple questions. The two main ones are: Does an LST/parser exist? For example, OpenRewrite has no parser for C++ code, so there is no way to create a recipe for that language.If there is an LST/parser, how difficult is it to create a recipe? There are a bunch of interesting and easy ways to compose existing recipes; however, when you have to work directly with an LST, it can be challenging. In short, it’s not always the answer. Good code development and stewardship still play a large role in minimizing long-term costs. Manual Intervention So far, the most complicated transformations have required human cleanup. Fortunately, those were in test cases, and the issues were apparent in a failed build. Until we get more sophisticated with detecting breaking changes, please understand that you own the changes, even if they come via a tool like OpenRewrite. Triaging Problems OpenRewrite does not have application logging like normal Java software. It also does not always produce errors in ways that you might expect. To help with these problems, we have a recommendations page in our internal OpenRewrite documentation. Conclusion Hopefully, you are excited about the new tools coming that will help you maximize the value. Resource OpenRewrite documentation

By Gangadhararamachary Ramadugu
How to Configure and Customize the Go SDK for Azure Cosmos DB
How to Configure and Customize the Go SDK for Azure Cosmos DB

The Go SDK for Azure Cosmos DB is built on top of the core Azure Go SDK package, which implements several patterns that are applied throughout the SDK. The core SDK is designed to be quite customizable, and its configurations can be applied with the ClientOptions struct when creating a new Cosmos DB client object using NewClient (and other similar functions). If you peek inside the azcore.ClientOptions struct, you will notice that it has many options for configuring the HTTP client, retry policies, timeouts, and other settings. In this blog, we will cover how to make use of (and extend) these common options when building applications with the Go SDK for Cosmos DB. I have provided code snippets throughout this blog. Refer to this GitHub repository for runnable examples. Retry Policies Common retry scenarios are handled in the SDK. You can dig into cosmos_client_retry_policy.go for more info. Here is a summary of errors for which retries are attempted: Error Type / Status CodeRetry LogicNetwork Connection ErrorsRetry after marking endpoint unavailable and waiting for defaultBackoff.403 Forbidden (with specific substatuses)Retry after marking endpoint unavailable and updating the endpoint manager.404 Not Found (specific substatus)Retry by switching to another session or endpoint.503 Service UnavailableRetry by switching to another preferred location. Let's see some of these in action. Non-Retriable Errors For example, here is a function that tries to read a database that does not exist. Go func retryPolicy1() { c, err := auth.GetClientWithDefaultAzureCredential("https://demodb.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) db, err := c.NewDatabase("i_dont_exist") if err != nil { log.Fatal("NewDatabase call failed", err) } _, err = db.Read(context.Background(), nil) if err != nil { log.Fatal("Read call failed: ", err) } The azcore logging implementation is configured using SetListener and SetEvents to write retry policy event logs to standard output. See Logging section in azcosmos package README for details. Let's look at the logs generated when this code is run: Plain Text //.... Retry Policy Event: exit due to non-retriable status code Retry Policy Event: =====> Try=1 for GET https://demodb.documents.azure.com:443/dbs/i_dont_exist Retry Policy Event: response 404 Retry Policy Event: exit due to non-retriable status code Read call failed: GET https://demodb-region.documents.azure.com:443/dbs/i_dont_exist -------------------------------------------------------------------------------- RESPONSE 404: 404 Not Found ERROR CODE: 404 Not Found When a request is made to read a non-existent database, the SDK gets a 404 (not found) response for the database. This is recognized as a non-retriable error, and the SDK stops retrying. Retries are only performed for retriable errors (like network issues or certain status codes). The operation failed because the database does not exist. Retriable Errors - Invalid Account This function tries to create a Cosmos DB client using an invalid account endpoint. It sets up logging for retry policy events and attempts to create a database. Go func retryPolicy2() { c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Let's look at the logs generated when this code is run, and see how the SDK handles retries when the endpoint is unreachable: Plain Text //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=682.644105ms Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=2.343322179s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #3, Delay=7.177314269s Retry Policy Event: =====> Try=4 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 3 exceeded failed to retrieve account properties: Get "https://iamnothere.docume Each failed attempt is logged, and the SDK retries the operation several times (three times to be specific), with increasing delays between attempts. After exceeding the maximum number of retries, the operation fails with an error indicating the host could not be found - the SDK automatically retries transient network errors before giving up. But you don't have to stick to the default retry policy. You can customize the retry policy by setting the azcore.ClientOptions when creating the Cosmos DB client. Configurable Retries Let's say you want to set a custom retry policy with a maximum of two retries and a delay of one second between retries. You can do this by creating a policy.RetryOptions struct and passing it to the azcosmos.ClientOptions when creating the client. Go func retryPolicy3() { retryPolicy := policy.RetryOptions{ MaxRetries: 2, RetryDelay: 1 * time.Second, } opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ Retry: retryPolicy, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", &opts) if err != nil { log.Fatal(err) } log.Println(c.Endpoint()) azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Each failed attempt is logged, and the SDK retries the operation according to the custom policy — only two retries, with a 1-second delay after the first attempt and a longer delay after the second. After reaching the maximum number of retries, the operation fails with an error indicating the host could not be found. Plain Text Retry Policy Event: =====> Try=1 for GET https://iamnothere.documents.azure.com:443/ //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=1.211970493s Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=3.300739653s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 2 exceeded failed to retrieve account properties: Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host exit status 1 Note: The first attempt is not counted as a retry, so the total number of attempts is three (1 initial + 2 retries). You can customize this further by implementing fault injection policies. This allows you to simulate various error scenarios for testing purposes. Fault Injection For example, you can create a custom policy that injects a fault into the request pipeline. Here, we use a custom policy (FaultInjectionPolicy) that simulates a network error on every request. Go type FaultInjectionPolicy struct { failureProbability float64 // e.g., 0.3 for 30% chance to fail } // Implement the Policy interface func (f *FaultInjectionPolicy) Do(req *policy.Request) (*http.Response, error) { if rand.Float64() < f.failureProbability { // Simulate a network error return nil, &net.OpError{ Op: "read", Net: "tcp", Err: errors.New("simulated network failure"), } } // no failure - continue with the request return req.Next() } This can be used to inject custom failures into the request pipeline. The function configures the Cosmos DB client to use this policy, sets up logging for retry events, and attempts to create a database. Go func retryPolicy4() { opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ PerRetryPolicies: []policy.Policy{&FaultInjectionPolicy{failureProbability: 0.6}, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", &opts) // Updated to use opts if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test_1"}, nil) if err != nil { log.Fatal(err) } } Take a look at the logs generated when this code is run. Each request attempt fails due to the simulated network error. The SDK logs each retry, with increasing delays between attempts. After reaching the maximum number of retries (default = 3), the operation fails with an error indicating a simulated network failure. Note: This can change depending on the failure probability you set in the FaultInjectionPolicy. In this case, we set it to 0.6 (60% chance to fail), so you may see different results each time you run the code. Plain Text Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ //.... Retry Policy Event: MaxRetries 0 exceeded Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=794.018648ms Retry Policy Event: =====> Try=2 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #2, Delay=2.374693498s Retry Policy Event: =====> Try=3 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #3, Delay=7.275038434s Retry Policy Event: =====> Try=4 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: MaxRetries 3 exceeded Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=968.457331ms 2025/05/05 19:53:50 failed to retrieve account properties: read tcp: simulated network failure exit status 1 Do take a look at Custom HTTP pipeline policies in the Azure SDK for Go documentation for more information on how to implement custom policies. HTTP-Level Customizations There are scenarios where you may need to customize the HTTP client used by the SDK. For example, when using the Cosmos DB emulator locally, you want to skip certificate verification to connect without SSL errors during development or testing. TLSClientConfig allows you to customize TLS settings for the HTTP client and setting InsecureSkipVerify: true disables certificate verification – useful for local testing but insecure for production. Go func customHTTP1() { // Create a custom HTTP client with a timeout client := &http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, }, } clientOptions := &azcosmos.ClientOptions{ ClientOptions: azcore.ClientOptions{ Transport: client, }, } c, err := auth.GetEmulatorClientWithAzureADAuth("http://localhost:8081", clientOptions) if err != nil { log.Fatal(err) } _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } All you need to do is pass the custom HTTP client to the ClientOptions struct when creating the Cosmos DB client. The SDK will use this for all requests. Another scenario is when you want to set a custom header for all requests to track requests or add metadata. All you need to do is implement the Do method of the policy.Policy interface and set the header in the request: Go type CustomHeaderPolicy struct{} func (c *CustomHeaderPolicy) Do(req *policy.Request) (*http.Response, error) { correlationID := uuid.New().String() req.Raw().Header.Set("X-Correlation-ID", correlationID) return req.Next() } Looking at the logs, notice the custom header X-Correlation-ID is added to each request: Plain Text //... Request Event: ==> OUTGOING REQUEST (Try=1) GET https://ACCOUNT_NAME.documents.azure.com:443/ Authorization: REDACTED User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Version: 2020-11-05 Request Event: ==> OUTGOING REQUEST (Try=1) POST https://ACCOUNT_NAME-region.documents.azure.com:443/dbs Authorization: REDACTED Content-Length: 27 Content-Type: application/query+json User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Documentdb-Query: True X-Ms-Version: 2020-11-05 OpenTelemetry Support The Azure Go SDK supports distributed tracing via OpenTelemetry. This allows you to collect, export, and analyze traces for requests made to Azure services, including Cosmos DB. The azotel package is used to connect an instance of OpenTelemetry's TracerProvider to an Azure SDK client (in this case, Cosmos DB). You can then configure the TracingProvider in azcore.ClientOptions to enable automatic propagation of trace context and emission of spans for SDK operations. Go func getClientOptionsWithTracing() (*azcosmos.ClientOptions, *trace.TracerProvider) { exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Fatalf("failed to initialize stdouttrace exporter: %v", err) } tp := trace.NewTracerProvider(trace.WithBatcher(exporter)) otel.SetTracerProvider(tp) op := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ TracingProvider: azotel.NewTracingProvider(tp, nil), }, } return &op, tp } The above function creates a stdout exporter for OpenTelemetry (prints traces to the console). It sets up a TracerProvider, registers this as the global tracer, and returns a ClientOptions struct with the TracingProvider set, ready to be used with the Cosmos DB client. Go func tracing() { op, tp := getClientOptionsWithTracing() defer func() { _ = tp.Shutdown(context.Background()) }() c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", op) //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } //ctx := context.Background() tracer := otel.Tracer("tracer_app1") ctx, span := tracer.Start(context.Background(), "query-items-operation") defer span.End() query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(ctx) if err != nil { log.Fatal("query items failed:", err) } for _, item := range queryResp.Items { log.Printf("Queried item: %+v\n", string(item)) } } } The above function calls getClientOptionsWithTracing to get tracing-enabled options and a tracer provider, and ensures the tracer provider is shut down at the end (flushes traces). It creates a Cosmos DB client with tracing enabled, executes an operation to query items in a container. The SDK call is traced automatically, and exported to stdout in this case. You can plug in any OpenTelemetry-compatible tracer provider and traces can be exported to various backend. Here is a snippet for Jaeger exporter. The traces are quite large, so here is a small snippet of the trace output. Check the query_items_trace.txt file in the repo for the full trace output: Go //... { "Name": "query_items democontainer", "SpanContext": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "f2c892bec75dbf5d", "TraceFlags": "01", "TraceState": "", "Remote": false }, "Parent": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "b833d109450b779b", "TraceFlags": "01", "TraceState": "", "Remote": false }, "SpanKind": 3, "StartTime": "2025-05-06T17:59:30.90146+05:30", "EndTime": "2025-05-06T17:59:36.665605042+05:30", "Attributes": [ { "Key": "db.system", "Value": { "Type": "STRING", "Value": "cosmosdb" } }, { "Key": "db.cosmosdb.connection_mode", "Value": { "Type": "STRING", "Value": "gateway" } }, { "Key": "db.namespace", "Value": { "Type": "STRING", "Value": "demodb-gosdk3" } }, { "Key": "db.collection.name", "Value": { "Type": "STRING", "Value": "democontainer" } }, { "Key": "db.operation.name", "Value": { "Type": "STRING", "Value": "query_items" } }, { "Key": "server.address", "Value": { "Type": "STRING", "Value": "ACCOUNT_NAME.documents.azure.com" } }, { "Key": "az.namespace", "Value": { "Type": "STRING", "Value": "Microsoft.DocumentDB" } }, { "Key": "db.cosmosdb.request_charge", "Value": { "Type": "STRING", "Value": "2.37" } }, { "Key": "db.cosmosdb.status_code", "Value": { "Type": "INT64", "Value": 200 } } ], //.... Refer to Semantic Conventions for Microsoft Cosmos DB. What About Other Metrics? When executing queries, you can get basic metrics about the query execution. The Go SDK provides a way to access these metrics through the QueryResponse struct in the QueryItemsResponse object. This includes information about the query execution, including the number of documents retrieved, etc. Plain Text func queryMetrics() { //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(context.Background()) if err != nil { log.Fatal("query items failed:", err) } log.Println("query metrics:\n", *queryResp.QueryMetrics) //.... } } The query metrics are provided as a simple raw string in a key-value format (semicolon-separated), which is very easy to parse. Here is an example: Plain Text totalExecutionTimeInMs=0.34;queryCompileTimeInMs=0.04;queryLogicalPlanBuildTimeInMs=0.00;queryPhysicalPlanBuildTimeInMs=0.02;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.07;indexLookupTimeInMs=0.00;instructionCount=41;documentLoadTimeInMs=0.04;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=9;retrievedDocumentSize=1251;outputDocumentCount=9;outputDocumentSize=2217;writeOutputTimeInMs=0.02;indexUtilizationRatio=1.00 Here is a breakdown of the metrics you can obtain from the query response: Plain Text | Metric | Unit | Description | | ------------------------------ | ----- | ------------------------------------------------------------ | | totalExecutionTimeInMs | ms | Total time taken to execute the query, including all phases. | | queryCompileTimeInMs | ms | Time spent compiling the query. | | queryLogicalPlanBuildTimeInMs | ms | Time spent building the logical plan for the query. | | queryPhysicalPlanBuildTimeInMs | ms | Time spent building the physical plan for the query. | | queryOptimizationTimeInMs | ms | Time spent optimizing the query. | | VMExecutionTimeInMs | ms | Time spent executing the query in the Cosmos DB VM. | | indexLookupTimeInMs | ms | Time spent looking up indexes. | | instructionCount | count | Number of instructions executed for the query. | | documentLoadTimeInMs | ms | Time spent loading documents from storage. | | systemFunctionExecuteTimeInMs | ms | Time spent executing system functions in the query. | | userFunctionExecuteTimeInMs | ms | Time spent executing user-defined functions in the query. | | retrievedDocumentCount | count | Number of documents retrieved by the query. | | retrievedDocumentSize | bytes | Total size of documents retrieved. | | outputDocumentCount | count | Number of documents returned as output. | | outputDocumentSize | bytes | Total size of output documents. | | writeOutputTimeInMs | ms | Time spent writing the output. | | indexUtilizationRatio | ratio | Ratio of index utilization (1.0 means fully utilized). | Conclusion In this blog, we covered how to configure and customize the Go SDK for Azure Cosmos DB. We looked at retry policies, HTTP-level customizations, OpenTelemetry support, and how to access query metrics. The Go SDK for Azure Cosmos DB is designed to be flexible and customizable, allowing you to tailor it to your specific needs. For more information, refer to the package documentation and the GitHub repository. I hope you find this useful! Resources Go SDK for Azure Cosmos DBCore Azure Go SDK packageClientOptionsNewClient

By Abhishek Gupta DZone Core CORE
The Cypress Edge: Next-Level Testing Strategies for React Developers
The Cypress Edge: Next-Level Testing Strategies for React Developers

Introduction Testing is the backbone of building reliable software. As a React developer, you’ve likely heard about Cypress—a tool that’s been making waves in the testing community. But how do you go from writing your first test to mastering complex scenarios? Let’s break it down together, step by step, with real-world examples and practical advice. Why Cypress Stands Out for React Testing Imagine this: You’ve built a React component, but it breaks when a user interacts with it. You spend hours debugging, only to realize the issue was a missing prop. Cypress solves this pain point by letting you test components in isolation, catching errors early. Unlike traditional testing tools, Cypress runs directly in the browser, giving you a real-time preview of your tests. It’s like having a pair of eyes watching every click, hover, and API call. Key Advantages: Real-Time Testing: Runs in the browser with instant feedback.Automatic Waiting: Eliminates flaky tests caused by timing issues.Time Travel Debugging: Replay test states to pinpoint failures.Comprehensive Testing: Supports unit, integration, and end-to-end (E2E) tests Ever felt like switching between Jest, React Testing Library, and Puppeteer is like juggling flaming torches? Cypress simplifies this by handling component tests (isolated UI testing) and E2E tests (full user flows) in one toolkit. Component Testing vs. E2E Testing: What’s the Difference? Component Testing: Test individual React components in isolation. Perfect for verifying props, state, and UI behavior.E2E Testing: Simulate real user interactions across your entire app. Great for testing workflows like login → dashboard → checkout. Think of component tests as “microscope mode” and E2E tests as “helicopter view.” You need both to build confidence in your app. Setting Up Cypress in Your React Project Step 1: Install Cypress JavaScript npm install cypress --save-dev This installs Cypress as a development dependency. Pro Tip: If you’re using Create React App, ensure your project is ejected or configured to support Webpack 5. Cypress relies on Webpack for component testing. Step 2: Configure Cypress Create a cypress.config.js file in your project root: JavaScript const { defineConfig } = require('cypress'); module.exports = defineConfig({ component: { devServer: { framework: 'react', bundler: 'webpack', }, }, e2e: { setupNodeEvents(on, config) {}, baseUrl: 'http://localhost:3000', }, }); Step 3: Organize Your Tests JavaScript cypress/ ├── e2e/ # E2E test files │ └── login.cy.js ├── component/ # Component test files │ └── Button.cy.js └── fixtures/ # Mock data This separation ensures clarity and maintainability. Step 4: Launch the Cypress Test Runner JavaScript npx cypress open Select Component Testing and follow the prompts to configure your project. Writing Your First Test: A Button Component The Component Create src/components/Button.js: JavaScript import React from 'react'; const Button = ({ onClick, children, disabled = false }) => { return ( <button onClick={onClick} disabled={disabled} data-testid="custom-button" > {children} </button> ); }; export default Button; The Test Create cypress/component/Button.cy.js: JavaScript import React from 'react'; import Button from '../../src/components/Button'; describe('Button Component', () => { it('renders a clickable button', () => { const onClickSpy = cy.spy().as('onClickSpy'); cy.mount(<Button onClick={onClickSpy}>Submit</Button>); cy.get('[data-testid="custom-button"]').should('exist').and('have.text', 'Submit'); cy.get('[data-testid="custom-button"]').click(); cy.get('@onClickSpy').should('have.been.calledOnce'); }); it('disables the button when the disabled prop is true', () => { cy.mount(<Button disabled={true}>Disabled Button</Button>); cy.get('[data-testid="custom-button"]').should('be.disabled'); }); }); Key Takeaways: Spies:cy.spy() tracks function calls.Selectors:data-testid ensures robust targeting.Assertions: Chain .should() calls for readability.Aliases:cy.get('@onClickSpy') references spies. Advanced Testing Techniques Handling Context Providers Problem: Your component relies on React Router or Redux. Solution: Wrap it in a test provider. Testing React Router Components: JavaScript import { MemoryRouter } from 'react-router-dom'; cy.mount( <MemoryRouter initialEntries={['/dashboard']}> <Navbar /> </MemoryRouter> ); Testing Redux-Connected Components: JavaScript import { Provider } from 'react-redux'; import { store } from '../../src/redux/store'; cy.mount( <Provider store={store}> <UserProfile /> </Provider> ); Leveling Up: Testing a Form Component Let’s tackle a more complex example: a login form. The Component Create src/components/LoginForm.js: JavaScript import React, { useState } from 'react'; const LoginForm = ({ onSubmit }) => { const [email, setEmail] = useState(''); const [password, setPassword] = useState(''); const handleSubmit = (e) => { e.preventDefault(); if (email.trim() && password.trim()) { onSubmit({ email, password }); } }; return ( <form onSubmit={handleSubmit} data-testid="login-form"> <input type="email" value={email} onChange={(e) => setEmail(e.target.value)} data-testid="email-input" placeholder="Email" /> <input type="password" value={password} onChange={(e) => setPassword(e.target.value)} data-testid="password-input" placeholder="Password" /> <button type="submit" data-testid="submit-button"> Log In </button> </form> ); }; export default LoginForm; The Test Create cypress/component/LoginForm.spec.js: JavaScript import React from 'react'; import LoginForm from '../../src/components/LoginForm'; describe('LoginForm Component', () => { it('submits the form with email and password', () => { const onSubmitSpy = cy.spy().as('onSubmitSpy'); cy.mount(<LoginForm onSubmit={onSubmitSpy} />); cy.get('[data-testid="email-input"]').type('test@example.com').should('have.value', 'test@example.com'); cy.get('[data-testid="password-input"]').type('password123').should('have.value', 'password123'); cy.get('[data-testid="submit-button"]').click(); cy.get('@onSubmitSpy').should('have.been.calledWith', { email: 'test@example.com', password: 'password123', }); }); it('does not submit if email is missing', () => { const onSubmitSpy = cy.spy().as('onSubmitSpy'); cy.mount(<LoginForm onSubmit={onSubmitSpy} />); cy.get('[data-testid="password-input"]').type('password123'); cy.get('[data-testid="submit-button"]').click(); cy.get('@onSubmitSpy').should('not.have.been.called'); }); }); Key Takeaways: Use .type() to simulate user input.Chain assertions to validate input values.Test edge cases, such as missing fields. Authentication Shortcuts Problem: Testing authenticated routes without logging in every time.Solution: Use cy.session() to cache login state. JavaScript beforeEach(() => { cy.session('login', () => { cy.visit('/login'); cy.get('[data-testid="email-input"]').type('user@example.com'); cy.get('[data-testid="password-input"]').type('password123'); cy.get('[data-testid="submit-button"]').click(); cy.url().should('include', '/dashboard'); }); cy.visit('/dashboard'); // Now authenticated! }); This skips redundant logins across tests, saving time. Handling API Requests and Asynchronous Logic Most React apps fetch data from APIs. Let’s test a component that loads user data. The Component Create src/components/UserList.js: JavaScript import React, { useEffect, useState } from 'react'; import axios from 'axios'; const UserList = () => { const [users, setUsers] = useState([]); const [loading, setLoading] = useState(false); useEffect(() => { setLoading(true); axios.get('https://api.example.com/users') .then((response) => { setUsers(response.data); setLoading(false); }) .catch(() => setLoading(false)); }, []); return ( <div data-testid="user-list"> {loading ? ( <p>Loading...</p> ) : ( <ul> {users.map((user) => ( <li key={user.id} data-testid={`user-${user.id}`}> {user.name} </li> ))} </ul> )} </div> ); }; export default UserList; The Test Create cypress/component/UserList.spec.js: JavaScript import React from 'react'; import UserList from '../../src/components/UserList'; describe('UserList Component', () => { it('displays a loading state and then renders users', () => { cy.intercept('GET', 'https://api.example.com/users', { delayMs: 1000, body: [{ id: 1, name: 'John Doe' }, { id: 2, name: 'Jane Smith' }], }).as('getUsers'); cy.mount(<UserList />); cy.get('[data-testid="user-list"]').contains('Loading...'); cy.wait('@getUsers').its('response.statusCode').should('eq', 200); cy.get('[data-testid="user-1"]').should('have.text', 'John Doe'); cy.get('[data-testid="user-2"]').should('have.text', 'Jane Smith'); }); it('handles API errors gracefully', () => { cy.intercept('GET', 'https://api.example.com/users', { statusCode: 500, body: 'Internal Server Error', }).as('getUsersFailed'); cy.mount(<UserList />); cy.wait('@getUsersFailed'); cy.get('[data-testid="user-list"]').should('be.empty'); }); }); Why This Works: cy.intercept() mocks API responses without hitting a real server.delayMs simulates network latency to test loading states.Testing error scenarios ensures your component doesn’t crash. Best Practices for Sustainable Tests Isolate Tests: Reset state between tests using beforeEach hooks.Use Custom Commands: Simplify repetitive tasks (e.g., logging in) by adding commands to cypress/support/commands.js.Avoid Conditional Logic: Don’t use if/else in tests—each test should be predictable.Leverage Fixtures: Store mock data in cypress/fixtures to keep tests clean. Use Data Attributes as Selectors Example: data-testid="email-input" instead of #email or .input-primary.Why? Class names and IDs change; test IDs don’t. Mock Strategically Component Tests: Mock child components with cy.stub().E2E Tests: Mock APIs with cy.intercept(). Keep Tests Atomic Test one behavior per block: One test for login success.Another for login failure. Write Resilient Assertions Instead of: JavaScript cy.get('button').should('have.class', 'active'); Write: JavaScript cy.get('[data-testid="status-button"]').should('have.attr', 'aria-checked', 'true'); Cypress Time Travel Cypress allows users to see test steps visually. Use .debug() to pause and inspect state mid-test. JavaScript cy.get('[data-testid="submit-button"]').click().debug(); FAQs: Your Cypress Questions Answered Q: How do I test components that use React Router? A: Wrap your component in a MemoryRouter to simulate routing in your tests: JavaScript cy.mount( <MemoryRouter> <YourComponent /> </MemoryRouter> ); Q: Can I run Cypress tests in CI/CD pipelines? A: Absolutely! You can run your tests head less in environments like GitHub Actions using the command: JavaScript cypress run Q: How do I run tests in parallel to speed up CI/CD? A: To speed up your tests, you can run them in parallel with the following command: JavaScript npx cypress run --parallel Q: How do I test file uploads? A: You can test file uploads by selecting a file input like this: JavaScript cy.get('input[type="file"]').selectFile('path/to/file.txt'); Wrapping Up Cypress revolutionizes testing by integrating it smoothly into your workflow. Begin with straightforward components and progressively address more complex scenarios to build your confidence and catch bugs before they affect users. Keep in mind that the objective isn't to achieve 100% test coverage; rather, it's about creating impactful tests that ultimately save you time and prevent future headaches.

By Raju Dandigam
Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways

In my experience managing large-scale Kubernetes deployments across multi-cloud platforms, traffic control often becomes a critical bottleneck, especially when dealing with mixed workloads like APIs, UIs, and transactional systems. While Istio’s default ingress gateway does a decent job, I found that relying on a single gateway can introduce scaling and isolation challenges. That’s where configuring multiple Istio Ingress Gateways can make a real difference. In this article, I’ll walk you through how I approached this setup, what benefits it unlocked for our team, and the hands-on steps we used, along with best practices and YAML configurations that you can adapt in your own clusters. Why Do We Use an Additional Ingress Gateway? Using an additional Istio Ingress Gateway provides several advantages: Traffic isolation: Route traffic based on workload-specific needs (e.g., API traffic vs. UI traffic or transactional vs. non-transactional applications).Multi-tenancy: Different teams can have their gateway while still using a shared service mesh.Scalability: Distribute traffic across multiple gateways to handle higher loads efficiently.Security and compliance: Apply different security policies to specific gateway instances.Flexibility: You can create any number of additional ingress gateways based on project or application needs.Best practices: Kubernetes teams often use Horizontal Pod Autoscaler (HPA), Pod Disruption Budget (PDB), Services, Gateways, and Region-Based Filtering (via Envoy Filters) to enhance reliability and performance. Understanding Istio Architecture Istio IngressGateway and Sidecar Proxy: Ensuring Secure Traffic Flow When I first began working with Istio, one of the key concepts that stood out was the use of sidecar proxies. Every pod in the mesh requires an Envoy sidecar to manage traffic securely. This ensures that no pod can bypass security or observability policies. Without a sidecar proxy, applications cannot communicate internally or with external sources.The Istio Ingress Gateway manages external traffic entry but relies on sidecar proxies to enforce security and routing policies.This enables zero-trust networking, observability, and resilience across microservices. How Traffic Flows in Istio With Single and Multiple Ingress Gateways In an Istio service mesh, all external traffic follows a structured flow before reaching backend services. The Cloud Load Balancer acts as the entry point, forwarding requests to the Istio Gateway Resource, which determines traffic routing based on predefined policies. Here's how we structured the traffic flow in our setup: Cloud Load Balancer receives external requests and forwards them to Istio's Gateway Resource.The Gateway Resource evaluates routing rules and directs traffic to the appropriate ingress gateway: Primary ingress gateway: Handles UI requests.Additional ingress gateways: Route API, transactional, and non-transactional traffic separately.Envoy Sidecar Proxies enforce security policies, manage traffic routing, and monitor observability metrics.Requests are forwarded to the respective Virtual Services, which process and direct them to the final backend service. This structure ensures better traffic segmentation, security, and performance scalability, especially in multi-cloud Kubernetes deployments. Figure 1: Istio Service Mesh Architecture – Traffic routing from Cloud Load Balancer to Istio Gateway Resource, Ingress Gateways, and Service Mesh. Key Components of Istio Architecture Ingress gateway: Handles external traffic and routes requests based on policies.Sidecar proxy: Ensures all service-to-service communication follows Istio-managed rules.Control plane: Manages traffic control, security policies, and service discovery. Organizations can configure multiple Istio Ingress Gateways by leveraging these components to enhance traffic segmentation, security, and performance across multi-cloud environments. Comparison: Single vs. Multiple Ingress Gateways We started with a single ingress gateway and quickly realized that as traffic grew, it became a bottleneck. Splitting traffic using multiple ingress gateways was a simple but powerful change that drastically improved routing efficiency and fault isolation. On the other hand, multiple ingress gateways allowed better traffic segmentation for APIs, UI, and transaction-based workloads, improved security enforcement by isolating sensitive traffic, and scalability and high availability, ensuring each type of request is handled optimally. The following diagram compares a single Istio Ingress Gateway with multiple ingress gateways for handling API and web traffic. Figure 2: Single vs. Multiple Istio Ingress Gateways – Comparing routing, traffic segmentation, and scalability differences. Key takeaways from the comparison: A single Istio Ingress Gateway routes all traffic through a single entry point, which may become a bottleneck.Multiple ingress gateways allow better traffic segmentation, handling API traffic and UI traffic separately.Security policies and scaling strategies can be defined per gateway, making it ideal for multi-cloud or multi-region deployments. Feature Single Ingress Gateway Multiple Ingress Gateways Traffic Isolation No isolation, all traffic routes through a single gateway Different gateways for UI, API, transactional traffic Resilience If the single gateway fails, traffic is disrupted Additional ingress gateways ensure redundancy Scalability Traffic bottlenecks may occur Load distributed across multiple gateways Security Same security rules apply to all traffic shared Custom security policies per gateway Setting Up an Additional Ingress Gateway How Additional Ingress Gateways Improve Traffic Routing We tested routing different workloads (UI, API, transactional) through separate gateways. This gave each gateway its own scaling behavior and security profile. It also helped isolate production incidents — for example, UI errors no longer impacted transactional requests. The diagram below illustrates how multiple Istio Ingress Gateways efficiently manage API, UI, and transactional traffic. Figure 3: Multi-Gateway Traffic Flow – External traffic segmentation across API, UI, and transactional ingress gateways. How it works: Cloud Load Balancer forwards traffic to the Istio Gateway Resource, which determines routing rules.Traffic is directed to different ingress gateways: The Primary ingress gateway handles UI traffic.The API Ingress Gateway handles API requests.The Transactional Ingress Gateway ensures financial transactions and payments are processed securely.The Service Mesh enforces security, traffic policies, and observability. Step 1: Install Istio and Configure Operator For our setup, we used Istio’s Operator pattern to manage lifecycle operations. It’s flexible and integrates well with GitOps workflows. Prerequisites Kubernetes cluster with Istio installedHelm installed for deploying Istio components Ensure you have Istio installed. If not, install it using the following commands: Plain Text curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$(istio_version) TARGET_ARCH=x86_64 sh - export PATH="$HOME/istio-$ISTIO_VERSION/bin:$PATH" Initialize the Istio Operator Plain Text istioctl operator init Verify the Installation Plain Text kubectl get crd | grep istio Alternative Installation Using Helm Istio Ingress Gateway configurations can be managed using Helm charts for better flexibility and reusability. This allows teams to define customizable values.yaml files and deploy gateways dynamically. Helm upgrade command: Plain Text helm upgrade --install istio-ingress istio/gateway -f values.yaml This allows dynamic configuration management, making it easier to manage multiple ingress gateways. Step 2: Configure Additional Ingress Gateways With IstioOperator We defined separate gateways in the IstioOperator config (additional-ingress-gateway.yaml) — one for UI and one for API — and kept them logically grouped using Helm values files. This made our Helm pipelines cleaner and easier to scale or modify. Below is an example configuration to create multiple additional ingress gateways for different traffic types: YAML apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: name: additional-ingressgateways namespace: istio-system spec: components: ingressGateways: - name: istio-ingressgateway-ui enabled: true k8s: service: type: LoadBalancer - name: istio-ingressgateway-api enabled: true k8s: service: type: LoadBalancer Step 3: Additional Configuration Examples for Helm We found that adding HPA and PDB configs early helped ensure we didn’t hit availability issues during upgrades. This saved us during one incident where the default config couldn’t handle a traffic spike in the API gateway. Below are sample configurations for key Kubernetes objects that enhance the ingress gateway setup: Horizontal Pod Autoscaler (HPA) YAML apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ingressgateway-hpa namespace: istio-system spec: minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: istio-ingressgateway Pod Disruption Budget (PDB) YAML apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: ingressgateway-pdb namespace: istio-system spec: minAvailable: 1 selector: matchLabels: app: istio-ingressgateway Region-Based Envoy Filter YAML apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: region-header-filter namespace: istio-system spec: configPatches: - applyTo: HTTP_FILTER match: context: GATEWAY listener: filterChain: filter: name: envoy.filters.network.http_connection_manager subFilter: name: envoy.filters.http.router proxy: proxyVersion: ^1\.18.* patch: operation: INSERT_BEFORE value: name: envoy.filters.http.lua typed_config: '@type': type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua inlineCode: | function envoy_on_response(response_handle) response_handle:headers():add("X-Region", "us-eus"); end Step 4: Deploy Additional Ingress Gateways Apply the configuration using istioctl: Plain Text istioctl install -f additional-ingress-gateway.yaml Verify that the new ingress gateways are running: Plain Text kubectl get pods -n istio-system | grep ingressgateway After applying the configuration, we monitored the rollout using kubectl get pods and validated each gateway's service endpoint. Naming conventions like istio-ingressgateway-ui really helped keep things organized. Step 5: Define Gateway Resources for Each Ingress Each ingress gateway should have a corresponding gateway resource. Below is an example of defining separate gateways for UI, API, transactional, and non-transactional traffic: YAML apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: my-ui-gateway namespace: default spec: selector: istio: istio-ingressgateway-ui servers: - port: number: 443 name: https protocol: HTTPS hosts: - "ui.example.com" Repeat similar configurations for API, transactional, and non-transactional ingress gateways. Make sure your gateway resources use the correct selector. We missed this during our first attempt, and traffic didn’t route properly — a simple detail, big impact. Step 6: Route Traffic Using Virtual Services Once the gateways are configured, create Virtual Services to control traffic flow to respective services. YAML apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-api-service namespace: default spec: hosts: - "api.example.com" gateways: - my-api-gateway http: - route: - destination: host: my-api port: number: 80 Repeat similar configurations for UI, transactional, and non-transactional services. Just a note that VirtualServices gives you fine-grained control over traffic. We even used them to test traffic mirroring and canary rollouts between the gateways. Resilience and High Availability With Additional Ingress Gateways One of the biggest benefits we noticed: zero downtime during regional failovers. Having dedicated gateways meant we could perform rolling updates with zero user impact. This model also helped us comply with region-specific policies by isolating sensitive data flows per gateway — a crucial point when dealing with financial workloads. If the primary ingress gateway fails, additional ingress gateways can take over traffic seamlessly.When performing rolling upgrades or Kubernetes version upgrades, separating ingress traffic reduces downtime risk.In multi-region or multi-cloud Kubernetes clusters, additional ingress gateways allow better control of regional traffic and compliance with local regulations. Deploying additional IngressGateways enhances resilience and fault tolerance in a Kubernetes environment. Best Practices and Lessons Learned Many teams forget that Istio sidecars must be injected into every application pod to ensure service-to-service communication. Below are some lessons we learned the hard way When deploying additional ingress gateways, consider implementing: Horizontal Pod Autoscaler (HPA): Automatically scale ingress gateways based on CPU and memory usage.Pod Disruption Budgets (PDB): Ensure high availability during node upgrades or failures.Region-Based Filtering (EnvoyFilter): Optimize traffic routing by dynamically setting request headers with the appropriate region.Dedicated services and gateways: Separate logical entities for better security and traffic isolation.Ensure automatic sidecar injection is enabled in your namespace: Plain Text kubectl label namespace <your-namespace> istio-injection=enabled Validate that all pods have sidecars using: Plain Text kubectl get pods -n <your-namespace> -o wide kubectl get pods -n <your-namespace> -o jsonpath='{.items[*].spec.containers[*].name}' | grep istio-proxy Without sidecars, services will not be able to communicate, leading to failed requests and broken traffic flow. When upgrading additional ingress gateways, consider the following: Delete old Istio configurations (if needed): If you are upgrading or modifying Istio, delete outdated configurations: Plain Text kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io istio-sidecar-injector kubectl get crd --all-namespaces | grep istio | awk '{print $1}' | xargs kubectl delete crd Ensure updates to proxy version, deployment image, and service labels during upgrades to avoid compatibility issues. YAML proxyVersion: ^1.18.* image: docker.io/istio/proxyv2:1.18.6 Scaling down Istio Operator: Before upgrading, scale down the Istio Operator to avoid disruptions. Plain Text kubectl scale deployment -n istio-operator istio-operator --replicas=0 Backup before upgrade: Plain Text kubectl get deploy,svc,cm,secret -n istio-system -o yaml > istio-backup.yaml Monitoring and Observability With Grafana With Istio's built-in monitoring, Grafana dashboards provide a way to segregate traffic flow by ingress type: Monitor API, UI, transactional, and non-transactional traffic separately.Quickly identify which traffic type is affected when an issue occurs in Production using Prometheus-based metricsIstio Gateway metrics can be monitored in Grafana & Prometheus to track traffic patterns, latency, and errors.It provides real-time metrics for troubleshooting and performance optimization.Using PrometheusAlertmanager, configure alerts for high error rates, latency spikes, and failed request patterns to improve reliability. FYI, we extended our dashboards in Grafana to visualize traffic per gateway. This was a game-changer — we could instantly see which gateway was spiking and correlate it to service metrics. Prometheus alerting was configured to trigger based on error rates per ingress type. This helped us catch and resolve issues before they impacted end users. Conclusion Implementing multiple Istio Ingress Gateways significantly transformed the architecture of our Kubernetes environments. This approach enabled us to independently scale different types of traffic, enforce custom security policies per gateway, and gain enhanced control over traffic management, scalability, security, and observability. By segmenting traffic into dedicated ingress gateways — for UI, API, transactional, and non-transactional services — we achieved stronger isolation, improved load balancing, and more granular policy enforcement across teams. This approach is particularly critical in multi-cloud Kubernetes environments, such as Azure AKS, Google GKE, Amazon EKS, Red Hat OpenShift, VMware Tanzu Kubernetes Grid, IBM Cloud Kubernetes Service, Oracle OKE, and self-managed Kubernetes clusters, where regional traffic routing, failover handling, and security compliance must be carefully managed. By leveraging best practices, including: Sidecar proxies for service-to-service securityHPA (HorizontalPodAutoscaler) for autoscalingPDB (PodDisruptionBudget) for availabilityEnvoy filters for intelligent traffic routingHelm-based deployments for dynamic configuration Organizations can build a highly resilient and efficient Kubernetes networking stack. Additionally, monitoring dashboards like Grafana and Prometheus provide deep observability into ingress traffic patterns, latency trends, and failure points, allowing real-time tracking of traffic flow, quick root-cause analysis, and proactive issue resolution. By following these principles, organizations can optimize their Istio-based service mesh architecture, ensuring high availability, enhanced security posture, and seamless performance across distributed cloud environments. References Istio Architecture OverviewIstio Ingress Gateway vs. Kubernetes IngressIstio Install Guide (Using Helm or Istioctl)Istio Operator & Profiles for Custom Deployments Best Practices for Istio Sidecar InjectionIstio Traffic Management: VirtualServices, Gateways & DestinationRules

By Prabhu Chinnasamy
Measuring the Impact of AI on Software Engineering Productivity
Measuring the Impact of AI on Software Engineering Productivity

It is hard to imagine a time not long ago where AI has not been front and center of our everyday news, let alone in the software engineering world? The advent of LLMs coupled with the existing compute power catapulted the use of AI in our everyday lives and in particular so in the life of a software engineer. This article breaks down some of the use cases of AI in software engineering and suggests a path to investigate the key question: Did we actually become more productive? It has only been a few years since the inception of GitHub Copilot in 2021. Since then, AI assisted coding tools have had a significant impact on software engineering practices. As of 2024 it is estimated that 75% of developers use some kind of AI tool. Often, these tools are not fully rolled out in organizations and used on the side. However, Gartner estimates that we will reach 90% enterprise adoption by 2028. Today there are dozens of tools that do or claim can help software engineers in their daily lives. Besides GitHub Copilot, ChatGTP, and Google Gemini, common tools include GitLab Duo, Claude, Jetbrains AI, Cody, Bolt, Cursor, and AWS CodeWhisperer. Updates are reported almost daily leading to new and advanced solutions. AI Assisted Coding with GitHub Copilot AI in Software Engineering: What is Changing? Looking at the use cases inside engineering organization, we can identify a number of key purposes: Building proof of concepts and scaffolding quickly for new products. Engineers use AI-based solutions that leverage intrinsic knowledge about frameworks for generating initial blueprints and solutions. Solutions include Bolt, v0 and similar.Writing new code, iterating existing code and using AI as a perceived productivity assistant. The purpose is to quickly iterate on existing solutions, have an AI-supported “knowledge base” and assistance. This type of AI not only produces code, but to a degree, replaces expert knowledge forums and sites such as Stack Overflow. This is the space where we have seen the most success with solutions being embedded on the IDE, connected to repositories and tightly integrated into the software development process. Automating engineering processes through Agentic AI. The latest approach is increasing the level of automation on niche tasks as well as connecting across tasks and development silos. Besides automating more mundane tasks, Agentic AI shapes up to be helpful in creating test cases, optimizing build pipelines and managing the whole planning-to-release process and is an area of much ongoing development. For the purpose of this article let us focus on the most mature technology—AI assisted coding solutions. Besides all the progress and the increasing adoption of AI, the main question remains: Are we any more productive? Productivity means getting done what needs to be done with a particular benefit in mind. Producing more code can be a step in the right direction, but it might also have unintended consequences of producing low-quality code, code that works, but does not meet the intention, or where junior developers might blindly accept code leading to issues down the road. Obviously, a lot depends on the skill of the prompt engineer (asking the right question), the ability to iterate on the AI generated code (the expertise and experience of the developer) and of course on the maturity of the AI technology. Let us dive into the productivity aspect in more detail. AI and Productivity: The Big Unknown One of the key questions in rolling out AI tools across the engineering organization is judging its productivity impact. How do we know if and when AI assisted coding really helps our organization to be more productive? What are good indicators and what might be good metrics to measure and track productivity over time? Firstly, as mentioned above, productivity does not mean simply writing more code. More code is just more code. It does not mean it necessarily does anything useful or adds something to a product that is actually needed. Nonetheless, more code produced quickly is helpful if it solves a business problem. Internal indicators for this can be that feature tickets get resolved quicker, code reviews are accepted (quickly) and security and quality criteria are met—either through higher pre-release pass rates, or lower incidence tickets post release. As such, some common indicators for productivity are: The throughput of your accepted coding activities as for instance defined by the number of PRs you get approved and merged in a week. The number of feature tickets or tasks that can be resolved in a sprint, for instance measured by the number of planning tickets you can complete. The quality and security standard of your coding activities. For instance, does AI coding assistance generate less security issues, do quality tests pass more often, or do code reviews take less time and less cycles?The time it takes to get any of the above done and a release out of the door. Do you release more often? Are your release pipelines more reliable? All things being equal, in a productive AI assisted coding organization we would expect that you would be able to ship more or ship faster—ideally both. ROI: Measuring the Impact of AI The best time to measure your engineering productivity is today. Productivity is never a single number and the trend is important. Having a baseline to measure the current state against future organizational and process improvements is crucial to gauge any productivity gains. If you haven’t invested heavily into AI tooling yet but planning to, it is a good time to establish a baseline. If you have invested in AI, it is essential to track ongoing changes over time. You can do this with manual investigation at certain points in time, or automatically and continuously with software engineering intelligence platforms such as Logilica, which not only track your ongoing metrics, but also enable you to forensically look into the past and future project states. There are a number of key metrics we suggest tracking and see if your AI investment pays off. We suggest centering them around the following dimensions: Speed of delivery: Are you able to deliver faster than before? This means, are you able to respond to customer needs and market demand quicker and more flexibly? Indicators are your release cadence, your lead time for releases, lead time for customer and planning tickets and even cycle times for each individual code activities (PRs).Features shipped: Are you able to actually ship more? Not producing more code only, but finishing more planned tasks, approving and merging code activity (PRs), and are you able to have more or larger releases? Throughput metrics are important if they are balanced with time and quality metrics.Level of predictability: One main challenge with software engineering is having on-target delivery and not letting deadlines or scope slip. Do your AI initiatives help you with this? For instance, do you hit the target dates more reliably? Does your sprint planning improve, and conversely, are you able to reduce your sprint overruns? Does your effort more reliably align with the business expectation, e.g., do you track if new features increase and bug fixing/technical debt decreases?Security/quality expectations: Does your downstream release pipeline improve with less build failures? Do you hit your testing and security scanning criteria? Do you see less support tickets since the introduction of AI? Is there a change in user sentiment that supports your ongoing investment?Developer team health: Lastly, does the introduction of AI positively impact your developer team health, lead to less overload and to happier teams? This is a big one and much less clear cut than one might expect. While AI assisted development can produce more code quicker, it is unclear if it does not create more burden elsewhere. For instance, more code means more code reviews, easily making humans a bottle neck again, which leads to frustration and burn out. Also, AI generated code might be larger, leading to larger PR where the actual submitter has less confidence in his own AI-assisted code. QA/security might feel the extra burden and customers report more bugs that take longer to resolve. Overall, it is essential to track engineering processes and key metrics from multiple dimensions simultaneously, ensuring that your AI investment actually delivers positive, measurable productivity gains. Tracking the Impact of AI Assisted Development Outlook AI assisted development has arrived. It is a new reality that will rapidly permeate all parts of the software development lifecycle. As such, it is critical to build up the expertise and strategies to use that technology in the most beneficial way. Ensuring success requires the right level of visibility into the software engineering processes to provide the essential observability for decision makers. Those decisions are two-fold: Justifying the investment to executive teams with data-driven evidence, and being able to set the right guardrails for productivity improvements and process goals. There is the inevitable hype cycle around AI assisted coding. To look beyond the hype it is important to measure the positive impact and steer its adoptions into the right direction, to ensure a positive business impact. Software Engineering Intelligence (SEI) platforms connect with your engineering data and give you the visibility into your processes and bottlenecks to get answers to the above questions. These platforms automate the measuring and analytics process for you to focus on the data-driven division making. In future parts of this series we will dive into details of how predictive models can be applied to your engineering processes, how you can use AI to monitor your software engineering AI and how SEI platforms can help you to build high-performance engineering organizations.

By Ralf Huuck
Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
Simplify Authorization in Ruby on Rails With the Power of Pundit Gem

Hi, I'm Denis, a backend developer. I’ve been recently working on building a robust all-in-one CRM system for HR and finance, website, and team management. Using the Pundit gem, I was able to build such an efficient role-based access system, and now I'd like to share my experience. Managing authorization efficiently became a crucial challenge as this system expanded, requiring a solution that was both scalable and easy to maintain. In Ruby on Rails, handling user access can quickly become complex, but the Pundit gem. In this article, I will explore how Pundit simplifies authorization, its core concepts, and how you can integrate it seamlessly into your Rails project. By the end of this guide, you'll hopefully have a clear understanding of how to use Pundit effectively to manage user permissions and access control in your Rails applications. Why Choose Pundit for Authorization? Pundit is a lightweight and straightforward authorization library for Rails that follows the policy-based authorization pattern. Unlike other authorization libraries, such as CanCanCan, which rely on rule-based permissions, Pundit uses plain Ruby classes to define access rules. Some key benefits of Pundit include: Simplicity: Uses plain old Ruby objects (POROs) to define policies, making it easy to read and maintain.Flexibility: Provides fine-grained control over permissions, allowing complex authorization rules to be implemented easily.Scalability: Easily extendable as your application grows without adding unnecessary complexity.Security: Encourages explicit authorization checks, reducing the risk of unauthorized access to resources. Unlike role-based systems that define broad roles, Pundit allows for granular, action-specific permission handling. This approach improves maintainability and prevents bloated permission models. Pundit vs. CanCanCan Pundit and CanCanCan are both popular authorization libraries for Rails, but they take different approaches: FeaturePunditCanCanCanAuthorization MethodPolicy-based (separate classes for each resource)Rule-based (centralized abilities file)FlexibilityHigh (you define logic explicitly)Medium (relies on DSL rules)ComplexityLower (straightforward Ruby classes)Higher (complex rules can be harder to debug)PerformanceGenerally better for large applicationsCan slow down with many rules If you need explicit, granular control over access, Pundit is often the better choice. If you prefer a more declarative, centralized way of defining permissions, CanCanCan might be suitable. Getting Started With Pundit Before diving into using Pundit, it’s important to understand how it fits into Rails’ authorization system. By relying on clear, independent policies, Pundit keeps your code maintainable and easy to follow. Now, let’s walk through the setup process and see how you can start using Pundit to manage access control in your application. 1. Installing Pundit Gem To begin using Pundit in your Rails project, add it to your Gemfile and run bundle install: Then, install Pundit by running: This command generates an ApplicationPolicy base class that will be used for defining your policies. This base class provides default behavior for authorization checks and serves as a template for specific policies you create for different models. 2. Defining Policies Policies in Pundit are responsible for defining authorization rules for a given model or resource. A policy is simply a Ruby class stored inside the app/policies/ directory. For example, let’s generate a policy for a Post model: This generates a PostPolicy class inside app/policies/post_policy.rb. A basic policy class looks like this: Each method defines an action (e.g., show?, update?, destroy?) and returns true or false based on whether the user has permission to perform that action. Keeping policy methods small and specific makes them easy to read and debug. 3. Using Policies in Controllers In your controllers, you can leverage Pundit's authorize method to enforce policies. Here’s how you can integrate Pundit into a PostsController: Here, authorize @post automatically maps to PostPolicy and calls the appropriate method based on the controller action. This ensures authorization is consistently checked before performing actions on a resource. 4. Handling Authorization at the View Level Pundit provides the policy helper, which allows you to check permissions in views: You can also use policy_scope to filter records based on permissions: This ensures that only authorized data is displayed to the user, preventing unauthorized access even at the UI level (but data loading with policy scope is recommended on the non-view level). 5. Custom Scopes for Querying Data Pundit allows you to define custom scopes for fetching data based on user roles. Modify PostPolicy to include a Scope class: In the controller: This ensures users only see records they are authorized to view, adding an extra layer of security and data privacy. In our experience, it is often necessary to load data from another scope, and then you need to specify additional parameters when loading data from the policy scope: Also, when you have several scopes for one policy, you can specify which one you need (because by default, the scope uses the "resolve" method for scope). For example, in your policy, you have: And you can call it: 6. Rescuing a Denied Authorization in Rails It's important not only to verify authorization correctly but also to handle errors and access permissions properly. In my implementation, I used role-based access rules to ensure secure and flexible control over user permissions, preventing unauthorized actions while maintaining a smooth user experience. I won’t be dwelling a lot upon them in this article, as I described them in detail in one of my recent CRM overviews. Pundit raises a Pundit::NotAuthorizedError you can rescue_from in your ApplicationController. You can customize the user_not_authorized method in every controller. So you can also change the behavior of your application when access is denied. Best Practices for Using Pundit To get the most out of Pundit, it's essential to follow best practices that ensure your authorization logic remains clean, efficient, and scalable. Let’s explore some key strategies to keep your policies well-structured and your application secure. 1. Сreating a Separate Module: A Clean and Reusable Approach A well-structured application benefits from modularization, reducing repetitive code, and improving maintainability. Module encapsulates authorization logic, making it easy to reuse across multiple controllers. Let’s break it down: The load_and_authorize_resource method is a powerful helper that: Loads the resource based on controller actions.Authorizes the resource using Pundit.Automatically assigns the loaded resource to an instance variable for use in the controller actions. Example: This means that controllers no longer need to explicitly load and authorize records, reducing boilerplate code. For example, the load_and_authorize method dynamically loads records based on controller actions: index: Loads all records.new/create: Initializes a new record.Other actions: Fetches a specific record using a flexible find strategy. This makes it easy to add authorization without cluttering individual controllers. 2. Applying It in a Controller With the AuthorizationMethods module included in ApplicationController, controllers become much cleaner. For example, in PostsControllerloading and authorizing a Post record is as simple as: With load_and_authorize_resource, the controller: ✅ Automatically loads Post records✅ Ensures authorization is enforced ✅ Remains clean and maintainable Other Best Practices for Pundit Keep policies concise and focused. Each policy should only contain logic related to authorization to maintain clarity and separation of concerns.Use scopes for query-level authorization. This ensures that unauthorized data is never retrieved from the database, improving both security and efficiency.Always call authorize in controllers. This prevents accidental exposure of sensitive actions by ensuring explicit permission checks.Avoid authorization logic in models. Keep concerns separate by handling authorization in policies rather than embedding logic within models. Wrap Up Pundit simplifies authorization in Ruby on Rails by providing a clean and structured way to define and enforce permissions. By using policies, scopes, and controller-based authorization, you can create secure, maintainable, and scalable applications with minimal complexity. If you’re building a Rails app that requires role-based access control, Pundit is a powerful tool that can streamline your authorization logic while keeping your codebase clean and easy to manage.

By Denys Kozlovskyi
Cookies Revisited: A Networking Solution for Third-Party Cookies
Cookies Revisited: A Networking Solution for Third-Party Cookies

Cookies are fundamental aspects of a web application that end users and developers frequently deal with. A cookie is a small piece of data that is stored in a user’s browser. The data element is used as a medium to communicate information between the web browser and the application's server-side layer. Cookies serve various purposes, such as remembering a user’s credentials (not recommended), targeting advertisements (tracking cookies), or helping to maintain a user’s authentication status in a web application. Several fantastic articles on the internet have been written over the years on cookies. This article focuses on handling cross-domain, aka third-party cookies. Cookie Types Before jumping straight away into the main goal of this article, let’s briefly highlight the categories into which we can break cookies. One category is based on the type of use cases they solve, and other category is based on ownership of cookie. Breakdown Based on Use Cases Session Cookie As the name suggests session cookie is a type of cookie that is used to manage a user’s web session. Typically, the server sends the cookie as response back to the browser after successful authentication as “Set-Cookie” response header. The browser sends the cookie as part of the request in subsequent call to the server. The server validates the cookie to make sure the user is still authenticated before responding with data. If the user logs out or the session times out, the cookie is invalidated. Otherwise, if the user closes the browser, the session cookie also becomes inaccessible. JavaScript // Setting a session cookie document.cookie = "session_cookie=value; path=/"; Persistent Cookie A persistent cookie is a type of cookie that doesn’t die when the browser is closed or if the user signs out of a web application. Its purpose is to retain some information in the user's workstation for a longer period of time. One common example of a use case where the persistent cookie is used is during two-factor authentication on a website. We’ve all encountered this experience on websites, especially when logging into online banking portals. After entering our user ID and password, we’re often prompted for a second layer of authentication. This second factor is typically a one-time passcode (OTP), which is sent either to our mobile device via SMS or voice call, or to our email address (though using email is generally discouraged, as email accounts are more prone to compromise). Generally, the 2nd factor authentication screen gives us the option to remember the device. If we choose that as an option, typically the application generates some random code and persists on the server side. The application sets that random code as a persistent cookie and sends it back to the browser. During subsequent login, the client-side code of the application sends the persistent cookie in the request after successful authentication. The server side of the code finds the persistent cookie as valid, then it doesn’t prompt the user for 2nd factor authentication. Otherwise, the user is challenged for the OTP code. JavaScript // Setting a persistent cookie for 30 days var expirationDate = new Date(); expirationDate.setDate(expirationDate.getDate() + 30); document.cookie = "persistent_cookie=value; expires=" + expirationDate.toUTCString() + "; path=/"; Tracking Cookie Unlike session or persistent cookies, which have been very common since the inception of cookie-based solutions in web applications, tracking cookies are comparatively new and mostly a phenomenon of the past decade. Here, a website starts tracking users' browsing activity and stores it in the browser. Later, it is used to display relevant advertisements to users when they access the internet from the same browser. As a tracking cookie is used to capture user data, websites prompt users to accept or reject tracking cookies when they access a website that implements tracking cookies. JavaScript // Setting a tracking cookie with SameSite=None and Secure option for cross site access var expirationDate = new Date(); expirationDate.setDate(expirationDate.getDate() + 365); // Expires after 1 year document.cookie = "tracking_cookie=value; expires=" + expirationDate.toUTCString() + "; path=/; SameSite=None; Secure"; Breakdown Based on Ownership First-Party Cookie Imagine we open the URL www.abc.com in a browser tab. The website uses cookies, and a cookie is set in the browser. As the URL in the browser, www.abc.com, matches the domain of the cookie, it is a first-party cookie. In other words, a cookie issued in the browser for the website address present in the browser address bar is a first-party cookie. Third-Party Cookie Now, imagine there is a webpage within www.abc.com that loads content from a different website, www.xyz.com. Typically, this is done using an iFrame HTML tag. The cookie issued by www.xyz.com is called a third-party cookie. As the domain of the cookie for www.xyz.com doesn’t match the URL present in the address bar of the browser (www.abc.com), the cookie is considered third-party. Solving Third-Party Cookie Access Issue Due to privacy reasons, Safari on Mac, iOS, and Chrome in Incognito mode block third-party cookies. Even if the third-party cookie is set using the SameSite=None; Secure attribute, Safari, and Chrome Incognito will block it. Therefore, the iFrame-based content embedding example explained above will not work in the browser, which puts restrictions on it. In order to solve the problem, some networking work needs to be done. An alias record, such as xyz-thirdparty.abc.com, needs to be created.The alias record xyz-thirdparty.abc.com needs to have www.xyz.com as the target endpoint in the network configuration. A certificate needs to be generated with CN and Subject Alternate Name as xyz-thirdparty.abc.com by a Certificate Authority (e.g., VeriSign). The certificate needs to be installed in the infrastructure (e.g., reverse proxy, web server, load balancer, etc.) of www.xyz.com.The iFrame code should use the target URL as xyz-thirdparty.abc.com instead of www.xyz.com.This way the cookie issued by www.xyz.com will actually be issued under alias record xyz-thirdpary.abc.com. As the domain of the cookie is abc.com which matches the domain of the URL present in the browser address bar(www.abc.com), the cookie will be treated as first party. The application using iFrame will work in Safari and Chrome Incognito mode. Note: The subdomain for the alias record could be anything like foo.abc.com. I have used the subdomain of the alias as xyz-thirdparty for demonstration purposes only. The diagram below demonstrates the networking solution. Network configuration for cross-domain iFrame Consideration The www.xyz.com website must use X-Frame-Options headers in its infrastructure (e.g., reverse proxy) and whitelist www.abc.com as the website that can iFrame. Otherwise, even with the alias record solution, www.abc.com will not be able to iFrame www.xyz.com. As a side note, the X-Frame-Options header is used to control if a website can be iFrame-d or not and if yes, which specific websites can iFrame it. This is done to protect a website from a clickjacking attack. Conclusion Protecting end users and websites from malicious attacks is critical in the modern web. Browsers are getting restrictive with add-on controls. However, there are legitimate use cases when cross-domain communications need to happen in a browser, like embedding one website within another in iFrame. Third-party cookies become a challenge to implement in cross-domain iFrame-based implementations. This article demonstrated the technique about how to implement this feature using network configuration. References Saying goodbye to third-party cookies in 2024X-Frame-Options

By Dipankar Saha
Top Book Picks for Site Reliability Engineers
Top Book Picks for Site Reliability Engineers

I believe reading is fundamental. site reliability engineers (SREs) need to have deep knowledge in a wide range of subjects and topics such as coding, operating systems, computer networking, large-scale distributed systems, SRE best practices, and more to be successful at their job. In this article, I discuss a few books that will help SREs to become better at their job. 1. Site Reliability Engineering, by the Google SRE team Google originally coined the term "Site Reliability Engineering." This book is a must read for anyone interested in site reliability engineering. It covers a wide range of topics that SREs focus on day to day such as SLOs, eliminating toil, monitoring distributed systems, release management, incident management, infrastructure, and more. This books gives an overview of the different elements that SREs work on. Although this book has many topics specific to Google, it provides a good framework and mental model about various SRE topics. The online version of this book is freely available, so there is no excuse not to read it. The free online version of this book is available here. 2. The Site Reliability Workbook, by the Google SRE team After the success of the original site reliability engineering book, the Google SRE team released this book as a continuation to add more implementation details to the topics in the first book. One of my favorite chapters in the book is "Introducing Non-Abstract Large Scale System Design," and I have read it multiple times. In similar fashion to their first book, this book is also available for free to read online. You can read this book for free here. 3. Systems Performance, by Brendan Gregg I got introduced to Brendan Gregg's work through his famous blog "Linux Performance Analysis in 60,000 Milliseconds." This book introduced me to the USE Method, which is one that can help to quickly troubleshoot performance issues. USE stands for usage, saturation, and errors. This book covers topics such as Linux kernel internals, various observability tools (to analyze CPU, memory, disk, file systems, and network), and application performance topics. The USE method helped me apply methodical problem solving while troubleshooting complex distributed system issues. This book can help you to gain a deeper understanding of troubleshooting performance issues on a Linux operating system. More information about his book can be found here. 4. The Linux Programming Interface, by Michael Kerrisk Having a deeper understanding about operating systems can provide a valuable advantage for SREs. Most of the time, SREs tend to use many commands to configure and troubleshoot various OS related issues. However, understanding how the operating systems work internally help make troubleshooting easier. This book provides a deeper understanding about the Linux OS, and focuses on the system call interface of the Linux OS. A majority of the teams and companies use Linux to run production systems. However, you may work in teams where other operating systems like Windows are being used. If that is the case, then including a book specific to the OS in your reading list is worthwhile. You can check out the above mentioned book here. 5. TCP/IP Illustrated: The Protocols, Volume 1, by Kevin Fall and Richard Stevens This book is great to learn about core networking protocols such as IP (Internet Protocol), ICMP (Internet Control Message Protocol), ARP (Address Resolution Protocol), UDP (User Datagram Protocol), and TCP (Transmission Control Protocol). Having strong understanding of the TCP/IP protocol suite and how to use various tools to debug networking issues is one of the core skills for SREs. This books provides the reader with a strong understanding of how protocols work under the hood. Details about the book are found here. 6. The Illustrated Network: How TCP/IP Works in a Modern Network, by Walter Goralski While TCP/IP Illustrated provides an in-depth explanation of the core TCP/IP protocols, this book focuses on understanding the fundamental principles and how they work in a modern networking context. This is great addition to your library along with TCP/IP Illustrated, which provides a deeper and broader understanding of TCP/IP protocols. More about this book can be found here. 7. Designing Data-Intensive Applications, by Martin Kleppmann This is a great book for understanding how distributed systems work through the lens of data-oriented systems. If you are working on distributed database systems, this book is a must read. I personally learned a lot with this book because I currently work as an SRE on CosmosDB (a globally distributed database service). What makes this book specifically useful for SREs is that it focuses on the reliability, scalability, and maintainability of data-intensive applications. It dives deep in to distributed database concepts such as replication, partitioning, transactions, and the problems with distributed consensus. You can learn more about this book here. 8. Building Secure and Reliable Systems, by the Google SRE team This book extends the principles of site reliability engineering to encompass the security aspects, and argues that security and reliability are not separate concerns, but rather are deeply related and should be addressed together. It advocates for integrating security practices into every stage of the system lifecycle— from design and development to deployment and operations. Google has made this book available for free here. 9. Domain-Specific Books Often, SREs work in specific domains such as databases, real-time communication systems, ERP/CRM systems, AI/ML systems, and more, and having a general understanding of these domains is important to be effective at your job. Including a book in your reading list that provides a breadth of knowledge about the domains is a great idea. Conclusion By reading these books, you can develop a deeper understanding on various subjects such as coding, operating systems, computer networking, distributed systems, and SRE principles which will help you to become a better site reliability engineer. Personally, these books helped me to broaden my understanding of the essential knowledge to perform my job as an SRE effectively, and also helped me while I was pursuing opportunities across teams and organizations as well. Happy reading!

By Krishna Vinnakota

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Top Book Picks for Site Reliability Engineers

May 7, 2025 by Krishna Vinnakota

Recurrent Workflows With Cloud Native Dapr Jobs

May 5, 2025 by Siri Varma Vegiraju DZone Core CORE

Rethinking Recruitment: A Journey Through Hiring Practices

May 2, 2025 by Miguel Garcia DZone Core CORE

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Automatic Code Transformation With OpenRewrite

May 9, 2025 by Gangadhararamachary Ramadugu

Accelerating AI Inference With TensorRT

May 9, 2025 by Vineeth Reddy Vatti

A Complete Guide to Modern AI Developer Tools

May 9, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

The Human Side of Logs: What Unstructured Data Is Trying to Tell You

May 9, 2025 by Alvin Lee DZone Core CORE

Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers

May 9, 2025 by Ramesh Krishna Mahimalur

Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve

May 9, 2025 by Sushma Kukkadapu

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

How to Convert XLS to XLSX in Java

May 9, 2025 by Brian O'Neill DZone Core CORE

Automatic Code Transformation With OpenRewrite

May 9, 2025 by Gangadhararamachary Ramadugu

A Complete Guide to Modern AI Developer Tools

May 9, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

The Human Side of Logs: What Unstructured Data Is Trying to Tell You

May 9, 2025 by Alvin Lee DZone Core CORE

Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers

May 9, 2025 by Ramesh Krishna Mahimalur

Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways

May 8, 2025 by Prabhu Chinnasamy

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

How to Convert XLS to XLSX in Java

May 9, 2025 by Brian O'Neill DZone Core CORE

Automatic Code Transformation With OpenRewrite

May 9, 2025 by Gangadhararamachary Ramadugu

Accelerating AI Inference With TensorRT

May 9, 2025 by Vineeth Reddy Vatti

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: