How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.
Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.
AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset
Generative AI
AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.
Machine Learning Patterns and Anti-Patterns
Getting Started With Data Quality
In this hands-on tutorial, you'll learn how to automate sentiment analysis and categorize customer feedback using Snowflake Cortex, all through a simple SQL query without needing to build heavy and complex machine learning algorithms. No MLOps is required. We'll work with sample data simulating real customer feedback comments about a fictional company, "DemoMart," and classify each customer feedback entry using Cortex's built-in function. We'll determine sentiment (positive, negative, neutral) and label the feedback into different categories. The goal is to: Load a sample dataset of customer feedback into a Snowflake table.Use the built-in LLM-powered classification (CLASSIFY_TEXT) to tag each entry with a sentiment and classify the feedback into a specific category. Automate this entire workflow to run weekly using Snowflake Task.Generate insights from the classified data. Prerequisites A Snowflake account with access to Snowflake CortexRole privileges to create tables, tasks, and proceduresBasic SQL knowledge Step 1: Create Sample Feedback Table We'll use a sample dataset of customer feedback that covers products, delivery, customer support, and other areas. Let's create a table in Snowflake to store this data. Here is the SQL for creating the required table to hold customer feedback. SQL CREATE OR REPLACE TABLE customer.csat.feedback ( feedback_id INT, feedback_ts DATE, feedback_text STRING ); Now, you can load the data into the table using Snowflake's Snowsight interface. The sample data "customer_feedback_demomart.csv" is available in the GitHub repo. You can download and use it. Step 2: Use Cortex to Classify Sentiment and Category Let's read and process each row from the feedback table. Here's the magic. This single query classifies each piece of feedback for both sentiment and category: SQL SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING AS sentiment, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_category FROM customer.csat.feedback LIMIT 10; I have used the CLASSIFY_TEXT Function available within Snowflake's cortex to derive the sentiment based on the feedback_text and further classify it into a specific category the feedback is associated with, such as 'Product', 'Customer Service', 'Delivery', and so on. P.S.: You can change the categories based on your business needs. Step 3: Store Classified Results Let's store the classified results in a separate table for further reporting and analysis purposes. For this, I have created a table with the name feedback_classified as shown below. SQL CREATE OR REPLACE TABLE customer.csat.feedback_classified ( feedback_id INT, feedback_ts DATE, feedback_text STRING, sentiment STRING, feedback_category STRING ); Initial Bulk Load Now, let's do an initial bulk classification for all existing data before moving on to the incremental processing of newly arriving data. SQL -- Initial Load INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback; Once the initial load is completed successfully, let's build an SQL that fetches only incremental data based on the processed_ts column value. For the incremental load, we need fresh data with customer feedback. For that, let's insert ten new records into our raw table customer.csat.feedback SQL INSERT INTO customer.csat.feedback (feedback_id, feedback_ts, feedback_text) VALUES (5001, CURRENT_DATE, 'My DemoMart order was delivered to the wrong address again. Very disappointing.'), (5002, CURRENT_DATE, 'I love the new packaging DemoMart is using. So eco-friendly!'), (5003, CURRENT_DATE, 'The delivery speed was slower than promised. Hope this improves.'), (5004, CURRENT_DATE, 'The product quality is excellent, I’m genuinely impressed with DemoMart.'), (5005, CURRENT_DATE, 'Customer service helped me cancel and reorder with no issues.'), (5006, CURRENT_DATE, 'DemoMart’s website was down when I tried to place my order.'), (5007, CURRENT_DATE, 'Thanks DemoMart for the fast shipping and great support!'), (5008, CURRENT_DATE, 'Received a damaged item. This is the second time with DemoMart.'), (5009, CURRENT_DATE, 'DemoMart app is very user-friendly. Shopping is a breeze.'), (5010, CURRENT_DATE, 'The feature I wanted is missing. Hope DemoMart adds it soon.'); Step 4: Automate Incremental Data Processing With TASK Now that we have newly added (incremental) fresh data into our raw table, let's create a task to pick up only new data and classify it automatically. We will schedule this task to run every Sunday at midnight UTC. SQL --Creating task CREATE OR REPLACE TASK CUSTOMER.CSAT.FEEDBACK_CLASSIFIED WAREHOUSE = COMPUTE_WH SCHEDULE = 'USING CRON 0 0 * * 0 UTC' -- Run evey Sunday at midnight UTC AS INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback WHERE feedback_ts > (SELECT COALESCE(MAX(PROCESSED_TIMESTAMP),'1900-01-01') FROM CUSTOMER.CSAT.FEEDBACK_CLASSIFIED ); This will automatically run every Sunday at midnight UTC, process any newly arrived customer feedback, and classify it. Step 5: Visualize Insights You can now build dashboards in Snowsight to see weekly trends using a simple query like this: SQL SELECT feedback_category, sentiment, COUNT(*) AS total FROM customer.csat.feedback_classified GROUP BY feedback_category, sentiment ORDER BY total DESC; Conclusion With just a few lines of SQL, you: Ingested raw feedback into a Snowflake table.Used Snowflake Cortex to classify customer feedback and derive sentiment and feedback categoriesAutomated the process to run weeklyBuilt insights into the classified feedback for business users/leadership team to act upon by category and sentiment This approach is ideal for support teams, product teams, and leadership, as it allows them to continuously monitor customer experience without building or maintaining ML infrastructure. GitHub I have created a GitHub page with all the code and sample data. You can access it freely. The whole dataset generator and SQL scripts are available on GitHub.
In Terraform, you will often need to convert a list to a string when passing values to configurations that require a string format, such as resource names, cloud instance metadata, or labels. Terraform uses HCL (HashiCorp Configuration Language), so handling lists requires functions like join() or format(), depending on the context. How to Convert a List to a String in Terraform The join() function is the most effective way to convert a list into a string in Terraform. This concatenates list elements using a specified delimiter, making it especially useful when formatting data for use in resource names, cloud tags, or dynamically generated scripts. The join(", ", var.list_variable) function, where list_variable is the name of your list variable, merges the list elements with ", " as the separator. Here’s a simple example: Shell variable "tags" { default = ["dev", "staging", "prod"] } output "tag_list" { value = join(", ", var.tags) } The output would be: Shell "dev, staging, prod" Example 1: Formatting a Command-Line Alias for Multiple Commands In DevOps and development workflows, it’s common to run multiple commands sequentially, such as updating repositories, installing dependencies, and deploying infrastructure. Using Terraform, you can dynamically generate a shell alias that combines these commands into a single, easy-to-use shortcut. Shell variable "commands" { default = ["git pull", "npm install", "terraform apply -auto-approve"] } output "alias_command" { value = "alias deploy='${join(" && ", var.commands)}'" } Output: Shell "alias deploy='git pull && npm install && terraform apply -auto-approve'" Example 2: Creating an AWS Security Group Description Imagine you need to generate a security group rule description listing allowed ports dynamically: Shell variable "allowed_ports" { default = [22, 80, 443] } resource "aws_security_group" "example" { name = "example_sg" description = "Allowed ports: ${join(", ", [for p in var.allowed_ports : tostring(p)])}" dynamic "ingress" { for_each = var.allowed_ports content { from_port = ingress.value to_port = ingress.value protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } } } The join() function, combined with a list comprehension, generates a dynamic description like "Allowed ports: 22, 80, 443". This ensures the security group documentation remains in sync with the actual rules. Alternative Methods For most use cases, the join() function is the best choice for converting a list into a string in Terraform, but the format() and jsonencode() functions can also be useful in specific scenarios. 1. Using format() for Custom Formatting The format() function helps control the output structure while joining list items. It does not directly convert lists to strings, but it can be used in combination with join() to achieve custom formatting. Shell variable "ports" { default = [22, 80, 443] } output "formatted_ports" { value = format("Allowed ports: %s", join(" | ", var.ports)) } Output: Shell "Allowed ports: 22 | 80 | 443" 2. Using jsonencode() for JSON Output When passing structured data to APIs or Terraform modules, you can use the jsonencode() function, which converts a list into a JSON-formatted string. Shell variable "tags" { default = ["dev", "staging", "prod"] } output "json_encoded" { value = jsonencode(var.tags) } Output: Shell "["dev", "staging", "prod"]" Unlike join(), this format retains the structured array representation, which is useful for JSON-based configurations. Creating a Literal String Representation in Terraform Sometimes you need to convert a list into a literal string representation, meaning the output should preserve the exact structure as a string (e.g., including brackets, quotes, and commas like a JSON array). This is useful when passing data to APIs, logging structured information, or generating configuration files. For most cases, jsonencode() is the best option due to its structured formatting and reliability in API-related use cases. However, if you need a simple comma-separated string without additional formatting, join() is the better choice. Common Scenarios for List-to-String Conversion in Terraform Converting a list to a string in Terraform is useful in multiple scenarios where Terraform requires string values instead of lists. Here are some common use cases: Naming resources dynamically: When creating resources with names that incorporate multiple dynamic elements, such as environment, application name, and region, these components are often stored as a list for modularity. Converting them into a single string allows for consistent and descriptive naming conventions that comply with provider or organizational naming standards.Tagging infrastructure with meaningful identifiers: Tags are often key-value pairs where the value needs to be a string. If you’re tagging resources based on a list of attributes (like team names, cost centers, or project phases), converting the list into a single delimited string ensures compatibility with tagging schemas and improves downstream usability in cost analysis or inventory tools.Improving documentation via descriptions in security rules: Security groups, firewall rules, and IAM policies sometimes allow for free-form text descriptions. Providing a readable summary of a rule’s purpose, derived from a list of source services or intended users, can help operators quickly understand the intent behind the configuration without digging into implementation details.Passing variables to scripts (e.g., user_data in EC2 instances): When injecting dynamic values into startup scripts or configuration files (such as a shell script passed via user_data), you often need to convert structured data like lists into strings. This ensures the script interprets the input correctly, particularly when using loops or configuration variables derived from Terraform resources.Logging and monitoring, ensuring human-readable outputs: Terraform output values are often used for diagnostics or integration with logging/monitoring systems. Presenting a list as a human-readable string improves clarity in logs or dashboards, making it easier to audit deployments and troubleshoot issues by conveying aggregated information in a concise format. Key Points Converting lists to strings in Terraform is crucial for dynamically naming resources, structuring security group descriptions, formatting user data scripts, and generating readable logs. Using join() for readable concatenation, format() for creating formatted strings, and jsonencode() for structured output ensures clarity and consistency in Terraform configurations.
Jenkins is an open-source CI/CD tool written in Java that is used for organising the CI/CD pipelines. Currently, at the time of writing this blog, it has 24k stars and 9.1k forks on GitHub. With over 2000 plugin support, Jenkins is a well-known tool in the DevOps world. The following are multiple ways to install and set up Jenkins: Using the Jenkins Installer package for WindowsUsing Homebrew for macOSUsing the Generic Java Package (war)Using DockerUsing KubernetesUsing apt for Ubuntu/Debian Linux OS In this tutorial blog, I will cover the step-by-step process to install and setup Jenkins using Docker Compose for an efficient and seamless CI/CD experience. Using Dockerwith Jenkins allows users to set up a Jenkins instance quickly with minimal manual configuration. It ensures portability and scalability, as with Docker Compose, users can easily set up Jenkins and its required services, such as volumes and networks, using a single YAML file. This allows the users to easily manage and replicate the setup in different environments. Installing Jenkins Using Docker Compose Installing Jenkins with Docker Compose makes the setup process simple and efficient, and allows us to define configurations in a single file. This approach removes the complexity and difficulty faced while installing Jenkins manually and ensures easy deployment, portability, and quick scaling. Prerequisite As a prerequisite, Docker Desktop needs to be installed, up and running on the local machine. Docker Compose is included in Docker Desktop along with Docker Engine and Docker CLI. Jenkins With Docker Compose Jenkins could be instantly set up by running the following docker-compose command using the terminal: Plain Text docker compose up -d This docker-compose command could be run by navigating to the folder where the Docker Compose file is placed. So, let’s create a new folder jenkins-demo and inside this folder, let’s create another new folder jenkins-configuration and a new file docker-compose.yaml. The following is the folder structure: Plain Text jenkins-demo/ ├── jenkins-configuration/ └── docker-compose.yaml The following content should be added to the docker-compose.yaml file. YAML # docker-compose.yaml version: '3.8' services: jenkins: image: jenkins/jenkins:lts privileged: true user: root ports: - 8080:8080 - 50000:50000 container_name: jenkins volumes: - /Users/faisalkhatri/jenkins-demo/jenkins-configuration:/var/jenkins_home - /var/run/docker.sock:/var/run/docker.sock Decoding the Docker Compose File The first line in the file is a comment. The services block starts from the second line, which includes the details of the Jenkins service. The Jenkins service block contains the image, user, and port details. The Jenkins service will run the latest Jenkins image with root privileges and name the container as jenkins. The ports are responsible for mapping container ports to the host machine. The details of these ports are as follows: 8080:8080:This will map the port 8080 inside the container to the port 8080 on the host machine. It is important, as it is required for accessing the Jenkins web interface. It will help us in accessing Jenkins in the browser by navigating to http://localhost:808050000:50000:This will map the port 50000 inside the container to port 50000 on the host machine. It is the JNLP (Java Network Launch Protocol) agent port, which is used for connecting Jenkins build agents to the Jenkins Controller instance. It is important, as we would be using distributed Jenkins setups, where remote build agents connect to the Jenkins Controller instance. The privileged: true setting will grant the container full access to the host system and allow running the process as the root user on the host machine. This will enable the container to perform the following actions : Access all the host devicesModify the system configurationsMount file systemsManage network interfacesPerform admin tasks that a regular container cannot perform These actions are important, as Jenkins may require permissions to run specific tasks while interacting with the host system, like managing Docker containers, executing system commands, or modifying files outside the container. Any data stored inside the container is lost when the container stops or is removed. To overcome this issue, Volumes are used in Docker to persist data beyond the container’s lifecycle. We will use Docker Volumes to keep the Jenkins data intact, as it is needed every time we start Jenkins. Jenkins data would be stored in the jenkins-configuration folder on the local machine. The /Users/faisalkhatri/jenkins-demo/jenkins-configuration on the host is mapped to /var/jenkins_home in the container. The changes made inside the container in the respective folder will reflect on the folder on the host machine and vice versa. This line /var/run/docker.sock:/var/run/docker.sock, mounts the Docker socket from the host into the container, allowing the Jenkins container to directly communicate with the Docker daemon running on the host machine. This enables Jenkins, which is running inside the container, to manage and run Docker commands on the host, allowing it to build and run other Docker containers as a part of CI/CD pipelines. Installing Jenkins With Docker Compose Let’s run the installation process step by step as follows: Step 1 — Running Jenkins Setup Open a terminal, navigate to the jenkins-demo folder, and run the following command: Plain Text docker compose up -d After the command is successfully executed, open any browser on your machine and navigate to https://localhost:8080, you should be able to find the Unlock Jenkins screen as shown in the screenshot below: Step 2 — Finding the Jenkins Password From the Docker Container The password to unlock Jenkins could be found by navigating to the jenkins container (remember we had given the name jenkins to the container in the Docker Compose file) and checking out its logs by running the following command on the terminal: Plain Text docker logs jenkins Copy the password from the logs, paste it in the Administrator password field on the Unlock Jenkins screen in the browser, and click on the Continue button. Step 3 — Setting up Jenkins The “Getting Started” screen will be displayed next, which will prompt us to install plugins to set up Jenkins. Select the Install suggested plugins and proceed with the installation. It will take some time for the installations to complete. Step 4 — Creating Jenkins user After the installation is complete, Jenkins will show the next screen to update the user details. It is recommended to update the user details with a password and click on Save and Continue. This username and password can then be used to log in to Jenkins. Step 5 — Instance Configuration In this window, we can update the Jenkins accessible link so it can be further used to navigate and run Jenkins. However, we can leave it as it is now — http://localhost:8080. Click on the Save and Finish button to complete the set up. With this, the Jenkins installation and set up are complete; we are now ready to use Jenkins. Summary Docker is the go-to tool for instantly spinning up a Jenkins instance. Using Docker Compose, we installed Jenkins successfully in just 5 simple steps. Once Jenkins is up and started, we can install the required plugin and set up CI/CD workflows as required. Using Docker Volumes allows us to use Jenkins seamlessly, as it saves the instance data between restarts. In the next tutorial, we will learn about installing and setting up Jenkins agents that will help us run the Jenkins jobs.
Few concepts in Java software development have changed how we approach writing code in Java than Java Streams. They provide a clean, declarative way to process collections and have thus become a staple in modern Java applications. However, for all their power, Streams present their own challenges, especially where flexibility, composability, and performance optimization are priorities. What if your programming needs more expressive functional paradigms? What if you are looking for laziness and safety beyond what Streams provide and want to explore functional composition at a lower level? In this article, we will be exploring other functional programming techniques you can use in Java that do not involve using the Streams API. Java Streams: Power and Constraints Java Streams are built on a simple premise—declaratively process collections of data using a pipeline of transformations. You can map, filter, reduce, and collect data with clean syntax. They eliminate boilerplate and allow chaining operations fluently. However, Streams fall short in some areas: They are not designed for complex error handling.They offer limited lazy evaluation capabilities.They don’t integrate well with asynchronous processing.They lack persistent and immutable data structures. One of our fellow DZone members wrote a very good article on "The Power and Limitations of Java Streams," which describes both the advantages and limitations of what you can do using Java Streams. I agree that Streams provide a solid basis for functional programming, but I suggest looking around for something even more powerful. The following alternatives are discussed within the remainder of this article, expanding upon points introduced in the referenced piece. Vavr: A Functional Java Library Why Vavr? Provides persistent and immutable collections (e.g., List, Set, Map)Includes Try, Either, and Option types for robust error handlingSupports advanced constructs like pattern matching and function composition Vavr is often referred to as a "Scala-like" library for Java. It brings in a strong functional flavor that bridges Java's verbosity and the expressive needs of functional paradigms. Example: Java Option<String> name = Option.of("Bodapati"); String result = name .map(n -> n.toUpperCase()) .getOrElse("Anonymous"); System.out.println(result); // Output: BODAPATI Using Try, developers can encapsulate exceptions functionally without writing try-catch blocks: Java Try<Integer> safeDivide = Try.of(() -> 10 / 0); System.out.println(safeDivide.getOrElse(-1)); // Output: -1 Vavr’s value becomes even more obvious in concurrent and microservice environments where immutability and predictability matter. Reactor and RxJava: Going Asynchronous Reactive programming frameworks such as Project Reactor and RxJava provide more sophisticated functional processing streams that go beyond what Java Streams can offer, especially in the context of asynchrony and event-driven systems. Key Features: Backpressure control and lazy evaluationAsynchronous stream compositionRich set of operators and lifecycle hooks Example: Java Flux<Integer> numbers = Flux.range(1, 5) .map(i -> i * 2) .filter(i -> i % 3 == 0); numbers.subscribe(System.out::println); Use cases include live data feeds, user interaction streams, and network-bound operations. In the Java ecosystem, Reactor is heavily used in Spring WebFlux, where non-blocking systems are built from the ground up. RxJava, on the other hand, has been widely adopted in Android development where UI responsiveness and multithreading are critical. Both libraries teach developers to think reactively, replacing imperative patterns with a declarative flow of data. Functional Composition with Java’s Function Interface Even without Streams or third-party libraries, Java offers the Function<T, R> interface that supports method chaining and composition. Example: Java Function<Integer, Integer> multiplyBy2 = x -> x * 2; Function<Integer, Integer> add10 = x -> x + 10; Function<Integer, Integer> combined = multiplyBy2.andThen(add10); System.out.println(combined.apply(5)); // Output: 20 This simple pattern is surprisingly powerful. For example, in validation or transformation pipelines, you can modularize each logic step, test them independently, and chain them without side effects. This promotes clean architecture and easier testing. JEP 406 — Pattern Matching for Switch Pattern matching, introduced in Java 17 as a preview feature, continues to evolve and simplify conditional logic. It allows type-safe extraction and handling of data. Example: Java static String formatter(Object obj) { return switch (obj) { case Integer i -> "Integer: " + i; case String s -> "String: " + s; default -> "Unknown type"; }; } Pattern matching isn’t just syntactic sugar. It introduces a safer, more readable approach to decision trees. It reduces the number of nested conditions, minimizes boilerplate, and enhances clarity when dealing with polymorphic data. Future versions of Java are expected to enhance this capability further with deconstruction patterns and sealed class integration, bringing Java closer to pattern-rich languages like Scala. Recursion and Tail Call Optimization Workarounds Recursion is fundamental in functional programming. However, Java doesn’t optimize tail calls, unlike languages like Haskell or Scala. That means recursive functions can easily overflow the stack. Vavr offers a workaround via trampolines: Java static Trampoline<Integer> factorial(int n, int acc) { return n == 0 ? Trampoline.done(acc) : Trampoline.more(() -> factorial(n - 1, n * acc)); } System.out.println(factorial(5, 1).result()); Trampolining ensures that recursive calls don’t consume additional stack frames. Though slightly verbose, this pattern enables functional recursion in Java safely. Conclusion: More Than Just Streams "The Power and Limitations of Java Streams" offers a good overview of what to expect from Streams, and I like how it starts with a discussion on efficiency and other constraints. So, I believe Java functional programming is more than just Streams. There is a need to adopt libraries like Vavr, frameworks like Reactor/RxJava, composition, pattern matching, and recursion techniques. To keep pace with the evolution of the Java enterprise platform, pursuing hybrid patterns of functional programming allows software architects to create systems that are more expressive, testable, and maintainable. Adopting these tools doesn’t require abandoning Java Streams—it means extending your toolbox. What’s Next? Interested in even more expressive power? Explore JVM-based functional-first languages like Kotlin or Scala. They offer stronger FP constructs, full TCO, and tighter integration with functional idioms. Want to build smarter, more testable, and concurrent-ready Java systems? Time to explore functional programming beyond Streams. The ecosystem is richer than ever—and evolving fast. What are your thoughts about functional programming in Java beyond Streams? Let’s talk in the comments!
When I first began working with serverless architectures in 2018, I quickly discovered that my traditional security playbook wasn't going to cut it. The ephemeral nature of functions, the distributed service architecture, and the multiplicity of entry points created a fundamentally different security landscape. After several years of implementing IAM strategies for serverless applications across various industries, I've compiled the approaches that have proven most effective in real-world scenarios. This article shares these insights, focusing on practical Python implementations that address the unique security challenges of serverless environments. The Shifting Security Paradigm in Serverless Traditional security models rely heavily on network perimeters and long-running servers where security agents can monitor activity. Serverless computing dismantles this model through several key characteristics: Execution lifetime measured in milliseconds: Functions that spin up, execute, and terminate in the blink of an eye make traditional agent-based security impracticalHighly distributed components: Instead of monolithic services, serverless apps often comprise dozens or hundreds of small functionsMultiple ingress points: Rather than funneling traffic through a single application gatewayComplex service-to-service communication patterns: With functions frequently calling other servicesPerformance sensitivity: Where security overhead can significantly impact cold start times During a financial services project last year, we learned this lesson the hard way when our initial security approach added nearly 800ms to function cold starts—unacceptable for an API that needed to respond in under 300ms total. Core Components of Effective Serverless IAM Through trial and error across multiple projects, I've found that serverless IAM strategies should address four key areas: 1. User and Service Authentication Authenticating users and services in a serverless context requires approaches optimized for stateless, distributed execution: JWT-based authentication: These stateless tokens align perfectly with the ephemeral nature of serverless functionsOpenID Connect (OIDC): For standardized authentication flows that work across service boundariesAPI keys and client secrets: When service-to-service authentication is requiredFederated identity: Leveraging identity providers to offload authentication complexity 2. Authorization and Access Control After verifying identity, you need robust mechanisms to control access: Role-based access control (RBAC): Assigning permissions based on user rolesAttribute-based access control (ABAC): More dynamic permissions based on user attributes and contextPolicy enforcement points: Strategic locations within your architecture where access decisions occur 3. Function-Level Permissions The functions themselves need careful permission management: Principle of least privilege: Granting only the minimal permissions requiredFunction-specific IAM roles: Approving tailored permissions for each functionResource-based policies: Controlling which identities can invoke your functions 4. Secrets Management Secure handling of credentials and sensitive information: Managed secrets services: Cloud-native solutions for storing and accessing secretsEnvironment variables: For injecting configuration at runtimeParameter stores: For less sensitive configuration information Provider-Specific Implementation Patterns Having implemented serverless security across major cloud providers, I've developed practical patterns for each platform. These examples reflect real-world implementations with necessary simplifications for clarity. AWS: Pragmatic IAM Approaches AWS offers several robust options for serverless authentication: Authentication with Amazon Cognito Here's a streamlined example of validating Cognito tokens in a Lambda function, with performance optimizations I've found effective in production: Python # Example: Validating Cognito tokens in a Lambda function import json import os import boto3 import jwt import requests from jwt.algorithms import RSAAlgorithm # Cache of JWKs - crucial for performance jwks_cache = {} def lambda_handler(event, context): try: # Extract token from Authorization header auth_header = event.get('headers', {}).get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return { 'statusCode': 401, 'body': json.dumps({'message': 'Missing or invalid authorization header'}) } token = auth_header.replace('Bearer ', '') # Verify the token decoded_token = verify_token(token) # Process authenticated request with user context user_id = decoded_token.get('sub') user_groups = decoded_token.get('cognito:groups', []) # Your business logic here, using the authenticated user context response_data = process_authorized_request(user_id, user_groups, event) return { 'statusCode': 200, 'body': json.dumps(response_data) } except jwt.ExpiredSignatureError: return { 'statusCode': 401, 'body': json.dumps({'message': 'Token expired'}) } except Exception as e: print(f"Authentication error: {str(e)}") return { 'statusCode': 401, 'body': json.dumps({'message': 'Authentication failed'}) } def verify_token(token): # Decode the token header header = jwt.get_unverified_header(token) kid = header['kid'] # Get the public keys if not cached region = os.environ['AWS_REGION'] user_pool_id = os.environ['USER_POOL_ID'] if not jwks_cache: keys_url = f'https://cognito-idp.{region}.amazonaws.com/{user_pool_id}/.well-known/jwks.json' jwks = requests.get(keys_url).json() jwks_cache.update(jwks) # Find the key that matches the kid in the token key = None for jwk in jwks_cache['keys']: if jwk['kid'] == kid: key = jwk break if not key: raise Exception('Public key not found') # Construct the public key public_key = RSAAlgorithm.from_jwk(json.dumps(key)) # Verify the token payload = jwt.decode( token, public_key, algorithms=['RS256'], audience=os.environ['APP_CLIENT_ID'] ) return payload This pattern has performed well in production, with the key caching strategy reducing token verification time by up to 80% compared to our initial implementation. Secrets Management with AWS Secrets Manager After securing several high-compliance applications, I've found this pattern for secrets management to be both secure and performant: Python # Example: Using AWS Secrets Manager in Lambda with caching import json import boto3 import os from botocore.exceptions import ClientError # Cache for secrets to minimize API calls secrets_cache = {} secrets_ttl = {} SECRET_CACHE_TTL = 300 # 5 minutes in seconds def lambda_handler(event, context): try: # Get the secret - using cache if available and not expired api_key = get_secret('payment-api-key') # Use secret for external API call result = call_payment_api(api_key, event.get('body', {})) return { 'statusCode': 200, 'body': json.dumps({'transactionId': result['id']}) } except ClientError as e: print(f"Error retrieving secret: {e}") return { 'statusCode': 500, 'body': json.dumps({'message': 'Internal error'}) } def get_secret(secret_id): import time current_time = int(time.time()) # Return cached secret if valid if secret_id in secrets_cache and secrets_ttl.get(secret_id, 0) > current_time: return secrets_cache[secret_id] # Create a Secrets Manager client secrets_manager = boto3.client('secretsmanager') # Retrieve secret response = secrets_manager.get_secret_value(SecretId=secret_id) # Parse the secret if 'SecretString' in response: secret_data = json.loads(response['SecretString']) # Cache the secret with TTL secrets_cache[secret_id] = secret_data secrets_ttl[secret_id] = current_time + SECRET_CACHE_TTL return secret_data else: raise Exception("Secret is not a string") The caching strategy here has been crucial in high-volume applications, where we've seen up to 95% reduction in Secrets Manager API calls while maintaining reasonable security through controlled TTL. Azure Serverless IAM Implementation When working with Azure Functions, I've developed these patterns for robust security: Authentication with Azure Active Directory (Entra ID) For enterprise applications on Azure, this pattern has provided a good balance of security and performance: Python # Example: Validating AAD token in Azure Function import json import os import jwt import requests import azure.functions as func from jwt.algorithms import RSAAlgorithm import logging from datetime import datetime, timedelta # Cache for JWKS with TTL jwks_cache = {} jwks_timestamp = None JWKS_CACHE_TTL = timedelta(hours=24) # Refresh keys daily def main(req: func.HttpRequest) -> func.HttpResponse: try: # Extract token auth_header = req.headers.get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return func.HttpResponse( json.dumps({'message': 'Missing or invalid authorization header'}), mimetype="application/json", status_code=401 ) token = auth_header.replace('Bearer ', '') # Validate token start_time = datetime.now() decoded_token = validate_token(token) validation_time = (datetime.now() - start_time).total_seconds() # Log performance for monitoring logging.info(f"Token validation completed in {validation_time} seconds") # Process authenticated request user_email = decoded_token.get('email', 'unknown') user_name = decoded_token.get('name', 'User') return func.HttpResponse( json.dumps({ 'message': f'Hello, {user_name}', 'email': user_email, 'authenticated': True }), mimetype="application/json", status_code=200 ) except Exception as e: logging.error(f"Authentication error: {str(e)}") return func.HttpResponse( json.dumps({'message': 'Authentication failed'}), mimetype="application/json", status_code=401 ) def validate_token(token): global jwks_cache, jwks_timestamp # Decode without verification to get the kid header = jwt.get_unverified_header(token) kid = header['kid'] # Get tenant ID from environment tenant_id = os.environ['TENANT_ID'] # Get the keys if not cached or expired current_time = datetime.now() if not jwks_cache or not jwks_timestamp or current_time - jwks_timestamp > JWKS_CACHE_TTL: keys_url = f'https://login.microsoftonline.com/{tenant_id}/discovery/v2.0/keys' jwks = requests.get(keys_url).json() jwks_cache = jwks jwks_timestamp = current_time logging.info("JWKS cache refreshed") # Find the key matching the kid key = None for jwk in jwks_cache['keys']: if jwk['kid'] == kid: key = jwk break if not key: raise Exception('Public key not found') # Construct the public key public_key = RSAAlgorithm.from_jwk(json.dumps(key)) # Verify the token client_id = os.environ['CLIENT_ID'] issuer = f'https://login.microsoftonline.com/{tenant_id}/v2.0' payload = jwt.decode( token, public_key, algorithms=['RS256'], audience=client_id, issuer=issuer ) return payload The key implementation detail here is the TTL-based JWKS cache, which has dramatically improved performance while ensuring keys are periodically refreshed. Google Cloud Serverless IAM Implementation For Google Cloud Functions, these patterns have proven effective in production environments: Authentication with Firebase This approach works well for consumer-facing applications with Firebase Authentication: Python # Example: Validating Firebase Auth token in Cloud Function import json import firebase_admin from firebase_admin import auth from firebase_admin import credentials import time import logging from functools import wraps # Initialize Firebase Admin SDK (with exception handling for warm instances) try: app = firebase_admin.get_app() except ValueError: cred = credentials.ApplicationDefault() firebase_admin.initialize_app(cred) def require_auth(f): @wraps(f) def decorated_function(request): # Performance tracking start_time = time.time() # Get the ID token auth_header = request.headers.get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return json.dumps({'error': 'Unauthorized - Missing token'}), 401, {'Content-Type': 'application/json'} id_token = auth_header.split('Bearer ')[1] try: # Verify the token decoded_token = auth.verify_id_token(id_token) # Check if token is issued in the past auth_time = decoded_token.get('auth_time', 0) if auth_time > time.time(): return json.dumps({'error': 'Invalid token auth time'}), 401, {'Content-Type': 'application/json'} # Track performance validation_time = time.time() - start_time logging.info(f"Token validation took {validation_time*1000:.2f}ms") # Add user info to request request.user = { 'uid': decoded_token['uid'], 'email': decoded_token.get('email'), 'email_verified': decoded_token.get('email_verified', False), 'auth_time': auth_time } # Continue to the actual function return f(request) except Exception as e: logging.error(f'Error verifying authentication token: {e}') return json.dumps({'error': 'Unauthorized'}), 401, {'Content-Type': 'application/json'} return decorated_function @require_auth def secure_function(request): # The function only executes if auth is successful user = request.user return json.dumps({ 'message': f'Hello, {user["email"]}!', 'userId': user['uid'], 'verified': user['email_verified'] }), 200, {'Content-Type': 'application/json'} The decorator pattern has been particularly valuable, standardizing authentication across dozens of functions in larger projects. Hard-Earned Lessons and Best Practices After several years of implementing serverless IAM in production, I've learned these critical lessons: 1. Implement Least Privilege with Precision One of our earlier projects granted overly broad permissions to Lambda functions. This came back to haunt us when a vulnerability in a dependency was exploited, giving the attacker more access than necessary. Now, we religiously follow function-specific permissions: YAML # AWS SAM example with precise permissions Resources: ProcessPaymentFunction: Type: AWS::Serverless::Function Properties: Handler: payment_handler.lambda_handler Runtime: python3.9 Policies: - DynamoDBReadPolicy: TableName: !Ref CustomerTable - SSMParameterReadPolicy: ParameterName: /prod/payment/api-key - Statement: - Effect: Allow Action: - secretsmanager:GetSecretValue Resource: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:payment/* 2. Implement Smart Caching for Performance Authentication processes can significantly impact cold start times. Our testing showed that a poorly implemented token validation flow could add 300-500ms to function execution time. This optimized caching approach has been effective in real-world applications: Python # Example: Smart caching for token validation import json import jwt import time from functools import lru_cache import threading # Thread-safe token cache with TTL class TokenCache: def __init__(self, ttl_seconds=300): self.cache = {} self.lock = threading.RLock() self.ttl = ttl_seconds def get(self, token_hash): with self.lock: cache_item = self.cache.get(token_hash) if not cache_item: return None expiry, user_data = cache_item if time.time() > expiry: # Token cache entry expired del self.cache[token_hash] return None return user_data def set(self, token_hash, user_data): with self.lock: expiry = time.time() + self.ttl self.cache[token_hash] = (expiry, user_data) # Initialize cache token_cache = TokenCache() def get_token_hash(token): # Create a hash of the token for cache key import hashlib return hashlib.sha256(token.encode()).hexdigest() def validate_token(token): # Check cache first token_hash = get_token_hash(token) cached_user = token_cache.get(token_hash) if cached_user: print("Cache hit for token validation") return cached_user print("Cache miss - validating token") # Actual token validation logic here decoded = jwt.decode(token, verify=False) # Placeholder for actual verification # Extract user data user_data = { 'sub': decoded.get('sub'), 'email': decoded.get('email'), 'roles': decoded.get('roles', []) } # Cache the result token_cache.set(token_hash, user_data) return user_data In high-volume applications, intelligent caching like this has improved average response times by 30-40%. 3. Implement Proper Defense in Depth During a security audit of a serverless financial application, we discovered that while our API Gateway had authentication enabled, several functions weren't verifying the JWT token payload. This created a vulnerability where valid but expired tokens could be reused. We now implement defense in depth consistently: Python # Example: Multiple validation layers def process_order(event, context): try: # 1. Verify authentication token (already checked by API Gateway, but verify again) auth_result = verify_token(event) if not auth_result['valid']: return { 'statusCode': 401, 'body': json.dumps({'error': auth_result['error']}) } user = auth_result['user'] # 2. Validate input data structure body = json.loads(event.get('body', '{}')) validation_errors = validate_order_schema(body) if validation_errors: return { 'statusCode': 400, 'body': json.dumps({'errors': validation_errors}) } # 3. Verify business-level authorization auth_result = check_order_authorization(user, body) if not auth_result['authorized']: return { 'statusCode': 403, 'body': json.dumps({'error': auth_result['reason']}) } # 4. Process with proper input sanitization processed_data = sanitize_order_input(body) # 5. Execute with error handling result = create_order(user['id'], processed_data) # 6. Return success with minimal information return { 'statusCode': 200, 'body': json.dumps({'orderId': result['id']}) } except Exception as e: # Log detailed error internally but return generic message log_detailed_error(e) return { 'statusCode': 500, 'body': json.dumps({'error': 'An unexpected error occurred'}) } This approach has proven effective in preventing various attack vectors. 4. Build Secure Service-to-Service Communication One of the more challenging aspects of serverless security is function-to-function communication. In a recent project, we implemented this pattern for secure internal communication: Python # Example: Service-to-service communication with JWT import json import jwt import time import os import requests def generate_service_token(service_name, target_service): # Create a signed JWT for service-to-service auth secret = os.environ['SERVICE_JWT_SECRET'] payload = { 'iss': service_name, 'sub': f'service:{service_name}', 'aud': target_service, 'iat': int(time.time()), 'exp': int(time.time() + 60), # Short-lived token (60 seconds) 'scope': 'service' } return jwt.encode(payload, secret, algorithm='HS256') def call_order_service(customer_id, order_data): service_token = generate_service_token('payment-service', 'order-service') # Call the order service with the token response = requests.post( os.environ['ORDER_SERVICE_URL'], json={ 'customerId': customer_id, 'orderDetails': order_data }, headers={ 'Authorization': f'Bearer {service_token}', 'Content-Type': 'application/json' } ) if response.status_code != 200: raise Exception(f"Order service error: {response.text}") return response.json() This pattern ensures that even if one function is compromised, the attacker has limited time to exploit the service token. 5. Implement Comprehensive Security Monitoring After a security incident where unauthorized token usage went undetected for days, we implemented enhanced security monitoring: Python # Example: Enhanced security logging for authentication import json import time import logging from datetime import datetime import traceback def log_auth_event(event_type, user_id, ip_address, success, details=None): """Log authentication events in a standardized format""" log_entry = { 'timestamp': datetime.utcnow().isoformat(), 'event': f'auth:{event_type}', 'userId': user_id, 'ipAddress': ip_address, 'success': success, 'region': os.environ.get('AWS_REGION', 'unknown'), 'functionName': os.environ.get('AWS_LAMBDA_FUNCTION_NAME', 'unknown') } if details: log_entry['details'] = details # Log in JSON format for easy parsing logging.info(json.dumps(log_entry)) def authenticate_user(event): try: # Extract IP from request context ip_address = event.get('requestContext', {}).get('identity', {}).get('sourceIp', 'unknown') # Extract and validate token auth_header = event.get('headers', {}).get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): log_auth_event('token_missing', 'anonymous', ip_address, False) return {'authenticated': False, 'error': 'Missing authentication token'} token = auth_header.replace('Bearer ', '') # Track timing for performance monitoring start_time = time.time() try: # Validate token (implementation details omitted) decoded_token = validate_token(token) validation_time = time.time() - start_time user_id = decoded_token.get('sub', 'unknown') # Log successful authentication log_auth_event('login', user_id, ip_address, True, { 'validationTimeMs': round(validation_time * 1000), 'tokenExpiry': datetime.fromtimestamp(decoded_token.get('exp')).isoformat() }) return { 'authenticated': True, 'user': { 'id': user_id, 'email': decoded_token.get('email'), 'roles': decoded_token.get('roles', []) } } except jwt.ExpiredSignatureError: # Extract user ID from expired token for logging try: expired_payload = jwt.decode(token, options={'verify_signature': False}) user_id = expired_payload.get('sub', 'unknown') except: user_id = 'unknown' log_auth_event('token_expired', user_id, ip_address, False) return {'authenticated': False, 'error': 'Authentication token expired'} except Exception as e: log_auth_event('token_invalid', 'unknown', ip_address, False, { 'error': str(e), 'tokenFragment': token[:10] + '...' if len(token) > 10 else token }) return {'authenticated': False, 'error': 'Invalid authentication token'} except Exception as e: # Unexpected error in authentication process error_details = { 'error': str(e), 'trace': traceback.format_exc() } log_auth_event('auth_error', 'unknown', 'unknown', False, error_details) return {'authenticated': False, 'error': 'Authentication system error'} This comprehensive logging approach has helped us identify suspicious patterns and potential attacks before they succeed. Advanced Patterns from Production Systems As our serverless systems have matured, we've implemented several advanced patterns that have proven valuable: 1. Fine-Grained Authorization with OPA For a healthcare application with complex authorization requirements, we implemented Open Policy Agent: Python # Example: Using OPA for authorization in AWS Lambda import json import requests import os def check_authorization(user, resource, action): """Check if user is authorized to perform action on resource using OPA""" # Create authorization query auth_query = { 'input': { 'user': { 'id': user['id'], 'roles': user['roles'], 'department': user.get('department'), 'attributes': user.get('attributes', {}) }, 'resource': resource, 'action': action, 'context': { 'environment': os.environ.get('ENVIRONMENT', 'dev'), 'timestamp': datetime.utcnow().isoformat() } } } # Query OPA for authorization decision try: opa_url = os.environ['OPA_URL'] response = requests.post( f"{opa_url}/v1/data/app/authz/allow", json=auth_query, timeout=0.5 # Set reasonable timeout ) # Parse response if response.status_code == 200: result = response.json() is_allowed = result.get('result', False) # Log authorization decision log_auth_event( 'authorization', user['id'], 'N/A', is_allowed, { 'resource': resource.get('type') + ':' + resource.get('id'), 'action': action, 'allowed': is_allowed } ) return { 'authorized': is_allowed, 'reason': None if is_allowed else "Not authorized for this operation" } else: # OPA service error log_auth_event( 'authorization_error', user['id'], 'N/A', False, { 'statusCode': response.status_code, 'response': response.text } ) # Fall back to deny by default return { 'authorized': False, 'reason': "Authorization service error" } except Exception as e: # Error communicating with OPA log_auth_event( 'authorization_error', user['id'], 'N/A', False, {'error': str(e)} ) # Default deny on errors return { 'authorized': False, 'reason': "Authorization service unavailable" } This approach has allowed us to implement complex authorization rules that would be unwieldy to code directly in application logic. 2. Multi-Tenant Security Pattern For SaaS applications with multi-tenant requirements, we've developed this pattern: Python # Example: Multi-tenant request handling in AWS Lambda import json import boto3 import os from boto3.dynamodb.conditions import Key def lambda_handler(event, context): try: # Authenticate user auth_result = authenticate_user(event) if not auth_result['authenticated']: return { 'statusCode': 401, 'body': json.dumps({'error': auth_result['error']}) } user = auth_result['user'] # Extract tenant ID from token or path parameter requested_tenant_id = event.get('pathParameters', {}).get('tenantId') user_tenant_id = user.get('tenantId') # Security check: User can only access their assigned tenant if not user.get('isAdmin', False) and requested_tenant_id != user_tenant_id: log_auth_event( 'tenant_access_denied', user['id'], get_source_ip(event), False, { 'requestedTenant': requested_tenant_id, 'userTenant': user_tenant_id } ) return { 'statusCode': 403, 'body': json.dumps({'error': 'Access denied to this tenant'}) } # Create tenant-specific DynamoDB client dynamodb = boto3.resource('dynamodb') table = dynamodb.Table(os.environ['DATA_TABLE']) # Query with tenant isolation to prevent data leakage result = table.query( KeyConditionExpression=Key('tenantId').eq(requested_tenant_id) ) # Audit the data access log_data_access( user['id'], requested_tenant_id, 'query', result['Count'] ) return { 'statusCode': 200, 'body': json.dumps({ 'items': result['Items'], 'count': result['Count'] }) } except Exception as e: # Log the error but return generic message log_error(str(e), event) return { 'statusCode': 500, 'body': json.dumps({'error': 'Internal server error'}) } This pattern has successfully prevented tenant data leakage even in complex multi-tenant systems. Conclusion: Security is a Journey, Not a Destination Implementing IAM in serverless architectures requires a different mindset from traditional application security. Rather than focusing on perimeter security, the emphasis shifts to identity-centric, fine-grained permissions that align with the distributed nature of serverless applications. Through my journey implementing serverless security across various projects, I've found that success depends on several key factors: Designing with least privilege from the start - It's much harder to reduce permissions later than to grant them correctly initiallyBalancing security with performance - Intelligent caching and optimization strategies are essentialBuilding defense in depth - No single security control should be your only line of defenseMonitoring and responding to security events - Comprehensive logging and alerting provides visibilityContinuously adapting security practices - Serverless security is evolving rapidly as the technology matures The serverless paradigm has fundamentally changed how we approach application security. By embracing these changes and implementing the patterns described in this article, you can build serverless applications that are both secure and scalable. Remember that while cloud providers secure the underlying infrastructure, the security of your application logic, authentication flows, and data access patterns remains your responsibility. The shared responsibility model is especially important in serverless architectures where the division of security duties is less clear than in traditional deployments. As serverless adoption continues to grow, expect to see more sophisticated security solutions emerge that address the unique challenges of highly distributed, ephemeral computing environments. By implementing the practices outlined here, you'll be well-positioned to leverage these advancements while maintaining strong security fundamentals.
Have you found yourself staring at an entire whiteboard filled with boxes and arrows, pondering whether this would become the next awesome microservices architecture or the worst distributed monolith that ever existed? Same here, and more often than I would like to admit. Last month, I was talking to one of my cofounder friends, and he mentioned, “We have 47 services!” with pride. Then two weeks later, I was going through their docs and found out that to deploy a simple feature, I need to make changes in six of their services. What I thought was their “microservices” architecture turned out to be a monolith split into pieces, with distribution complexity but no benefits whatsoever. Perhaps the most critically important and the most underappreciated step in this architectural style is correct partitioning of microservices. Doing so increases modular independent deployability, isolation of faults, and swiftness in team operations. Mess it up, and welcome to a distributed system that is a thousand times harder to maintain than the monolith you wanted to remove. The Anti-patterns: How Boundaries Fail – Consider the Case The Distributed Monolith: Death by a Thousand Cuts An application made up of multiple services that are interdependent is an example of a pattern I encounter frequently, known as “distributed monolith.” This occurs when the complexity of distribution exists, but not the advantages. Here are some indicators that your distributed Monolith is operating below peak efficiency: One modification prompts and requires multiple adaptations across different services.Disabling dependency across services results in a breakage.Cross-team coordination complexity for release planning. A team I recently interacted with had to cross-coordinate across eight services for deployment just to add a field in their user’s profile. That is neither a microservices nor a service; that is just an unnecessarily intricate web of self-ensuing torture. The Shared Database Trap Again, “We need to use the same data!” falls under an alarm that can lead to this trap. Having many services access the same database tables leads to direct hidden coupling that eliminates every siloed advantage your architecture stands for. I saw a retail company suffering through Black Friday as a consequence of four hours of downtime when their order service’s inventory service changed a database schema that their order service relied on. Nanoservice Growth: Over-Indulging on the Good Stuff This can also go in the opposite direction. Sometimes I refer to it as “nanoservice madness.” You create an endless number of services and it turns your architecture into something resembling spaghetti. One of the gaming companies I consulted for was creating individual microservices for user achievements, user added preferences, user friends, and even user authentication and profile. Each of these services had their own deployment pipeline, database, and even an on-call rotation. The operational overhead was too much for their small team. A Defined Scenario: Redesigning an E-Commerce Boundary Let me show you an actual scenario from last year. I was consulting for an e-commerce business that had a typical case of a “distributed monolith.” Their initial architecture was something along the lines of this: YAML # Original architecture with poor boundaries services: product-service: responsibilities: - Management of product catalogs - Inventory management - Rules associated with pricing - Discount calculations database: shared_product_db dependencies: - user-service - order-service order-service: responsibilities: - Management and creation of orders - Processing of payments - Coordination of shipping database: shared_order_db dependencies: - product-service - user-service user-service: responsibilities: - User profiles - Authentication - Authorization - User preferences database: shared_user_db dependencies: - product-service It was obvious what the problems were. Services did have an appropriate amount of responsibilities but were overloaded with circular dependencies and too much knowledge of each other. Changes required coordinating at minimum three separate teams which is a disaster waiting to happen. Their business professionals were with us for a week. By the end of day one, the sticky notes had taken over the walls. The product team was in a heated debate with the inventory folks over who “owned” the concept of a product being “in stock.” It was chaotic, but by the end of the week, we had much clearer boundaries. The end result is as follows: YAML services: catalog-service: responsibilities: - Product information - Categorization - Search database: catalog_db dependencies: [] inventory-service: responsibilities: - Stock tracking - Reservations database: inventory_db dependencies: [] pricing-service: responsibilities: - Base prices - Discounts - Promotions database: pricing_db dependencies: [] order-service: responsibilities: - Order creation - Tracking - History database: order_db dependencies: - catalog-service - inventory-service - pricing-service (all async) payment-service: responsibilities: - Payment processing - Refunds database: payment_db dependencies: [] user-profile-service: responsibilities: - Profile management - Preferences database: user_profile_db dependencies: [] auth-service: responsibilities: - Authentication - Authorization database: auth_db dependencies: [] I understand your initial thoughts, “You went from 3 services to 7? That is increasing complexity, not decreasing it,” right? The thing is, every service now has one, dedicated responsibility. The dependencies are reduced and mostly asynchronous. Each service is fully in control of its data. The outcome was drastic. The average time to implement new features decreased by 60%, while deployment frequency went up by 300%. Their Black Friday sale was the real test for us six months later. Each service scaled on its load patterns rather than overstocking resources like the previous year. While the catalog service required 20 instances, payment only needed five. In the middle of the night, their CTO texted me a beer emoji, the universal sign of a successful launch. A Practical Strategy Finding The Right Boundaries Start With Domain-Driven Design (But Make It Work) As much as Domain-Driven Design (DDD) purists would like to disagree, you don’t need to be a purist to benefit from DDD’s tools for exploring boundaries. Start with Event Storming. This is a workshop approach where you gather domain experts and developers to construct business process models using sticky notes representing domain events, commands, and aggregates. This type of collaboration often exposes boundaries that are already a feature in your domain. The “Two Pizza Team” Rule Still Works Amazon continues to enforce their famous rule that states a team should be small enough to be fed by two pizzas. The team should be able to fit into a single meeting room alongside to microservices. If a service grows so complicated that it takes more than 5-8 engineers to maintain it, that's often an indication it should be split. But the inverse is also true, if you have 20 services and only 5 engineers, there is an increasing likelihood you’ve become too fine grained. The Cognitive Load Test Introduction 2025 An interesting approach I adopted in 2025 is what I like to refer to as 'the cognitive load test' for boundary determination, which seems to be very effective. It’s very straightforward: does any new team member manage to understand the goals, duties, and functions of the service within a day? If not, your service might have too many operations or is fragmented. Actionable Insights for 2025 and Further The Strangler Fig Pattern: Expand Your Horizons When remodeling an existing system, don’t sweat the boundaries on the first attempt. Implement the Strangler Fig pattern which replaces parts of monolithic architecture gradually with well-structured microservices (named after a vine that gradually overtakes its host tree). A healthcare client of mine tried to create the perfect microservices architecture for 18 months without writing a single line of code. Their design became completely obsolete after many changes within the business requirements during the tangled time-consuming process. The Newest Pattern: Boundary Observability A trend that I've started noticing in 2025 is something I'm calling “boundary-testing observability”—monitoring cross service dependencies and consistency data, essentially. ServiceMesh and BoundaryGuard are tools that will notify you when services are getting too talkative or when data redundancy is posing a consistency threat. Concluding Remarks: Boundaries Are a Journey, Not a Destination After assisting countless businesses with adopting microservices, my domain understanding boundaries have shifted as business needs change. This learning will always remain agile, and boundless. If there’s a strong “value” in doing so, initiate with coarse-grained services and progress from there. Boundaries and borders are subjective. There is a fine line that dictates whether data should be shared or duplicated, so be reasonable. Most importantly, pay attention to the problems and pain your teams face, there is a strong chance that it will give clues to boundary issues. As my mentor used to say, “the best microservices architecture isn’t the one that looks prettiest on a whiteboard—it’s the one that lets your teams ship features quickly and safely without wanting to quit every other Tuesday.”
Any form of data that we can use to make decisions for writing code, be it requirements, specifications, user stories, and the like, must have certain qualities. In agile development, for example, we have the INVEST qualities. More specifically, a user story must be Independent of all others and Negotiable, i.e., not a specific contract for features. It must be Valuable (or vertical) and Estimable (to a good approximation). It must also be Small to fit within an iteration and Testable (in principle, even if there isn’t a test for it yet). This article goes beyond agile, waterfall, rapid application development, and the like. I will summarise a set of general and foundational qualities as a blueprint for software development. To effectively leverage AI for code generation, while fundamental principles of software requirements remain, their emphasis and application must adapt. This ensures the AI, which lacks human intuition and context, can produce code that is not only functional but also robust, maintainable, and aligned with project constraints. For each fundamental quality, I first explain its purpose. Its usefulness and applicability when code is generated by AI are also discussed. The level of detail that I want to cover this topic necessitates two articles. This article summarizes the "what" we should do. A follow-up article gives an elaborate example about "how" we can do that. Documented Software requirements must be documented and should not just exist in our minds. Documentation may be as lightweight as possible as long as it’s easy to maintain. After all, documentation's purpose is to be a single source of truth. When we say requirements must be "Documented" for human developers, we mean they need to be written down somewhere accessible (wiki, requirements doc, user stories, etc.). If they only exist in someone's head or if they are scattered across chat messages, they probably won't be very effective. This ensures alignment, provides a reference point, and helps with onboarding. While lightweight documentation is often preferred (like user stories), there's usually an implicit understanding that humans can fill in gaps through conversation, experience, and shared context. For AI code generation, the "Documented" quality takes on a more demanding role: The documentation is the primary input: AI-code assistants don't attend planning meetings. They may not ask clarifying questions in real-time (though some tools allow interactive refinement). Currently, they lack the years of contextual experience a human developer has. The written requirement document could be the most direct and often sole instruction set the AI receives for a specific task. Its quality can directly dictate the quality of the generated code.Need for machine interpretability: While we can understand natural language fairly well, even with some ambiguity, AI models perform best with clear, structured, and unambiguous input. This means that the format and precision of the documentation could be a game-changer. Vague language can lead to unpredictable or incorrect assumptions by the AI.Structured formats aid consistency: We could use Gherkin for BDD, specific prompt templates, or even structured data formats like JSON/YAML for configuration-like requirements. Using predefined structures or templates for requirements can be very useful. This way, the necessary details (like error handling, edge cases, and non-functional requirements) are consistently considered and provided to the AI. This can lead to more predictable and reliable code generation.Single source of truth is paramount: Because the document is the spec fed to the AI, ensuring it's the definitive, up-to-date version is critical. Changes must be reflected in the documentation before regeneration is attempted. Correct We must understand correctly what is required from the system and what is not required. This may seem simple, but how many times have we implemented requirements that were wrong? The Garbage In, Garbage Out (GIGO) rule applies here. For AI-code generation, the importance of correctness can be evaluated if we consider that: AI executes literally: AI code generators are powerful tools for translating instructions (requirements) into code. However, they typically lack deep domain understanding. Currently, they lack the "common sense" to question if the instructions themselves are logical or align with broader business goals. If you feed an AI a requirement that is clearly stated but functionally wrong, the AI will likely generate code that perfectly implements that wrong functionality.Reduced opportunity for implicit correction: We might read a requirement and, based on our experience or understanding of the project context, spot a logical flaw or something that contradicts a known business rule. We might say, "Are you sure this is right? Usually, we do X in this situation." This provides a valuable feedback loop to catch incorrect requirements early. An AI is much less likely to provide this kind of proactive, context-aware sanity check. It usually assumes the requirements it receives are the intended truth.Validation is key: The burden of ensuring correctness falls heavily on the requirement definition and validation process before the AI gets involved. The people defining and reviewing the requirements must be rigorous in confirming that what they are asking for is truly what is needed. Complete This is about having no missing attributes or features. While incomplete requirements are an issue, again, we may infer missing details, ask clarifying questions, or rely on implicit knowledge. This is not always the case, however, even for us humans! Requirements may remain incomplete even after hours of meetings and discussions. In the case of AI-generated code, I've seen AI-assistants going both ways. There are cases where AI assistants generate what is explicitly stated. The resulting gaps led to incomplete features or the AI making potentially incorrect assumptions. There are also cases where the AI-assistant spotted the missing attributes and made suggestions. In any case, for completeness, I think it's still worth being as explicit as we can be. Requirements must detail not just the "happy path" but also: Edge cases: Explicitly list known edge cases.Error handling: Specify how errors should be handled (e.g., specific exceptions, return codes, logging messages).Non-functional requirements (NFRs): Performance targets, security constraints (input validation, output encoding, authentication/authorization points), scalability considerations, and data handling/privacy rules must be stated clearly.Assumptions: Explicitly list any assumptions being made. Unambiguous When we read the requirements, we can all understand the same thing. Ambiguous requirements may lead to misunderstandings, long discussions, and meetings for clarification. They may also lead to rework and bugs. In the worst case, requirements may be interpreted differently and we may develop something different than what was expected. In the case of AI assistants, it also looks particularly dangerous. Patterns and rules: AI models process the input text according to the patterns and rules they've learned. They don't inherently "understand" the underlying business goal or possess common sense in the human way. If a requirement can be interpreted in multiple ways, the AI might arbitrarily choose one interpretation based on its training data. This may not be the one intended by the stakeholders.Unpredictable results: Ambiguity leads directly to unpredictability in the generated code. You might run the same ambiguous requirement through the AI (or slightly different versions of it) and get vastly different code implementations. Each time you run the code, the AI-assistant may handle the ambiguity in a different way. Consistent Consistency in requirements means using the same terminology for the same concepts. It means that statements don't contradict each other and maintain a logical flow across related requirements. For human teams, minor inconsistencies can often be resolved through discussion or inferred context. In the worst case, inconsistency can also lead to bugs and rework. However, for AI code generators, consistency is vital for several reasons: Pattern recognition: The AI assistants will try to extract patterns for your requirements. Because LLMs lack an internal semantic model of the system, they won’t infer that ‘Client’ and ‘User’ refer to the same entity unless that connection is made explicit. This can lead to generating separate, potentially redundant code structures, data fields, or logic paths, or fail to correctly link related functionalities.Inability to resolve contradictions: AI models struggle with logical contradictions. If one requirement states "Data must be deleted after 30 days," and another related requirement states "User data must be archived indefinitely," the AI may not ask for clarification or determine the correct business rule. It might implement only one rule (often the last one it processed), try to implement both (leading to errors), or fail unpredictably.Impact on code quality: Consistency in requirements often translates to consistency in the generated code. If requirements consistently use specific naming conventions for features or data elements, the AI is more likely to follow those conventions in the generated code (variables, functions, classes). Inconsistent requirements can lead to inconsistently structured and named code. This makes it harder to understand and maintain.Logical flow: Describing processes or workflows requires a consistent logical sequence. Jumbled or contradictory steps in the requirements can confuse the AI about the intended order of operations. Testable We must have an idea about how to test that the requirements are fulfilled. A requirement is testable if there are practical and objective ways to determine whether the implemented solution meets it. Testability is paramount for both human-generated code and AI-generated code. Our confidence must primarily come from verifying code behavior. Rigorous testing against clear, testable requirements is the primary mechanism to ensure that the code is reliable and fit for purpose. Testable requirements provide the blueprint for verification. Testability calls for smallness, observability, and controllability. A small requirement here implies that it results in a small unit of code under test. This is where decomposability, simplicity, and modularity become important. Smaller, well-defined, and simpler units of code with a single responsibility are inherently easier to understand, test comprehensively, and reason about than large, monolithic, and complex components. If an AI generates a massive, tangled function, even if it "works" for the happy path, verifying all its internal logic and edge cases is extremely difficult. You can't be sure what unintended behaviours might lurk within. For smallness, decompose large requirements into smaller, more manageable sub-requirements. Each sub-requirement should ideally describe a single, coherent piece of functionality with its own testable outcomes. Observability is the ease with which you can determine the internal state of a component and its outputs, based on its inputs. This holds true before, during, and after a test execution. Essentially, can you "see" what the software is doing and what its results are? To test, we need to be able to observe behaviour or state. If the effects of an action are purely internal and not visible, testing is difficult. For observability, we need clear and comprehensive logging, exposing relevant state via getters or status endpoints. We need to return detailed and structured error messages, implement event publishing, or use debuggers effectively. This way we can verify intermediate steps, understand the flow of execution, and diagnose why a test might be failing. Describe external behavior: Focus on what the system does that can be seen, not how it does it internally (unless the internal "how" impacts an NFR like performance that needs constraint).Specify outputs: Detail the format, content, and destination of any outputs (UI display, API responses, file generation, database entries, logged messages). Example: Upon successful registration, the system MUST return an HTTP 201 response with a JSON body containing user_id and email.Define state changes: If a state change is an important outcome, specify how that state can be observed. Example: After order submission, the order status MUST be 'PENDING_PAYMENT' and this status MUST be retrievable via the /orders/{orderId}/status endpoint.Require logging for key events: Log key state changes and decision points at INFO level. The system MUST log an audit event with event_type='USER_LOGIN_SUCCESS' and user_id upon successful login. Controllability is the ease with which we can "steer" a component into specific states or conditions. How easily can we provide a component with the necessary inputs (including states of dependencies) to execute a test and isolate it from external factors that are not part of the test? We can achieve this through techniques like dependency injection (DI), designing clear APIs and interfaces, using mock objects or stubs for dependencies, and providing configuration options. This allows us to easily set up specific scenarios, test individual code paths in isolation, and create deterministic tests. Problems Caused by Poor Controllability Hardcoded Dependencies They can force you to test your unit along with its real dependencies. This turns unit tests into slow, potentially unreliable integration tests. You can't easily simulate error conditions from the dependency. Reliance on Global State If a component reads or writes to global variables or singletons, it's hard to isolate tests. One test might alter the global state, causing subsequent tests to fail or behave unpredictably. Resetting the global state between tests can be complex. Lack of Clear Input Mechanisms If a component's behaviour is triggered by intricate internal state changes or relies on data from opaque sources rather than clear input parameters, it's difficult to force it into the specific state needed for a particular test. Consequences Slow tests: Tests that need to set up databases, call real APIs, or wait for real timeouts run slowly, discouraging frequent execution.Flaky tests: Tests relying on external systems or shared state can fail intermittently due to factors outside the code under test (e.g., network issues, API rate limits).Difficult to write and maintain: Complex setups and non-deterministic behaviour lead to tests that are hard to write, understand, and debug when they fail. The "Arrange" phase of a test becomes a huge effort. Traceable Traceability in software requirements means being able to follow the life of a requirement both forwards and backwards. You should be able to link a specific requirement to the design elements, code modules, and test cases that implement and verify it. Conversely, looking at a piece of code or a test case, you should be able to trace it back to the requirement(s) it fulfills. Traceability tells us why that code exists and what business rule or functionality it's supposed to implement. Without this link, code can quickly become opaque "magic" that developers are hesitant to touch. Debugging and root cause analysis: When AI-generated code exhibits a bug or unexpected behavior, tracing it back to the source requirement is often the first step. Was the requirement flawed? Did the AI misinterpret a correct requirement? Traceability guides the debugging process.Maintenance and change impact analysis: Requirements inevitably change. If REQ-123 is updated, traceability allows you to quickly identify the specific code sections (potentially AI-generated). Tests associated with REQ-123 will need review, modification, or regeneration. Without traceability, finding all affected code sections becomes a time-consuming and error-prone manual search.Verification and coverage: Traceability helps verify that our requirements have code and tests. You can check if any requirements have been missed or if any generated code doesn't trace back to a valid requirement. Viable A requirement is "Viable" if it can realistically be implemented within the project's given constraints. These constraints typically include available time, budget, personnel skills, existing technology stack, architectural patterns, security policies, industry regulations, performance targets, and the deployment environment. Need for explicit constraints: To ensure that AI assistants generate viable code, the requirements must explicitly state the relevant constraints. These act as guardrails, guiding the AI towards solutions that are not just technically possible but also practical and appropriate for your specific project context. Perhaps your company standardized on using the FastAPI framework for Python microservices. Maybe that direct database access from certain services is forbidden by the security policy. Maybe your deployment target is a low-memory container environment, or maybe a specific external (paid) API suggested by the AI exceeds the project budget. Wrapping Up When writing requirements for AI-generated code, the fundamental principles remain, but the emphasis shifts towards: Extreme explicitness: Cover edge cases, errors, and NFRs meticulously.Unambiguity and precision: Use clear, machine-interpretable language.Constraint definition: Guide the AI by specifying architecture, tech stack, patterns, and NFRs.Testability: Define clear, measurable acceptance criteria. Smallness, observability, and controllability are important.Structured input: Format requirements for optimal AI consumption. In essence, the requirements for AI code generation mean being more deliberate, detailed, and directive. It's about providing the AI with a high-fidelity blueprint that minimizes guesswork. A blueprint that maximizes the probability of generating correct, secure, efficient, and maintainable code. Code that aligns with project goals and technical standards. This involves amplifying the importance of qualities like completeness, unambiguity, and testability. It also involves evolving the interpretation of understandability to suit an AI "developer." Currently, it seems that carefully crafting software requirements can also reduce hallucinations in AI-generated code. However, it's not expected to eliminate hallucinations entirely just through the requirements alone. The quality and structure of the input prompt (including the requirements) significantly influence how prone the AI is to hallucinate details. Hallucinations also stem from model limitations, training data artifacts, and prompt-context boundaries. Such factors are beyond the scope of this article.
"There is no good or bad code. But how you write it… that makes all the difference.” - Master Shifu The sun had just touched the tips of the Valley of Peace. Birds chirped, the wind whispered tales of warriors, and Po—the Dragon Warrior—was busy trying to write some Java code. Yes, you read that right. Master Shifu stood behind him, watching, amused and concerned. Po (scratching his head): “Master Shifu, I’m trying to make this app where each Kung Fu move is chosen based on the enemy. But the code is… bloated. Classes everywhere. If OOP was noodles, this is a full buffet.” Shifu (calmly sipping tea): “Ah, the classic Strategy Pattern. But there’s a better way, Po… a functional way. Let me show you the path.” The Traditional (OOP) Strategy Pattern – Heavy Like Po’s Lunch Po wants to choose a fighting strategy based on his opponent. Java // Strategy Interface interface FightStrategy { void fight(); } // Concrete Strategies class TigerFightStrategy implements FightStrategy { public void fight() { System.out.println("Attack with swift tiger strikes!"); } } class MonkeyFightStrategy implements FightStrategy { public void fight() { System.out.println("Use agile monkey flips!"); } } // Context class Warrior { private FightStrategy strategy; public Warrior(FightStrategy strategy) { this.strategy = strategy; } public void fight() { strategy.fight(); } public void setStrategy(FightStrategy strategy) { this.strategy = strategy; } } Usage Java Warrior po = new Warrior(new TigerFightStrategy()); po.fight(); // Output: Attack with swift tiger strikes! po.setStrategy(new MonkeyFightStrategy()); po.fight(); // Output: Use agile monkey flips! Why This Is a Problem (and Why Po Is Annoyed) Po: “So many files, interfaces, boilerplate! All I want is to change moves easily. This feels like trying to meditate with a noodle cart passing by!” Indeed, OOP Strategy pattern works, but it's verbose, rigid, and unnecessarily class-heavy. It violates the spirit of quick Kung Fu adaptability! Enter Functional Programming – The Way of Inner Simplicity Shifu (nodding): “Po, what if I told you… that functions themselves can be passed around like scrolls of wisdom?” Po: “Whoa... like… JScrolls? Shifu: “No, Po. Java lambdas.” In functional programming, functions are first-class citizens. You don’t need classes to wrap behavior. You can pass behavior directly. Higher-Order Functions are functions that take other functions as parameters or return them. Po, In Java 8 onwards, we can do that easily with the help of lambda. Lambda can wrap the functionality and can be passed to another method as a parameter. Strategy Pattern – The Functional Way in Java Java import java.util.function.Consumer; class Warrior { private Consumer<Void> strategy; public Warrior(Consumer<Void> strategy) { this.strategy = strategy; } public void fight() { strategy.accept(null); } public void setStrategy(Consumer<Void> strategy) { this.strategy = strategy; } } But there’s a better, cleaner way with just lambdas and no class at all. Java import java.util.function.Supplier; public class FunctionalStrategy { public static void main(String[] args) { // Each strategy is just a lambda Runnable tigerStyle = () -> System.out.println("Attack with swift tiger strikes!"); Runnable monkeyStyle = () -> System.out.println("Use agile monkey flips!"); Runnable pandaStyle = () -> System.out.println("Roll and belly-bounce!"); // Fighter is a high-order function executor executeStrategy(tigerStyle); executeStrategy(monkeyStyle); executeStrategy(pandaStyle); } static void executeStrategy(Runnable strategy) { strategy.run(); } } Shifu (with a gentle tone): “Po, in the art of code—as in Kung Fu—not every move needs a name, nor every master a title. In our example, we summoned the ancient scroll of Runnable… a humble interface with but one method—run(). In Java 8, we call it Functional Interface. Think of it as a silent warrior—it expects no inputs (parameters), demands no rewards (return type), and yet, performs its duty when called. Each fighting style—tiger, monkey, panda—was not wrapped in robes of classes, but flowed freely as lambdas. And then, we had the executeStrategy() method… a higher-order sensei. It does not fight itself, Po. It simply receives the wisdom of a move—a function—and executes it when the time is right. This… is the way of functional composition. You do not command the move—you invite it. You do not create many paths—you simply choose the next step.” Benefits – As Clear As The Sacred Pool of Tears No extra interfaces or classes Easily switch behaviors at runtimeMore readable, composable, and flexiblePromotes the power of behavior as data. Real-World Example: Choosing Payment Strategy in an App Java Map<String, Runnable> paymentStrategies = Map.of( "CARD", () -> System.out.println("Processing via Credit Card"), "UPI", () -> System.out.println("Processing via UPI"), "CASH", () -> System.out.println("Processing via Cash") ); String chosen = "UPI"; paymentStrategies.get(chosen).run(); // Output: Processing via UPI Po: “This is amazing! It’s like picking dumplings from a basket, but each dumpling is a deadly move.” Shifu: “Exactly. The Strategy was never about the class, Po. It was about choosing the right move at the right moment… effortlessly.” One move = One lambda. The good part is, this lambda only holds the move details—nothing else. So any warrior can master these moves and use them when needed, without having to rely on some bounded object that wrapped the move inside a bulky, boilerplate class. Final Words of Wisdom “The strength of a great developer lies not in how many patterns they know… but in how effortlessly they flow between object thinking and function weaving to craft code that adapts like water, yet strikes like steel.”- Master Shifu, on the Tao of Design Patterns. Coming Up in the Series Code of Shadows: Master Shifu and Po Use Functional Java to Solve the Decorator Pattern MysteryKung Fu Commands: Shifu Teaches Po the Command Pattern with Java Functional Interfaces
PostgreSQL employs sophisticated techniques for data storage and indexing to ensure efficient data management and fast query performance. This guide explores PostgreSQL's mechanisms, showcases practical examples, and includes simulated performance metrics to illustrate the impact of indexing. Data Storage in PostgreSQL Table Structure and TOAST (The Oversized-Attribute Storage Technique) Table Structure: PostgreSQL stores table data in a format known as a heap. Each table's heap contains one or more pages (blocks), where each page is typically 8KB in size. This size can be altered when compiling PostgreSQL from source. PostgreSQL organizes table data in a heap structure with 8KB pages by default. Rows exceeding a page size are handled using TOAST, which compresses and stores oversized attributes in secondary storage. Example: Managing Large Text Data Consider a documents table: SQL CREATE TABLE documents ( doc_id SERIAL PRIMARY KEY, title TEXT, content TEXT ); Scenario: Storing a document with 10MB of content.Without TOAST: The entire document resides in the table, slowing queries.With TOAST: The content is compressed and stored separately, leaving a pointer in the main table. Expected Performance Improvement MetricWithout TOASTWith TOASTQuery Execution Time~4.2 seconds~2.1 seconds (50% faster) TOAST significantly reduces table size, enhancing read and write efficiency. MVCC (Multi-Version Concurrency Control): Consistency with Row Versions: PostgreSQL uses MVCC to ensure data consistency and support concurrent transactions. Each transaction sees a snapshot of the database, isolating it from others and preventing locks during long queries.Transaction Management with XIDs: Each row version includes Transaction IDs (XIDs) to indicate when it was created and when it expired. This enables PostgreSQL to manage concurrency and recovery efficiently. For example, while editing an inventory item during a sales report generation, MVCC ensures the sales report sees the original data while the update operates independently. Indexing in PostgreSQL Indexes in PostgreSQL optimize queries by reducing the need for full-table scans. Below are examples showcasing indexing techniques, their use cases, and expected improvements. B-Tree Index: Default for Range Queries B-tree indexes are efficient for equality and range queries. Example: Product Price Filtering Given a products table: SQL CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name TEXT, price NUMERIC ); Query Without Index SQL SELECT * FROM products WHERE price BETWEEN 50 AND 100; Execution Time: ~8.3 seconds (full scan on 1 million rows). Query With B-Tree Index MariaDB SQL CREATE INDEX idx_price ON products(price); SELECT * FROM products WHERE price BETWEEN 50 AND 100; Execution Time: ~0.6 seconds (direct row access). Performance Improvement MetricWithout IndexWith IndexImprovement (%)Query Execution Time~8.3 seconds~0.6 seconds~92.8% faster Hash Index: Fast Equality Searches Hash indexes are ideal for simple equality searches. Example: User Email Lookup Given a users table: SQL CREATE TABLE users ( user_id SERIAL PRIMARY KEY, name TEXT, email TEXT UNIQUE ); Query Without Index SQL SELECT * FROM users WHERE email = '[email protected]'; Execution Time: ~4.5 seconds (scans 500,000 rows). Query With Hash Index SQL CREATE INDEX idx_email_hash ON users USING hash(email); SELECT * FROM users WHERE email = '[email protected]'; Execution Time: ~0.3 seconds. Performance Improvement MetricWithout IndexWith IndexImprovement (%)Query Execution Time~4.5 seconds~0.3 seconds~93.3% faster GiST Index: Handling Spatial Data GiST indexes are designed for complex data types, such as geometric or spatial queries. Example: Store Locator Given a locations table: SQL CREATE TABLE locations ( location_id SERIAL PRIMARY KEY, name TEXT, coordinates GEOMETRY(Point, 4326) ); Query Without Index SQL SELECT * FROM locations WHERE ST_DWithin(coordinates, ST_MakePoint(40.748817, -73.985428), 5000); Execution Time: ~6.7 seconds. Query With GiST Index SQL CREATE INDEX idx_coordinates_gist ON locations USING gist(coordinates); SELECT * FROM locations WHERE ST_DWithin(coordinates, ST_MakePoint(40.748817, -73.985428), 5000); Execution Time: ~1.2 seconds. Performance Improvement MetricWithout IndexWith IndexImprovement (%)Query Execution Time~6.7 seconds~1.2 seconds~82% faster GIN Index: Full-Text Search GIN indexes optimize composite or multi-value data types, such as arrays or JSON. Example: Tag Search Given an articles table: SQL CREATE TABLE articles ( article_id SERIAL PRIMARY KEY, title TEXT, tags TEXT[] ); Query Without Index SQL SELECT * FROM articles WHERE tags @> ARRAY['technology']; Execution Time: ~9.4 seconds. Query With GIN Index SQL CREATE INDEX idx_tags_gin ON articles USING gin(tags); SELECT * FROM articles WHERE tags @> ARRAY['technology']; Execution Time: ~0.7 seconds. Performance Improvement MetricWithout IndexWith IndexImprovement (%)Query Execution Time~9.4 seconds~0.7 seconds~92.6% faster BRIN Index: Large Sequential Datasets BRIN indexes summarize data blocks, suitable for massive sequential datasets. Example: Log File Queries Given a logs table: SQL CREATE TABLE logs ( log_id SERIAL PRIMARY KEY, log_time TIMESTAMP, message TEXT ); Query Without Index SQL SELECT * FROM logs WHERE log_time BETWEEN '2023-01-01' AND '2023-01-31'; Execution Time: ~45 seconds. Query With BRIN Index SQL CREATE INDEX idx_log_time_brin ON logs USING brin(log_time); SELECT * FROM logs WHERE log_time BETWEEN '2023-01-01' AND '2023-01-31'; Execution Time: ~3.2 seconds. Performance Improvement MetricWithout IndexWith IndexImprovement (%)Query Execution Time~45 seconds~3.2 seconds~92.9% faster Performance Considerations Impact on Writes: Indexes can slow down INSERT, UPDATE, or DELETE operations as they require updates to all associated indexes. Balancing the number and type of indexes is crucial. Example: An orders table with multiple indexes may experience slower insert speeds, requiring careful optimization. Index Maintenance: Over time, indexes can fragment and degrade in performance. Regular maintenance with commands like REINDEX can restore efficiency: SQL REINDEX INDEX idx_salary; Using Execution Plans: Analyze queries with EXPLAIN to understand index usage and identify performance bottlenecks: SQL EXPLAIN SELECT * FROM employees WHERE salary BETWEEN 50000 AND 70000; Conclusion PostgreSQL employs effective storage and indexing strategies, such as the TOAST mechanism for handling oversized data and various specialized index types, to significantly enhance query performance. This guide provides examples and performance metrics that showcase the tangible benefits of using indexes in various scenarios. By applying these techniques, database engineers can optimize both read and write operations, leading to robust and scalable database systems.
The Challenge Our organization has maintained a large monolithic codebase in Team Foundation Version Control (TFVC) for over a decade. As development velocity has increased and teams have moved toward agile methodologies, microservices, and cloud-native architectures, the limitations of TFVC have become increasingly apparent. The centralized version control model hinders collaboration, branching, and automation, and our existing classic build and release pipelines in TFS are tightly coupled with legacy tooling that no longer aligns with modern DevOps practices. We have observed significant bottlenecks in: Managing concurrent feature development across teamsImplementing flexible CI/CD workflowsIntegrating with cloud-based infrastructure and toolsAdopting containerized, microservice-oriented deployments To enable a scalable, collaborative, and DevOps-friendly environment, we must migrate our TFVC repositories to Git, which is better suited for distributed development, supports lightweight branching, and integrates seamlessly with modern CI/CD pipelines and platforms like Azure DevOps, GitHub, and Kubernetes. Overview While TFVC has served enterprises for years, its centralized nature and complex branching model make it less suitable for modern development paradigms. In contrast, Git, a distributed version control system, empowers teams to move faster, collaborate more effectively, and align with industry-standard CI/CD practices. In this blog, we will walk through Why should we migrate from TFVC to GitKey challenges during migrationStep-by-step guide using Azure DevOpsAn example use casePost-migration best practices Why Migrate from TFVC to Git? 1. Align With Modern Tooling Git integrates seamlessly with tools like GitHub, GitLab, Bitbucket, Azure DevOps Repos, Kubernetes, Docker, and more. TFVC is limited mostly to older Visual Studio versions and TFS 2. Distributed Workflows Git allows every developer to work independently with a local copy of the entire codebase, enabling offline work, faster operations, and streamlined collaboration. 3. Agile and DevOps Support Git's branching and merging strategies suit agile development and trunk-based development better than TFVC’s heavyweight model. 4. Cloud-Native and Microservices Ready Microservices require isolated, independently deployable repositories. Git supports this easily with its lightweight branching, tagging, and submodule capabilities. Challenges We May Face While Git offers substantial benefits, the migration is not trivial, especially in large enterprises ChallengeDescriptionRepository SizeTFVC projects can be large with extensive historyHistory PreservationWe may want to retain commit history, comments and metadataUser MappingMapping historical TFVC users to Git commit authorsTool FamiliarityDevelopers may need Git trainingPipeline DependenciesExisting TFS build/release pipelines may break post-migration Step-by-Step Migration from TFVC to Git (Using Azure DevOps) Azure DevOps provides native tools to facilitate TFVC-to-Git migrations. Let’s walk through a real-world example: Scenario A legacy monolithic application is stored in a TFVC repository in Azure DevOps Server 2019. The organization wants to modernize development by migrating this codebase to Git and starting to use YAML pipelines. Step 1: Prepare the Environment Git-TFS.NET Framework 4.7.2+Git Step 2: Install Git-TFS Git-TFS is a .NET tool that allows you to clone a TFVC repository and convert it into a Git repository. Shell choco install gittfs Or manually download from this link. Step 3: Clone the TFVC Repository With Git History Now we will create a Git repository by fetching history from TFVC: Shell git tfs clone http://your-tfs-url:8080/tfs/DefaultCollection $/YourProject/MainBranch --branches=all Notes: --branches=all will attempt to migrate TFVC branches to Git branches.$ is the root symbol for TFVC paths. We can also limit the history to a certain number of changesets for performance: Shell git tfs clone http://your-tfs-url:8080/tfs/DefaultCollection $/YourProject/MainBranch --changeset=10000 Step 4: Push to Git Repository in Azure DevOps Create a new Git Repo in Azure DevOps: Go to Project > Repos > New Repository.Select Git, name it appropriately. Then, push your migrated Git repo. Shell cd <<YOUR-GIT-REPO>> git remote add origin https://dev.azure.com/your-org/YourProject/_git/YourProject-Git git push -u origin --all Step 5: Validate and Set Up CI/CD Ensure all branches and tags are present.Recreate pipelines using Azure Pipelines (YAML) or any Git-based CI/CD system.Define branch policies, pull request templates, and protection rules. Example Use Case Let's assume we are migrating a health care management system developed in the .NET framework and hosted in TFVC. Before Migration Single monolithic TFVC repository.Classic release pipelines in TFS.Developers struggle with branching and rollback. After Migration Git repository with main, feature/*, and release/* branches.Developers create pull requests for features and hotfixes.Azure YAML pipelines automate builds and deployments Sample Git Branching Strategy Shell main │ ├── feature/add-enrollment-integration ├── feature/optimize-db-calls ├── release/v1.0 Sample Azure DevOps (YAML) CI/CD Pipeline YAML trigger: branches: include: - main - release/* pool: vmImage: 'windows-latest' steps: - task: UseDotNet@2 inputs: packageType: 'sdk' version: '6.x' - script: dotnet build - script: dotnet test - script: dotnet publish -c Release Best Practices Post-Migration Train the team on Git commands and workflows.Automate branching policies and PR reviews.Archive or decommission TFVC repositories to avoid confusion.Use semantic versioning, tagging, and GitHub flow based on team size.Monitor Git performance with tools like Git Large File Storage (LFS) if needed. Understanding Branch Migration With Git-TFS When we run a basic git tfs clone command, Git-TFS only clones the main branch (trunk) and its history.To migrate all branches, we must add --branches=all option: Shell git tfs clone http://tfs-url:8080/tfs/Collection $/YourProject/MainBranch --branches=all This: Identifies TFVC branches as defined in the TFS repositoryAttempts to map them into Git branchesTries to preserve the merge relationships, if any Migrate Selective TFVC Branches to Git 1. Identify TFVC Branch Paths Find the full TFVC paths for the branches we care about: Shell $/YourProject/Main $/YourProject/Dev $/YourProject/Release/1.0 $/YourProject/Release/2.0 We can use Visual Studio or the tf branches command to list these 2. Clone the Main Branch First Shell git tfs clone http://tfs-server:8080/tfs/DefaultCollection $/YourProject/Main --with-branches --branches=none --debug This creates a Git repository tracking the Main TFVC branch and avoids pulling in unwanted branches.--branches=none ensures only this branch is cloned (avoids automatic detection of others).--with-branches initializes Git-TFS to track additional branches later. 3. Add Additional Branches Add additional desired branches using: Shell cd YourProject git tfs branch -i $/YourProject/Dev git tfs branch -i $/YourProject/Release/1.0 4. Fetch All the Branches Now download the full changeset history for all the added branches: Shell git tfs fetch 5. Verify Git Branches List the available branches in the local Git repo: Shell git branch -a Expected output: Shell * main remotes/tfs/Dev remotes/tfs/Release-1.0 6. Create Local Branches (Optional) If we want to work locally on these branches: Shell git checkout -b dev remotes/tfs/Dev git checkout -b release/1.0 remotes/tfs/Release-1.0 7. Commit (Only If Modifications Are Made Locally) If any manual changes are done to the working directory, don’t forget to commit: Shell git add . git commit -m "Post-migration cleanup or updates" 8. Push to Git Remote (e.g., Azure DevOps or GitHub) First, add your remote Git repository: Shell git remote add origin https://dev.azure.com/your-org/YourProject/_git/YourProject-Git Then push your branches: Shell git push -u origin main git push -u origin dev git push -u origin release/1.0 Other Methods While Git-TFS is one common approach, there are multiple methods to migrate from TFVC to Git, each with trade-offs depending on your goals: whether you need full history, multiple branches, scalability for large repos, or simplicity for new development. Below are the main options: 1. Shallow Migration (No History, Clean Slate) This is best for: Teams that want a fresh start in GitRewriting architecture to microservicesRepositories with bloated or irrelevant TFVC history Steps Create a Git repo.Export the latest code snapshot from TFVC (e.g, using tf get).Add, commit, and push to Git. Shell tf get $/YourProject/Main git init git add . git commit -m "Initial commit from TFVC snapshot" git remote add origin <GitRepoURL> git push -u origin main Challenges We lose historical commit history.Can't track file-level changes pre-migration. 2. Manual Branch-by-Branch Migration This is best for: Large monoliths broken down into microservicesControlled, phased migration Steps Identify key branches (e.g., main, dev, release).Export them one by one using git-tfs clone.Push each to separate Git repos or branches. Challenges Requires effort to maintain consistency across branchesRisk of missing context between branches Conclusion Migrating from TFVC to Git is not just a source control update — it's a strategic step toward modernization. Git enables speed, agility, and scalability in software development that centralized systems like TFVC cannot match. By adopting Git, you not only align with current development trends but also lay the foundation for DevOps, microservices, and scalable delivery pipelines. Whether you’re handling a single project or thousands of TFVC branches, start small, validate your process, and iterate. With the right tooling and planning, the transition to Git can be smooth and incredibly rewarding.
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset
June 12, 2025 by
How Security Engineers Can Help Build a Strong Security Culture
June 12, 2025 by
KubeVirt: Can VM Management With Kubernetes Work?
June 12, 2025
by
CORE
HTAP Using a Star Query on MongoDB Atlas Search Index
June 12, 2025 by
KubeVirt: Can VM Management With Kubernetes Work?
June 12, 2025
by
CORE
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
June 12, 2025
by
CORE
How Security Engineers Can Help Build a Strong Security Culture
June 12, 2025 by
KubeVirt: Can VM Management With Kubernetes Work?
June 12, 2025
by
CORE
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
June 12, 2025
by
CORE
KubeVirt: Can VM Management With Kubernetes Work?
June 12, 2025
by
CORE
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking
June 12, 2025
by
CORE
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
June 12, 2025
by
CORE