Deployment Resources

DZone's Featured Deployment Resources

Troubleshooting Kubernetes Pod Crashes: Common Causes and Effective Solutions

By Srinivas Chippagiri

Kubernetes has become the de facto standard for container orchestration, offering scalability, resilience, and ease of deployment. However, managing Kubernetes environments is not without challenges. One common issue faced by administrators and developers is pod crashes. In this article, we will explore the reasons behind pod crashes and outline effective strategies to diagnose and resolve these issues. Common Causes of Kubernetes Pod Crashes 1. Out-of-Memory (OOM) Errors Cause Insufficient memory allocation in resource limits. Containers often consume more memory than initially estimated, leading to termination. Symptoms Pods are evicted, restarted, or terminated with an OOMKilled error. Memory leaks or inefficient memory usage patterns often exacerbate the problem. Logs Example Shell State: Terminated Reason: OOMKilled Exit Code: 137 Solution Analyze memory usage using metrics-server or Prometheus.Increase memory limits in the pod configuration.Optimize code or container processes to reduce memory consumption.Implement monitoring alerts to detect high memory utilization early. Code Example for Resource Limits Shell resources: requests: memory: "128Mi" cpu: "500m" limits: memory: "256Mi" cpu: "1" 2. Readiness and Liveness Probe Failures Cause Probes fail due to improper configuration, delayed application startup, or runtime failures in application health checks. Symptoms Pods enter CrashLoopBackOff state or fail health checks. Applications might be unable to respond to requests within defined probe time limits. Logs Example Shell Liveness probe failed: HTTP probe failed with status code: 500 Solution Review probe configurations in deployment YAML.Test endpoint responses manually to verify health status.Increase probe timeout and failure thresholds.Use startup probes for applications with long initialization times. Code Example for Probes Shell livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 3 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 3. Image Pull Errors Cause Incorrect image name, tag, or registry authentication issues. Network connectivity problems may also contribute. Symptoms Pods fail to start and remain in the ErrImagePull or ImagePullBackOff state. Failures often occur due to missing or inaccessible images. Logs Example Shell Failed to pull image "myrepo/myimage:latest": Error response from daemon: manifest not found Solution Verify the image name and tag in the deployment file.Ensure Docker registry credentials are properly configured using secrets.Confirm image availability in the specified repository.Pre-pull critical images to nodes to avoid network dependency issues. Code Example for Image Pull Secrets Shell imagePullSecrets: - name: myregistrykey 4. CrashLoopBackOff Errors Cause Application crashes due to bugs, missing dependencies, or misconfiguration in environment variables and secrets. Symptoms Repeated restarts and logs showing application errors. These often point to unhandled exceptions or missing runtime configurations. Logs Example Shell Error: Cannot find module 'express' Solution Inspect logs using kubectl logs <pod-name>.Check application configurations and dependencies.Test locally to identify code or environment-specific issues.Implement better exception handling and failover mechanisms. Code Example for Environment Variables Shell env: - name: NODE_ENV value: production - name: PORT value: "8080" 5. Node Resource Exhaustion Cause Nodes running out of CPU, memory, or disk space due to high workloads or improper resource allocation. Symptoms Pods are evicted or stuck in pending status. Resource exhaustion impacts overall cluster performance and stability. Logs Example Shell 0/3 nodes are available: insufficient memory. Solution Monitor node metrics using tools like Grafana or Metrics Server.Add more nodes to the cluster or reschedule pods using resource requests and limits.Use cluster autoscalers to dynamically adjust capacity based on demand.Implement quotas and resource limits to prevent overconsumption. Effective Troubleshooting Strategies Analyze Logs and Events Use kubectl logs <pod-name> and kubectl describe pod <pod-name> to investigate issues. Inspect Pod and Node Metrics Integrate monitoring tools like Prometheus, Grafana, or Datadog. Test Pod Configurations Locally Validate YAML configurations with kubectl apply --dry-run=client. Debug Containers Use ephemeral containers or kubectl exec -it <pod-name> -- /bin/sh to run interactive debugging sessions. Simulate Failures in Staging Use tools like Chaos Mesh or LitmusChaos to simulate and analyze crashes in non-production environments. Conclusion Pod crashes in Kubernetes are common but manageable with the right diagnostic tools and strategies. By understanding the root causes and implementing the solutions outlined above, teams can maintain high availability and minimize downtime. Regular monitoring, testing, and refining configurations are key to avoiding these issues in the future. More

Effective Exception Handling in Microservices Integration

By RENJITH RAMACHANDRAN

Microservices architecture offers benefits such as scalability, agility, and maintainability, making it ideal for building robust applications. Spring Boot, as the preferred framework for developing microservices, provides various mechanisms to simplify integration with different systems. The modules offered by the Spring framework abstract much of the complexity, allowing developers to integrate seamlessly with external systems. Integration types may vary depending on the system, including API integration, messaging system integration, or database connectivity. Each system requires specific error-handling mechanisms. Regardless of the integration type, the API layer should not directly expose errors returned by the integrated systems to ensure a consistent and user-friendly response. Error Handling in a Sample Spring Boot Application Below is an example of a Spring Boot application with a /register API call for user registration. This API demonstrates integration with a database to save user details, an internal messaging system to post messages, and an external API. Code Snippet 1: Java @PostMapping("/register") public ResponseEntity<String> registerUser(@RequestBody User user) { userRegistrationService.registerUser(user); return new ResponseEntity<>("User registered successfully", HttpStatus.CREATED); } Code Snippet 2: Java public void registerUser(User user) { saveUserEntity(user); registerEvents(user); invokeLoginApi(user); } public ResponseEntity<String> invokeLoginApi(User user) { LoginDTO loginDTO = new LoginDTO(); loginDTO.setUsername(user.getUsername()); loginDTO.setPassword(user.getPassword()); String url = "http://localhost:8080/api/auth/login"; HttpHeaders headers = new HttpHeaders(); headers.set("Content-Type", "application/json"); HttpEntity<LoginDTO> request = new HttpEntity<>(loginDTO, headers); return restTemplate.exchange(url, HttpMethod.POST, request, String.class); } private UserEntity saveUserEntity(User user) { UserEntity userEntity = new UserEntity(); userEntity.setUsername(user.getUsername()); userEntity.setFirstName(user.getFirstName()); userEntity.setLastName(user.getLastName()); userEntity.setEmail(user.getEmail()); userRegistrationRepository.save(userEntity); LoginDTO loginDTO = saveLoginDTO(user); return userEntity; } private User registerEvents(User user) { UserRegisteredEvent userRegisteredEvent = new UserRegisteredEvent(); userRegisteredEvent.setEmail(user.getEmail()); userRegisteredEvent.setFirstName(user.getFirstName()); userRegisteredEvent.setLastName(user.getLastName()); userRegisteredEvent.setEmail(user.getEmail()); applicationEventPublisher.publishEvent(userRegisteredEvent); return user; } Code Snippet 1 illustrates the controller code, which uses dependency injection to autowire and call the service layer. The register function in the service layer performs three key operations: saving user information in the database, producing an event, and invoking the login API to authenticate the user. If an error occurs during data saving, event publishing, or API invocation, the system will return a generic 500 error, as shown in Figure 1. This error is not informative, making it difficult for the invoking client to understand the root cause. Developers must rely on logs to identify and debug the issue. Figure 1 Controller Advice A Controller advice can handle these exceptions and return a meaningful error, which invoking clients can use, as shown in Code Snippet 3 and Figure 2. Code Snippet 3: Java @ControllerAdvice public class GlobalExceptionHandler { @ExceptionHandler(Exception.class) public ResponseEntity<String> handleGeneralException(Exception ex) { return new ResponseEntity<>("An unexpected error occurred: " + ex.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR); } } Figure 2 Each integration layer may encounter different types of errors, and it is crucial to return meaningful information to the invoking client so that appropriate messages can be displayed. Returning a generic error for all scenarios is not a good design practice. The Approach To handle errors effectively, custom exceptions should be defined for each integration layer. Exceptions specific to an integration layer should be caught and encapsulated within these custom exception classes. These exceptions can be grouped under a single custom exception or differentiated by introducing specific attributes, enabling the Controller Advice to return more detailed and meaningful error responses for each scenario. Figure 3 below illustrates the implementation of different custom exception classes, designed to encapsulate exceptions from various integration layers. Each custom exception class extends a base class, DemoException, which itself extends the RuntimeException class. This hierarchical structure ensures a consistent approach to exception handling across all integration layers. Figure 3 Code Snippet 4: Java package com.example.demo.exceptionhandling.exception; public class DemoException extends RuntimeException { private String errorCode; public DemoException(String errorCode) { super(errorCode); } public DemoException(){ super(); } } Code Snippet 5: Java public void registerUser(User user) throws DemoException { saveUserEntity(user); registerEvents(user); invokeLoginApi(user); } public ResponseEntity<String> invokeLoginApi(User user) throws DemoAPIException { try { LoginDTO loginDTO = new LoginDTO(); loginDTO.setUsername(user.getUsername()); loginDTO.setPassword(user.getPassword()); String url = "http://localhost:8080/api/auth/login"; HttpHeaders headers = new HttpHeaders(); headers.set("Content-Type", "application/json"); HttpEntity<LoginDTO> request = new HttpEntity<>(loginDTO, headers); return restTemplate.exchange(url, HttpMethod.POST, request, String.class); }catch (Exception e){ throw new DemoAPIException("API-001:Error while invoking API"); } } private UserEntity saveUserEntity(User user) throws DemoDataException { try { UserEntity userEntity = new UserEntity(); userEntity.setUsername(user.getUsername()); userEntity.setFirstName(user.getFirstName()); userEntity.setLastName(user.getLastName()); userEntity.setEmail(user.getEmail()); userRegistrationRepository.save(userEntity); LoginDTO loginDTO = saveLoginDTO(user); return userEntity; }catch (Exception e){ throw new DemoDataException("DATA-001:Error while saving user data"); } } private User registerEvents(User user) throws DemoDataException{ try { UserRegisteredEvent userRegisteredEvent = new UserRegisteredEvent(); userRegisteredEvent.setEmail(user.getEmail()); userRegisteredEvent.setFirstName(user.getFirstName()); userRegisteredEvent.setLastName(user.getLastName()); userRegisteredEvent.setEmail(user.getEmail()); applicationEventPublisher.publishEvent(userRegisteredEvent); return user; }catch (Exception e){ throw new DemoEventException("EVENT-001:Error while sending the user data"); } } As illustrated in Code Snippet 5, each function throws exceptions specific to its functionality, allowing the errors to be handled in the ControllerAdvice class. This enables the application to return detailed and specific error responses. Code Snippet 6 demonstrates the ControllerAdvice code, which handles each exception individually. Figure 4 shows the formatted error response. Unlike the generic error shown in Figure 3, the new error response is more descriptive, enabling the client code to handle it more effectively. Code Snippet 6: Java @ExceptionHandler(Exception.class) public ResponseEntity<String> handleGeneralException(Exception ex) { return new ResponseEntity<>("An unexpected error occurred: " + ex.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR); } @ExceptionHandler(DemoException.class) public ResponseEntity<String> handleDemoException(DemoException ex) { return new ResponseEntity<>(ex.getMessage(), HttpStatus.BAD_REQUEST); } @ExceptionHandler(DemoDataException.class) public ResponseEntity<String> handleDemoDataException(DemoDataException ex) { return new ResponseEntity<>(ex.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR); } @ExceptionHandler(DemoAPIException.class) public ResponseEntity<String> handleDemoAPIException(DemoAPIException ex) { return new ResponseEntity<>(ex.getMessage(), HttpStatus.SERVICE_UNAVAILABLE); } Figure 4 Conclusion Proper handling of errors from different integration layers is essential when developing microservices. It provides interfacing applications with better visibility into the errors, allowing them to handle issues appropriately while preventing the exposure of implementation details and code-related information, which could pose security risks. The code for the example above can be found in this link. More

The Power of Docker and Cucumber in Automation Testing

By naga Harini Kodey

Strengthening Cloud Security: Privacy-Preserving Techniques for Compliance With Regulations and the NIST Framework

By Dorababu Nadella

How OpenAI’s Downtime Incident Teaches Us to Build More Resilient Systems

By Vasanthi Govindaraj

How to Test PATCH Requests for API Testing With Playwright Java

Automated API testing offers multiple benefits, including speeding up the testing lifecycle and providing faster feedback. It helps in enhancing the efficiency of the APIs and allows teams to deliver the new features speedily to the market. There are multiple tools and frameworks available in the market today that offer automation testing of the APIs, including Postman, Rest Assured, SuperTest, etc. The latest entry on this list is the Playwright framework, which offers API and Web Automation Testing. In this tutorial blog, we will discuss and cover the following points: What is a PATCH API request?How do you test PATCH API requests in automation testing using Playwright Java? Getting Started It is recommended to check out the earlier tutorial blog to know about the details related to prerequisite, setup and configuration. Application Under Test We will be using the free-to-use RESTful e-commerce APIs available over GitHub. This application can be set up using NodeJS or Docker. It offers multiple APIs related to order management functionality that allows creating, retrieving, updating, and deleting orders. What Is a PATCH Request? A PATCH request is used for partially updating a resource. It is the same as a PUT request. However, the difference is that PUT requires the whole request body to be sent in the request, while with PATCH, we can send only the required fields in the request that need to be updated. Another difference between a PUT and a PATCH request is that a PUT request is always idempotent; that is, making the same request repeatedly does not change the state of the resource, whereas a PATCH request may not always be idempotent. The following is an example of updating the order with a PATCH request using the RESTful e-commerce API. The same PATCH API will be further used in this blog to write the automation tests using Playwright Java. PATCH (/partialUpdateOrder/{id}) This partially updates the order using its Order ID. This API needs the id i.e., order_id as the Path Parameter to check for the existing order to partially update it. The partial details of the order must be supplied in the JSON format in the request body. Since it is a PATCH request, we just need to send the fields that we need to update; all other details do not need to be included in the request. Additionally, as a security measure, a valid Authentication token must be supplied with the PATCH request, otherwise the request will fail. The PATCH request will return the updated order details in the response with a Status Code 200. In case the update using the PATCH request fails, based on the criteria, the following status codes will be displayed: Status CodeCriteria404 When there are no order for the respective order_idsupplied to update the order400Token Authenication Fails /Incorrect request body or No request body is sent in the request403No Authentication is supplied while sending the request How to Test PATCH APIs Using Playwright Java Playwright offers the required methods that allow performing API testing seamlessly. Let’s now delve into writing the API automation tests for PATCH API requests using Playwright Java. The PATCH API ( /partialUpdateOrder/{id}) will be used for updating the order partially. Test Scenario: Update Order Using PATCH Start the RESTful e-commerce service.Use POST requests to create some orders in the system.Update the product_name, product_amt and qty of order_id - "1."Check that the Status Code 200 is returned in the response.Check that the order details have been updated correctly. Test Implementation To update the order partially, we need to send in the request body with partial fields to update and the authorization token. This token ensures that a valid user of the application is updating the order. 1. Generate the Authentication Token The token can be generated using the POST /auth API endpoint. This API endpoint needs the login credentials to be supplied in the request body. The valid login credentials are as follows: Field nameValueUsernameadminPasswordsecretPass123 On passing these valid login credentials, the API returns the JWT token in the response with Status Code 200. We would be generating and using the token using the getCredentials() method from the TokenBuilder class that is available in the testdata package. Java public static TokenData getCredentials() { return TokenData.builder().username("admin") .password("secretPass123") .build(); } This getCredentials() method returns a TokenData object containing the username and password fields. Java @Getter @Builder public class TokenData { private String username; private String password; } Once the token is generated it can be used in the PATCH API request for partially updating the order. 2. Generate the Test Data for Updating Order The next step in updating the order partially is to generate the request body with the required data. As discussed in the earlier blog of POST request tutorial, we would be adding a new method getPartialUpdatedOrder() in the existing class OrderDataBuilder that generates that test data on runtime. Java public static OrderData getPartialUpdatedOrder() { return OrderData.builder() .productName(FAKER.commerce().productName()) .productAmount(FAKER.number().numberBetween(550,560)) .qty(FAKER.number().numberBetween(3, 4)) .build(); } This method will use only three fields, which are product_name , product_amount and qty and accordingly, use them to generate a new JSON object that would be passed on as the request body to the PATCH API request. 3. Update the Order Using PATCH Request We have come to the final stage now, where we will be testing the PATCH API request using Playwright Java. Let’s create a new test method testShouldPartialUpdateTheOrderUsingPatch() in the existing HappyPathTests class. Java @Test public void testShouldPartialUpdateTheOrderUsingPatch() { final APIResponse authResponse = this.request.post("/auth", RequestOptions.create().setData(getCredentials())); final JSONObject authResponseObject = new JSONObject(authResponse.text()); final String token = authResponseObject.get("token").toString(); final OrderData partialUpdatedOrder = getPartialUpdatedOrder(); final int orderId = 1; final APIResponse response = this.request.patch("/partialUpdateOrder/" + orderId, RequestOptions.create() .setHeader("Authorization", token) .setData(partialUpdatedOrder)); final JSONObject updateOrderResponseObject = new JSONObject(response.text()); final JSONObject orderObject = updateOrderResponseObject.getJSONObject("order"); assertEquals(response.status(), 200); assertEquals(updateOrderResponseObject.get("message"), "Order updated successfully!"); assertEquals(orderId, orderObject.get("id")); assertEquals(partialUpdatedOrder.getProductAmount(), orderObject.get("product_amount")); assertEquals(partialUpdatedOrder.getQty(), orderObject.get("qty")); } This method will first hit the Authorization API to generate the token. The response from the Authorization API will be stored in the authResponseObject variable that will further be used to extract the value from the token field available in response. The request body required to be sent in the PATCH API will be generated and will be stored in the partialUpdateOrder object. This is done so we can use this object further for validating the response. Next, the token will be set using the setHeader() method, and the request body object will be sent using the setData() method. The PATCH API request will be handled using the patch() method of Playwright, which will allow partial updating of the order. Response body: JSON { "message": "Order updated successfully!", "order": { "id": 1, "user_id": "1", "product_id": "1", "product_name": "Samsung Galaxy S23", "product_amount": 5999, "qty": 1, "tax_amt": 5.99, "total_amt": 505.99 } } The response received from this PATCH API will be stored in the response variable and will be used further to validate the response. The last step is to perform assertions, the response from the PATCH API returns a JSON object that will be stored in the object named updateOrderResponseObject. The message field is available in the main response body. Hence, it will be verified using the updateOrderResponseObject calling the get() method that will return the value of the message field. The JSON object order received in the Response is stored in the object named orderObject that will be used for checking the values of the order details. The partialUpdateOrder object that actually stores the request body that we sent to partially update the order will be used as expected values, and the orderObject will be used for actual values finally performing the assertions. Test Execution We will be creating a new testng.xml file ( testng-restfulecommerce-partialupdateorder.xml) to execute the test sequentially, i.e., first calling the POST API test to generate orders and then calling the PATCH API test to partially update the order. XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful ECommerce Test Suite"> <test name="Testing Happy Path Scenarios of Creating and Updating Orders"> <classes> <class name="io.github.mfaisalkhatri.api.restfulecommerce.HappyPathTests"> <methods> <include name="testShouldCreateNewOrders"/> <include name="testShouldPartialUpdateTheOrderUsingPatch"/> </methods> </class> </classes> </test> </suite> The following test execution screenshot from IntelliJ IDE shows that the tests were executed successfully, and a partial update of the order was successful. Summary PATCH API requests allow updating the resource partially. It allows flexibility to update a particular resource as only the required fields can be easily updated using it. In this blog, we tested the PATCH API requests using Playwright Java for automation testing. Testing all the HTTP methods is equally important while performing API testing. We should perform isolated tests for each endpoint as well as end-to-end testing for all the APIs to make sure that all APIs of the application are well integrated with each other and run seamlessly.

By Faisal Khatri

CORE

Deploying LLMs Securely With OWASP Top 10

Generative Artificial Intelligence (GenAI) adoption is picking up pace. According to McKinsey, the rate of implementation has doubled compared to just ten months prior, with 65 percent of respondents saying their companies regularly use GenAI. The promise of disruptive impact to existing businesses — or delivering services into markets in new and more profitable ways — is driving much of this interest. Yet many adopters aren’t aware of the security risks at hand. Earlier this year, the Open Worldwide Application Security Project (OWASP) released a Top 10 for Large Language Model (LLM) applications. Designed to provide hands-on guidance to software developers and security architects, the OWASP Top 10 guide lays out best practices for securely implementing GenAI applications that rely on LLMs. By explicitly naming the most critical vulnerabilities seen in LLMs thus far, prevention becomes a simpler task. The biggest challenge around secure AI development, however, is not just knowing that problems might exist — this is true in any large software deployment. Instead, it is about how to help bridge the gaps that exist in cross-functional teams that span development, security, and line of business teams. Using OWASP Top 10, developers and security professionals can collaboratively map out their responsibilities across the whole AI application, then align with line of business teams on why these controls matter. On top of the technical security perspective, CISOs, too, can get a better overview of potential business risks associated with AI applications and attacks against their infrastructure. These attacks can vary from exploiting “traditional” software flaws that must be remediated to the hijacking and abuse of LLM service accounts that can cost organizations tens of thousands of dollars per day. For the CISO, explaining risk to the board and the business is essential when so many eyes are watching these potentially transformative projects. In essence, the pressure to deliver is high and the level of risk is growing; security should be involved from the start. What Is in the OWASP Top 10 Today? The OWASP Top 10 covers the following areas: Prompt injection: How attackers might manipulate an LLM to carry out unforeseen or risky actions that affect the security of the outputInsecure output handling: Similar to more traditional security failures like cross-site scripting (XSS), Cross-Site Request Forgery (CSRF), and code execution, which occur when the LLM does not have appropriate security for its back-end systemsTraining data poisoning: When the data set used to train the LLM includes an exploitable vulnerabilityModel denial of service: When an attacker carries out actions that are resource-intensive, leading to either poor performance or significant bills for the victim organizationSupply chain vulnerabilities: Vulnerabilities that occur in more traditional software components or services and can then be exploited in an attack; the more complex the LLM application, the more likely it will have vulnerabilities to fix over timeSensitive information disclosure: Incidents where an LLM may disclose information that should not be publicly available; to prevent this, ensure that training and production data are sanitized before using themInsecure plugin design: Potential issues where plugins don’t protect the LLM against unauthorized access or insecure input, which can then lead to issues like remote code executionExcessive agency: When the LLM can carry out more actions or has more permissions than it should have, leading to it undertaking tasks that it should not be allowed to carry outOver-reliance: Where users rely on the LLM without carrying out appropriate checks on the content; this can lead to hallucinations, potential misinformation, or even legal issuesModel theft: When attackers access, copy, or steal proprietary LLM models To implement the appropriate security measures for GenAI applications, developers should engage with security teams and carry out threat model exercises to better understand the potential risks involved. Adding Security to GenAI To deploy GenAI, developers have two choices: they can either adopt a service that fills the gap in their applications, or they can choose specific components and build a complete AI pipeline. For enterprises that want to use their own data to inform their AI applications, running a stack of AI components that equip them to more easily personalize their apps will be an advantage. These components are typically cloud-native and run on containerized architecture, as they can require large volumes of compute instances and storage to operate. Commonly, AI deployments use an orchestration tool like Kubernetes to manage workloads efficiently. Kubernetes makes it easier to deploy and scale GenAI, but it also introduces more complexity and security risks that organizations must consider. A containerized AI model running on a cloud platform has a very different set of security concerns to a traditional on-premises deployment or even other cloud-native containerized environments. With so many moving parts involved in the pipeline, and many of those elements being relatively new and developed fast, getting the right security insight in place right from the start is essential. The first step for security teams is to understand which components are in each AI application, covering the data, application components, and infrastructure together. This should be part of their overall IT asset inventory so they can catalog any existing AI services, tools, and owners in one place and keep that record up to date. This approach can also be used to create a Software Bill of Material (SBOM) for their AI system that includes all the relevant components. An AI SBOM generates a comprehensive list of all the software components, dependencies, and metadata associated with a GenAI workload, one which they can then follow up to track any changes that are needed over time. They can also apply this same approach to their AI data sources, based on how sensitive the data those sources contain. For general data — say a product catalog, website pages, or standard support material — information can be rated as public and used by anyone. For more sensitive data that includes either intellectual property, personally identifiable information (PII), or other information that must be kept secure, they can assign different levels of security access. This should help teams prioritize security for their AI workloads based on their risk severity level as well as the data that those workloads might use. Posture Management While implementing security for AI is essential, having proof of that implementation is also necessary for compliance and reporting. Ensuring that a system complies with the OWASP Top 10 is something that should be regularly affirmed so deployments are kept secure over time. This involves regularly auditing outlined aspects of the technology and checking that common security tasks are being followed, but it also covers the business process and workflow side. Security and development teams can achieve this by mapping out their organization’s LLM security strategy to the MITRE Adversarial Threat Landscape for Artificial Intelligence Systems (ATLAS) framework. This will help teams look at their GenAI security in context with other application and software security requirements, such as those around their APIs, as well as other security holes that may exist. Whereas the OWASP Top 10 for LLMs provides guidance on where to harden GenAI applications and LLM systems against attack, MITRE ATLAS can be used to understand the signs that attackers might create during their reconnaissance or attack processes. By looking out for evidence of threat actors’ Tactics, Techniques & Procedures (TTPs), organizations can be more proactive in their approach. Hardening Systems Involves Thinking Ahead New systems like GenAI require new security models and new thinking around how to stay ahead of attackers. However, in the rush to meet the business goals set by implementing these new applications, it is all too easy to overlook the role that security should play. Frameworks like the OWASP Top 10 for LLMs help everyone understand the potential risks that exist and apply those guidelines to specific applications. However, AI workloads don’t run on their own — they may exist in software containers that are orchestrated by Kubernetes and run on the cloud. Kubernetes faces similar problems to GenAI in that it is still a very new technology and the best practices for security are not well-established across the industry. Using OWASP’s guidance for each of these platforms can make it easier to model potential threats and protect each layer of the infrastructure involved. Clear-cut and auditable security processes can improve compliance reporting, threat mitigation, and workflows around application updates. They also improve the wider approach to deploying applications, ensuring that cloud security controls, processes, and procedures reduce potential exposure to evolving threats.

By Nigel Douglas

Mainframe to Serverless Migration on AWS: Challenges and Solutions

Companies across the globe spend more than $65 billion each year to maintain their legacy mainframe systems. Moving from mainframes to serverless systems on AWS gives businesses a great chance to cut operating costs. They can also benefit from cloud-native architecture. This fundamental change lets companies replace their rigid, monolithic systems with adaptable ones that meet market needs. AWS serverless technologies offer modern alternatives to traditional mainframe parts. Amazon EventBridge and Amazon API Gateway stand out as prime examples. These services make infrastructure management simple. They also deliver better scaling options and lower operating costs. This piece gets into the technical hurdles, strategies, and best practices you need for a successful mainframe-to-AWS serverless move. Your organization can direct this complex transition with confidence. Understanding Mainframe Architecture and AWS Serverless Components Mainframe systems have remained the backbone of enterprise computing since the 1950s. The world's largest banks still depend on these systems, with 96 out of the top 100 using them. About 71 percent of Fortune 500 companies rely on mainframes for their critical operations. A single powerful computer handles multiple users through terminal connections, which defines the traditional mainframe architecture. These systems handle both batch and online transaction processing. They use Job Control Language (JCL) for batch operations and let users interact through GUI or 3270 terminal interfaces. Mainframes excel at processing massive I/O volumes. They manage huge data repositories with databases that range from gigabytes to terabytes. AWS serverless architecture brings a radical alteration to computing. It offers a complete suite of services that removes infrastructure management worries. The main AWS serverless components are: AWS Lambda: Provides event-driven compute service that scales automaticallyAmazon API Gateway: Lets you create and manage RESTful APIsAmazon EventBridge: Makes serverless event bus implementation easierAWS Step Functions: Coordinates complex workflows and state management The serverless platform shows impressive scalability. AWS Lambda can handle concurrent executions of multiple functions while keeping costs low through a pay-per-use model. AWS has launched many fully-managed serverless services over the last several years. These services combine smoothly with existing AWS services and third-party solutions. Organizations must assess several critical factors before moving from mainframe to serverless architecture. The AWS Migration Acceleration Program (MAP) for Mainframe provides a structured approach. It offers processes, tools, and services built specifically for cloud migration projects. The program follows three steps: assess readiness, mobilize resources, and migrate workloads. Data migration needs careful planning because mainframes store data in Direct Access Storage Device (DASD) or Virtual Tape Library (VTL) formats. AWS offers storage options like Amazon S3, Amazon EFS, and Amazon FSx. These alternatives improve scalability and security while delivering high performance. The move to serverless requires attention to performance optimization. New challenges like cold start latencies can take 5-10 seconds for inactive functions. However, the benefits often outweigh these challenges. Customers report 60 to 90 percent cost savings after moving mainframe workloads to AWS. Automatic scaling and reduced operational overhead make the transition worthwhile. Technical Migration Challenges Organizations face major technical hurdles when moving from mainframe to serverless architecture. Studies show that more than 80% of data migration projects fail to achieve their goals. This highlights how complex these changes can be. Data Migration Complexities Data migration stands as a critical challenge in mainframe modernization. Legacy systems store massive amounts of data that could be flawed, inconsistent, or fail to meet current industry standards. The task becomes even more complex because mainframe systems use proprietary languages and technologies. This makes adapting data to cloud platforms extremely difficult. Organizations should put these measures in place to tackle these challenges: Resilient data management systems with strong backup and recovery protocolsStep-by-step migration phases with thorough validation testingAutomated validation tools that check compliance with GDPR and HIPAA Code Conversion and Refactoring Challenges We see fewer professionals who know mainframe legacy programming languages like COBOL/DB2 and NATURAL/ADABAS. This talent gap leads to higher costs and risks in maintaining legacy systems. Teams must handle complex tasks like flow normalization, code restructuring, and data layer extraction during refactoring. Large and complex mainframe systems often lack proper documentation, which makes code conversion harder. Teams find it difficult to integrate with modern agile development processes. This affects how quickly organizations can bring products to market and create new solutions. Performance and Scalability Concerns Many believe cloud migration offers unlimited scalability. Cloud platforms do offer better scalability than on-premises setups, but they have their limits. Organizations must work hard to maintain performance levels during and after migration, especially with high-volume transaction processing. Teams need to optimize performance by carefully planning resource use and capacity. Well-executed modernization projects can cut infrastructure costs by up to 70%. Legacy mainframe systems often can't keep up with modern needs. This creates bottlenecks that stop organizations from moving forward. The COVID-19 pandemic has made these challenges more obvious, especially with remote access issues and unpredictable demand patterns. Organizations now need to break down data silos faster and use data analysis better to stay competitive. Implementation Strategy and Architecture A successful move from mainframe to serverless migration needs a well-laid-out plan that tackles both technical and operational aspects. AWS provides complete solutions that help organizations modernize their legacy systems and keep their business running smoothly. Choosing the Right AWS Services AWS ecosystem gives you a strong set of services built specifically for mainframe modernization. The solution typically runs modernized applications inside Docker containers that Amazon Elastic Container Service (Amazon ECS) arranges, while AWS Secrets Manager and Parameter Store manage environmental configurations. Here are the most important AWS services for modernization: Amazon Aurora PostgreSQL: Serves as a replacement for mainframe database enginesAmazon S3: Handles task inputs and outputsAWS Step Functions: Manages workflow arrangementAmazon EventBridge: Enables live event processingAmazon API Gateway: Helps with service integration Breaking Down Monolithic Applications Moving from monolithic to microservices architecture needs a systematic approach. Organizations should use a two-phase transformation strategy: 1. Technical Stack Transformation Convert programs to REST APIsChange COBOL programs and JCLs into single executablesImplement in-memory cache optimizationDeploy services to chosen servers 2. Business Split Transformation Apply Domain-Driven Design principlesIdentify bounded contextsSeparate business functionalitiesCreate independent microservices Designing Serverless Microservices Serverless architecture implementation aims to create expandable, maintainable services. AWS Mainframe Modernization service supports both automated refactoring and replatforming patterns. It delivers cloud-native deployment by changing online and batch COBOL and PL/I applications to Java. This approach has shown remarkable results. One implementation delivered 1,018 transactions per second — equivalent to a 15,200 MIPS IBM Mainframe — and reduced annual infrastructure costs from $16 million to $365,000. The architecture makes use of AWS-managed services and serverless technology. Each microservice stays elastic and reduces system administrator tasks. Application Load Balancers provide encryption in transit and application health checks for HTTP-based services. Network Load Balancers handle other services, such as IBM CICS. AWS Secrets Manager handles sensitive data, while Parameter Store manages non-sensitive configurations for environmental settings, including database endpoints and credentials. This separation provides secure and efficient configuration management while maintaining operational flexibility. Security and Compliance Considerations Cloud migration security has changed substantially with serverless architectures. AWS shared responsibility model moves about 43% of compliance requirements to AWS. This allows organizations to concentrate on securing their applications. Identity and Access Management AWS Identity and Access Management (IAM) is the lifeblood of security control in serverless environments. Organizations need to set up detailed permissions that follow the principle of least privilege. Users should only get the permissions they need for their specific job functions. IAM offers a complete system for authentication and authorization that includes: Multi-factor authentication (MFA) to improve securityRole-based access control to manage resourcesProgrammatic and console-based access managementIntegration with existing identity providers Data Encryption and Protection The mainframe for serverless migration needs multiple security layers for data protection. AWS Mainframe Modernization works with AWS Key Management Service (KMS) to encrypt all stored data on the server side. The service creates and manages symmetric encryption keys. This helps organizations meet strict encryption requirements and reduces operational complexity. Security measures protect data in different states: TLS 1.2 or higher protocols safeguard data in transitAWS KMS-managed keys encrypt data at restAWS Secrets Manager protects application secrets Regulatory Compliance Requirements AWS serverless architecture supports various compliance frameworks with built-in controls for major regulatory standards. Organizations can make use of information from AWS compliance programs certified for: SOC (System and Organization Controls)PCI DSS (Payment Card Industry Data Security Standard)HIPAA (Health Insurance Portability and Accountability Act)FedRAMP (Federal Risk and Authorization Management Program)ISO (International Organization for Standardization) Container security needs a different approach than traditional environments, especially in highly regulated industries. Serverless environments change rapidly. This demands automated security controls throughout the software development lifecycle. Traditional security tools don't deal very well with the dynamic nature of serverless architectures. Risk intelligence plays a vital role in container security. Organizations need complete scanning and monitoring capabilities to maintain their security posture. AWS provides integrated security services that enable automated vulnerability scanning, compliance monitoring, and threat detection across serverless infrastructure. Performance Optimization and Testing Performance optimization and testing are crucial for successful mainframe to serverless migration on AWS. Studies show that performance standards of serverless platforms focused on CPU performance, network speed, and memory capacity measurements. Load Testing and Benchmarking Testing serverless infrastructure needs a systematic approach to confirm system performance. Artillery Community Edition has become a popular open-source tool to test serverless APIs. It shows median response times of 111ms with a p95 time of 218ms in standard implementations. Organizations can utilize Serverless Artillery to handle higher throughput scenarios. It runs the Artillery package on Lambda functions to achieve boosted performance metrics. Performance testing tools show that AWS serverless platforms have decreased tail latency, boosted bursty behavior, and improved image fetch speed. The ServerlessBench framework stands out with its detailed performance analysis capabilities. Monitoring and Observability Setup AWS CloudWatch works as the core monitoring solution and gives detailed insights into serverless application performance. Lambda Insights delivers essential metrics such as: Invocation rates and duration trackingSystem-level CPU utilizationMemory usage patternsNetwork performance indicatorsError count and failure rates CloudWatch Application Insights makes use of machine learning to create dashboards that spot potential problems, including metric anomalies and log error detection. AWS X-Ray helps developers create service maps with visual representations of tracing results that identify bottlenecks and connection latencies. Performance Tuning Strategies You can optimize serverless performance through smart capacity planning and resource allocation. Lambda functions support memory configurations from 128 MB to 10, 240 MB. CPU allocation increases proportionally with memory allocation. This scalability lets organizations fine-tune performance based on specific workload needs. Key optimization steps include: Function startup time evaluation and optimizationSDK client initialization outside function handlersImplementation of execution environment reuseSmart use of local file system cachingConnection pooling for database operations The AWS Lambda Power Tuning tool makes the optimization process automatic. It tests different memory configurations systematically to find the most efficient settings for specific use cases. Testing data shows that importing individual service libraries instead of the entire AWS SDK can cut initialization time by up to 125ms. CloudWatch Container Insights gives live visibility into containerized workloads. It offers detailed monitoring at the task, service, and cluster levels. Organizations can maintain optimal performance while managing complex serverless architectures during and after migration from mainframe systems. Conclusion AWS's complete suite of services helps organizations plan and execute their mainframe to serverless migration carefully. This technological move needs thorough planning. Companies that begin this experience can address complex modernization challenges while keeping their operations stable. Several key aspects lead to successful migration: AWS services like Lambda, EventBridge, and API Gateway offer strategic ways to apply changesSecurity frameworks protect data through encryption, access management, and compliance measuresSystem optimization techniques ensure strong operationsTesting methods verify migration success and system reliability Organizations that switched from mainframe to serverless architecture showed remarkable benefits. Many achieved 90% cost reduction and improved operational efficiency. AWS's serverless platform meets modern enterprise computing's needs through scalability, security, and performance. Your mainframe modernization success depends on monitoring, optimization, and adaptation to new technologies. Smart organizations embrace this change, and they position themselves well to gain agility, reduce costs, and gain competitive advantages.

By Sajith Narayanan

Dropwizard vs. Micronaut: Unpacking the Best Framework for Microservices

Microservices architecture has reshaped the way we design and build software, emphasizing scalability, maintainability, and agility. Two frameworks, Dropwizard and Micronaut, have gained prominence in the microservices ecosystem, each offering unique features to simplify and optimize development. In this article, we delve into a detailed comparison to help you determine which framework best suits your needs. Comparison Overview Dropwizard and Micronaut differ significantly in their design philosophies and capabilities: Dropwizard is a well-established Java framework that emphasizes simplicity and a "batteries-included" philosophy. It bundles popular libraries like Jetty, Jersey, and Jackson to create production-ready RESTful services quickly.Micronaut, a modern framework, targets cloud-native and serverless applications. It features compile-time dependency injection, AOT (Ahead-of-Time) compilation, and built-in support for reactive programming and serverless deployments. Advantages Dropwizard Mature and reliable: Dropwizard has been around for a long time and has a robust, well-documented ecosystem.Ease of use: With pre-integrated libraries, setting up a project is quick and straightforward.Metrics and monitoring: Dropwizard provides out-of-the-box support for monitoring and performance metrics using the Metrics library. Micronaut Performance: Micronaut’s AOT compilation and compile-time dependency injection reduce startup times and memory usage.Cloud-native features: It offers native integrations for AWS, Google Cloud, and other cloud providers, streamlining serverless deployments.Reactive programming: Micronaut has first-class support for non-blocking, event-driven architectures, improving scalability and responsiveness. Challenges Dropwizard Memory consumption: Dropwizard applications can be more memory-intensive and have longer startup times, making them less ideal for serverless use cases.Limited reactive support: Reactive programming requires additional libraries and configurations, as it is not natively supported. Micronaut Learning curve: Developers used to traditional frameworks like Spring may find Micronaut’s approach unfamiliar at first.Younger ecosystem: Although rapidly evolving, Micronaut’s ecosystem is newer and might not be as stable or extensive as that of Dropwizard. Use Cases Dropwizard Use Cases Building traditional REST APIs where rapid prototyping is crucialApplications requiring robust metrics and monitoring featuresMonolithic or microservices projects where memory usage is not a critical constraint Micronaut Use Cases Cloud-native applications requiring minimal memory and fast startup timesServerless deployments on platforms like AWS Lambda, Google Cloud Functions, or Azure FunctionsReactive microservices designed for scalability and low-latency responses Practical Examples Dropwizard Example Setting up a basic RESTful service in Dropwizard: Java public class HelloWorldApplication extends Application<Configuration> { public static void main(String[] args) throws Exception { new HelloWorldApplication().run(args); } @Override public void run(Configuration configuration, Environment environment) { environment.jersey().register(new HelloWorldResource()); } } @Path("/hello") public class HelloWorldResource { @GET public String sayHello() { return "Hello, Dropwizard!"; } } Micronaut Example Creating a similar service in Micronaut: Java import io.micronaut.http.annotation.*; @Controller("/hello") public class HelloWorldController { @Get public String sayHello() { return "Hello, Micronaut!"; } } Running an Application Dropwizard Set up a Maven project and include the Dropwizard dependencies.Define configuration files for the application.Use the java -jar command to run the Dropwizard service. Micronaut Use the Micronaut CLI to create a new project: mn create-app example.app.Configure any additional cloud or serverless settings if needed.Run the application with ./gradlew run or java -jar. Best Practices Dropwizard Monitor performance: Use the Metrics library to monitor the health and performance of your microservices.Keep it simple: Stick to Dropwizard's opinionated configurations for quicker development and maintenance. Micronaut Optimize for cloud: Leverage Micronaut’s cloud integrations for efficient deployment and scaling.Use GraalVM: Compile Micronaut applications to native images using GraalVM for even faster startup and lower memory usage. Conclusion Both Dropwizard and Micronaut are excellent frameworks for building microservices, but they cater to different needs. Dropwizard is a solid choice for teams seeking a well-integrated, production-ready solution with a mature ecosystem. Micronaut, with its cutting-edge features like AOT compilation and cloud-native support, is ideal for modern, scalable applications. Choosing the right framework depends on your project's specific requirements, including performance needs, deployment strategies, and team expertise. For traditional microservices with a need for reliability and simplicity, Dropwizard shines. For cloud-native and reactive architectures, Micronaut is the clear winner.

By Nilesh Jain

Launching Pega Web Mashup Forms on a Secure Static Website With AWS S3

I am pleased to share that I have designed and implemented a cost-effective and efficient solution for delivering a workflow-based form directly to end users via a publicly accessible static website. This solution streamlined user access while minimizing infrastructure complexity, ensuring a secure, scalable, and user-centered experience. Overview Pega Web Mashup is a powerful feature from Pega that allows businesses to embed Pega forms and applications into existing websites or portals. By using AWS S3, we can host a static website that contains a Pega Web Mashup form, which enables rapid deployment with minimal infrastructure overhead. This tutorial will walk through the steps to deploy a Pega Web Mashup form as a static website using an AWS S3 bucket. By combining Pega’s web mashup capabilities with the cost-effective and scalable AWS S3 hosting solution, we can efficiently publish interactive forms, surveys, and other Pega-based tools for broader accessibility. Here’s a step-by-step guide on how to accomplish this. Architecture Diagram Required Components AWS accountPega Application ServerWeb Mashup HTML code for a Pega FormAmazon S3 bucketAmazon Route 53 Amazon CloudFront Step-by-Step Guide Step 1: Prepare the Pega Web Mashup Code In the Pega application, Generate the Mashup Code by following the steps below: Navigate to App Studio > Channels and Interfaces.Select the Web Mashup option and add a new channel if one doesn’t already exist.Configure the mashup parameters, including the Application Name, Access Group, and Portal.Customize other settings, such as the display mode and allowed actions, to fit your requirements.Copy the generated HTML snippet. This HTML snippet code will be embedded into the HTML file of your static website.Ensure that any references to Pega assets (CSS, JS) are correctAdd necessary cross-origin headers if your Pega instance is hosted on a separate domain. HTML <div data-pega-encrypted ='true' data-pega-gadgetname ='PegaGadget' data-pega-action ='createNewWork' data-pega-applicationname ='appname' data-pega-threadname ='thread1' data-pega-channelID ='id' data-pega-resizetype ='stretch' data-pega-url ='https://demo.pegapp.com/appname' data-pega-action-param-parameters ='{"","""}' ></div> Step 2: Create an S3 Bucket for the Static Website Set up your S3 bucket to serve as a static website by following the steps below: Log in to the AWS Management Console, Go to S3, and click Create Bucket. Enter a unique name for the bucket and choose a region.Disable block public access (for website access) by adjusting the settings under Object Ownership and confirming public access.In the bucket properties, navigate to Static website hosting.Enable static website hosting and enter the file name for the Index Document (e.g., index.html). Save these changes. Step 3: Set Up and Upload Your HTML File in S3 Bucket Follow the steps below: Create an HTML file that will host the Pega Web Mashup form.Create the index.html file.Open a text editor and start a new HTML document.Paste the Pega Web Mashup code you generated in Step 1 into the body of this HTML file. HTML <!DOCTYPE html> <html lang="en"> <head> <title>Pega Web Mashup Form</title>  </head> <body> <h1>Embed Pega Form</h1>  <div data-pega-gadgetname="PegaGadget" data-pega-action="createNewWork" data-pega-application="appName" data-pega-port="https://demo.pega.com/appName/"> </div> </body> </html> In your S3 bucket, go to the Objects tab and click Upload to add the index.html file.Ensure permissions allow the file to be publicly accessible. Step 4: Configure Bucket Permissions Follow the steps below: To make the website public, modify the bucket permissions and Set Bucket Policy. In the Permissions tab, go to Bucket Policy.Then, paste a policy that allows public read access to the bucket contents. Here’s the code for public policy.Verify that the permissions are set correctly for individual files, particularly the index.html. JSON { "Version": "2011-07-27", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*" } ] } Step 5: Access the Static Website Follow the steps below: Once the S3 bucket is set up and permissions are configured, you can access your Pega Web Mashup form via the S3 bucket's website endpoint:In the Properties tab, find the Static website hosting section.The Bucket website endpoint URL provided there is the URL to access your static website. Go ahead and open it in a browser to see your Pega form embedded within a static page. Security Considerations Cross-Origin Resource Sharing (CORS): If your Pega instance is on a different domain, configure CORS settings in both AWS S3 and Pega to allow the mashup to load correctly.Security: Avoid exposing sensitive data, and consider adding AWS CloudFront for additional security, caching, and performance benefits.Data Collection: If the form collects sensitive data, ensure you configure encryption and other security measures in AWS to protect user data. Conclusion This article should provide some useful guidance on by following these steps, you can easily launch a Pega Web Mashup form on a static website hosted in an AWS S3 bucket. This method enables quick, scalable deployment without extensive infrastructure, providing an efficient way to share Pega-based forms and applications with your audience. As a next step, consider integrating further with AWS Lambda or CloudFront to enhance functionality and improve user experience.

By Sridevi Kakolu

Idempotency and Reliability in Event-Driven Systems: A Practical Guide

Introduction to Event-Driven Architectures and Idempotency The Rise of Event-Driven Architectures Modern e-commerce systems often depend on event-driven architectures to ensure scalability and responsiveness. For instance, when a user places an order, events like "Order Placed," "Payment Processed," and "Inventory Updated" are triggered asynchronously. Why Idempotency Matters in Distributed Systems In distributed systems, events can be duplicated or retried due to network failures, leading to problems like duplicate orders or incorrect inventory adjustments. Idempotency ensures that processing an event multiple times yields the same result as processing it once. Understanding Idempotency What Is Idempotency? Idempotency ensures that an operation has the same effect no matter how many ever times it is executed. For example, if a "Place Order" API is called twice due to a network glitch, only one order should be created. Idempotency vs. Other Fault-Tolerance Mechanisms Idempotency focuses on correctness under retries, while fault-tolerance mechanisms like retries or circuit breakers deal with failures but may not prevent duplicates. Challenges in Achieving Idempotency Common Causes of Duplicate Events Network failures: An API gateway like AWS API Gateway might retry a request if the response isn't received promptly. Retries and acknowledgment delays: A payment gateway might resend a "Payment Confirmation" event if the acknowledgment is delayed. Faulty producers or consumers: An e-commerce checkout microservice might emit duplicate "Order Created" events due to a bug in the system. Potential Pitfalls Without Idempotency Data inconsistencies: Processing duplicate "Inventory Update" events can lead to incorrect stock levels. Business logic failures: Charging a customer twice for the same order damages trust and creates refund headaches. E-Commerce Process Flow Diagram The following diagram illustrates the sequence of operations and interactions between various components of an e-commerce platform. It highlights a customer's journey from browsing products to completing a purchase and tracking the order. This diagram typically includes core processes such as user interactions, backend system workflows, payment processing, inventory updates, and delivery mechanisms. The flow provides a holistic view of how various components interact to deliver a seamless shopping experience. Also, implementing idempotency in critical workflows, such as payment processing and inventory updates, ensures that the system remains reliable and consistent, even in the face of network glitches or retries. Adopting some AWS services like AWS SQS FIFO queues, DynamoDB, and SNS can significantly simplify the implementation of idempotency in event-driven architectures. Key Processes in the Diagram 1. User Browsing and Search Users browse the Product Catalog or search for specific items. The backend retrieves data from the Product Catalog Service, often cached using AWS ElastiCache for faster results. 2. Wishlist Management Users can add items to their wishlist. Operations are idempotent to ensure the same product isn’t added multiple times. 3. Add to Cart and Checkout Products are added to the shopping cart, ensuring idempotent quantity adjustments to prevent duplication. At checkout, the system verifies the cart contents and calculates the total price. 4. Payment Processing The payment gateway initiates the transaction. Idempotency ensures a single payment is processed even if retries occur due to gateway timeouts. 5. Order Placement Upon successful payment, an "Order Placed" event is triggered. The system creates the order record, and idempotency prevents duplicate orders from being created. 6. Inventory Update Inventory is adjusted based on the placed order. Idempotent updates ensure stock levels are accurate even with duplicate or retry events. 7. Order Fulfillment and Delivery The order status progresses through stages like "Processing," "Shipped," and "Delivered." Updates are idempotent to avoid incorrect status changes from duplicate events. 8. Order Tracking and Notifications Users can check their order status. Notifications (e.g., email/SMS) are sent idempotently to avoid spamming users with duplicates. Idempotency Requirements 1. Cart Updates Adding the same product twice should update the quantity, not create duplicate cart entries. Implementation: Use a unique cart item identifier and a conditional update in DynamoDB. 2. Payment Gateway Payment retries must not result in duplicate charges. Implementation: Use an IdempotencyKey stored in DynamoDB to track completed transactions. 3. Order Placement Duplicate "Order Created" events from a retry should not create multiple orders. Implementation: Use a unique orderID and conditional PutItem operation in AWS DynamoDB. 4. Inventory Updates Adjusting stock levels should account for retries to avoid over-reduction. Implementation: Use distributed locks (e.g., with AWS DynamoDB TTL) to handle concurrent updates. 5. Notifications Email or SMS notifications triggered by an event should be sent only once. Implementation: Use a deduplication key with services like Amazon Simple Notification Service (SNS). Conclusion In event-driven systems, especially in e-commerce platforms and cloud architectures like AWS, idempotency is essential to guarantee reliability, fault tolerance, and data consistency. By implementing robust idempotent patterns and leveraging tools such as DynamoDB, SQS, and SNS, developers can mitigate the risks posed by retries and duplicate events. This guide demonstrates how adopting these practices not only enhances system reliability but also builds trust with users by delivering seamless and error-free experiences. As the demand for resilient and scalable systems grows, mastering idempotency becomes a cornerstone of modern software design.

By Vikram Mohanagandhi

How and Why the Developer-First Approach Is Changing the Observability Landscape

Developers play a crucial role in modern companies. If we want our product to be successful, we need to have a developer-first approach and include observability from day one. Read on to understand why. The World Has Changed Many things have changed in the last decade. In our quest for greater scalability, resilience, and flexibility within the digital infrastructure of our organization, there has been a strategic pivot away from traditional monolithic application architectures towards embracing modern software engineering practices such as microservices architecture coupled with cloud-native applications. This shift acknowledges that in today's fast-paced technological landscape, building isolated and independently deployable services offers significant advantages over the legacy of intertwined codebases characteristic of monolithic systems. Moreover, by adopting cloud-native principles tailored for public or hybrid cloud environments, we've further streamlined our application development and delivery process while ensuring optimal resource utilization through container orchestration tools like Kubernetes — which facilitate scalable deployment patterns such as horizontal scaling to match demand fluctuations. This paradigm shift not only allows us more efficient use of cloud resources but also supports the DevOps culture, fostering an environment where continuous integration and delivery become integral components that accelerate time-to-market for new features or enhancements in alignment with our business objectives. To deal with the fast-changing world, we've shifted our approach to reduce the complexity of deployments; they have become frequent daily tasks rather than rare challenging events due to a move from laborious manual processes to streamlined CI/CD pipelines and the creation of infrastructure deployment tools. This transition has substantially complicated system architectures across various dimensions including but not limited to infrastructure, configuration settings, security protocols, machine learning integrations, etc., where we've gained proficiency in managing these complexities through our deployments. Nevertheless, the intricate complexity of databases hasn’t been addressed adequately; it has surged dramatically with each application now leveraging multiple database types — ranging from SQL and NoSQL systems to specialized setups for specific tasks like machine learning or advanced vector search operations due to regular frequent deployments. Because these changes are often rolled out asynchronously, alterations in the schema of databases or background jobs can occur at any time without warning which has a cascading effect on performance issues throughout our interconnected systems. This not only affects business directly but also complicates resolution efforts for developers and DevOps engineers who lack the expertise to troubleshoot these database-centric problems alone, thus necessitating external assistance from operations experts or specialized DBAs (Database Administrators). The absence of automated solutions leaves the process vulnerable due to dependence on manual intervention. In the past, we would put the burden of increased complexity on specialized teams like DBAs or operations. Unfortunately, this is not possible anymore. The complexity of the deployments and applications increased enormously due to the hundreds of databases and services we deploy every day. Nowadays, we face multi-tenant architectures with hundreds of databases, thousands of serverless applications, and millions of changes going through the pipelines each day. Even if we wanted to handle this complexity with specialized teams of DBAs or DevOps engineers, it’s simply impossible. Thinking that this remains irrelevant to mainstream business applications couldn’t be farther from the truth. Let’s read on to understand why. Developers Are Evaluating Your Business Many companies realized that streamlining developers’ work inevitably brings multiple benefits to the whole company. This happens mostly due to two reasons: performance improvement and new domains. Automation in development areas can significantly reduce MTTR and improve velocity. All business problems of today’s world need to be addressed by the digital solutions that are ultimately developed and maintained by developers. Keeping developers far from the end of the funnel means higher MTTR, more bugs, and longer troubleshooting. On the other hand, if we reorganize the environment to let developers work faster, they can directly impact all the organizational metrics. Therefore, our goal is to involve developers in all the activities and shift-left as much as possible. By putting more tasks directly on the development teams, we impact not only the technical metrics but also the business KPIs and customer-facing OKRs. The second reason is the rise of new domains, especially around machine learning. AI solutions significantly reshape our today’s world. With large language models, recommendation systems, image recognition, and smart devices around, we can build better products and solve our customers’ issues faster. However, AI changes so rapidly that only developers can tame this complexity. This requires developers to understand not only the technical side of the AI solutions but also the domain knowledge of the business they work on. Developers need to know how to build and train the recommendation systems, but also why these systems recommend specific products and how societies work. This turns developers into experts in sociology, politics, economics, finances, communication, psychology, and any other domain that benefits from AI. Both of these reasons lead to developers playing a crucial role in running our businesses. Days of developers just taking their tasks from Jira board are now long gone. Developers not only lead the business end-to-end but also the performance of the business strongly depends on the developers’ performance. Therefore, we need to shift our solutions to be more developer-centric to lower the MTTR, improve velocity, and enable developers to move faster. Developers are increasingly advocating for an ecosystem where every component, from configuration changes to deployment processes, is encapsulated within code — a philosophy known as infrastructure as code (IaC). This approach not only streamlines the setup but also ensures consistency across various environments. The shift towards full automation further emphasizes this trend; developers are keen on implementing continuous integration and delivery pipelines that automatically build, test, and deploy software without human intervention whenever possible. They believe in removing manual steps to reduce errors caused by human error or oversight and speed up the overall development cycle. Furthermore, they aim for these automated processes to be as transparent and reversible as needed — allowing developers quick feedback loops when issues arise during testing stages while ensuring that any rollback can happen seamlessly if necessary due to a failed deployment or unexpected behavior in production environments. Ultimately, the goal is an efficient, error-resistant workflow where code not only dictates functionality but also governs infrastructure changes and automation protocols — a vision of development heavily reliant on software for its operational needs rather than traditional manual processes. Developers critically evaluate each tool under their purview — whether these be platforms for infrastructure management like Puppet or Chef, continuous integration systems such as Jenkins, deployment frameworks including Kubernetes, monitoring solutions (perhaps Prometheus or Grafana), or even AI and machine learning applications. They examine how maintenance-friendly the product is: Can it handle frequent updates without downtime? Does its architecture allow for easy upgrades to newer versions with minimal configuration changes required by developers themselves? The level of automation built into these products becomes a central focus - does an update or change trigger tasks automatically, streamlining workflows and reducing the need for manual intervention in routine maintenance activities? Beyond mere functionality, how well does it integrate within their existing pipelines? Are its APIs easily accessible so that developers can extend capabilities with custom scripts if necessary? For instance, integrating monitoring tools into CI/CD processes to automatically alert when a release has failed or rolled back due to critical issues is an essential feature assessed by savvy devs who understand the cascading effects of downtime in today's interconnected digital infrastructure. Their focus is not just immediate utility but future-proofing: they seek out systems whose design anticipates growth, both in terms of infrastructure complexity and the sheer volume of data handled by monitoring tools or AI applications deployed across their stacks — ensuring that what today might be cutting edge remains viable for years to come. Developers aim not just at building products but also curating ecosystem components tailored towards seamless upkeep with minimal manual input required on everyday tasks while maximizing productivity through intelligent built-in mechanisms that predict, prevent, or swiftly rectify issues. Developers play an essential role in shaping technology within organizations by cooperating with teams at various levels — management, platforms engineering, and senior leaders — to present their findings, proposed enhancements, or innovative solutions aimed to improve efficiency, security, scalability, user experience, or other critical factors. These collaborations are crucial for ensuring that technological strategies align closely with business objectives while leveraging the developers' expertise in software creation and maintenance. By actively communicating their insights through structured meetings like code reviews, daily stand-ups, retrospectives, or dedicated strategy sessions, they help guide informed decision-making at every level of leadership for a more robust tech ecosystem that drives business success forward. This suggests that systems must keep developers in mind to be successful. Your System Must Be Developer-First Companies are increasingly moving to platform solutions to enhance their operational velocity, enabling faster development cycles and quicker time-to-market. By leveraging integrated tools and services, platform solutions streamline workflows, reduce the complexity of managing multiple systems, and foster greater collaboration across teams. This consolidated approach allows companies to accelerate innovation, respond swiftly to market changes, and deliver value to customers more efficiently, ultimately gaining a competitive edge in the fast-paced business environment. However, to enhance the operational velocity, the solutions must be developer-first. Let's look at some examples of products that have shifted towards prioritizing developers. The first is cloud computing. Manual deployments are a thing of the past. Developers now prefer to manage everything as code, enabling repeatable, automated, and reliable deployments. Cloud platforms have embraced this approach by offering code-centric mechanisms for creating infrastructure, monitoring, wikis, and even documentation. Solutions like AWS CloudFormation and Azure Resource Manager allow developers to represent the system's state as code, which they can easily browse and modify using their preferred tools. Another example is internal developer platforms (IDPs), which empower developers to build and deploy their services independently. Developers no longer need to coordinate with other teams to create infrastructure and pipelines. Instead, they can automate their tasks through self-service, removing dependencies on others. Tasks that once required manual input from multiple teams are now automated and accessible through self-service, allowing developers to work more efficiently. Yet another example is artificial intelligence tools. AI is significantly enhancing developer efficiency by seamlessly integrating with their tools and workflows. By automating repetitive tasks, such as code generation, debugging, and testing, AI allows developers to focus more on creative problem-solving and innovation. AI-powered tools can also provide real-time suggestions, detect potential issues before they become problems, and optimize code performance, all within the development environment. This integration not only accelerates the development process but also improves the quality of the code, leading to faster, more reliable deployments and ultimately, a more productive and efficient development cycle. Many tools (especially at Microsoft) are now enabled with AI assistants that streamline the developers’ work. Observability 2.0 to the Rescue We saw a couple of solutions that kept developers’ experience in mind. Let’s now see an example domain that lacks this approach — monitoring and databases. Monitoring systems often prioritize raw and generic metrics because they are readily accessible and applicable across various systems and applications. These metrics typically include data that can be universally measured, such as CPU usage or memory consumption. Regardless of whether an application is CPU-intensive or memory-intensive, these basic metrics are always available. Similarly, metrics like network activity, the number of open files, CPU count, and runtime can be consistently monitored across different environments. The issue with these metrics is that they are too general and don’t provide much insight. For instance, a spike in CPU usage might be observed, but what does it mean? Or perhaps the application is consuming a lot of memory — does that indicate a problem? Without a deeper understanding of the application, it's challenging to interpret these metrics meaningfully. Another important consideration is determining how many metrics to collect and how to group them. Simply tracking "CPU usage" isn't sufficient: we need to categorize metrics based on factors like node type, application, country, or other relevant dimensions. However, this approach can introduce challenges. If we aggregate all metrics under a single "CPU" label, we might miss critical issues affecting only a subset of the sources. For example, if you have 100 hosts and only one experiences a CPU spike, this won't be apparent in aggregated data. While metrics like p99 or tm99 can offer more insights than averages, they still fall short. If each host experiences a CPU spike at different times, these metrics might not detect the problem. When we recognize this issue, we might attempt to capture additional dimensions, create more dashboards for various subsets, and set thresholds and alarms for each one individually. However, this approach can quickly lead to an overwhelming number of metrics. There is a discrepancy between what developers want and what evangelists or architects think the right way is. Architects and C-level executives promote monitoring solutions that developers just can’t stand. Monitoring solutions are just wrong because they swamp the users with raw data instead of presenting curated aggregates and actionable insights. To make things better, the monitoring solutions need to switch gears to observability 2.0 and database guardrails. First and foremost, developers aim to avoid issues altogether. They seek modern observability solutions that can prevent problems before they occur. This goes beyond merely monitoring metrics: it encompasses the entire software development lifecycle (SDLC) and every stage of development within the organization. Production issues don't begin with a sudden surge in traffic; they originate much earlier when developers first implement their solutions. Issues begin to surface as these solutions are deployed to production and customers start using them. Observability solutions must shift to monitoring all the aspects of SDLC and all the activities that happen throughout the development pipeline. This includes the production code and how it’s running, but also the CI/CD pipeline, development activities, and every single test executed against the database. Second, developers deal with hundreds of applications each day. They can’t waste their time manually tuning alerting for each application separately. The monitoring solutions must automatically detect anomalies, fix issues before they happen, and tune the alarms based on the real traffic. They shouldn’t raise alarms based on hard limits like 80% of the CPU load. Instead, they should understand if the high CPU is abnormal or maybe it’s inherent to the application domain. Last but not least, monitoring solutions can’t just monitor. They need to fix the issues as soon as they appear. Many problems around databases can be solved automatically by introducing indexes, updating the statistics, or changing the configuration of the system. These activities can be performed automatically by the monitoring systems. Developers should be called if and only if there are business decisions to be taken. And when that happens, developers should be given a full context of what happens, why, where, and what choice they need to make. They shouldn’t be debugging anything as all the troubleshooting should be done automatically by the tooling. Stay In the Loop With Developers In Mind Over the past decade, significant changes have occurred. In our pursuit of enhanced scalability, resilience, and flexibility within our organization’s digital infrastructure, we have strategically moved away from traditional monolithic application architectures. Instead, we have adopted modern software engineering practices like microservices architecture and cloud-native applications. This shift reflects the recognition that in today’s rapidly evolving technological environment, building isolated, independently deployable services provides substantial benefits compared to the tightly coupled codebases typical of monolithic systems. To make this transition complete, we need to make all our systems developer-centric. This shifts the focus on what we build and how to consider developers and integrate with their environments. Instead of swamping them with data and forcing them to do the hard work, we need to provide solutions and answers. Many products already shifted to this approach. Your product shouldn’t stay behind.

By Adam Furmanek

CORE

Strategies for Effectively Managing Terraform State

Terraform is a leading infrastructure-as-code tool developed by HashiCorp and has grown to become a keystone in modern infrastructure management. By using a declarative approach, Terraform enables organizations to define, provision, and manage infrastructures that stretch across many cloud providers. One of the critical components at the core of Terraform’s functionality is the state file. This acts like a database of real-world resources managed by Terraform and their corresponding configurations. The state file is important in that it retains information about the current state of your infrastructure: resource IDs, attributes, and metadata. It helps in generating changes required by changes in configuration. In the absence of a state file, Terraform would be unable to know what is provisioned or even how to apply incremental changes or track the current state. This will act as the single source of truth for Terraform while handling infrastructures; this means Terraform can create, update, and delete infrastructures predictively and consistently. Why State Management Is Crucial State management, in a general sense, is the most important part of using Terraform. Improper handling of the state files might result in configuration drift, resource conflicts, and even accidental deletion of resources. As the state file contains some sensitive information of the infrastructure, handling this file must be appropriate, and it has to be kept safe from unauthorized access or corruption. Proper state management ensures that your infrastructure is reproduced identically across different environments, such as development, staging, and production. Keeping the state files correct and up-to-date enables Terraform to plan the changes correctly in your infrastructure and thus avoid discrepancies between its intended and real states. Another important role of state management is team collaboration. In multi-user environments, such as when different team members are working on the same infrastructure, there needs to be a way to share and lock state files to avoid racing conditions that might introduce conflicts or inconsistencies. That’s where remote state backends come in — storing state files centrally for collaboration on them as a team. In Terraform, state management is one of the basic constituents within the infrastructure-as-code approach. It ensures that your infrastructure is reliably, securely, and consistently managed across all environments, cloud accounts, and deployment regions. Understanding state files and how to manage them in the best way will allow organizations to have maximum value derived from Terraform and avoid common pitfalls related to automating the infrastructure. Understanding Terraform State A Terraform state is an integral part of Terraform management of infrastructure. It is a file recording the present state of every infrastructure resource managed by Terraform. The file holds information about each single resource, its attributes, and metadata, generally acting as the single source of truth about the state of the infrastructure. How Terraform Uses State Files? Terraform relies on the state file to map your infrastructure resources as defined in your configuration files to the actual resources in the cloud or other platforms. This mapping allows Terraform to understand what resources are being managed, how they relate to one another, and how they should be updated or destroyed. When you run a Terraform plan, Terraform compares the current state of resources, as stored in the state file, with the desired state specified in the configuration. This comparison helps Terraform identify what changes are needed to align the actual infrastructure with the intended configuration. For instance, if you’ve added a new resource in the configuration, Terraform will detect that this resource doesn’t exist in the state file and will proceed to create it. In addition to mapping resources, the state file also tracks metadata, including resource dependencies and other vital information that might not be explicitly defined in your configuration. This metadata is essential for Terraform to manage complex infrastructures, ensuring that operations like resource creation or destruction are performed in the correct order to maintain dependencies and prevent conflicts. Moreover, the state file enhances Terraform’s performance. Instead of querying the cloud provider or infrastructure platform every time it needs to assess the infrastructure, Terraform uses the state file to quickly determine what the current state is. This efficiency is especially important in large-scale environments, where querying each resource could be time-consuming and costly. Understanding the role of the Terraform state file is crucial for successful infrastructure management, as it underpins Terraform’s ability to manage, track, and update infrastructure accurately. Common Challenges in Terraform State Management State File Corruption State file corruption is one of the major risks associated with Terraform and may further create high-severity problems in infrastructure management. Due to irreconcilable corruption in a state file, Terraform will lose track of existing resources; therefore, if not detected and handled correctly, it will result in either wrong changes in infrastructure or their complete deployment failure. This type of corruption could be due to a variety of factors, such as file system errors, manual editing, or improper shutdowns during state operations. Such corruption can have a deep impact, ranging from expensive downtime to misconfigurations. Concurrency Issues Concurrency issues arise when several users or automation tools are attempting to update the Terraform state file at the same time. Since this state file is a key resource, Terraform is built so that only a single process can write to it at any particular time. If appropriate locking is not put in place, it can overwrite the state file or even corrupt it when concurrent operations are done, hence leading to inconsistencies in the infrastructure. Especially in collaborative environments, where many people in a team are working on the same infrastructure, this can pose quite an issue. State File Size and Performance As infrastructure grows, so does the Terraform state file. A large state file can lead to performance degradation, making operations like terraform plan and terraform apply slow and cumbersome. This slowdown occurs because Terraform must read, write, and update the entire state file during these operations. Large state files can also complicate debugging and increase the risk of corruption, making it harder to manage infrastructure efficiently. Proper state management strategies are essential to mitigate these performance issues, ensuring that Terraform remains a reliable and scalable tool for infrastructure management. Best Practices for Managing Terraform State Effective Terraform state management is important for reliability, security, and performance in your infrastructure as code workflows. State files in Terraform contain very vital information regarding the current state of your infrastructure; thus, mismanagement may result in issues such as corruption or even security vulnerabilities and performance bottlenecks. Below are best practices in managing Terraform state that can help mitigate such risks. 1. Use Remote State Storage One of the best state-management practices with Terraform is to store .state files in a remote backend. Terraform stores the state file by default on the local disk of the machine where it is executed. However, that may suffice for small projects or single-user environments; shortly after, it becomes very limiting for collaborative or production environments. Key benefits of remote state storage include: Better collaboration: The state file can be stored remotely, thereby enabling and ensuring a safe and effective place for more than one team member to access, mess up, and modify the infrastructure. This is critical in collaborative workflows involving many developers or DevOps engineers working on the same project.Improved security: This is also connected with the inherent security features of remote state storage backends, such as AWS S3, Azure Blob Storage, or Terraform Cloud, for encryption at rest and in transit, access control, and audit logs. This safeguards sensitive data stored in the state file, such as resource identifiers, IP addresses, and in some cases even credentials.No data redundancy or durability: remote storage usually makes automatic backups and replication by default, with high availability, to prevent the possibility of losing data after local hardware failures or unintentional deletion. With your Terraform backend configured, you can set up a remote state recipe using the storage service of a cloud provider. For instance, you would do this to use AWS S3. Plain Text terraform { backend "s3" { bucket = "your-terraform-state-bucket" key = "path/to/your/statefile" region = "us-west-2" } } 2. Enable State Locking State locking creates a lock on the state file to prevent concurrent operations from modifying it at the same time. If such operations are performed, this can cause state file corruption or inconsistent infrastructure. When locking is enabled, Terraform will automatically manage a lock for any modifying operation on state and release the lock when the operation is complete. State locking is very important, particularly in collaborative environments where various members of your team might be working on the infrastructure simultaneously. If this is not state locked, then two different users could change the state file accidentally at the same time, causing conflicts, and problems with your infrastructure. You can set up DynamoDB for state locking with AWS S3 as your backend by configuring it in this manner: Plain Text terraform { backend "s3" { bucket = "your-terraform-state-bucket" key = "path/to/your/statefile" region = "us-west-2" dynamodb_table = "terraform-lock-table" } } This configuration ensures that Terraform uses a DynamoDB table to lock the state file during operations, preventing concurrent modifications. 3. Version Control for State Files This is one of the fundamental practices in any codebase management and is just as relevant in Terraform state files. Keeping different versions of the state file enables going back to a previous state in the event of something going wrong with updating an infrastructure. Although Terraform doesn’t have intrinsic version control on state files, as it does on configurations, you can achieve version control by having the state files stored in a remote backend that allows for versioning. For example, AWS S3 lets you turn on versioning for an S3 bucket used for storing state files. If you do this, every change in the state file will be kept as a different version, and you can revert back to it whenever you want. Here is how to enable versioning for an S3 bucket: Launch the S3 console. Select the bucket used for Terraform state storage from the selected AWS account. Click “Properties.” Under the “Bucket Versioning” menu, click “Edit” and turn on versioning. It will keep a history of state changes, so in the case of a problem, previous states can be restored. 4. State File Encryption Since Terraform state files have sensitive information about one’s infrastructure, it is very important that such files be encrypted at rest and during transit. This will help in a situation when unauthorized people have access to the state file; they will not be able to read its content without appropriate decryption keys. You can enable encryption for your state files; this way, they will be protected even when you store them in some remote backends, such as AWS S3, Azure Blob Storage, or Terraform Cloud. On the other side, for instance, AWS S3 supports server-side encryption with Amazon S3-managed keys, known as SSE-S3; AWS Key Management Service, known as SSE-KMS; or customer-provided keys, known as SSE-C. Terraform uses SSE-S3 to encrypt its state file, which is stored in S3 by default. However, you will be able to use SSE-KMS to get more granular control over the encryption keys: Plain Text terraform { backend "s3" { bucket = "your-terraform-state-bucket" key = "path/to/your/statefile" region = "us-west-2" kms_key_id = "alias/your-kms-key" } } This configuration ensures that the state file is encrypted using a specific KMS key, providing additional security. 5. Minimize State File Size As your infrastructure grows, so does the Terraform state file. Large state files can slow down Terraform operations, making commands like terraform plan and terraform apply take longer to execute. To minimize the state file size and maintain performance, consider the following techniques: Use data sources: Instead of managing all resources directly in Terraform, use data sources to reference existing resources without storing their full details in the state file. This approach reduces the amount of information stored in the state and speeds up Terraform operations.Minimize resource configurations: Avoid unnecessary or redundant resource configurations that add to the state file size. Regularly review and clean up obsolete resources or configurations that are no longer needed.Split large configurations: If your Terraform configuration manages a very large infrastructure, consider splitting it into multiple smaller configurations, each with its own state file. This way, you can manage different parts of your infrastructure independently, reducing the size of each state file and improving performance. Implementing these best practices for managing Terraform state ensures that your infrastructure as code workflows are reliable, secure, and scalable. Proper state management is a cornerstone of successful Terraform usage, helping you avoid common pitfalls and maintain a healthy, performant infrastructure. Terraform State Management Strategies Effective state management is critical when using Terraform, especially in complex infrastructure setups. Here are key strategies to manage Terraform state effectively: 1. Managing State in Multi-Environment Setups In multi-environment setups (e.g., development, staging, production), managing state can be challenging. A common practice is to use separate state files for each environment. This approach ensures that changes in one environment do not inadvertently impact another. You can achieve this by configuring separate backends for each environment or using different state paths within a shared backend. For instance, in AWS S3, you can define different key paths for each environment: Plain Text terraform { backend "s3" { bucket = "your-terraform-state-bucket" key = "prod/terraform.tfstate" # Use "dev/" or "staging/" for other environments region = "us-west-2" } } This setup isolates states, reducing the risk of cross-environment issues and allowing teams to work independently on different stages of the infrastructure lifecycle. 2. Handling Sensitive Data in State Files Terraform state files may contain sensitive information, such as resource configurations, access credentials, and infrastructure secrets. Managing this data securely is vital to prevent unauthorized access. Key strategies include: Encryption: Always encrypt state files at rest and in transit. Remote backends like AWS S3, Azure Blob Storage, and Terraform Cloud offer encryption options, ensuring that state data is protected from unauthorized access.Sensitive data management: Avoid storing sensitive data directly in the Terraform configuration files or state. Instead, use environment variables, secure secret management systems (e.g., HashiCorp Vault, AWS Secrets Manager), or Terraform’s sensitive variable attribute to obscure sensitive values. By doing so, these values won’t appear in the state file or logs. Plain Text variable "db_password" { type = string sensitive = true } This configuration marks the variable as sensitive, preventing its value from being displayed in Terraform outputs. 3. Using Workspaces for Multi-Tenant Environments Terraform workspaces are an excellent way to manage state for different tenants or environments within a single backend. Workspaces allow you to manage multiple states in the same configuration directory, each representing a different environment or tenant. Create workspaces: You can create and switch between workspaces using the Terraform CLI commands: Plain Text terraform workspace new dev terraform workspace select dev Organize by tenant or environment: Each workspace has its own isolated state, making it easier to manage multiple tenants or environments without risking cross-contamination of state data.Best practices: When using workspaces, ensure that naming conventions are clear and consistent. Workspaces should be used in cases where you have similar infrastructure setups across different environments or tenants. However, for significantly different infrastructures, separate Terraform configurations might be more appropriate. Tools and Resources for Terraform State Management Terraform CLI Commands One of the important things about Terraform state files is understanding and applying Terraform CLI commands. Some of the important ones are as follows: Terraform state: This is a command for direct management of the state file. It allows one to list the resources, move resources between states, and even remove them from the state file in case they no longer exist in the configuration.Terraform refresh: This command refreshes the state file with the real-time state of the infrastructure, ensuring that it correctly reflects the current environment.Terraform import: This command allows the import of pre-existing infrastructure into the Terraform state file. This makes it possible to bring manually created resources under Terraform management. These are commands that allow the user to ensure the real infrastructure and state file are consistent, very much a part of Terraform state management. These commands help maintain consistency between the actual infrastructure and the state file, a critical aspect of Terraform state management. Third-Party Tools In addition to native Terraform tools, several third-party tools can enhance Terraform state management: Terraform Cloud: Terraform Cloud is more of a HashiCorp addition for Terraform, with inbuilt state management features like remote state storage, state locking, and versioning; it greatly provides a solid solution for the team.Atlantis: Atlantis is a tool that makes Terraform operations, such as planning and applying, a no-brainer with the seamless integration of Version Control Systems, most especially when you are working with a ton of fellow developers on the same infrastructure.Terragrunt: Terragrunt is a thin wrapper for Terraform that provides extra tools for working with multiple Terraform modules, automating remote state configuration, promoting DRY (Don’t Repeat Yourself) principles with your configurations, and managing locking.Atmosly: Atmosly supports Terraform pipelines, offering state management assistance and integration within Terraform workflows. This feature streamlines state handling and enhances pipeline automation, making it easier for teams to manage their Terraform deployments with greater efficiency. Together with Terraform native CLI commands, this presents a more comprehensive set of tools for ensuring your Infrastructure’s state is managed such that growth in infrastructure size/increase in infrastructure is predictable and secure. Conclusion Effective Terraform state management is important for integrity, security, and performance. This paper details some of the best practices you can implement, like remote state storage, state locking, encryption, splitting state files in large deployments, and multi-tenancy workspaces to significantly reduce risks associated with your state file corruption and concurrency. Take a closer look at how you’re managing Terraform states at the moment. Consider implementing the techniques and tools described for better infrastructure management.

By Ankush Madaan

Understanding IaC Tools: CloudFormation vs. Terraform

AWS CloudFormation and Terraform — not sure which to choose? This article will help you reach an intelligent decision. Cloud computing has revolutionized the world of DevOps. It is not just a buzzword anymore; it is here to change the way we develop and maintain our applications. While there are countless reasons why you should use cloud computing for all scales of businesses, there is a slight limitation: You have to provision your infrastructure manually. You have to go to the consoles of your cloud providers and tell them exactly what you want. This works well for small use cases, but what if you have different people making configuration changes in the console? You could end up with a super complicated infrastructure that will only become harder and harder to maintain. There is no efficient way to collaborate or keep track of changes to the cloud infrastructure. Well, there is Infrastructure as a Code. Infrastructure as a Code (IaC) is a trendy term in cloud computing. It is the process of managing your IT IaC. Yes, that is right. Instead of going to the console and doing everything manually, IaC allows you to write configuration files to provision your cloud infrastructure. IaC gives us benefits like consistency, easy and fast maintenance, and no room for human errors. Using IaC With Amazon Web Services AWS is the leading cloud computing service in the world, with double the market share of the next cloud provider. It offers over 200 services that can cater to hundreds and thousands of use cases. When starting to use IaC with AWS, you will often narrow down your choices to AWS CloudFormation and the open-source tool Terraform. If you want to choose between the two, understanding the multitude of features both tools offer can be overwhelming. In this article, we will examine the differences between AWS CloudFormation and Terraform to help you decide which tool is better suited to your needs. Terraform vs. AWS CloudFormation: Differences Modularity When using IaC in big organizations, modularity can be a significant factor in choosing the right tool. CloudFormation CloudFormation does not have native support for modules. Instead, it allows you to use something called nested stacks as modules. For example, you can create a standard CloudFormation template for provisioning an S3 bucket in your organization. When end-users wish to create an S3 bucket, they can use this CloudFormation template as a nested stack to provision the standard S3 bucket. There is also an AWS service, the AWS Service Catalog, which can assist with modularity for CloudFormation. The AWS Service Catalog is designed for organizations that need to limit the scope of AWS services to meet compliance, security, cost, or performance requirements. It uses CloudFormation templates on the backend. Let us quickly understand this with an example. If not used properly, S3 buckets can soon be catastrophic for your confidential data. Let us take the same example. You want to have a standard way of using S3 in your organization. The first option is to create the nested stack template, which can be used within other CloudFormation stacks and is equally good. Alternatively, you can use the AWS Service Catalog, which allows users to use this standard template from the console UI and specify some parameters for slight customizations. This will allow you to control how infrastructure is provisioned in your AWS Accounts and prevent any unwanted scenarios. CloudFormation's use of nested stacks and AWS Service Catalog can also support standard configurations in large organizations, though it may require more manual configuration. Terraform Terraform has native support for modules. It allows you to create standard configurations similar to the AWS CloudFormation and use them in other Terraform configurations. Since Terraform is an open-source tool, you can also find and use some pre-made open-source modules in the Terraform Registry. You can also create your own modules with your own configurations and host them on a private module registry. Terraform’s native support for modules provides a straightforward approach to modularity. However, managing modules across a large team might require additional governance to ensure proper usage. Using a nested stack in CloudFormation is not as easy as using modules in Terraform. The primary factor is that passing data from a CFN template to the nested stack can be complicated. CloudFormation does not have a centralized repository for sharing templates. The AWS Service Catalog allows you to manage this process but primarily enforces rules via the console. While CloudFormation templates can encapsulate complex tasks, users would still have to specify parameters when creating resources. On the other hand, Terraform has a set method for creating, maintaining, and sharing modules. You can see the exact requirements of the modules in the Terraform Module Registry and easily use them in your Terraform files. Control and Governance Over Infrastructure If you want to limit what resources your people can create in your AWS Accounts, AWS CloudFormation, and Terraform provide you with the means to do so. CloudFormation CloudFormation provides control via IAM policies, allowing you to manage user access to resources. However, this control is AWS-specific, which can be ideal if your infrastructure is fully AWS-centered. In our S3 bucket example, you might want to limit all "S3 Create" permissions for users and only allow them to create S3 buckets from AWS Service Catalog or Nested Stacks. Terraform Terraform allows you to control which resources your users can create using a policy as a code tool, Sentinel. Sentinel will enable you to enforce fine-grained, logic-based policies to allow or deny user actions via Terraform. For example, you can deny all resources that create S3 buckets and only allow users to create S3 buckets from a standard module. State Management AWS CloudFormation and Terraform need to keep track of the resources they maintain. Terraform Terraform stores the state of your infrastructure in a state file. This file is stored locally by default; however, you can store it on remote backends like S3 and have multiple users make changes to the same set of infrastructure. CloudFormation CloudFormation does state maintenance internally in the background, so users don’t need to worry about manually managing a state file. This is good for those who want a fully managed service. Both AWS CloudFormation and Terraform allow you to check what changes will be made to your infrastructure. In Terraform, you can run the command "terraform plan" to see how Terraform plans to apply your configuration changes. In CloudFormation, users can see this information via Change Sets. Language Terraform Terraform uses the HashiCorp Configuration Language, HCL, a language created by HashiCorp. It is very similar to JSON, with additional built-in features and capabilities. CloudFormation CloudFormation templates are written in YAML or JSON formats. Logging and Rollbacks Both AWS CloudFormation and Terraform have good logging capabilities. In my experience, the errors and issues have been straightforward (for the most part). CloudFormation By default, CloudFormation rolls back all your changes in case of a failed stack change. This is a good feature, but it can be disabled for debugging purposes. Terraform Terraform will not automatically roll back your changes if it fails. This is not an issue, as you can always run the Terraform destroy command to delete the half-provisioned configuration and restart a Terraform run again. Scope Terraform Terraform's multi-cloud support allows you to deploy infrastructure across AWS, Azure, Google Cloud, and other platforms and provides flexibility if you're working in a multi-cloud environment. CloudFormation CloudFormation is tightly integrated with AWS, making it a good option for AWS-only infrastructures but limited for multi-cloud setups. Feature Support CloudFormation AWS CloudFormation typically receives updates first for new services and features, given its close integration with AWS. Terraform In cases where Terraform lacks certain AWS features, you can integrate CloudFormation stacks directly into your Terraform code as a workaround. Technical Support CloudFormation The paid AWS technical support plan also covers CloudFormation support. Terraform HashiCorp has paid plans for technical support on Terraform as well. Conclusion Both AWS CloudFormation and Terraform are robust and fully developed tools, each with its own advantages. The differences above can help you determine which tool best suits your needs. If you plan to use multiple cloud platforms, Terraform offers multi-cloud support, while AWS CloudFormation is an excellent choice for AWS-specific environments. Ultimately, both tools are fair game and can effectively manage IaC. The right choice depends on your requirements, whether you're focusing on AWS alone or working with multiple cloud providers.

By Pallavi Godse

Deploying Databricks Asset Bundles

Disclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This article is not a promotion for any cloud/data management platform. All the images and code snippets are publicly available on the Azure/Databricks website. In my other blogs, I have provided details on Databricks, how to create Unity Catalog, etc. in Azure cloud. In this blog, I will provide information on the Databricks Asset Bundle, when to use it, and how to deploy it in a Databricks workspace in Azure using the Databricks CLI. What Is Databricks Asset Bundle (DABs)? Databricks Assets Bundles are an Infrastructure as Code (IaC) approach to managing your Databricks objects. Since bundles are defined and managed through YAML templates and files you create and maintain alongside source code, they map well to scenarios where IaC is an appropriate approach. The Databricks Asset Bundle is a tool that makes it easier to move things like code, data pipelines, machine learning models, and settings from one Databricks workspace to another. Imagine you have built some data tools or projects in one place and want to set them up in another; the Asset Bundle helps "package" everything together so you can move and reuse it without having to recreate each part. It's like zipping up a folder with all your work to send somewhere else, making it much faster and more organized to share or deploy. Image Source: Databricks Why Use Databricks Asset Bundle (DABs)? Databricks Asset Bundles make it easy to move and manage data projects across different workspaces or environments. Simplifies Deployment: If you’ve developed code, models, or data workflows in one environment (like a development environment), Asset Bundles let you deploy everything to another (like production) Databricks workspace without redoing the setup, following DevOps best practices.Easy Collaboration: Teams can share complete data projects, including all necessary components, in a structured way. This makes it easy for others to set up and use those projects.Version Control and Consistency: Asset Bundles help ensure that all parts of a project stay consistent and up-to-date across environments so no steps are missed.Reduces Setup Time: By packaging everything into a single bundle, you save time on configuration, making it faster to roll out updates or set up projects in new workspaces. How to Deploy a Databricks Job Using Databricks Asset Bundle (DABs) Prerequisites Databricks workspace is already configured in Azure cloud.The user running the CLI commands has access to the workspace and is able to create jobs in it.Databricks CLI is installed and configured on the local machine. For information on how to install it, please refer to this website and follow the instructions based on your OS. DAB Demo Step 1: Validate That the Databricks CLI Is Installed Correctly Run the following command in the command prompt or terminal: Shell databricks -v You should see an output similar to the one below (your version could be different): Shell Databricks CLI v0.221.1 Step 2: Log In to the Workspace Run the following command to initiate OAuth token management locally for each target workspace: Shell databricks auth login --host <workspace-url> The CLI prompts for the Databricks Profile Name. Press enter to accept the default or enter a name to change the profile name. The profile information is saved in the ~/.databrickscfg file (on Mac). Step 3: Initialize the Bundle Run the following command to initiate the bundle creation: Shell databricks bundle init Enter a unique name for this project, such as demo. Select "no" for all other questions. After the command runs successfully, you should see the following folders created. Step 4: Update the Bundle Create a new file demo.yaml inside the Resources folder, and copy and paste the content below. This file contains the Databricks job definition for a Python notebook task. Also it contains the job cluster (required compute) definition. Replace the notebook path with the existing notebook in your workspace. For more asset bundle configuration options, refer here. YAML resources: jobs: sb_demo_job: name: "sb demo job" tasks: - task_key: demo_task notebook_task: notebook_path: <REPLACE THIS BLOCK WITH AN EXISTING NOTEBOOK IN YOUR WORKSPACE> source: WORKSPACE job_cluster_key: Job_cluster job_clusters: - job_cluster_key: Job_cluster new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: ON_DEMAND_AZURE spot_bid_max_price: -1 node_type_id: Standard_D4ds_v5 spark_env_vars: PYSPARK_PYTHON: /databricks/python3/bin/python3 enable_elastic_disk: true data_security_mode: SINGLE_USER runtime_engine: PHOTON num_workers: 1 queue: enabled: true Step 5: Validate the Bundle Run the following command to validate the bundle. Make sure you are running this command within your bundle folder, where the databricks.yml file is present. Shell databricks bundle validate Any error in the bundle configuration will show as an output of this command. If you have multiple profiles in your .databrickscfg file, put the -p <PROFILE NAME> in the command as a parameter. Step 6: Deploy the Bundle Run the following command to deploy the bundle in the Databricks dev workspace using -t parameter. You can find all the targets in the databricks.yml inside your bundle folder. Shell databricks bundle deploy -t dev If you have multiple profiles in your .databrickscfg file, then put the -p <PROFILE NAME> in the command. You should see similar prompts as below: Step 7: Validate the Bundle Deployment Log into the Databricks workspace and click on the "Workflows" menu in the left menu panel. Search with the job name in your bundle YAML file; the job should appear like the image below. This validates that you have successfully configured and deployed a Databricks job with a notebook task in the workspace. Now, you can commit and push the YAML file to your favorite Git repository and deploy it in all environments using the CI/CD pipeline. Conclusion Automating deployment through the Databricks Asset Bundle ensures that all the jobs and Delta Live Table pipelines are deployed consistently in Databricks workspaces. It also ensures that the configurations are codified, version-controlled, and migrated across environments following DevOps best practices. The above steps demonstrate how anyone can easily develop, validate, and deploy a job in a Databricks workspace using Databricks CLI from a local workstation during the development and unit testing phases. Once validated, the same YAML file and the associated notebook(s) can be pushed to a version control tool (e.g., Github), and a CI/CD pipeline can be implemented to deploy the same job in test and production environments.

By Soumya Barman

Deployment

DZone's Featured Deployment Resources

Top Deployment Experts

The Latest Deployment Topics