Software Design and Architecture Resources

DZone's Featured Software Design and Architecture Resources

The LLM Advantage: Smarter Time Series Predictions With Less Effort

By Vinoth Manamala Sudhakar

Have you ever wondered why predicting next month's sales is so hard? Or why forecasting the weather seems like a coin flip sometimes? Time series data is everywhere, but making sense of it has always been a headache — until now. Large language models (LLMs) are shaking things up in the time series world. Seriously, it's like someone finally handed us a decent flashlight after we've been stumbling around in the dark for years. The Old Way Was Kind of a Pain Traditional time series methods like ARIMA and Prophet are great; don't get me wrong. But they're fussy. You need to know your data inside and out — is it seasonal? Trending? Both? And the preprocessing steps! Stationarity testing, differencing, parameter tuning... it's enough to make your eyes glaze over. I once spent three days trying to forecast inventory levels with ARIMA. Three days! And the results were still okay at best. Enter the Language Model Revolution Here's the cool part: LLMs don't really care about all those technical requirements. They just figure stuff out. These models have seen patterns in massive amounts of data, which helps them recognize trends in time series data without explicit programming. It's like they've developed an intuition for how things change over time. What Makes LLMs Good at This? They see the big picture. LLMs can spot complex relationships without you having to specify them.They handle messy data better. Missing values? Outliers? LLMs can work around these issues more gracefully than traditional methods.They bring context to the table. An LLM knows that retail sales spike during holidays or that energy consumption changes with the seasons because it has learned these patterns from text data.Transfer learning capabilities. LLMs pre-trained on diverse datasets can transfer knowledge across domains, reducing the need for domain-specific feature engineering.Multivariate analysis. They excel at handling multiple interrelated variables simultaneously without explicit modeling of their relationships. Real-World Implementation Example It's important to note that the code provided in this article is based on what I've used in my actual work, but please be aware that: It needs adaptation. I've simplified some parts of the article, and you'll need to adjust it to work in your specific environment.GPU requirements. This implementation runs on a CUDA-enabled GPU. If you're using different hardware, you'll need to modify the device settings.It might break. The code works for my specific use case but may throw errors or behave unexpectedly with your data without some tweaking.Missing pieces. I've omitted some auxiliary functions and error handling for brevity. You'll need to fill these gaps.Model changes. Llama-2-7b might not be available or might be replaced by newer models by the time you read this.Memory issues. With large datasets, you might run into memory problems that aren't addressed here.Prompt tweaking needed. The example prompts work for my data but will almost certainly need adjustment for yours.API access. You'll need proper access to the models referenced. This is real code that solved a real problem for me, but don't expect to copy-paste it and have it work right away. Consider it a starting point that will require some debugging and customization for your specific use case. Now, let me walk you through an actual implementation I built for a manufacturing client that reduced forecast error by 31% compared to their existing ARIMA models. The Problem The client needed to forecast component demand across 540 SKUs with highly seasonal patterns and irregular spikes due to promotional events. The Solution: Time-LLM Approach We implemented a modified version of the Time-LLM architecture, which combines traditional time series decomposition with LLM-based pattern recognition. Python import pandas as pd import numpy as np import torch from transformers import AutoTokenizer, AutoModelForCausalLM from statsmodels.tsa.seasonal import seasonal_decompose import matplotlib.pyplot as plt class TimeLLMForecaster: def __init__(self, llm_model="meta-llama/Llama-2-7b-chat-hf", device="cuda"): self.tokenizer = AutoTokenizer.from_pretrained(llm_model) self.model = AutoModelForCausalLM.from_pretrained( llm_model, torch_dtype=torch.float16, device_map="auto" ) self.device = device def decompose_time_series(self, series, period=None): """Decompose time series into trend, seasonal, and residual components""" if period is None: # Auto-detect seasonality using autocorrelation acf = pd.Series(series).autocorr(lag=range(1, min(len(series)//2, 50))) period = acf.argmax() + 1 decomposition = seasonal_decompose(series, model='additive', period=period) return decomposition.trend, decomposition.seasonal, decomposition.resid # Additional methods omitted for brevity Key Technical Components Time series decomposition. We first decompose the time series into trend, seasonality, and residual components using classical methods.Prompt engineering. Carefully craft prompts that include: Recent historical valuesContextual information (holidays, promotions, etc.)Explicit numerical reasoning instructionsResidual modeling. Use the LLM specifically to model the residual component, which contains the irregular patterns that traditional methods struggle with.Component recombination. Combine statistical forecasts of trend and seasonality with LLM-predicted residuals for the final forecast. Results The approach yielded impressive results: 31% reduction in Mean Absolute Percentage Error (MAPE)47% improvement in predicting demand spikes from promotional events28% reduction in inventory carrying costs Technical Deep Dive: LLMs for Time Series The recent advances in time series forecasting with LLMs rely on several technical innovations: 1. Patchification Time series data is typically converted into "patches" or segments that can be tokenized and processed by the LLM. This approach, borrowed from computer vision transformers, allows LLMs to process numerical sequences effectively. Python def patchify_time_series(data, patch_length=10, stride=5): """Convert time series into overlapping patches""" patches = [] for i in range(0, len(data) - patch_length + 1, stride): patches.append(data[i:i + patch_length]) return np.array(patches) 2. Prompt Templates for Time Series Effective prompts for time series tasks typically include: Plain Text [SERIES] 10.5, 11.2, 9.8, 10.1, 12.3, 11.8, 13.2 [CONTEXT] This is weekly sales data for a retail store. Black Friday occurs during the forecast period. [FORECAST_HORIZON] 7 [QUESTION] Predict the next 7 values in this time series. 3. Multi-Modal Integration The most advanced implementations combine numerical and textual inputs: Python # Combining numerical features with text context def create_multimodal_embedding(time_series_data, textual_context, model): # Process time series with numerical encoder numerical_features = process_time_series(time_series_data) # Process text with LLM encoder text_embedding = model.encode_text(textual_context) # Concatenate or cross-attend between modalities combined_representation = concatenate_features(numerical_features, text_embedding) return combined_representation Open Source Frameworks to Try Here are some production-ready frameworks that implement these techniques: Chronos – A specialized time series forecasting library built on top of Hugging Face transformersNixtla TimeGPT – An open-source framework for time series forecasting with LLMsLangChain time series agents – Specialized agents for time series analysis Benchmarks Worth Noting Recent benchmarks on the M4 competition dataset show that LLM-based approaches are beginning to outperform statistical methods: MethodMAPE (%)RMSETraining TimeARIMA13.20.187FastProphet12.70.164MediumN-BEATS11.40.149SlowTime-LLM (ours)9.80.132Very SlowSpecialized TimeGPT9.10.123Very Slow What's Next? We're just scratching the surface here. As models get more specialized for numerical reasoning, we'll see even better performance on time series tasks. The most exciting developments are happening in: Domain-specific fine-tuning. LLMs fine-tuned to industry-specific time series data show dramatic improvements over general-purpose models.Hierarchical forecasting. Using LLMs to generate coherent forecasts across multiple levels of aggregation (e.g., store → region → country).Uncertainty quantification. Getting LLMs to produce reliable prediction intervals, not just point forecasts.Hybrid neural-symbolic systems. Combining the pattern recognition abilities of LLMs with the computational precision of traditional statistical methods. So if you've been struggling with time series forecasting, maybe it's time to give LLMs a shot. Trust me, your future self (and your stress levels) will thank you. More

Configure Testcontainers in Spring Boot 3.x, 2.x, and Reactive

By Seun Matt

CORE

Overview Testcontainers provides disposable Docker containers for databases, message queues, Redis, and so much more. It enables us to run fully integrated SpringBoot tests without mocking the Database, Redis, and even RabbitMQ interactions. In this tutorial, we will learn how to set up Testcontainers in a Spring Boot application. This approach will work for Spring Boot version 3.x, 2.x, and even reactive Spring Boot applications. Project Setup and Dependency Installation For this article, we will work with a reactive Spring Boot application that manages a Pet resource. It uses the PostgreSQL database and runs a database schema on startup. The demo application includes two integration tests for the endpoints: PetControllerIntegrationTest.java: Java @Test void givenValidRequestBody_whenCreate_thenReturn2XX() { PetRequest request = new PetRequest(); request.setName("Aja Ode " + insecure().randomAlphabetic(5).toUpperCase()); request.setColour("orange"); webTestClient.post().uri(Routes.Pets.PETS) .bodyValue(request) .exchange() .expectStatus().isOk() .expectBody() .jsonPath("$.data.name").isEqualTo(request.getName()) .jsonPath("$.data.colour").isEqualTo(request.getColour()); } @Test void givenExistingPets_whenGetAll_thenReturnAllPets() { Pet pet = new Pet(); pet.setName("Blimey the Goat " + insecure().randomAlphabetic(5).toUpperCase()); pet.setColour("red"); StepVerifier.create(petRepository.save(pet)) .assertNext(Assertions::assertNotNull) .verifyComplete(); webTestClient.get().uri(Routes.Pets.PETS) .exchange() .expectStatus().isOk() .expectBody() .jsonPath("$..name").value(Matchers.hasItem(pet.getName())); } For these tests to run successfully, we need a running PostgreSQL database server that's accessible to the application. Using Postgres on my local machine means the tests will not be able to run successfully in a CI/CD pipeline. Moreover, we do not want to mix test data with real application data. Therefore, we will add Testcontainers as part of the test setup, so it can run anywhere, independently without any mocking. We will use the testcontainers BOM (Bill Of Material) to manage the various versions of the dependencies: pom.xml: XML <dependencyManagement> <dependencies> <dependency> <groupId>org.testcontainers</groupId> <artifactId>testcontainers-bom</artifactId> <version>1.20.4</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> After that, we can add the following dependencies: XML <dependencies>  <dependency> <groupId>org.testcontainers</groupId> <artifactId>testcontainers</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.testcontainers</groupId> <artifactId>junit-jupiter</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.testcontainers</groupId> <artifactId>postgresql</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <scope>test</scope> </dependency> </dependencies> The PostgreSQL driver in the test scope is for Testcontainers to be able to run the init schema, as we will soon learn. Testcontainers Configuration The number one requirement for Testcontainers is a running Docker environment. So, ensure you have Docker on your local machine and that it is running. The test classes, for the demo application, have been organized such that a class named BaseSpringBootTest houses the annotations and main configurations. Every other test class will extend this one. We will create a static instance of the PostgreSQLContainer in the BaseSpringBootTest class. BaseSpringBootTest.java: Java static PostgreSQLContainer<?> postgresContainer = new PostgreSQLContainer<>("postgres:latest") .withDatabaseName("testcontainers_demo") .withInitScript("init.sql") .withUsername("test") .withExposedPorts(5432) .withPassword("test"); The PostgreSQLContainer instance above, is set to use the username and password test and run on port 5432 - the default. The string "postgres:latest" passed to the constructor is the specific Docker image we want to use. With this, we can control the specific version of the PostgreSQL database we want to use for the application. Testcontainers support running an init script when initializing a database container. In the above listing, with the method .withInitScript("init.sql"), the PostgreSQLContainer will run all SQL statements in a file named init.sql. The file should be in the src/test/resources directory. Now, we need to start the container and replace the Spring database connection properties, with that of the PostgreSQLContainer. We can achieve this in Spring Boot via a @DynamicPropertySource annotated method. This method enables us to replace application properties during start up. Listing 3.2 BaseSpringBootTest.java Java @DynamicPropertySource static void registerDynamicProperties(DynamicPropertyRegistry registry) { postgresContainer.start(); registry.add("spring.r2dbc.url", () -> String.format( "r2dbc:postgresql://%s:%d/%s", postgresContainer.getHost(), postgresContainer.getMappedPort(5432), postgresContainer.getDatabaseName() )); registry.add("spring.r2dbc.username", postgresContainer::getUsername); registry.add("spring.r2dbc.password", postgresContainer::getPassword); Runtime.getRuntime().addShutdownHook(new Thread(postgresContainer::stop)); } In the registerDynamicProperties method, we first invoke the start() method of the PostgreSQLContainer. This will cause the program flow to wait for a complete start before proceeding with other steps. Once the container has started, we then add the Spring datasource properties to the DynamicPropertyRegistry with values from the recently started container. All database interactions will now be using the PostgreSQL running in the Docker container we started. Finally, we registered a shutdown hook to stop the container once the application run is complete. In this case, once the test run is completed. Notice how we used spring.r2dbc.url and not spring.datasource.url. This is because the demo application is reactive and uses the r2dbc library for database interactions. We can achieve the same for a non-reactive Spring Boot application: BaseSpringBootTest.java: Java @DynamicPropertySource static void registerDynamicProperties(DynamicPropertyRegistry registry) { postgresContainer.start(); registry.add("spring.datasource.url", postgresContainer::getJdbcUrl); registry.add("spring.datasource.username", postgresContainer::getUsername); registry.add("spring.datasource.password", postgresContainer::getPassword); Runtime.getRuntime().addShutdownHook(new Thread(postgresContainer::stop)); } Testcontainers, by default, use random ports. For example, the default PostgreSQL port is 5432. PostgreSQLContainer will internally run on 5432 but map it to a random port for accessibility. In order for us to get the mapped port for 5432, we can use the function postgresContainer.getMappedPort(5432). The complete BaseSpringBootTest.java is like this: Listing 3.4 BaseSpringBootTest.java Java @SpringBootTest @ContextConfiguration(classes = WebTestClientConfiguration.class) public class BaseSpringBootTest { static PostgreSQLContainer<?> postgresContainer = new PostgreSQLContainer<>("postgres:latest") .withDatabaseName("testcontainers_demo") .withInitScript("init.sql") .withUsername("test") .withExposedPorts(5432) .withPassword("test"); @DynamicPropertySource static void registerDynamicProperties(DynamicPropertyRegistry registry) { postgresContainer.start(); registry.add("spring.r2dbc.url", () -> String.format( "r2dbc:postgresql://%s:%d/%s", postgresContainer.getHost(), postgresContainer.getMappedPort(5432), postgresContainer.getDatabaseName() )); registry.add("spring.r2dbc.username", postgresContainer::getUsername); registry.add("spring.r2dbc.password", postgresContainer::getPassword); Runtime.getRuntime().addShutdownHook(new Thread(postgresContainer::stop)); } } At this point, we can now run our tests without mocking the database or relying on a local database instance. Whenever we start the test run. Testcontainers will pull the required Docker image from Docker Hub, and create a Docker container for the application to use. When the test run completes, Testcontainers will dispose of the Docker container automatically. Generic Containers Testcontainers provide containers for other resources like RabbitMQ, MySQL, CassandraSQL, Local Stack, and so much more. We simply need to add the required modules to our applications as needed. Moreover, Testcontainers has a generic container that we can use to start up a container with any Docker image. For example, we can create an instance of a Generic container for Redis: BaseSpringBootTest.java: Java static GenericContainer redisContainer = new GenericContainer<>("redis:latest") .withExposedPorts(6379); Using the same technique as above, we can then start the container in a DynamicPropertySource method and override the application's Redis connection properties: Java @DynamicPropertySource static void registerDynamicProperties(DynamicPropertyRegistry registry) { postgresContainer.start(); redisContainer.start(); registry.add("spring.r2dbc.url", () -> String.format( "r2dbc:postgresql://%s:%d/%s", postgresContainer.getHost(), postgresContainer.getMappedPort(5432), postgresContainer.getDatabaseName() )); registry.add("spring.r2dbc.username", postgresContainer::getUsername); registry.add("spring.r2dbc.password", postgresContainer::getPassword); registry.add("spring.data.redis.host", redisContainer::getHost); registry.add("spring.data.redis.port", () -> redisContainer.getMappedPort(6379)); Runtime.getRuntime().addShutdownHook(new Thread(() -> { postgresContainer.stop(); redisContainer.stop(); })); } Following this configuration (on lines 5, 16, 17, and 21), all Redis interactions within the Spring Boot test will be using the Redis instance provided by the Testcontainers. Tips and Tricks 1. Connecting to a Running TestContainer Instance Sometimes, we may want to connect to the temporary Docker containers before they're disposed of. The trick is to add a breakpoint somewhere in the test logic and run the test in debug mode. Once the execution is paused, search the console output for a string like jdbc:. You will see a line like Container is started (JDBC URL: jdbc:postgresql://localhost:53321/testcontainers_demo?loggerLevel=OFF). With this host and port, we can connect to the running database using the default username and password test. 2. Using a Private Docker Repository Can you use private Docker repositories? Sure. Especially to circumvent the rate limit on Docker Hub. To achieve this, we need to do the following: Create a testcontainers.properties in src/test/resourcesAdd hub.image.name.prefix = private.docker-repo.io/ where docker.io should be the URL of your private Docker repo or mirror siteRemember to authenticate to the private Docker repo by executing docker login private.docker-repo.io -u username -p $DOCKER_PASS Every time Testcontainers needs to pull an image, it will use the configured private Docker repo as opposed to the default Docker Hub. 3. Running the Test in a CI/CD Runner Just like your local machine, your CI/CD environment must also have a valid Docker environment for the tests to run. You can confirm this by running docker --version in your pipeline step. Different CI/CD providers have different approaches to availing a Docker environment within the pipeline. GitLab CI/CD will call this feature Docker-in-Docker. Do ensure to consult your CI/CD provider's documentation on achieving the same. The demo project uses CircleCI, and enabling Docker in the pipeline is as simple as adding a setup_remote_docker step. Conclusion Testcontainers is revolutionary. It eliminates the complexities around running an integrated test for a Spring Boot application. The complete source code is available on GitHub. Consult the Testcontainers official documentation for more info. Happy coding! You can watch a video version of this article below. Video More

Enhanced Query Caching Mechanism in Hibernate 6.3.0

By Karthik Kamarapu

How Explainable AI Is Building Trust in Everyday Products

By Shailesh Chauhan

IoT Communication Protocols for Efficient Device Integration

By Richard Kaplan

Mastering Scalability in Spring Boot

Scalability is a fundamental concept in both technology and business that refers to the ability of a system, network, or organization to handle a growing amount of requests or ability to grow. This characteristic is crucial for maintaining performance and efficiency as need increases. In this article, we will explore the definition of scalability, its importance, types, methods to achieve it, and real-world examples. What Is Scalability in System Design? Scalability encompasses the capacity of a system to grow and manage increasing workloads without compromising performance. This means that as user traffic, data volume, or computational demands rise, a scalable system can maintain or even enhance its performance. The essence of scalability lies in its ability to adapt to growth without necessitating a complete redesign or significant resource investment Why This Is Important Managing growth. Scalable systems can efficiently handle more users and data without sacrificing speed or reliability. This is particularly important for businesses aiming to expand their customer base.Performance enhancement. By distributing workloads across multiple servers or resources, scalable systems can improve overall performance, leading to faster processing times and better user experiences.Cost-effectiveness. Scalable solutions allow businesses to adjust resources according to demand, helping avoid unnecessary expenditures on infrastructure.Availability assurance. Scalability ensures that systems remain operational even during unexpected spikes in traffic or component failures, which is essential for mission-critical applications.Encouraging innovation. A scalable architecture supports the development of new features and services by minimizing infrastructure constraints. Types of Scalability in General Vertical scaling (scaling up). This involves enhancing the capacity of existing hardware or software components. For example, upgrading a server's CPU or adding more RAM allows it to handle increased loads without changing the overall architecture.Horizontal scaling (scaling out). This method involves adding more machines or instances to distribute the workload. For instance, cloud services allow businesses to quickly add more servers as needed. Challenges Complexity. Designing scalable systems can be complex and may require significant planning and expertise.Cost. Initial investments in scalable technologies can be high, although they often pay off in the long run through improved efficiency.Performance bottlenecks. As systems scale, new bottlenecks may emerge that need addressing, such as database limitations or network congestion. Scalability in Spring Boot Projects Scalability refers to the ability of an application to handle growth — whether in terms of user traffic, data volume, or transaction loads — without compromising performance. In the context of Spring Boot, scalability can be achieved through both vertical scaling (enhancing existing server capabilities) and horizontal scaling (adding more instances of the application). Key Strategies Microservices Architecture Independent services. Break your application into smaller, independent services that can be developed, deployed, and scaled separately. This approach allows for targeted scaling; if one service experiences high demand, it can be scaled independently without affecting others.Spring cloud integration. Utilize Spring Cloud to facilitate microservices development. It provides tools for service discovery, load balancing, and circuit breakers, enhancing resilience and performance under load. Asynchronous Processing Implement asynchronous processing to prevent thread blockage and improve response times. Utilize features like CompletableFuture or message queues (e.g., RabbitMQ) to handle long-running tasks without blocking the main application thread. Asynchronous processing allows tasks to be executed independently of the main program flow. This means that tasks can run concurrently, enabling the system to handle multiple operations simultaneously. Unlike synchronous processing, where tasks are completed one after another, asynchronous processing helps in reducing idle time and improving efficiency. This approach is particularly advantageous for tasks that involve waiting, such as I/O operations or network requests. By not blocking the main execution thread, asynchronous processing ensures that systems remain responsive and performant. Stateless Services Design your services to be stateless, meaning they do not retain any client data between requests. This simplifies scaling since any instance can handle any request without needing session information. There is no stored knowledge of or reference to past transactions. Each transaction is made as if from scratch for the first time. Stateless applications provide one service or function and use a content delivery network (CDN), web, or print servers to process these short-term requests. Database Scalability Database scalability refers to the ability of a database to handle increasing amounts of data, numbers of users, and types of requests without sacrificing performance or availability. A scalable database tackles these database server challenges and adapts to growing demands by either adding resources such as hardware or software, by optimizing its design and configuration, or by undertaking some combined strategy. Type of Databases 1. SQL Databases (Relational Databases) Characteristics: SQL databases are known for robust data integrity and support complex queries.Scalability: They can be scaled both vertically by upgrading hardware and horizontally through partitioning and replication.Examples: PostgreSQL supports advanced features like indexing and partitioning. 2. NoSQL Databases Characteristics: Flexible schema designs allow for handling unstructured or semi-structured data efficiently.Scalability: Designed primarily for horizontal scaling using techniques like sharding.Examples: MongoDB uses sharding to distribute large datasets across multiple servers. Below are some techniques that enhance database scalability: Use indexes. Indexes help speed up queries by creating an index of frequently accessed data. This can significantly improve performance, particularly for large databases. Timescale indexes work just like PostgreSQL indexes, removing much of the guesswork when working with this powerful tool.Partition your data. Partitioning involves dividing a large table into smaller, more manageable parts. This can improve performance by allowing the database to access data more quickly. Read how to optimize and test your data partitions’ size in Timescale.Use buffer cache. In PostgreSQL, buffer caching involves storing frequently accessed data in memory, which can significantly improve performance. This is particularly useful for read-heavy workloads, and while it is always enabled in PostgreSQL, it can be tweaked for optimized performance.Consider data distribution. In distributed databases, data distribution or sharding is an extension of partitioning. It turns the database into smaller, more manageable partitions and then distributes (shards) them across multiple cluster nodes. This can improve scalability by allowing the database to handle more data and traffic. However, sharding also requires more design work up front to work correctly.Use a load balancer. Sharding and load balancing often conflict unless you use additional tooling. Load balancing involves distributing traffic across multiple servers to improve performance and scalability. A load balancer that routes traffic to the appropriate server based on the workload can do this; however, it will only work for read-only queries.Optimize queries. Optimizing queries involves tuning them to improve performance and reduce the load on the database. This can include rewriting queries, creating indexes, and partitioning data. Caching Strategies Caching is vital in enhancing microservices' performance and firmness. It is a technique in which data often and recently used is stored in a separate storage location for quicker retrieval from the main memory, known as a cache. If caching is incorporated correctly into the system architecture, there is a marked improvement in the microservice's performance and a lessened impact on the other systems. When implementing caching: Identify frequently accessed data that doesn't change often — ideal candidates for caching.Use appropriate annotations (@Cacheable, @CachePut, etc.) based on your needs.Choose a suitable cache provider depending on whether you need distributed capabilities (like Hazelcast) or simple local storage (like ConcurrentHashMap).Monitor performance improvements after implementing caches to ensure they're effective without causing additional overheads like stale data issues. Performance Optimization Optimize your code by avoiding blocking operations and minimizing database calls. Techniques such as batching queries or using lazy loading can enhance efficiency. Regularly profile your application using tools like Spring Boot Actuator to identify bottlenecks and optimize performance accordingly. Steps to Identify Bottlenecks Monitoring Performance Metrics Use tools like Spring Boot Actuator combined with Micrometer for collecting detailed application metrics.Integrate with monitoring systems such as Prometheus and Grafana for real-time analysis. Profiling CPU and Memory Usage Utilize profilers like VisualVM, YourKit, or JProfiler to analyze CPU usage, memory leaks, and thread contention.These tools help identify methods that consume excessive resources. Database Optimization Analyze database queries using tools like Hibernate statistics or database monitoring software.Optimize SQL queries by adding indexes, avoiding N+1 query problems, and optimizing connection pool usage. Thread Dump Analysis for Thread Issues Use jstack <pid> command or visual analysis tools like yCrash to debug deadlocks or blocked threads in multi-threaded applications. Distributed Tracing (If Applicable) For microservices architecture, use distributed tracing tools such as Zipkin or Elastic APM to trace latency issues across services. Common Bottleneck Scenarios High Latency Analyze each layer of the application (e.g., controller, service) for inefficiencies. Scenario Tools/Techniques High CPU Usage VisualVM, YourKit High Memory Usage Eclipse MAT, VisualVM Slow Database Queries Hibernate Statistics Network Latency Distributed Tracing Tools Monitoring and Maintenance Continuously monitor your application’s health using tools like Prometheus and Grafana alongside Spring Boot Actuator. Monitoring helps identify performance issues early and ensures that the application remains responsive under load Load Balancing and Autoscaling Use load balancers to distribute incoming traffic evenly across multiple instances of your application. This ensures that no single instance becomes a bottleneck. Implement autoscaling features that adjust the number of active instances based on current demand, allowing your application to scale dynamically Handling 100 TPS in Spring Boot 1. Optimize Thread Pool Configuration Configuring your thread pool correctly is crucial for handling high TPS. You can set the core and maximum pool sizes based on your expected load and system capabilities. Example configuration: Java spring.task.execution.pool.core-size=20 spring.task.execution.pool.max-size=100 spring.task.execution.pool.queue-capacity=200 spring.task.execution.pool.keep-alive=120s This configuration allows for up to 100 concurrent threads with sufficient capacity to handle bursts of incoming requests without overwhelming the system. Each core of a CPU can handle about 200 threads, so you can configure it based on your hardware. 2. Use Asynchronous Processing Implement asynchronous request handling using @Async annotations or Spring WebFlux for non-blocking I/O operations, which can help improve throughput by freeing up threads while waiting for I/O operations to complete. 3. Enable Caching Utilize caching mechanisms (e.g., Redis or EhCache) to store frequently accessed data, reducing the load on your database and improving response times. 4. Optimize Database Access Use connection pooling (e.g., HikariCP) to manage database connections efficiently. Optimize your SQL queries and consider using indexes where appropriate. 5. Load Testing and Monitoring Regularly perform load testing using tools like JMeter or Gatling to simulate traffic and identify bottlenecks. Monitor application performance using Spring Boot Actuator and Micrometer. Choosing the Right Server Choosing the right web server for a Spring Boot application to ensure scalability involves several considerations, including performance, architecture, and specific use cases. Here are key factors and options to guide your decision: 1. Apache Tomcat Type: Servlet containerUse case: Ideal for traditional Spring MVC applications.Strengths: Robust and widely used with extensive community support.Simple configuration and ease of use.Well-suited for applications with a straightforward request-response model.Limitations: May face scalability issues under high loads due to its thread-per-request model, leading to higher memory consumption per request 2. Netty Type: Asynchronous event-driven frameworkUse case: Best for applications that require high concurrency and low latency, especially those using Spring WebFlux.Strengths: Non-blocking I/O allows handling many connections with fewer threads, making it highly scalable.Superior performance in I/O-bound tasks and real-time applications.Limitations: More complex to configure and requires a different programming model compared to traditional servlet-based application. 3. Undertow Type: Lightweight web serverUse case: Suitable for both blocking and non-blocking applications; often used in microservices architectures.Strengths: High performance with low resource consumption.Supports both traditional servlet APIs and reactive programming models.Limitations: Less popular than Tomcat, which may lead to fewer community resources available 4. Nginx (As a Reverse Proxy) Type: Web server and reverse proxyUse case: Often used in front of application servers like Tomcat or Netty for load balancing and serving static content.Strengths: Excellent at handling high loads and serving static files efficiently.Can distribute traffic across multiple instances of your application server, improving scalability Using the Right JVM Configuration 1. Heap Size Configuration The Java heap size determines how much memory is allocated for your application. Adjusting the heap size can help manage large amounts of data and requests. Shell -Xms1g -Xmx2g -Xms: Set the initial heap size (1 GB in this example).-Xmx: Set the maximum heap size (2 GB in this example). 2. Garbage Collection Choosing the right garbage collector can improve performance. The default G1 Garbage Collector is usually a good choice, but you can experiment with others like ZGC or Shenandoah for low-latency requirements. Shell -XX:+UseG1GC For low-latency applications, consider using: Shell -XX:+UseZGC # Z Garbage Collector -XX:+UseShenandoahGC # Shenandoah Garbage Collector 3. Thread Settings Adjusting the number of threads can help handle concurrent requests more efficiently. Set the number of threads in the Spring Boot application properties: Properties files server.tomcat.max-threads=200 Adjust the JVM’s thread stack size if necessary: Shell -Xss512k 4. Enable JIT Compiler Options JIT (Just-In-Time) compilation can optimize the performance of your code during runtime. Shell -XX:+TieredCompilation -XX:CompileThreshold=1000 -XX:CompileThreshold: This option controls how many times a method must be invoked before it's considered for compilation. Adjust according to profiling metrics. Hardware Requirements To support 100 TPS, the underlying hardware infrastructure must be robust. Key hardware considerations include: High-performance servers. Use servers with powerful CPUs (multi-core processors) and ample RAM (64 GB or more) to handle concurrent requests effectively.Fast storage solutions. Implement SSDs for faster read/write operations compared to traditional hard drives. This is crucial for database performance.Network infrastructure. Ensure high bandwidth and low latency networking equipment to facilitate rapid data transfer between clients and servers. Conclusion Performance optimization in Spring Boot applications is not just about tweaking code snippets; it's about creating a robust architecture that scales with growth while maintaining efficiency. By implementing caching, asynchronous processing, and scalability strategies — alongside careful JVM configuration — developers can significantly enhance their application's responsiveness under load. Moreover, leveraging monitoring tools to identify bottlenecks allows for targeted optimizations that ensure the application remains performant as user demand increases. This holistic approach improves user experience and supports business growth by ensuring reliability and cost-effectiveness over time. If you're interested in more detailed articles or references on these topics: Best Practices for Optimizing Spring Boot Application PerformancePerformance Tuning Spring Boot ApplicationsIntroduction to Optimizing Spring HTTP Server Performance

By Mostafa Hosseinzadeh

Disaster Recovery Plan for DevOps

A well-designed disaster recovery plan is critical to mitigate risks, recover swiftly from failures, and ensure your data and infrastructure integrity. Are There Any Myths Related to DR in DevOps? Some organizations still mistakenly assume that DevOps tools, like GitHub, GitLab, Bitbucket, Azure DevOps, or Jira, come with built-in, all-encompassing disaster recovery. However, we shouldn’t forget about the shared responsibility models, which explicitly clarify that while providers secure their infrastructure and smoothly run their services, users must safeguard their own account data. For example, let’s take a look at the quote from the Atlassian Security Practices: “For Bitbucket, data is replicated to a different AWS region, and independent backups are taken daily within each region. We do not use these backups to revert customer-initiated destructive changes, such as fields overwritten using scripts, or deleted issues, projects, or sites. To avoid data loss, we recommend making regular backups.” You may find the same pieces of advice in any SaaS provider’s shared responsibility model. And missteps in this area can lead to severe disruptions, including data loss of critical source code or metadata, reputational damage, and financial setbacks. Challenges Unique to the DevOps Ecosystem While developing your disaster recovery Plan for your DevOps stack, it’s worth considering the challenges DevOps face in this view. DevOps ecosystems always have complex architecture, like interconnected pipelines and environments (e., GitHub and Jira integration). Thus, a single failure, whether due to a corrupted artifact or a ransomware attack, can cascade through the entire system. Moreover, the rapid development of DevOps creates constant changes, which can complicate data consistency and integrity checks during the recovery process. Another issue is data retention policies. SaaS tools often impose limited retention periods – usually, they vary from 30 to 365 days. Thus, for example, if you accidentally delete your repository without having a backup copy of it, you can lose it forever. Why a Disaster Recovery Is a DevOps Imperative The criticality of data is important, but it isn’t the only reason for organizations to develop and improve their Disaster Recovery mechanisms. An effective disaster recovery plan can help organizations: Mitigate the risks, as service outages, cyberattacks, and accidental deletions can lead to prolonged downtime and data loss. Facts and statistics: In 2023, incidents that impacted GitHub users grew by over 21% in comparison to 2022. When it comes to GitLab, about 32% of events were recognized as having an impact on service performance and impacted customers. (Statistics taken from the State of DevOps Threats Report). Align with the compliance and regulatory requirements — for example, ISO 20071, GDPR, or NIS 2 mandate organizations to have robust data protection and recovery mechanisms. Failing to comply may result in heavy fines and legal consequences. Note: In December 2024, the EU Cyber Resilience Act came into force. It means that by December 2027, organizations that provide digital products and services and operate in the European Union should adapt their data protection and incident management within the legislation’s requirements. Reduce or eliminate the cost of downtime, as every minute of system unavailability equates to revenue loss. The average downtime cost can exceed $ 9K per minute, which makes rapid recovery essential. Best Practices for Building a Robust Disaster Recovery Plan Isn’t it crucial that your disaster recovery plan foresee any possible disaster scenario and provide you and your team with all the necessary steps to address the event of failure quickly? Let’s figure out the components of the effective DRP… Assess All the Critical Components You should identify the most critical DevOps assets. They may include source code repositories, metadata, CI/CD pipelines, build artifacts, configuration management files, etc. You need to know what data is the priority to recover in the event of failure. Implement Backup Best Practices It’s impossible to retrieve data without a well-organized backup strategy. Thus, it’s important to follow backup best practices to ensure that you can restore your critical data in any event of failure, including service outage, infrastructure downtime, ransomware attack, accidental deletion, etc. For that reason, your backup solution should allow you to: Automate your backups, by scheduling them with the most appropriate interval between backup copies, so that no data is lost in the event of failure,Provide long-term or even unlimited retention, which will help you to restore data from any point in time,Apply the 3-2-1 backup rule and ensure replication between all the storages, so that in case one of the backup locations fails, you can run your backup from another one, Ransomware protection, which includes AES encryption with your own encryption key, immutable backups, restore and DR capabilities (point-in-time restore, full and granular recovery, restore to multiple destinations, like a local machine, the same or new account, or cross-overly between any of GitHub, GitLab, Bitbucket, and Azure DevOps). Define Your Yecovery Metrics It’s critical for an organization to set its measurable objectives, like RTO or RPO. The Recovery Time Objective (RTO) refers to how quickly your company systems should be operating after the disaster strikes. For example, if your organization establishes its RTO as 8 hours, then in those 8 hours it should resume its normal workflow after an event of a disaster. Usually, the lower the RTO the organization sets, the better it’s prepared for failure. The Recovery Point Objective (RPO) shows the acceptable data loss measured in the time the company can withstand. For example, if the company can easily survive without 3 hours’ worth of data, then its RPO is 3 hours. The lower the RPO you have, the more frequent backups your organization should have. Regularly Test and Validate Your Backup and Restore Operations With regular test restores, you can ensure your backup integrity and have peace of mind that in case of a failure, you can retrieve your data fast. Moreover, it’s worth simulating failures. It will help your organization evaluate its DRP efficacy in the face of simulated outages, ransomware attacks, or other disasters. Educate Your Team Panic is the worst when it comes to a disaster. Thus, each member of your team should understand what he or she should do in such a situation. Set up responsibilities and roles on who should perform restore operations and who should communicate about the disaster. Your organization should have a thoroughly built communication plan for disasters that states the communication strategy and people responsible for informing stakeholders and other possibly impacted parties, and templates for such a communication. Case Studies of DRP in DevOps Let’s look at case studies of how a DRP can help to avoid the devastating consequences of disasters: Service Outages A big digital corporation fully relies on GitHub (there may be any other service provider, like GitLab, Atlassian, or Azure DevOps). Suddenly, the company understands that the service provider is experiencing an outage… yet, the company needs to continue its operations as fast as possible — let’s not forget that the average cost of downtime is $9K per minute. Having a comprehensive DRP, the organization restores its data from the latest backup copy, using the point-in-time restore, to GitLab (or Bitbuket or Azure DevOps). Thus, the organization resumes its operations fast, eliminates data loss, and ensures minimal downtime. Tip: In such a situation, your backup solution should also allow you to restore your data to your local machine to resume business continuity as quickly as possible. Human Error vs. Infrastructure Downtime A developer pushes the incorrect data and accidentally overwrites critical files. The entire situation paralyzes the company’s workflow and leads to downtime. Hopefully, the organization’s DRP foresees such a situation, by following the 3-2-1 backup rule. Thus, the company’s IT team runs the backup from another storage to ensure business continuity. Ransomware Attack A mid-sized software company faces a ransomware attack encrypting its primary Git repositories. Having implemented an efficient DRP with automated backups and ransomware-proof features, such as immutable backups, the company manages to restore its data from the point in time when its data wasn’t corrupted. The result? The company retrieves its operations within hours, avoiding a multi-million-dollar ransom demand and minimizing downtime. Takeaway A disaster recovery plan is a strategic necessity for organizations nowadays. Beyond protecting data, it helps organizations ensure compliance, build customer trust, and reduce financial risks. Backup strategy should become a comprehensive basis for any DRP, even the most demanding one. Thus, you should be able to: Set up backup policies to automate backup processes within the most demanding RTOs and RPOs,Keep data in multiple locations, meeting the 3-2-1 backup rule,Have secure ransomware protection mechanisms,Monitor backup performance due to data-driven dashboards, Slack/email notifications, SLA, compliance reports, etc.,Have test restores,Restore data in any event of failure as the solution foresees any DR scenario and provides robust restore capabilities, including full data recovery, granular restore, point-in-time recovery, restore to the same or a new account, restore to your local instance, andEnsure compliance and cyber resilience.

By Daria Kulikova

Migrating Java Microservices to Go: A Comprehensive Guide

With the rising demand for high-performance, scalable, and resource-efficient microservices, many organizations are exploring the transition from Java to Go (Golang). Java, a long-standing enterprise favorite, offers robustness and a vast ecosystem, but Go’s lightweight concurrency model, fast execution speed, and lower memory footprint make it an attractive alternative. This guide explores why and how to migrate Java microservices to Go effectively. Migrating Java microservices to Go Why Migrate From Java to Go? Performance gains. Go’s compiled nature eliminates JVM overhead, leading to faster execution and lower memory consumption. Go routines enable efficient concurrency with minimal resource overhead.Resource efficiency. Go’s lightweight binaries reduce memory usage, optimize cloud deployment, and lower infrastructure costs.Faster development and deployment. Simple syntax and static linking minimize complexity, accelerate development, and streamline deployment.Better scalability. Go routines allow massive concurrency with minimal overhead, making Go ideal for cloud-native and high-performance applications. Key Considerations Before Migration Assess Microservices Dependencies Assessment of the current microservice dependencies is critical to the success of any migration effort; therefore, before migrating from Java to Go, Java-based applications that may include assorted libraries, frameworks, and APIs that usually have no analogs in Go need to be evaluated. A detailed assessment will help find Go alternatives to several essential dependencies, such as authentication, logging, and messaging queues. An understanding in advance of these dependencies will help prevent compatibility issues and ensure a changeover as smooth as possible. Go libraries should also be checked for the same performance and stability features their Java counterparts provide. Analyze Business Logic Different enterprises also may avoid entire rewrites of Java applications but focus on migrating extremely core business logic that could derive the fastest performance benefits from Go. A selective strategy for migration of the core business risks will mitigate risks and significantly reduce the development time of the migration, as well as ensure better transition efficiency. A close examination of the business logic will assist developers in deciding which Android components can take advantage of Go's concurrency model while maintaining less-critical functions in Kotlin or Java. This mode of migration would allow teams to gradually adopt Go as a technology without disturbing already-in-place services. Evaluate Data Persistence Layers Compatibility with the database seems to be a key factor in understanding migration, many involving applications built on ORM-framed technologies such as Hibernate that might require rationalization when transitioning to Go. Go-based ORMs like GORM and sqlx provide similar capabilities but might require modifications in respect of the way a database is interacted with. It is mandatory to ensure that the existing databases will be supported by Go database drivers and ORMs if data reliability and performance are to be in environment preservation. Tests should be conducted to make sure that database queries, transactions, and indexing systems render into that comparative language properly and perform efficiently under benign conditions. Benchmark and Plan Performance Testing Before migration begins, there should be a number of performance benchmarks set up to compare implementations of Java to Go. That is, one needs to measure response times, memory consumption, CPU usage, and concurrency handling under load. This helps developers by running stress tests on both Java and Go versions of microservices to ascertain whether the migration provides noticeably better performance. Before migrating, performance benchmarking constitutes the check to ascertain problem areas for further optimization and ensures that Go-based microservices match or exceed their performance bar. Migration Steps 1. Identify and Crumble Services The migration process would likely begin with choosing microservices that will benefit most from Go’s efficient execution. These would be services dedicated to interfacing as stateless services or computational-heavy workloads. Developers have to set up microservice boundaries and isolate business logic from Java-specific dependencies for a modular migration approach. Decomposing services into smaller, independent components allows a more convenient, gradual, full transition and independent testing of each migrated service. 2. Set Up a Go Environment For development purposes, organizations need to set up a Go development environment. This includes installing Go, setting up Go modules for dependency management, and selecting appropriate frameworks for specific use cases. Popular Go frameworks include Gin for building web APIs, GORM for database interactions, and Cobra for building command line interfaces. Having a standardized development setup can guarantee that the migration carries the same spirit and purpose, applying immediately for Java developers transitioning to Go. 3. Rewriting Core Business Logic Rewriting Java code to Go consists primarily of converting Java classes and methods into Go structs and functions. Developers will need to get a feel for Go’s simpler and more idiomatic approach to structuring code. The multi-threaded Java model should ideally be substituted with concurrent lightweight Go routines in Go. Additionally, Java annotations should be substituted with Go’s native struct tags for JSON serialization and other functionalities. By leveraging Go’s built-in features, developers can create more efficient and maintainable code. 4. Implement API Layer and Middleware The microservices design demands a sound API layer for its service communication. In most Java applications, Spring Boot is commonly used for API development and dependency injection, respectively. Either the Gin or Fiber APIs define the routes and the middleware for Go. Typically, in contrast to Java, which is based on annotations for dependency injection, Go works with interface-based dependency handling to provide better flexibility for API interaction. Developers must ensure that authentication, authorization, and request-handling mechanisms are implemented effectively in Go to maintain security and performance. 5. Integrate Database and Caching Layers Java to Go involves replacing Java ORM frameworks like Hibernate with Go equivalents like GORM or sqlx. These ORM frameworks provide better database connection support for Go applications while maintaining ease of use and performance. For better performance, one would prefer to use a caching mechanism in the case of microservices that need data persistence and retrieval. Go's database drivers enable integration with Redis, PostgreSQL, or MongoDB. Proper indexing, connection pooling, and query optimization must be considered when migrating. 6. Implement Logging and Monitoring Observability really helps debug and maintain microservices. Java applications typically employ a logging framework such as Log4j or SLF4J, while Go has lightweight offerings such as Logrus and Zap. Using structured logging with Go gives much better log management and traceability. Monitoring tools such as Prometheus and Grafana can also be integrated with Go microservices for monitoring metrics, visualization of the performance trends, and issue detection. Proper logging and monitoring allow teams to keep an eye on the system performance once the migration is complete. 7. Containerization and Deployment Once the microservices have been migrated to Go, they must be packaged and deployed efficiently. Java-based microservices often rely on large Docker images due to the JVM, increasing resource consumption. In contrast, Go applications can be compiled into small, self-contained binaries that significantly reduce container size. Using lightweight base images, such as FROM golang:alpine, helps ensure minimal dependencies and faster startup times. Kubernetes is mostly used for deployment, controlling active microservices across production environments. This can be coupled with Helm charts for automated deployment configurations, and Istio service mesh is material in traffic management and service discovery. By adhering to best practices in containerization and deployment, organizations can ensure that their Go-based microservices are cloud-native, well-managed, and scalable. Challenges and Best Practices Challenges Learning curve. Developers require time to get accustomed to Go's syntax and conventions.Garbage collection. Go has a lower latency garbage collection but requires manual memory management.Library support. Many Java frameworks also lack direct equivalents in Go, meaning workarounds need to be devised. Best Practices Run pilot microservices first to make sure your approach will validate less critical microservices first.Use automated testing during development to ensure functional parity with unit integration and load test.Transition gradually. Leave some hybrid java-go microservices to reduce risks.Performance tuning. Go's built tools might be utilized for profiling and fine-tuning. Conclusion The migration of Java microservices to Go can deliver significant performance boosts, drive down infrastructure costs, and make deployment easier. A proper and linearly planned transition could lead to a smooth migration, which is highly recommended. Additionally, making use of concurrency management along with efficient memory management and lightweight deployment will enhance building faster and more scalable microservices tuned for modern cloud settings.

By sairamakrishna Karri

How to Deploy Karpenter on AWS Kubernetes With kOps

kOps is a widely used tool for deploying and managing Kubernetes clusters in multi-cloud or hybrid cloud environments. It provides a unified configuration system (YAML or JSON), which lets you easily set up clusters across AWS, GCP, Azure, and on-premises environments. With flexible customization options, kOps lets you adjust everything from control plane and worker node operating systems to network plugins (like Calico and Cilium) and storage solutions, which makes it an excellent fit for complex setups. To optimize Kubernetes resource efficiency, many teams choose Karpenter — an open-source autoscaler that provisions nodes dynamically based on workload demands. It supports multiple instance types, schedules AWS Spot Instances to cut costs, and eliminates the need for predefined node groups, offering greater flexibility. However, kOps no longer provides official support for Karpenter, meaning its latest versions require manual setup to integrate with Karpenter. This blog walks you through the step-by-step process of deploying Karpenter on a kOps-managed AWS Kubernetes cluster, helping you enable automatic scaling and improve resource efficiency. Prerequisites Before you begin, ensure you have the following: An AWS account with IAM permissions to create EC2 instancesAWS CLI installed and configuredkubectl (Kubernetes CLI) installedHelm (Kubernetes package manager) installedkOps installed Create a Cluster With kOps 1. Configure the Cluster Before creating the cluster, you need to specify the AWS region and cluster name. To simplify deployment, we will use a Gossip-based DNS cluster. If you prefer to use your own domain for the cluster, follow the official guide. Shell export DEPLOY_REGION="us-west-1" export CLUSTER_NAME="demo1" export DEPLOY_ZONE="us-west-1a" export NAME=${CLUSTER_NAME}.k8s.local 2. Create a kOps IAM User To create a Kubernetes cluster with kOps, you need a dedicated IAM user with the necessary permissions. This section will guide you through creating an IAM user named kops using the AWS CLI. Shell aws iam create-group --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonRoute53FullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/IAMFullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonVPCFullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonSQSFullAccess --group-name kops aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonEventBridgeFullAccess --group-name kops aws iam create-user --user-name kops aws iam add-user-to-group --user-name kops --group-name kops aws iam create-access-key --user-name kops 3. Export AWS Access Key and Secret Key To authenticate kOps with AWS, you need to export your Access Key and Secret Key. For simplicity, this guide does not switch users explicitly. You can manually switch to the kOps IAM user using: Shell export AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id) export AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key) 4. Create an S3 Bucket for Cluster State kOps requires a dedicated S3 bucket to store cluster state and configuration. This bucket serves as the single source of truth for managing your cluster. Shell export KOPS_STATE_STORE_NAME=kops-state-store-${CLUSTER_NAME} export KOPS_OIDC_STORE_NAME=kops-oidc-store-${CLUSTER_NAME} export KOPS_STATE_STORE=s3://${KOPS_STATE_STORE_NAME} aws s3api create-bucket \ --bucket ${KOPS_STATE_STORE_NAME} \ --region ${DEPLOY_REGION} \ --create-bucket-configuration LocationConstraint=${DEPLOY_REGION} aws s3api create-bucket \ --bucket ${KOPS_OIDC_STORE_NAME} \ --region ${DEPLOY_REGION} \ --create-bucket-configuration LocationConstraint=${DEPLOY_REGION} \ --object-ownership BucketOwnerPreferred aws s3api put-public-access-block \ --bucket ${KOPS_OIDC_STORE_NAME} \ --public-access-block-configuration BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false aws s3api put-bucket-acl \ --bucket ${KOPS_OIDC_STORE_NAME} \ --acl public-read 5. Create the Cluster The following command creates the cluster configuration without starting the build process. This is the most basic example: Shell kops create cluster \ --name=${NAME} \ --cloud=aws \ --node-count=1 \ --control-plane-count=1 \ --zones=${DEPLOY_ZONE} \ --discovery-store=s3://${KOPS_OIDC_STORE_NAME}/${NAME}/discovery We are now at the final step of building the cluster, which may take a while. Once the process is complete, you'll need to wait for the instances to finish downloading the Kubernetes components and reach the Ready state. Shell kops update cluster --name ${NAME} --yes --admin kops export kubeconfig # waiting for Ready kops validate cluster --wait 10m --name ${NAME} Deploy Karpenter 1. Prepare Before deploying Karpenter, you'll need to set up several environment variables for configuring NodePool and NodeClass. Use the AWS CLI to retrieve the OIDC Provider information, including the Issuer URL and AWS Account ID, to ensure smooth deployment: Shell export OIDC_PROVIDER_ID=$(aws iam list-open-id-connect-providers \ --query "OpenIDConnectProviderList[?contains(Arn, '${NAME}')].Arn" \ --output text | awk -F'/' '{print $NF}') export OIDC_ISSUER=${KOPS_OIDC_STORE_NAME}.s3.${DEPLOY_REGION}.amazonaws.com/${NAME}/discovery/${NAME} export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' \ --output text) export AWS_INSTANCE_PROFILE_NAME=nodes.${NAME} export KARPENTER_ROLE_NAME=karpenter.kube-system.sa.${NAME} export CLUSTER_ENDPOINT=$(kubectl config view -o jsonpath="{.clusters[?(@.name=='${NAME}')].cluster.server}") # Storage of temporary documents for subsequent needs export TMP_DIR=$(mktemp -d) 2. Create a Karpenter IAM Role To allow Karpenter to dynamically manage AWS resources (such as EC2 instances) based on Kubernetes workload requirements, you need to create a dedicated IAM Role with the appropriate policies. This role will use OIDC authentication to grant Karpenter the necessary permissions. Shell aws iam create-role \ --role-name ${KARPENTER_ROLE_NAME} \ --assume-role-policy-document "{ \"Version\": \"2012-10-17\", \"Statement\": [ { \"Effect\": \"Allow\", \"Principal\": { \"Federated\": \"arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/oidc.eks.${DEPLOY_REGION}.amazonaws.com/id/${OIDC_PROVIDER_ID}\" }, \"Action\": \"sts:AssumeRoleWithWebIdentity\", \"Condition\": { \"StringEquals\": { \"oidc.eks.${DEPLOY_REGION}.amazonaws.com/id/${OIDC_PROVIDER_ID}:sub\": \"system:serviceaccount:kube-system:karpenter\" } } } ] }" aws iam create-role \ --role-name ${KARPENTER_ROLE_NAME} \ --assume-role-policy-document "{ \"Version\": \"2012-10-17\", \"Statement\": [ { \"Effect\": \"Allow\", \"Principal\": { \"Federated\": \"arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${KOPS_OIDC_STORE_NAME}.s3.us-west-1.amazonaws.com/${NAME}/discovery/${NAME}\" }, \"Action\": \"sts:AssumeRoleWithWebIdentity\", \"Condition\": { \"StringEquals\": { \"${OIDC_ISSUER}:sub\": \"system:serviceaccount:kube-system:karpenter\" } } } ] }" aws iam put-role-policy \ --role-name ${KARPENTER_ROLE_NAME} \ --policy-name InlineKarpenterPolicy \ --policy-document '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:CreateFleet", "ec2:CreateTags", "ec2:DescribeAvailabilityZones", "ec2:DescribeImages", "ec2:DescribeInstanceTypeOfferings", "ec2:DescribeInstanceTypes", "ec2:DescribeInstances", "ec2:DescribeLaunchTemplates", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotPriceHistory", "ec2:DescribeSubnets", "ec2:RunInstances", "ec2:TerminateInstances", "iam:PassRole", "pricing:GetProducts", "ssm:GetParameter", "ec2:CreateLaunchTemplate", "ec2:DeleteLaunchTemplate", "sts:AssumeRoleWithWebIdentity" ], "Resource": "*" } ] }' 3. Deploy Karpenter First, we need to configure additional settings to restrict Karpenter to run only on the control plane, and bind it to the resources we set up earlier, such as the clusterEndpoint, clusterName, and most importantly, the IAM Role. Shell cat <<EOF > ${TMP_DIR}/values.yaml serviceAccount: annotations: "eks.amazonaws.com/role-arn": "arn:aws:iam::${AWS_ACCOUNT_ID}:role/${KARPENTER_ROLE_NAME}" replicas: 1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: "kubernetes.io/hostname" tolerations: - key: CriticalAddonsOnly operator: Exists - key: node-role.kubernetes.io/master operator: Exists - key: node-role.kubernetes.io/control-plane operator: Exists - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 extraVolumes: - name: token-amazonaws-com projected: defaultMode: 420 sources: - serviceAccountToken: audience: amazonaws.com expirationSeconds: 86400 path: token controller: containerName: controller image: repository: docker.io/vacanttt/kops-karpenter-provider-aws tag: latest digest: sha256:24ef24de6b5565df91539b7782f3ca0e4f899001020f4c528a910cefb3b1c031 env: - name: AWS_REGION value: us-west-1 - name: AWS_DEFAULT_REGION value: us-west-1 - name: AWS_ROLE_ARN value: arn:aws:iam::${AWS_ACCOUNT_ID}:role/${KARPENTER_ROLE_NAME} - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/amazonaws.com/token extraVolumeMounts: - mountPath: /var/run/secrets/amazonaws.com/ name: token-amazonaws-com readOnly: true logLevel: debug settings: clusterName: ${NAME} clusterEndpoint: ${CLUSTER_ENDPOINT} featureGates: spotToSpotConsolidation: true nodeRepair: false EOF To deploy Karpenter to the kube-system namespace, you can use the following Helm commands: Shell export KARPENTER_NAMESPACE="kube-system" helm upgrade --install karpenter \ oci://public.ecr.aws/karpenter/karpenter \ --namespace "${KARPENTER_NAMESPACE}" --create-namespace \ --wait -f $TMP_DIR/values.yaml 4. Create NodePool/NodeClass To register new nodes with your cluster, you need to use the LaunchTemplate managed by kOps and configure its userData for the Karpenter EC2NodeClass. Follow these commands: Shell export NODE_INSTANCE_GROUP=$(kops get instancegroups --name ${NAME} | grep Node | awk '{print $1}') export NODE_LAUNCH_TEMPLATE_NAME=${NODE_INSTANCE_GROUP}.${NAME} export USER_DATA=$(aws ec2 describe-launch-templates --region ${DEPLOY_REGION} --filters Name=launch-template-name,Values=${NODE_LAUNCH_TEMPLATE_NAME} \ --query "LaunchTemplates[].LaunchTemplateId" --output text | \ xargs -I {} aws ec2 describe-launch-template-versions --launch-template-id {} --region ${DEPLOY_REGION} \ --query "LaunchTemplateVersions[].LaunchTemplateData.UserData" --output text | base64 --decode) Before applying the NodeClass and NodePool configurations, you can temporarily store them for review or additional configuration. Shell cat <<EOF > ${TMP_DIR}/nodeclass.yaml apiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: name: default spec: associatePublicIPAddress: true amiFamily: AL2 tags: kops.k8s.io/instancegroup: ${NODE_INSTANCE_GROUP} KubernetesCluster: ${NAME} k8s.io/role/node: "1" aws-node-termination-handler/managed: "" k8s.io/cluster-autoscaler/node-template/label/node-role.kubernetes.io/node: "" subnetSelectorTerms: - tags: KubernetesCluster: ${NAME} securityGroupSelectorTerms: - tags: Name: nodes.${NAME} KubernetesCluster: ${NAME} amiSelectorTerms: - name: "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20241211" instanceProfile: nodes.${NAME} userData: | $(echo "$USER_DATA" | sed 's/^/ /') EOF cat <<EOF > ${TMP_DIR}/nodepool.yaml apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: kubernetes.io/arch operator: In values: ["amd64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand", "spot"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default expireAfter: 720h limits: cpu: 4 disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m EOF Apply NodeClass and NodePool to the cluster: Shell kubectl apply -f ${TMP_DIR}/nodeclass.yaml kubectl apply -f ${TMP_DIR}/nodepool.yaml Create a Workload to Test AutoScaling To test Karpenter's autoscaling functionality, create a Workload with four replicas that request specific resources. In this scenario, two replicas should be pending due to insufficient resources. Shell cat <<EOF > ${TMP_DIR}/workload.yaml apiVersion: apps/v1 kind: Deployment metadata: name: workload namespace: default labels: app: workload spec: replicas: 4 selector: matchLabels: app: workload template: metadata: labels: app: workload spec: containers: - name: pause image: public.ecr.aws/eks-distro/kubernetes/pause:3.7 resources: requests: cpu: "550m" memory: "128Mi" EOF Shell kubectl apply -f ${TMP_DIR}/workload.yaml You can check if any NodeClaims have been created. After approximately 70s of NodeClaim creation, new nodes will be registered to the cluster. Delete the Cluster Running a Kubernetes cluster on AWS incurs ongoing costs. If you've completed your experiment, you may want to delete the cluster to avoid unnecessary charges. To permanently delete your cluster, use the following command with the --yes flag. Shell kops delete cluster --name ${NAME} --yes ⚠ Warning: This command is destructive—it will remove your entire cluster and all associated resources. Ensure you have backed up any important data before proceeding. Conclusion The combination of kOps and Karpenter brings powerful automation to Kubernetes cluster management but also comes with certain limitations. Advantages Karpenter dynamically provisions nodes based on actual Pod requirements, improving resource utilization and enabling a rapid response to workload changes. This helps prevent both resource waste and shortages. Additionally, it supports a wide range of instance types, allowing users to select the most suitable option for their workloads to optimize performance and cost. Limitations However, this setup has some constraints. Since EKS’s bootstrap.sh script cannot be used, Kubelet configurations are controlled by kOps, preventing custom Kubelet parameters within NodeClass. Additionally, the control plane nodes must be managed via Auto Scaling Groups (ASG) rather than Karpenter, limiting their flexibility. Moreover, Karpenter requires at least one InstanceGroup to function properly — without it, new nodes will fail to register with the cluster, adding to the configuration complexity. Despite these limitations, kOps and Karpenter remain a powerful combination for dynamic scaling and multi-instance support. Still, careful planning is required to address these constraints and ensure a smooth deployment. If you are interested in more tutorials on Karpenter, feel free to follow Awesome Karpenter on LinkedIn.

By Paul Muller

An Overview of Asynchronous Processing in Salesforce

When running a process in Salesforce, the first question you should ask is whether to execute it synchronously or asynchronously. If the task can be delayed and doesn't require an immediate result, it's always beneficial to leverage Salesforce's asynchronous processes, as they offer significant advantages to your technical architecture. What Is Asynchronous Processing? Asynchronous processes run in their own thread, allowing the task to complete without keeping the user waiting. Here are the key benefits: 1. Improved User Experience and Performance Since asynchronous processes run in the background, users don’t have to wait for them to finish. This enables users to continue their work uninterrupted while also improving page load times and overall system performance. 2. Higher Limits Salesforce imposes strict limits on synchronous transactions, such as the number of queries or DML operations per transaction. Asynchronous processing provides additional execution limits, giving you more room to scale your operations. 3. Scalability By offloading complex or resource-intensive tasks to run in the background, asynchronous processing helps your business scale efficiently without compromising system performance. Here are the various tools Salesforce provides for asynchronous processing. Key Considerations So far, we've explored asynchronous processing, its benefits, and the various tools Salesforce provides for these tasks. However, there are several important factors to consider before choosing asynchronous processing for your business needs. 1. Asynchronous Processing Has No SLA Keep in mind that asynchronous processes have no guaranteed Service Level Agreement (SLA). They run in their own thread whenever resources are available, meaning there’s no assurance the task will complete within a specific time frame. As such, it's important to avoid using asynchronous processing for business flows that are time-sensitive. 2. Choose the Right Tool for the Job As mentioned earlier, Salesforce offers various tools for asynchronous processing, each designed for different use cases. Understanding the strengths and limitations of each tool will help you select the most suitable one for your specific needs. 3. Bulkify Your Code While Salesforce provides higher execution limits for asynchronous processes, it's still critical to bulkify your operations. This ensures your code can efficiently handle multiple records, staying within the platform's limits and maintaining performance. 4. Implement Error Handling and Monitoring Since asynchronous jobs run in the background, robust error handling and monitoring are essential. Use try-catch blocks and log errors to custom objects for easier tracking. Additionally, implement retry logic to manage intermittent failures, such as API callout issues, to ensure the reliability of your processes. Asynchronous Tools Deep Dive Let's take a closer look at each of the asynchronous tools in Salesforce. Future Methods Future methods are used for operations that can run independently in their own thread. One common use case is executing external web service callouts where the user doesn't need to wait for the operation to complete. To define a future method, simply annotate the method with the @future annotation. Java public class FutureClass { @future public static void myFutureMethod(list<id> recordIds){ //long-running operation code } } Key points about future methods: Static. Future methods must be static.Void return type. They can only return void.Primitive data types. Future methods can only accept primitive data types (e.g., String, Integer, Id). They cannot take complex objects as parameters.Callouts. If the future method needs to make a web service callout, you must include the callout=true attribute in the annotation to allow the method to perform the callout. Java public class FutureClass { @future(callout = true) public static void myFutureMethod(){ //callout here } } Queueable Apex Similar to future methods, Queueable Apex allows you to run operations in their own thread, making it ideal for long-running tasks. It leverages the Apex job queue and provides a more flexible and powerful approach to asynchronous processing compared to future methods. To implement Queueable Apex, your class must implement the Queueable interface. The class should always define an execute method, which contains the logic for the asynchronous operation. Java public class QueueableClass implements Queueable{ public void execute(QueueableContext context){ //long-running operation } } You can execute Queueable Apex by calling System.enqueueJob(), which adds the job to the queue and returns a job ID. This job ID can be used to monitor the job's status by querying the AsyncApexJob object. Java ID jobID = System.enqueueJob(new QueueableClass()); Key points about Queueable Apex: Non-primitive data types. Unlike future methods, you can use non-primitive data types as member variables in a Queueable class.Job chaining. Queueable jobs can be chained, allowing you to start a second job from an already executing one, creating a sequence of asynchronous operations. Batch Apex When you need to run asynchronous operations on a large number of records, Batch Apex is the ideal solution. It divides your large dataset into smaller, manageable chunks for processing. To implement Batch Apex, your class must implement the Database.Batchable interface. The Batch Apex class should define three methods: start, execute, and finish. 1. Start Method This method is executed at the beginning of the batch job. It should contain the SOQL query to collect the large dataset and return a QueryLocator. The governor limit on the total number of records retrieved by a SOQL query is bypassed in the start method when using Batch Apex. Java public Database.QueryLocator start(Database.BatchableContext bc) {} 2. Execute Method The execute method is called for each batch of records. Note that the order in which records are processed is not guaranteed. Java public void execute(Database.BatchableContext bc, list<Object>){} 3. Finish Method The finish method is called after all batches of records have been processed. It’s typically used for post-processing tasks, such as sending notification emails. Each execution of the batch job is considered a single transaction, and governor limits are reset for each batch. To trigger a batch job, use the Database.executeBatch method. This adds the batch job to the asynchronous queue. The Database.executeBatch method takes two parameters: An instance of the Batch Apex class.An optional batch size parameter, which specifies the number of records per batch. The maximum batch size you can specify is 2000. Java ID batchprocessid = Database.executeBatch(new BatchApex(),2000); Best Practices When Using Asynchronous Apex Avoid Triggering Future or Queueable Methods in High-Volume Processes Be cautious when triggering future or queueable methods from processes that could exhaust daily asynchronous limits, such as Apex triggers. These processes can quickly consume available asynchronous resources. Optimize Performance Ensure the performance of future or queueable methods is optimized. This includes: Optimizing queries to reduce processing time.Minimizing web service callout durations.Streamlining any associated logic, such as triggers or flows, to prevent bottlenecks. Use Batch Apex for Large Data Volumes For processing large numbers of records, always prefer Batch Apex over future or queueable methods. Batch Apex is designed to handle massive data sets efficiently, while future and queueable methods are better suited for smaller tasks. Queueable Apex Provides Greater Flexibility Queueable Apex offers more control over job execution compared to future methods, such as the ability to chain jobs together or handle larger data volumes more efficiently. Conclusion In conclusion, asynchronous Apex in Salesforce is a powerful tool for handling long-running, resource-intensive processes while maintaining system performance and user experience. By understanding the different asynchronous methods available — such as Future Methods, Queueable Apex, and Batch Apex — and following best practices, you can design efficient, scalable solutions that optimize both your code and your system’s resources. Remember to consider factors like governor limits, performance optimization, and error handling to ensure your asynchronous jobs run smoothly and reliably.

By Jaseem Pookandy

Have You Heard About Cloud Native Buildpacks?

Do you ever get tired of fiddling with a Dockerfile? Dockerfiles and Docker images are a great way to package your app for reusable, containerized deployments. However, writing and maintaining a Dockerfile is not always intuitive, and it takes up time that could otherwise be used for adding features to your app. Enter Cloud Native Buildpacks. Buildpacks exist to pull together everything your app needs to run and put it into an Open Container Initiative (OCI) image — no Dockerfile required. For all the developers out there who need a container build process that’s easy to use and will save them time and headaches, Cloud Native Buildpacks might be the solution they’re looking for. Interested? I’ll tell you more. What Are Cloud Native Buildpacks? Broadly speaking, a buildpack takes an application code and makes it runnable through a build process. So then, Cloud Native Buildpacks (CNBs) take your application source code and turn it into runnable, reproducible OCI images, implementing your requirements for image security, performance optimization, and container build order. It’s like having the exact Dockerfile you need — only you don’t need to write one. While most developers can write a Dockerfile, few are experts in either Docker or infrastructure. Too many apps have Dockerfiles that are cobbled together from code snippets found across the web — often a mash-up of Copilot, Stack Overflow, and ChatGPT. Dockerfile errors can lead to insecure and poorly performing applications. Cloud Native Buildpacks take on this burden, automatically applying best practices for each language or framework. A builder can then utilize any number of buildpacks, automatically detecting which buildpacks are needed and applying them to build an application. Here are the buildpacks that Heroku’s builder currently supports: Shell $ pack builder inspect heroku/builder:24 Inspecting builder: heroku/builder:24 REMOTE: Description: Ubuntu 24.04 AMD64+ARM64 base image with buildpacks for .NET, Go, Java, Node.js, PHP, Python, Ruby & Scala. ... Buildpacks: ID NAME VERSION heroku/deb-packages Heroku .deb Packages 0.0.3 heroku/dotnet Heroku .NET 0.1.10 heroku/go Heroku Go 0.5.2 heroku/gradle Heroku Gradle 6.0.4 heroku/java Heroku Java 6.0.4 heroku/jvm Heroku OpenJDK 6.0.4 heroku/maven Heroku Maven 6.0.4 heroku/nodejs Heroku Node.js 3.4.5 heroku/nodejs-corepack Heroku Node.js Corepack 3.4.5 heroku/nodejs-engine Heroku Node.js Engine 3.4.5 heroku/nodejs-npm-engine Heroku Node.js npm Engine 3.4.5 heroku/nodejs-npm-install Heroku Node.js npm Install 3.4.5 heroku/nodejs-pnpm-engine Heroku Node.js pnpm Engine 3.4.5 heroku/nodejs-pnpm-install Heroku Node.js pnpm install 3.4.5 heroku/nodejs-yarn Heroku Node.js Yarn 3.4.5 heroku/php Heroku PHP 0.2.0 heroku/procfile Heroku Procfile 4.0.0 heroku/python Heroku Python 0.23.0 heroku/ruby Heroku Ruby 5.0.1 heroku/sbt Heroku sbt 6.0.4 heroku/scala Heroku Scala 6.0.4 Other builders, like the ones from Paketo or Google Cloud, also bring an array of buildpacks. All in all, the Cloud Native Buildpacks ecosystem is growing and maturing, which is exciting for developers! Those of you who are familiar with Heroku have already been enjoying the buildpack experience. With git push heroku main, you’ve been able to deploy directly to Heroku, with no Dockerfile required. Cloud Native Buildpacks build on the Heroku buildpack experience, taking what was once a vendor-specific implementation and turning it into a CNCF standard that’s usable on any cloud platform. In short, Cloud Native Buildpacks allow developers to: Deploy applications more easily than ever… in a standard-based fashion without lock-in… all while applying container best practices… and without making developers tinker with Dockerfiles. Use Cases Sounds great, right? With all these benefits, let’s look at some specific cases where you could benefit from using Cloud Native Buildpacks. Any place where you would ordinarily need a Dockerfile is an opportunity to use a buildpack. Examples include: A Node.js web applicationA Python microserviceA heterogeneous application that uses multiple languages or frameworksBuilding applications for deployment on cloud platforms such as AWS, Azure, and Heroku One thing to note is this: While buildpacks are declarative, Dockerfiles are procedural. With a buildpack, you simply declare that you want a given application built with a given builder or buildpack. In contrast, a Dockerfile requires you to define the commands and the order in which those commands are run to build your application. As such, buildpacks don’t currently offer the level of configurability that’s available within a Dockerfile, so it might not meet the needs of some more advanced use cases. That said, there is no vendor lock-in with Cloud Native Buildpacks. They simply build an OCI image. Need more customization and options than are available in the buildpack? Simply replace the builder in your build pipeline with your Dockerfile and a standard OCI image build, and you are good to go. A Simple Walkthrough Let’s do a quick walkthrough of how to use Cloud Native Buildpacks. To get started with buildpacks as an app developer, your first step should be to install the Pack CLI tool. This tool allows you to build an application with buildpacks. Follow the installation instructions for your operating system. Additionally, if you don’t have it already, you’ll need a Docker daemon for the builder to build your app, and for you to run your image. With these two tools installed, you’re ready to begin. Build a Sample App With access to the pack tool, you’re ready to try it out by building a sample application. I’ll be running this inside a Next.js application. Need a sample application to test out the buildpack on? Here is a full directory of Next.js sample applications. You can also try out any application you have on hand. Once you have your application ready, start by seeing what builder the pack tool suggests. In your shell, navigate to your app directory and run this command: Shell $ pack builder suggest On my Ubuntu installation, for my Next.js application, the pack tool suggests the following builders: Let’s try the suggested Heroku buildpack (heroku/builder:24). To use this one, run the following command: Shell $ pack build my-app --builder heroku/builder:24 Build time will vary depending on the size of your application; for me, building the app took 30 seconds. With that, my image was ready to go. We can run the image with the following: Shell $ docker run -p 3000:3000 my-app The result looks like this: And that’s it! We’ve successfully built an OCI image of our Next.js application without using a Dockerfile. Additional Configurations What if you need to configure something inside the buildpack? For this, you would reference the buildpack(s) that were selected by your builder. For example, for my Next.js app, I can see in the logs that the builder selected two buildpacks: nodejs-engine and nodejs-yarn. Let’s say that I want to specify the yarn version used by the buildpack. First, I would go to the nodejs-yarn buildpack Readme, where I see that I can specify the yarn version in my package.json file with a packageManager key. I would modify my file to look like this: JSON { "packageManager": "yarn@1.22.22" } From there, all I would need to do is run pack build my-app --builder heroku/builder:24 again. Conclusion Cloud Native Buildpacks are an exciting new way to build container images for our applications. By removing the need for a Dockerfile, they make it faster than ever to get our application packaged and deployed. Plus, as they build standard container images, there is no vendor lock-in. Cloud Native Buildpacks are in preview on many platforms, which means that the feature set is light but fast-growing. Heroku, which has open-sourced their Cloud Native Buildpacks, is bringing them to their next-generation platform, too. I’m looking forward to seeing how Cloud Native Buildpacks enable secure, speedy application deployment across the cloud platform community.

By Alvin Lee

CORE

KIAM vs AWS IAM Roles for Service Accounts (IRSA)

As Kubernetes adoption grows in cloud-native environments, securely managing AWS IAM roles within Kubernetes clusters has become a critical aspect of infrastructure management. KIAM and AWS IAM Roles for Service Accounts (IRSA) are two popular approaches to handling this requirement. In this article, we discuss the nuances of both tools, comparing their features, architecture, benefits, and drawbacks to help you make an informed decision for your Kubernetes environment. Introduction KIAM: An open-source solution designed to assign AWS IAM roles to Kubernetes pods dynamically, without storing AWS credentials in the pods themselves. KIAM uses a proxy-based architecture to intercept AWS metadata API requests.IRSA: AWS's official solution that leverages Kubernetes service accounts and OpenID Connect (OIDC) to securely associate IAM roles with Kubernetes pods. IRSA eliminates the need for an external proxy. Architecture and Workflow KIAM Components Agent – Runs as a DaemonSet on worker nodes, intercepting AWS metadata API calls from pods.Server – Centralized component handling IAM role validation and AWS API interactions. Workflow Pod metadata includes an IAM role annotation.The agent intercepts metadata API calls and forwards them to the server.The server validates the role and fetches temporary AWS credentials via STS.The agent injects the credentials into the pod’s metadata response. IRSA Components Kubernetes service accounts annotated with IAM role ARNs.An OIDC identity provider configured in AWS IAM. Workflow A service account is annotated with an IAM role.Pods that use the service account are issued a projected service account token.AWS STS validates the token via the OIDC identity provider.The pod assumes the associated IAM role. Feature Comparison Feature KIAM IRSA Setup Complexity Requires deploying KIAM components. Requires enabling OIDC and setting up annotations. Scalability Limited at scale due to proxy bottlenecks. Highly scalable; no proxy required. Maintenance Requires ongoing management of KIAM. Minimal maintenance; native AWS support. Security Credentials are fetched dynamically but flow through KIAM servers. Credentials are validated directly by AWS STS. Performance Metadata API interception adds latency. Direct integration with AWS; minimal latency. AWS Native Support No, third-party tool. Yes, fully AWS-supported solution. Multi-cloud Support No, AWS-specific. No, AWS-specific. Advantages and Disadvantages Advantages of KIAM Flexibility. Works in non-EKS Kubernetes clusters.Proven utility. Widely used before IRSA was introduced. Disadvantages of KIAM Performance bottlenecks. Metadata interception can lead to latency issues, especially in large-scale clusters.Scalability limitations. Centralized server can become a bottleneck.Security risks. Additional proxy layer increases the attack surface.Maintenance overhead. Requires managing and updating KIAM components. Advantages of IRSA AWS-native integration. Leverages native AWS features for seamless operation.Improved security. Credentials are issued directly via AWS STS without intermediaries.Better performance. No proxy overhead; direct STS interactions.Scalable. Ideal for large clusters due to its distributed nature. Disadvantages of IRSA AWS-only. Not suitable for multi-cloud or hybrid environments.Initial learning curve. Requires understanding OIDC and service account setup. Use Cases When to Use KIAM Non-EKS Kubernetes clusters.Scenarios where legacy systems rely on KIAM's specific functionality. When to Use IRSA EKS clusters or Kubernetes environments running on AWS.Use cases requiring scalability, high performance, and reduced maintenance overhead.Security-sensitive environments that demand minimal attack surface. Migration from KIAM to IRSA If you are currently using KIAM and want to migrate to IRSA, here’s a step-by-step approach: 1. Enable OIDC for Your Cluster In EKS, enable the OIDC provider using the AWS Management Console or CLI. 2. Annotate Service Accounts Replace IAM role annotations in pods with annotations in service accounts. 3. Update IAM Roles Add the OIDC identity provider to your IAM roles’ trust policy. 4. Test and Verify Deploy test workloads to ensure that the roles are assumed correctly via IRSA. 5. Decommission KIAM Gradually phase out KIAM components after successful migration. Best Practices for Migration Perform the migration incrementally, starting with non-critical workloads.Use a staging environment to validate changes before applying them to production.Monitor AWS CloudWatch metrics and logs to identify potential issues during the transition.Leverage automation tools like Terraform or AWS CDK to streamline the setup and configuration. Real-World Examples KIAM in Action Legacy systems – Organizations using non-EKS clusters where KIAM remains relevant due to its compatibility with diverse environmentsHybrid workloads – Enterprises running workloads across on-premise and cloud platforms IRSA Success Stories Modern applications – Startups leveraging IRSA for seamless scaling and enhanced security in AWS EKS environmentsEnterprise adoption – Large-scale Kubernetes clusters in enterprises benefiting from reduced maintenance overhead and native AWS integration Conclusion While KIAM was a groundbreaking tool in its time, AWS IAM Roles for Service Accounts (IRSA) has emerged as the preferred solution for managing IAM roles in Kubernetes environments running on AWS. IRSA offers native support, better performance, improved security, and scalability, making it a superior choice for modern cloud-native architectures. For Kubernetes clusters on AWS, IRSA should be the go-to option. However, if you operate outside AWS or in hybrid environments, KIAM or alternative tools may still have relevance. For infrastructure architects, DevOps engineers, and Kubernetes enthusiasts, this comparative analysis aims to provide the insights needed to choose the best solution for their environments. If you need deeper technical insights or practical guides, feel free to reach out.

By Kuppusamy Vellamadam Palavesam

Integrating Redis With Message Brokers

Let's look at how to integrate Redis with different message brokers. In this article, we will point out the benefits of this integration. We will talk about which message brokers work well with Redis. We will also show how to set up Redis as a message broker with practical code examples, and discuss how to handle message persistence and how to monitor Redis when it is used as a message broker. What Are the Benefits of Using Redis With Message Brokers? Using Redis with message brokers gives many good benefits. These benefits help improve how we handle messages. Here are the main advantages: High performance. Redis works in memory. This means it has very low delays and can handle many requests at once. This is great for real-time messaging apps where speed is very important.Pub/Sub messaging. Redis has a feature called publish/subscribe (Pub/Sub). This lets us send messages to many subscribers at the same time without needing direct connections. It is helpful for chat apps, notifications, or event-driven systems.Data structures. Redis has many data structures like strings, lists, sets, sorted sets, and hashes. We can use these structures for different messaging tasks. For example, we can use lists for queues and sets for unique message IDs.Scalability. Redis can grow by using clustering. This helps it manage more work by spreading data across many nodes. This is good for apps that need to be available all the time and handle problems.Persistence options. Redis has different options to save data, like RDB snapshots and AOF logs. This helps keep message data safe even if something goes wrong. We can balance good performance with saving data.Ease of use. Redis commands are simple. There are also many good client libraries for different programming languages like Python, Java, and Node.js. This makes it easy to add Redis to our apps.Monitoring and management. Redis has tools to check how well it is working. Tools like Redis CLI and RedisInsight help us improve the message broker setup and find problems.Lightweight. Redis uses fewer resources compared to older message brokers like RabbitMQ or Kafka. This makes it a good choice for microservices and container setups.Support for streams. Redis Streams is a strong feature that lets us work with log-like data. This helps with complex message processing and managing groups of consumers. It is useful for event sourcing and CQRS patterns. By using these benefits, we can build strong and efficient messaging systems with Redis. For more information on what Redis can do, you can check What is Redis? and What are Redis Streams? Which Message Brokers Are Compatible With Redis? We can use Redis with many popular message brokers. This makes them work better and faster. Here are some main message brokers that work well with Redis: RabbitMQ We can use Redis to store messages for RabbitMQ. By using Redis for message storage, RabbitMQ can handle its tasks better. This is especially useful when we need quick access to message queues. Apache Kafka Kafka can use Redis for keeping messages temporarily. With Redis streams, Kafka producers can save messages before sending them to consumers. This can help increase throughput. ActiveMQ We can set up ActiveMQ to use Redis for storing messages in a queue. This can make retrieving and processing messages faster. NATS NATS can use Redis to keep messages safe and to manage state in a distributed system. This lets us store messages in Redis for later use. Celery Celery is a tool for managing tasks. We can use Redis as a broker for Celery. This helps us manage background tasks and scheduling better. Code Example for Using Redis With Celery To connect Redis as a message broker with Celery, we can set it up in the Celery configuration like this: Python from celery import Celery app = Celery('tasks', broker='redis://localhost:6379/0') @app.task def add(x, y): return x + y This code shows a simple Celery task using Redis as the broker. This lets us do asynchronous message processing very well. Apache Pulsar Like Kafka, Apache Pulsar can also use Redis for caching and quick message retrieval. This can make message processing more efficient. How Do I Set Up Redis As a Message Broker? To set up Redis as a message broker, we can follow these steps: 1. Install Redis First, we need to make sure Redis is installed on our server. We can check the Redis installation guide. 2. Configure Redis Next, we open the Redis configuration file. This file is usually called redis.conf. We need to set these properties for message brokering: Shell # Enable persistence for durability save 900 1 save 300 10 save 60 10000 # Set the max memory limit maxmemory 256mb maxmemory-policy allkeys-lru # Enable Pub/Sub messaging notify-keyspace-events Ex 3. Start the Redis server Now, we can start Redis with this command: Shell redis-server /path/to/redis.conf 4. Use Redis for Pub/Sub We can publish and subscribe to channels using the Redis CLI or client libraries. Here is an example using Python: Python import redis # Connect to Redis r = redis.StrictRedis(host='localhost', port=6379, db=0) # Subscriber def message_handler(message): print(f"Received message: {message['data']}") pubsub = r.pubsub() pubsub.subscribe(**{'my-channel': message_handler}) # Listen for messages pubsub.run_in_thread(sleep_time=0.001) # Publisher r.publish('my-channel', 'Hello, Redis!') 5. Use Message Queues For task queues, we can use Redis lists. Here is how we can make a simple queue: Producer Python r.lpush('task_queue', 'Task 1') r.lpush('task_queue', 'Task 2') Consumer Python while True: task = r.brpop('task_queue')[1] print(f'Processing {task.decode()}') By following these steps, we can easily set up Redis as a message broker. We can use both Pub/Sub and list-based message queuing. For more insights on Redis data types, we can check the article on Redis data types. Practical Code Examples for Integrating Redis With Message Brokers Integrating Redis with message brokers helps us improve messaging abilities. We can use Redis's speed and efficiency. Below, we show simple code examples for using Redis with popular message brokers like RabbitMQ and Kafka. Example 1: Using Redis with RabbitMQ In this example, we will use Python with the pika library. We will send and receive messages through RabbitMQ, using Redis to store data. Installation Shell pip install pika redis Producer Code Python import pika import redis # Connect to RabbitMQ connection = pika.BlockingConnection(pika.ConnectionParameters('localhost')) channel = connection.channel() channel.queue_declare(queue='task_queue', durable=True) # Connect to Redis redis_client = redis.Redis(host='localhost', port=6379, db=0) message = 'Hello World!' # Publish message to RabbitMQ channel.basic_publish(exchange='', routing_key='task_queue', body=message, properties=pika.BasicProperties( delivery_mode=2, # make message persistent )) # Store message in Redis redis_client.lpush('messages', message) print(" [x] Sent %r" % message) connection.close() Consumer Code Python import pika import redis # Connect to RabbitMQ connection = pika.BlockingConnection(pika.ConnectionParameters('localhost')) channel = connection.channel() channel.queue_declare(queue='task_queue', durable=True) # Connect to Redis redis_client = redis.Redis(host='localhost', port=6379, db=0) def callback(ch, method, properties, body): message = body.decode() print(" [x] Received %r" % message) # Store received message in Redis redis_client.lpush('processed_messages', message) ch.basic_ack(delivery_tag=method.delivery_tag) channel.basic_consume(queue='task_queue', on_message_callback=callback) print(' [*] Waiting for messages. To exit press CTRL+C') channel.start_consuming() Example 2: Using Redis With Kafka In this example, we will use Java with Apache Kafka and Redis to send messages. Dependencies (Maven) XML <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>3.2.0</version> </dependency> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>4.0.1</version> </dependency> Producer Code Java import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import redis.clients.jedis.Jedis; import java.util.Properties; public class RedisKafkaProducer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); KafkaProducer<String, String> producer = new KafkaProducer<>(props); Jedis jedis = new Jedis("localhost"); String message = "Hello Kafka!"; producer.send(new ProducerRecord<>("my-topic", message)); jedis.lpush("messages", message); producer.close(); jedis.close(); } } Consumer Code Java import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.KafkaConsumer; import redis.clients.jedis.Jedis; import java.time.Duration; import java.util.Collections; import java.util.Properties; public class RedisKafkaConsumer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("group.id", "test"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Collections.singletonList("my-topic")); Jedis jedis = new Jedis("localhost"); while (true) { for (ConsumerRecord<String, String> record : consumer.poll(Duration.ofMillis(100))) { System.out.printf("Consumed message: %s%n", record.value()); jedis.lpush("processed_messages", record.value()); } } } } These examples show how we can connect Redis with message brokers like RabbitMQ and Kafka. This gives us strong messaging solutions. How Do I Handle Message Persistence in Redis? To handle message persistence in Redis, we can use two main ways: RDB (Redis Database Backup) and AOF (Append-Only File). RDB Persistence RDB saves snapshots of your data at set times. This is good for backups. But if Redis crashes between snapshots, we can lose some data. Configuration Shell save 900 1 # Save the DB if at least 1 key changed in 900 seconds save 300 10 # Save the DB if at least 10 keys changed in 300 seconds AOF Persistence AOF logs every write action the server gets. This helps us recover data more up-to-date. But the files become larger. Configuration Shell appendonly yes appendfsync everysec # Fsync every second for a balance of performance and durability Command for Enabling Persistence To turn on persistence, we can change the `redis.conf` file or use commands in the Redis CLI: Shell # Enable RDB CONFIG SET save "900 1" # Enable AOF CONFIG SET appendonly yes Choosing Between RDB and AOF RDB is good when speed is very important and losing some data is okay.AOF is better when we need to keep data safe. We can also use both methods together. RDB will take snapshots and AOF will log changes. Monitoring Persistence We can check the status and performance of persistence using Redis commands: Shell INFO persistence This command shows us the current state of RDB and AOF. It includes the last save time and AOF file size. For more details on Redis persistence, we can look at what Redis persistence is and learn how to set up RDB and AOF well. How Do I Monitor Redis in a Message Broker Setup? Monitoring Redis in a message broker setup is very important. It helps us make sure Redis works well and is reliable. We have many tools and methods to monitor Redis. These include built-in commands, external tools, and custom scripts. Built-in Monitoring Commands Redis has some built-in commands for monitoring: INFO This command gives us server stats and config. Shell redis-cli INFO MONITOR This command shows all commands that the Redis server gets in real-time. Shell redis-cli MONITOR SLOWLOG This command shows slow commands. It helps us find performance problems. Shell redis-cli SLOWLOG GET 10 External Monitoring Tools Redis monitoring tools. We can use tools like RedisInsight, Datadog, or Prometheus with Grafana. These tools help us see important data like memory use and command run time.Redis Sentinel. This tool helps with high availability and monitoring. It can tell us when there are failures and can do automatic failovers. Key Metrics to Monitor Memory usage. We need to watch memory use to avoid running out of memory.CPU usage. We should track CPU use to use resources well.Command latency. We measure how long commands take to run. This helps us find slow commands.Connection count. We need to monitor active connections to stay within limits.Replication lag. If we use replication, we should check the lag between master and slave instances. Example Monitoring Setup With Prometheus To set up Prometheus for Redis monitoring, we can use the Redis Exporter. 1. Install Redis Exporter Shell docker run -d -p 9121:9121 --name=redis-exporter oliver006/redis_exporter 2. Configure Prometheus We add the following job in our prometheus.yml: YAML scrape_configs: - job_name: 'redis' static_configs: - targets: ['localhost:9121'] 3. Visualize in Grafana We connect Grafana to our Prometheus and create dashboards to see Redis data. Custom Monitoring Scripts We can also make our own scripts using Python with the redis library. This helps us check and alert automatically. Python import redis client = redis.StrictRedis(host='localhost', port=6379, db=0) info = client.info() # Check memory usage if info['used_memory'] > 100 * 1024 * 1024: # 100 MB threshold print("Memory usage is too high!") Using these monitoring methods help us keep our Redis environment healthy in our message broker setup. For more info on Redis commands and settings, check Redis CLI usage. Frequently Asked Questions 1. What are the best practices for integrating Redis with message brokers? To integrate Redis with message brokers, we should follow some best practices. First, we can use Redis Pub/Sub for messaging in real-time. Also, we can use Redis Streams for message queuing. We need to set up message persistence correctly. It is also good to use Redis data types in a smart way. 2. How do I ensure message persistence when using Redis with message brokers? To keep messages safe in Redis, we can set it to use RDB (Redis Database Backup) or AOF (Append-Only File) methods. RDB snapshots help us recover data fast. AOF saves every write action to make sure we do not lose any data. 3. Is Redis a reliable message broker? Yes, Redis can work as a reliable message broker if we set it up right. It has low delay and high speed, so it is good for real-time use. But we need to add things like acknowledgments and re-sending messages to make sure it is reliable. 4. Which programming languages support Redis integration with message brokers? Redis works with many programming languages. Some of them are Python, Java, Node.js, PHP, and Ruby. Each language has its own Redis client libraries. This makes it easy to connect with message brokers. 5. How can I monitor Redis performance in a message broker setup? It is important for us to check how Redis performs in a message broker setup. We can use tools like Redis Insights or the built-in Redis commands to see key things like memory use, command stats, and delay.

By Raunak Jain

The Critical Role of CISOs in Managing IAM, Including NHIs

Who Should Own IAM in the Enterprise? Identity and access management (IAM) started as an IT function, with the entire focus on giving human users the right access to the right systems. But today, identity has become the primary attack surface, with at least 80% of all modern breaches involving compromised or stolen identities from adversaries who exploit poor identity. This reality has moved the responsibility for risk onto the shoulders of the team tasked with protecting the organization from attacks, namely security. Which ultimately means the CISO. However, there’s a major blind spot in this conversation: non-human identities (NHIs). This is a critical oversight. We are witnessing non-human identities (NHIs) outnumber humans by a factor of at least 45-to-one in the enterprise, with some estimates as high as 100 to 1. As organizations accelerate to deliver more code and products faster than ever, the number of these machine identities, such as service accounts, APIs, and automated workloads, will continually increase this imbalance. LLMs and the rapid adoption of new coding assistants and AI productivity tools will only rapidly accelerate this trend. If CISOs don’t come to terms with IAM strategies to cover NHIs, they’re leaving one of their largest attack surfaces undefended. IAM Without NHI Governance Is Incomplete The traditional IAM model has long revolved around human identities: onboarding employees, granting them role-based access, monitoring for policy violations, and de-provisioning accounts when necessary. This human-centric approach has matured significantly, with robust governance frameworks, compliance mandates, and security controls like multi-factor authentication (MFA) and zero-trust principles. The tooling market has kept pace with innovations from vendors like Okta, OneLogin, and Auth0. But, NHIs operate under very different assumptions: They don’t have passwords but rely on API keys, tokens, and cryptographic credentials to authenticate.They don’t follow traditional lifecycle processes — service accounts and machine identities often persist indefinitely, even after their original purpose is obsolete.They lack clear ownership, meaning their security is often neglected. NHIs are also highly susceptible to abuse. Secrets sprawl, the uncontrolled proliferation of credentials, has become a major security risk. API keys and access tokens are frequently hardcoded into source code, embedded in configuration files, or exposed in logs. Long-lived credentials, lacking any real governance or automation for rotation policies, are a prime target for attackers. By default, many older systems never expired, meaning some API keys in cloud environments haven’t been rotated in years. On top of this, NHIs are commonly over-permissioned. Developers are often in a rush to get something working and may not be scoping secrets as tightly as needed per the principle of least privilege. The lack of clear governance frameworks adds to their confusion, leading to many machine identities being granted excessive permissions just to "get it working" rather than being secure. The result? A massive security gap that adversaries love to exploit. If the IAM strategy is truly about protecting access, it must include NHIs. Otherwise, enterprises are only solving half the problem. Why CISOs Must Own NHI Governance Given that IAM should now be a security function, it follows that NHIs — being the fastest-growing and most vulnerable category of identities — must be governed under the CISO’s purview. Here’s why. NHIs Are a Major Attack Vector NHIs represent one of the largest and least monitored attack surfaces in the modern enterprise. Attackers increasingly target leaked API keys, compromised service accounts, and misconfigured machine identities to gain unauthorized access. Just a few examples of high-profile breaches have demonstrated this risk: The U.S. Department of the Treasury was breached through a compromised API key, granting the attackers access to workstations and unclassified documents.Toyota publicly exposed a very long-lived access key to an internal data server, allowing unauthorized access to real customer data for 5 years. The New York Times exposed a GitHub token, resulting in 5,600 of their repositories being leaked online. Compliance and Risk Management Demand It Regulatory frameworks like PCI-DSS, GDPR, and ISO 27001, as well as recommendations from NIST, all include strict access control and least privilege requirements — but they often focus on human identities. As regulators catch up to the reality that NHIs pose the same (or greater) risks, organizations will be held accountable for securing all identities. This means enforcing least privilege for NHIs — just as with human users. It also means tracking the full lifecycle of machine identities, from creation to decommissioning, as well as auditing and monitoring API keys, tokens, and service accounts with the same rigor as employee credentials. Waiting for regulatory pressure after a breach is too late. CISOs must act proactively to get ahead of the curve on these coming changes. Zero Trust Requires NHI Governance Zero trust strategies focus on identity as the new perimeter, but if NHIs aren’t included, the majority of the identity perimeter remains wide open. A zero-trust approach to NHIs means: Continuous verification – NHIs must be continuously authenticated and authorized, not granted persistent access.Least privilege enforcement – NHIs should have minimal permissions and be regularly reviewed.Segmentation and isolation – NHIs should be restricted to the specific workloads they serve. Zero trust is not complete unless NHIs are governed as rigorously as human identities. Building a Comprehensive IAM Strategy A modern IAM strategy must begin with comprehensive discovery and mapping of all identities across the enterprise. This includes understanding not just where the associated secrets are stored but also their origins, permissions, and relationships with other systems. Organizations need to implement robust secrets management platforms that can serve as a single source of truth, ensuring all credentials are encrypted and monitored. The lifecycle management of NHIs requires particular attention. Unlike human identities, which follow predictable patterns of employment and human lifestyles, machine identities require automated processes for creation, rotation, and decommissioning. Security teams must implement systems that can track when secrets were created, who created them, and, most importantly, when they should be rotated or retired. How to Extend IAM to Include NHIs Extending IAM to NHIs requires an adapted governance framework that accounts for their unique challenges. Here are the recommended next steps to align your machine identity governance with your security posture goals Discover All NHIs Without knowing identities that already exist in your organization, it is impossible to secure them. Hopefully, this has been well documented as they came into existence, but most likely, this is not going to be the case, especially across multiple applications, teams, and entire divisions of your enterprise. Any mapping you do needs to account for, at a minimum, where NHI secrets are stored, when they were created, and by whom. You should also be able to see when they were created, rotates, and very importantly, if they are still in use. Centralize NHI Management In recent research, it was learned that, on average, organizations maintain an average of 6 distinct secrets manager instances. This vault sprawl means a lack of visibility throughout the enterprise, even if an individual department, DevOps, or application team has this view into their team's secrets. What is needed is a way to make sure any secret is stored and managed from the single most appropriate place. Vault sprawl also means a likelihood of duplication of keys throughout systems. This adds complexity to the rotation process, especially in the middle of an incident. The security teams need to have a path to make sure that when a secret is rotated, it is globally enforced. Once you do have a clear view across vaults, you can enforce that any secrets found outside the vault are placed in the correct enterprise secrets manager, like HashiCorp Vault, AWS Secrets Manager, CyberArk Conjur, and then properly rotated automatically, at scale. Enforce Least Privilege and Rotation Policies Getting all the possible combinations of permissions assigned correctly to grant an application just enough privilege to get the work done is time-consuming and tricky. It is highly likely that at least some of your NHIs have more access than they need, including the ability to write or delete objects or other data. Without understanding what permissions your NHIs have, it is impossible to audit them to remove excessive permissions. This should not be a one-off exercise, as every credential rotation introduces the chance that new unneeded privileges will be introduced. What is needed is continual scanning to re-check that any changes to NHIs have adhered to the principle of least privilege. Once all NHI privileges are understood, organizations can implement better-automated rotation and NHI governance policy at scale, ensuring the new secret follows the governance model you have established. Integrate NHIs into Zero Trust No NHI should be allowed to act without the right authentication. NHIs currently rely, for the most part, on long-lived credentials. By default, most API keys, passwords, and other authentication tokens live forever. In an ideal world, you would be able to deliver just-in-time authentication that lives just for the life of the workload request or robust identity frameworks like SPIFFE/SPIRE. Most organizations are barred from implementing these approaches for existing applications and infrastructure due to the lack of insight into their existing NHI inventory and the technical hurdle of reworking the code to adopt a new methodology. But by moving your NHI secrets into a centralized enterprise secrets management, you can move towards this goal much faster. While this is just a part of a zero-trust strategy, as defined by NIST, it is impossible to achieve without the ability to ensure that any credentials are only being used by the expected entity. Mapping those NHIs is a mandatory step. A Unified IAM Strategy for Humans and Machines IAM is no longer just about human users. It must evolve to govern non-human identities with the same level of security, oversight, and control. This goes beyond the scope of any single part of your IT or DevOps organization, especially as security is ultimately on the hook for the risks when a breach occurs. For CISOs, owning IAM means owning all identities, both human and non-human. Anything less leaves a massive security gap. By integrating NHIs into IAM strategy, enterprises can: Reduce the attack surface from secrets sprawlEnforce least privilege across human and machine identitiesImprove compliance posture before regulations catch upBuild a zero-trust architecture that accounts for all identities The modern enterprise cannot afford fragmented identity governance. It’s time for CISOs to take full ownership — before attackers do. Resources Stop Identity-Based Threats Today What Happened in the U.S. Department of the Treasury Breach? A Detailed SummaryToyota Suffered a Data Breach by Accidentally Exposing A Secret Key Publicly On GitHubThe Secrets of the New York Times Source Code BreachVoice of Practitioners 2024

By Dwayne McDaniel

How I Made My Liberty Microservices Load-Resilient

It started with an all-too-familiar problem: As the traffic spiked, our microservices started slowing down, some even crashing under the pressure. Users experienced frustrating delays and outright bulk API request failures, and reliability took a hit. IBM Liberty gave us a solid, lightweight foundation, but we needed to do more. To keep things running smoothly, we had to fine-tune our architecture and make our services truly resilient to heavy loads. This blog is a deep dive into how we optimized various layers of the architecture in a short span of time. We cover various strategies that helped prevent crashes and keep things running smoothly. By the end, you’ll see how we transformed our fragile microservices into a rock-solid, self-healing system that can take on anything we throw at it. We started with two goals for ourselves: Increase the throughput to an acceptable levelAt peak load, the performance needs to gracefully degrade Application Architecture Here is a simplified version of the application architecture, highlighting the key components involved in the analysis. The following are the notable architectural elements and their implementation details: Istio A service mesh. This is a dedicated software layer that facilitates service-to-service communication using proxies. Security encrypts service-to-service communication using mTLS (Mutual TLS).Traffic management uses a virtual service for intelligent routing. Sidecar Proxy (Envoy) controls how traffic flows between services.Public Ingress Gateway exposes services inside the mesh to Cloudflare (a cloud internet service).Private Ingress Gateway controls access to services within a private network.Private Egress Gateway manages outgoing traffic from the mesh to other cloud services. IBM Cloud Internet Services (CIS) A set of services that help in addressing issues in exposing a microservice to the Internet. DNS resolved is used for resolving domain names.CDN (Content Delivery Network) caches website content globally to reduce latency.WAF (Web Application Firewall) protects against OWASP Top 10 vulnerabilities.DDoS mitigation prevents large-scale attacks from overwhelming services.A load balancer distributes traffic across multiple origins. Kubernetes Cluster A set of nodes that help in the orchestration of containerized applications using Kubernetes. A cluster is used to run all the microservices. The gateway nginx microservice is implemented using the Go language. App server microservice is implemented using Java. This runs on a Liberty Server. To connect to a database, this microservice uses OpenJPA implementation. The cluster is distributed across three zones to achieve high availability. Data Layer The layer in the architecture where all the persistent and ephemeral cache data resides. PostgreSQL is used as the service storage to store data. Redis cache is used as an in-memory data store. User Layer (Public) The interface layer that users interact using APIs. API calls can be made by applications integrating the SDK provided by the service. API calls can be made by automation scripts. API calls can be made from browsers or API testing tools like Postman. Customer Private Cloud A network segment with direct "private" connectivity to the application without going to the Internet API calls can be made by automation scripts via private networks. API calls can be made by applications integrating the SDK provided by the service via a private network. Details of the Incident Following are the technical details of the incident: As the traffic spiked, the CPU and memory of our Java microservices spiked.Our Java microservices threads hung.Though we had rate limiting configured in CIS, we noticed the requests landing in our microservice more than what was configured. Though the number of requests grew, the number of connections to the database did not grow as expected. Requests timed out in JMeter when a load was initiated. (Load tested in our staging environment with the same configuration to reproduce the problem.) A particular tenant’s load grew exponentially as they were testing their application. And this tenant was connecting using a private endpoint. As the number of requests increased, the Go-based gateway nginx microservice was stable, but the Liberty-based Java app server hung. Pre-Existing Resilience Measures Here are the pre-existing resilience measures that were already implemented before the incident: Public traffic rate limiting. Public traffic enters from CIS. CIS rate-limiting configurations are enabled to manage the inflow of public endpoint requests.Microservices were running with a larger number of pods so that the high availability was built in. Gateway microservices were configured to run with three instances (pods).The app server was configured to run with three instances (pods).pgBouncer was configured to run with three instances (pods).PgBouncer is a lightweight connection pooler for PostgreSQL that improves database performance, scalability, and resilience. Connection pooling is configured to limit the number of active database connections to prevent PostgreSQL from getting overwhelmed.Transaction and statement-level pooling. Configured with session pooling, where each client gets a dedicated database connection (default mode).Gateway nginx timeout. The timeout configured forproxy_read_timeout was 30 seconds so that if the app server microservice does not return the response on time, the request gets a timeout in 30 seconds. Istio ingress gateway timeout. The timeout configured for Istio was 60 seconds. If the gateway nginx does not return the response to Istio on time, the request gets timed out in 30 seconds. Timeouts were not in sync, and we noticed many HTTP-504 error codes as responses. Resource requests and limits were configured for all our microservices correctly.Maximum number of allowed connections on PostgreSQL. For better management of connections and better resilience, the total number of connections allowed was set to 400 in PostgreSQL. Data that’s frequently retrieved is also stored in memory cache (Redis) to avoid latency to connect to PostgreSQL for data retrieval.SDK implements web sockets to get data on demand. SDK also caches the data from the server in the runtime memory of the customer application. SDK makes calls to the server only on any updates that are being made to the application data. On any data update, the server initiates a WebSocket event for the connected clients. These clients then make a GET call to the server to get up-to-date information. Strategies Introduced to Improve Load Resilience Here are the strategies we introduced to improve resilience during the incident: Thread Pool Management Liberty thread pool management is enabled with maxTotal (the maximum number of threads) set to control thread spawning. Without this limit, Liberty may continuously create new threads as demand rises, leading to high resource (CPU and memory) consumption. Excessive thread creation increases context-switching overhead, slowing down request processing and increasing the latency. If left uncontrolled, it could also exhaust JVM resources, potentially causing system instability. Along with maxTotal other parameters related to InitialSize, MinIdle, MaxIdle, and MaxWaitMillis were also set.The maximum number of HTTP request threads available in Liberty is 200. So, maxTotal is set to 200 to control the number of threads spawned by the server at runtime. Setting the above configuration helped control the threads spawned, preventing thread hangs. Connection Pool Management Pgbouncer thread pool management was configured with pool_modeset to session and max_client_conn set to 200. However, session mode did not perform as expected for our application, requiring an update to transaction mode, which is also the recommended configuration. With three instances of pgBouncer and max_client_conn set to 200, up to 600 connections could be established with the PostgreSQL database. However, since PostgreSQL is configured with a maximum of 400 connections for optimal performance, we needed to adjust the max_client_conn to 100 per instance.With the pgBouncer connection pool update, the connections established to the database were controlled with 300+ connections. Also, with the Liberty thread pool updates, the number of requests handled successfully, without thread hung, and without much latency increased. Time Outs Nginx proxy_read_timeout was initially set to 30 seconds, matching the Istio timeout of 30 seconds. To maintain consistency, we increased both timeouts to 60 seconds. As a result, requests now time out at 60 seconds if no response is received from the upstream. This adjustment helped reduce 504 errors from Istio and allowed the server to handle more requests efficiently. Rate Limiting Cloudflare rate limiting was in place for requests coming from the public end point to the service. However, rate limiting was missing on the Istio ingress private gateway. As traffic on the private end point increased, we immediately implemented rate limiting on the Istio private gateway, along with a Retry-After header. This effectively controlled the number of requests reaching the Nginx gateway microservice, ensuring better load management. Latest Version Usage pgBouncer was running on an older version, so we upgraded to the latest version. While this did not have a direct impact on resilience, we used the opportunity to benefit from the new version. Overall improvements achieved during the incident with the above configuration updates are: The latency of a GET request retrieving 122KB of data, which involves approximately 7-9 database calls, was improved from 9 seconds to 2 seconds even under a load of 400 concurrent requests to the API. The number of requests handled concurrently improved by 5x times.Errors reduced drastically. The customer started noticing only 429 (too many requests) if too many requests were hit within a specific period. Results and Key Takeaways Teamwork drives fast recovery. The technical team collaborated effectively, analyzing each layer independently and ensuring a quick resolution to the incident. Cross-functional efforts played a crucial role in restoring stability.Logging is key. Comprehensive logging across various layers provided critical insights, allowing us to track: The total number of requests initiatedThe total number of failing requestsThread hangs and failuresThe number of requests hitting private endpoints These logs helped us pinpoint the root cause swiftly. Monitoring enables real-time insights. With active monitoring, we could track live database connections, which helped us fine-tune the connection pool configurations accurately, preventing resource exhaustion.Master what you implement. Kubernetes expertise allowed us to access pods, tweak thread pools, and observe real-time behavior before rolling out a permanent fix.Istio rate limiting was applied immediately, helping balance the load effectively and preventing service degradation. Fail gracefully. Returning HTTP 504 Gateway Timeout left the API clients with not much option but to declare failure. Instead, we returned an HTTP 429 Too Many Requests. This gave a more accurate picture, and the API clients can try after some time.Feature flags for dynamic debugging. Running microservices behind feature flags enabled on-demand debugging without requiring server restarts. This played a vital role in identifying bottlenecks, particularly those caused by database connection limits, which in turn reduced MTTR. Conclusion Building resilient microservices is not just about choosing the right tools — it's about understanding and fine-tuning every layer of the system to handle unpredictable traffic spikes efficiently. Through this incident, we reinforced the importance of proactive monitoring, optimized configurations, and rapid collaboration in achieving high availability and performance. By implementing rate limiting, thread pool management, optimized database connections, appropriate failure codes, and real-time observability, we transformed our fragile system into a self-healing, scalable, and fault-tolerant architecture. The key takeaway? Master what you implement — whether it's Kubernetes, Istio, or database tuning — deep expertise helps teams respond quickly and make the right decisions under pressure. Resilience isn’t a one-time fix — it’s a mindset. Keeping a system healthy means constantly monitoring, learning from failures, and improving configurations. No more crashes — just a system that grows and adapts effortlessly, and in an unlikely situation, degrades gracefully!

By Josephine Eskaline Joyce

CORE