A microservices architecture is a development method for designing applications as modular services that seamlessly adapt to a highly scalable and dynamic environment. Microservices help solve complex issues such as speed and scalability, while also supporting continuous testing and delivery. This Zone will take you through breaking down the monolith step by step and designing a microservices architecture from scratch. Stay up to date on the industry's changes with topics such as container deployment, architectural design patterns, event-driven architecture, service meshes, and more.
Why You Should Migrate Microservices From Java to Kotlin: Experience and Insights
Enhance Your Communication Strategy: Deliver Multimedia Messages With AWS Pinpoint
Logs of an application are the initial step to start debugging and analysis of issues, so they are quite an important part of the application. However, they are often ignored during the testing phase. As the world is moving to cloud-based microservices, gaining insights into any customer issue heavily relies on logs. If they are not properly structured or don’t contain enough information to analyze the issue, they can be a significant stumbling block for engineers. In this article, we’ll explore why testing microservice logs is crucial and how engineers can ensure logs are up to the mark. Why Logs Matter Logs are the backbone of debugging, monitoring, and security. They help engineers: Debug issues: Logs provide the first line of information when issues arise, helping to pinpoint the cause quickly. Monitor performance: They track the performance of microservices, identifying bottlenecks and ensuring smooth operations. Ensure security: Logs can highlight unauthorized access attempts and other security-related events. Meet compliance: Industries with regulatory requirements often need detailed logging to remain compliant. Consequences of Ignoring Log Testing Despite their importance, logs are frequently overlooked during the testing phase. This neglect can lead to: Insufficient information: Logs lacking critical details make it difficult to troubleshoot issues. Inconsistent format: Inconsistencies in log format can hinder automated analysis and make manual reviews cumbersome. Performance issues: Poorly implemented logging can cause significant performance overhead. Best Practices for Log Testing in Microservices To ensure logs are useful and efficient, engineers should integrate the following best practices into their processes. 1. Define Clear Log Requirements Start by defining what needs to be captured in the logs. Essential components include: Error messages: Ensure all errors are logged with context to facilitate diagnosis. Transaction IDs: Include IDs to trace actions across different microservices. Timestamps: Precise timestamps are critical for tracking the sequence of events. User information: Capture user IDs or session IDs to address user-specific issues. 2. Advocate Structured Logging Structured logging uses a consistent format, such as JSON, which aids both automated and manual log analysis. For example, a well-structured log entry might look like this: JSON { "timestamp": "2024-07-11T10:00:00Z", "level": "ERROR", "transactionId": "12345", "userId": "67890", "message": "Failed to connect to database", "details": { "error": "Connection timeout", "retryCount": 3 } } Using structured logging, particularly in JSON format, brings several advantages: Consistency: Ensures that all logs follow a uniform structure, making automated parsing easier. Readability: JSON format is human-readable and can be easily understood by validation engineers. Interoperability: JSON logs can be easily integrated with various logging and monitoring tools. 3. Validate Log Content Validation engineers should ensure that logs contain necessary information and are correctly formatted. Techniques include the following. Automated Tests Verify that logs are generated correctly during different scenarios. Scenario coverage: Test logging for various scenarios, including normal operations, error conditions, and edge cases. Log content checks: Use automated checks to ensure that logs include all required fields and follow the defined format. Log volume tests: Simulate high-load conditions to ensure that logging does not degrade performance or miss entries. Manual Reviews Periodically check logs to ensure they meet the defined requirements. Random sampling: Review a random sample of log entries to verify their accuracy and completeness. Consistency checks: Ensure that logs from different microservices or instances follow the same structure and contain the same level of detail. 4. Monitor Log Performance Ensure logging does not degrade application performance by: Sampling: Logging only a subset of high-frequency operations. Asynchronous logging: Using asynchronous methods to reduce impact on performance. 5. Secure Logging Practices Logs often contain sensitive information, so secure logging practices are essential: Encryption: Encrypt log data both during transmission and when stored. Access control: Restrict log access to authorized personnel only. What To Test in Logs When validating logs, consider the following aspects: Completeness Are all necessary events being logged? Ensure that logs capture all relevant actions, state changes, and errors. Validate that no critical information is missing from the logs. Accuracy Is the information logged correctly? Verify that log entries accurately reflect the events that occurred. Ensure that logs do not contain misleading or incorrect information. Consistency Are log entries consistently formatted? Check that all logs follow the same structure and format. Ensure that logs are uniformly structured across different microservices. Timeliness Are logs being generated in real time? Validate that logs are recorded promptly and reflect real-time events. Ensure that there is no significant delay between an event and its logging. Security Are logs protected against unauthorized access and tampering? Ensure that logs are encrypted and stored securely. Validate that access to logs is restricted to authorized personnel only. Examples of Poor Logging Practices Example 1: Insufficient Information JSON { "timestamp": "2024-07-11T10:00:00Z", "level": "ERROR", "message": "Failed to connect to database" } Issue: Lacks context, no transaction ID, user ID, or error details. Example 2: Inconsistent Format JSON { "timestamp": "2024-07-11T10:00:00Z", "level": "ERROR", "transactionId": "12345", "userId": "67890", "message": "Failed to connect to database: Connection timeout" } JSON { "time": "2024/07/11 10:01:00", "severity": "ERROR", "transaction": "12346", "user": "67891", "msg": "Database connection timeout" } Issue: Different formats make automated analysis difficult. Conclusion In the world of cloud-based microservices, logs are crucial for debugging, monitoring, and security. Yet, their importance is often overlooked during the testing phase. Validation engineers play a critical role in ensuring that logs are comprehensive, consistent, and secure. By following best practices and focusing on thorough log testing, we can significantly enhance the reliability and efficiency of our microservices. Remember, proper log testing is not optional — it is essential for maintaining robust cloud-based applications.
In the realm of distributed systems, ensuring that only one process can access a shared resource at any given time is crucial — this is where mutual exclusion comes into play. Without a reliable way to enforce mutual exclusion, systems can easily run into issues like data inconsistency or race conditions, potentially leading to catastrophic failures. As distributed systems grow more complex, the need for robust algorithms to manage access to shared resources becomes ever more critical. Algorithms To Address the Challenge Over the years, several algorithms have been developed to address the challenge of mutual exclusion in distributed environments. One of the most well-known is the Majority Quorum Algorithm. This algorithm is effective in maintaining data consistency by requiring a majority of nodes to agree before a shared resource can be accessed. However, it can be quite demanding in terms of communication, especially when dealing with a large network of nodes, leading to significant overhead and latency issues. On the other hand, there’s the Tree Quorum Algorithm. This method organizes nodes into a binary tree structure, reducing the number of nodes that need to be involved in the quorum. By strategically choosing nodes that form a quorum based on the tree structure, it significantly lowers communication costs while also improving fault tolerance. In distributed systems, achieving both low communication overhead and high fault tolerance is often a challenging balance — the Tree Quorum Algorithm excels at striking this balance. Practical Example Let’s dive into a practical example to illustrate how the Tree Quorum Algorithm can be implemented and used. Imagine you have a distributed system where you need to ensure mutual exclusion across a network of five nodes. Instead of contacting all nodes, as you might with a majority quorum, the tree quorum approach allows you to communicate with just a subset, following a path from the root node down to a leaf. This drastically reduces the number of messages you need to send, making your system more efficient. Here’s a quick Python example that illustrates how you might implement this: Python class TreeNode: def __init__(self, id): self.id = id self.left = None self.right = None self.is_active = True # Represents the node's active status def construct_tree(nodes): """Constructs a binary tree from a list of nodes.""" if not nodes: return None root = TreeNode(nodes[0]) queue = [root] index = 1 while index < len(nodes): current_node = queue.pop(0) if index < len(nodes): current_node.left = TreeNode(nodes[index]) queue.append(current_node.left) index += 1 if index < len(nodes): current_node.right = TreeNode(nodes[index]) queue.append(current_node.right) index += 1 return root def form_quorum(node, depth): """Forms a quorum based on a specific depth level of the tree, handling failures.""" if not node or depth == 0: return [] quorum = [] # Check if the node is active before adding to the quorum if node.is_active: quorum.append(node.id) if depth > 1: # Try forming quorum from left and right children if node.left: quorum.extend(form_quorum(node.left, depth - 1)) if node.right: quorum.extend(form_quorum(node.right, depth - 1)) return quorum def simulate_failure(node, failed_nodes): """Simulates failure of nodes by marking them as inactive.""" if node: if node.id in failed_nodes: node.is_active = False simulate_failure(node.left, failed_nodes) simulate_failure(node.right, failed_nodes) # Example usage: nodes = ['A', 'B', 'C', 'D', 'E'] root = construct_tree(nodes) # Simulate failures of nodes 'B' and 'D' simulate_failure(root, ['B', 'D']) # Forming a quorum at depth 2 quorum = form_quorum(root, 2) print(f"Formed Quorum: {quorum}") In the above code, we construct a binary tree from a list of nodes and then traverse the tree to form a quorum. The algorithm is designed to check if nodes are active before adding them to the quorum, which helps in handling failures. If some nodes fail, the algorithm dynamically adjusts by choosing alternative paths through the tree, ensuring that a quorum can still be formed without involving the failed nodes. Why Does This Matter? Now, why does this matter? It’s simple — efficiency and fault tolerance are key in distributed systems. The Tree Quorum Algorithm not only makes your system more efficient by reducing the communication overhead but also ensures that your system can continue to function even when some nodes go down. Beyond mutual exclusion, this algorithm can also be applied to other critical tasks like Replicated Data Management and Commit Protocols in distributed databases. For example, it can help ensure that read operations always return the most up-to-date data, or that distributed transactions either fully commit or fully roll back, without getting stuck in an inconsistent state. In conclusion, the Tree Quorum Algorithm offers a smart and scalable solution to the age-old problem of mutual exclusion in distributed systems, proving that sometimes, less really is more.
Regarding contemporary software architecture, distributed systems have been widely recognized for quite some time as the foundation for applications with high availability, scalability, and reliability goals. When systems shifted from a centralized structure, it became increasingly important to focus on the components and architectures that support a distributed structure. Regarding the choice of frameworks, Spring Boot is a widely adopted framework encompassing many tools, libraries, and components to support these patterns. This article will focus on the specific recommendations for implementing various distributed system patterns regarding Spring Boot, backed by sample code and professional advice. Spring Boot Overview One of the most popular Java EE frameworks for creating apps is Spring. The Spring framework offers a comprehensive programming and configuration mechanism for the Java platform. It seeks to make Java EE programming easier and increase developers' productivity in the workplace. Any type of deployment platform can use it. It tries to meet modern industry demands by making application development rapid and straightforward. While the Spring framework focuses on giving you flexibility, the goal of Spring Boot is to reduce the amount of code and give developers the most straightforward approach possible to create web applications. Spring Boot's default codes and annotation setup lessen the time it takes to design an application. It facilitates the creation of stand-alone applications with minimal, if any, configuration. It is constructed on top of a module of the Spring framework. With its layered architecture, Spring Boot has a hierarchical structure where each layer can communicate with any layer above or below it. Presentation layer: The presentation layer converts the JSON parameter to an object, processes HTTP requests (from the specific Restful API), authenticates the request, and sends it to the business layer. It is made up, in brief, of views or the frontend section. Business layer: All business logic is managed by this layer. It employs services from data access layers and is composed of service classes. It also carries out validation and permission. Persistence layer: Using various tools like JDBC and Repository, the persistence layer translates business objects from and to database rows. It also houses all of the storage logic. Database layer: CRUD (create, retrieve, update, and delete) actions are carried out at the database layer. The actual scripts that import and export data into and out of the database This is how the Spring Boot flow architecture appears: Table 1: Significant differences between Spring and Spring Boot 1. Microservices Pattern The pattern of implementing microservices is arguably one of the most used designs in the current software world. It entails breaking down a complex, monolithic application into a collection of small, interoperable services. System-dependent microservices execute their processes and interconnect with other services using simple, lightweight protocols, commonly RESTful APIs or message queues. The first advantages of microservices include that they are easier to scale, separate faults well, and can be deployed independently. Spring Boot and Spring Cloud provide an impressive list of features to help implement a microservices architecture. Services from Spring Cloud include service registry, provided by Netflix Eureka or Consul; configuration offered by Spring Cloud config; and resilience pattern offered through either Hystrix or recently developed Resilience4j. Let’s, for instance, take a case where you’re creating an e-commerce application. This application can be split into several microservices covering different domains, for example, OrderService, PaymentService, and InventoryService. All these services can be built, tested, and implemented singularly in service-oriented systems. Java @RestController @RequestMapping("/orders") public class OrderController { @Autowired private OrderService orderService; @PostMapping public ResponseEntity<Order> createOrder(@RequestBody Order order) { Order createdOrder = orderService.createOrder(order); return ResponseEntity.status(HttpStatus.CREATED).body(createdOrder); } @GetMapping("/{id}") public ResponseEntity<Order> getOrder(@PathVariable Long id) { Order order = orderService.getOrderById(id); return ResponseEntity.ok(order); } } @Service public class OrderService { // Mocking a database call private Map<Long, Order> orderRepository = new HashMap<>(); public Order createOrder(Order order) { order.setId(System.currentTimeMillis()); orderRepository.put(order.getId(), order); return order; } public Order getOrderById(Long id) { return orderRepository.get(id); } } In the example above, OrderController offers REST endpoints for making and retrieving orders, while OrderService manages the business logic associated with orders. With each service operating in a separate, isolated environment, this pattern may be replicated for the PaymentService and InventoryService. 2. Event-Driven Pattern In an event-driven architecture, the services do not interact with each other in a request-response manner but rather in a loosely coupled manner where some services only produce events and others only consume them. This pattern is most appropriate when there is a need for real-time processing while simultaneously fulfilling high scalability requirements. It thus establishes the independence of the producers and consumers of events — they are no longer tightly linked. An event-driven system can efficiently work with large and unpredictable loads of events and easily tolerate partial failures. Implementation With Spring Boot Apache Kafka, RabbitMQ, or AWS SNS/SQS can be effectively integrated with Spring Boot, greatly simplifying the creation of event-driven architecture. Spring Cloud Stream provides developers with a higher-level programming model oriented on microservices based on message-driven architecture, hiding the specifics of different messaging systems behind the same API. Let us expand more on the e-commerce application. Consider such a scenario where the order is placed, and the OrderService sends out an event. This event can be consumed by other services like InventoryService to adjust the stock automatically and by ShippingService to arrange delivery. Java // OrderService publishes an event @Autowired private KafkaTemplate<String, String> kafkaTemplate; public void publishOrderEvent(Order order) { kafkaTemplate.send("order_topic", "Order created: " + order.getId()); } // InventoryService listens for the order event @KafkaListener(topics = "order_topic", groupId = "inventory_group") public void consumeOrderEvent(String message) { System.out.println("Received event: " + message); // Update inventory based on the order details } In this example, OrderService publishes an event to a Kafka topic whenever a new order is created. InventoryService, which subscribes to this topic, consumes and processes the event accordingly. 3. CQRS (Command Query Responsibility Segregation) The CQRS pattern suggests the division of the handling of commands into events that change the state from the queries, which are events that retrieve the state. This can help achieve a higher level of scalability and maintainability of the solution, especially when the read and write operations within an application are significantly different in the given area of a business domain. As for the support for implementing CQRS in Spring Boot applications, let’s mention the Axon Framework, designed to fit this pattern and includes command handling, event sourcing, and query handling into the mix. In a CQRS setup, commands modify the state in the write model, while queries retrieve data from the read model, which could be optimized for different query patterns. A banking application, for example, where account balances are often asked, but the number of transactions that result in balance change is comparatively less. By separating these concerns, a developer can optimize the read model for fast access while keeping the write model more consistent and secure. Java // Command to handle money withdrawal @CommandHandler public void handle(WithdrawMoneyCommand command) { if (balance >= command.getAmount()) { balance -= command.getAmount(); AggregateLifecycle.apply(new MoneyWithdrawnEvent(command.getAccountId(), command.getAmount())); } else { throw new InsufficientFundsException(); } } // Query to fetch account balance @QueryHandler public AccountBalance handle(FindAccountBalanceQuery query) { return new AccountBalance(query.getAccountId(), this.balance); } In this code snippet, a WithdrawMoneyCommand modifies the account balance in the command model, while a FindAccountBalanceQuery retrieves the balance from the query model. 4. API Gateway Pattern The API Gateway pattern is one of the critical patterns used in a microservices architecture. It is the central access point for every client request and forwards it to the right microservice. The following are the cross-cutting concerns: Authentication, logging, rate limiting, and load balancing, which are all handled by the gateway. Spring Cloud Gateway is considered the most appropriate among all the available options for using an API Gateway in a Spring Boot application. It is developed on Project Reactor, which makes it very fast and can work with reactive streams. Let us go back to our first e-commerce example: an API gateway can forward the request to UserService, OrderService, PaymentService, etc. It can also have an authentication layer and accept subsequent user requests to be passed to the back-end services. Java @Bean public RouteLocator customRouteLocator(RouteLocatorBuilder builder) { return builder.routes() .route("order_service", r -> r.path("/orders/**") .uri("lb://ORDER-SERVICE")) .route("payment_service", r -> r.path("/payments/**") .uri("lb://PAYMENT-SERVICE")) .build(); } In this example, the API Gateway routes requests to the appropriate microservice based on the request path. The lb://prefix indicates that these services are registered with a load balancer (such as Eureka). 5. Saga Pattern The Saga pattern maintains transactions across multiple services in a distributed transaction environment. With multiple microservices available, it becomes challenging to adjust data consistency in a distributed system where each service can have its own database. The Saga pattern makes it possible for all the operations across services to be successfully completed or for the system to perform compensating transactions to reverse the effects of failure across services. The Saga pattern can be implemented by Spring Boot using either choreography — where services coordinate and interact directly through events — or orchestration, where a central coordinator oversees the Saga. Each strategy has advantages and disadvantages depending on the intricacy of the transactions and the degree of service coupling. Imagine a scenario where placing an order involves multiple services: A few of them include PaymentService, InventoryService, and ShippingService. Every service has to be successfully executed for the order to be confirmed. If any service fails, compensating transactions must be performed to bring the system back to its initial status. Java public void processOrder(Order order) { try { paymentService.processPayment(order.getPaymentDetails()); inventoryService.reserveItems(order.getItems()); shippingService.schedule**process(order);** Figure 2: Amazon’s Saga Pattern Functions Workflow The saga pattern is a failure management technique that assists in coordinating transactions across several microservices to preserve data consistency and establish consistency in distributed systems. Every transaction in a microservice publishes an event, and the subsequent transaction is started based on the event's result. Depending on whether the transactions are successful or unsuccessful, they can proceed in one of two ways. As demonstrated in Figure 2, the Saga pattern uses AWS Step Functions to construct an order processing system. Every step (like "ProcessPayment") has a separate step to manage the process's success (like "UpdateCustomerAccount") or failure (like "SetOrderFailure"). A company or developer ought to think about implementing the Saga pattern if: The program must provide data consistency amongst several microservices without tightly connecting them together. Because some transactions take a long time to complete, they want to avoid the blocking of other microservices due to the prolonged operation of one microservice. If an operation in the sequence fails, it must be possible to go back in time. It is important to remember that the saga pattern becomes more complex as the number of microservices increases and that debugging is challenging. The pattern necessitates the creation of compensatory transactions for reversing and undoing modifications using a sophisticated programming methodology. 6. Circuit Breaker Pattern Circuit Breaker is yet another fundamental design pattern in distributed systems, and it assists in overcoming the domino effect, thereby enhancing the system's reliability. It operates so that potentially failing operations are enclosed by a circuit breaker object that looks for failure. When failures exceed the specified limit, the circuit "bends,and the subsequent calls to the operation simply return an error or an option of failure without performing the task. It enables the system to fail quickly and/or protects other services that may be overwhelmed. In Spring, you can apply the Circuit Breaker pattern with the help of Spring Cloud Circuit Breaker with Resilience4j. Here's a concise implementation: Java // Add dependency in build.gradle or pom.xml // implementation 'org.springframework.cloud:spring-cloud-starter-circuitbreaker-resilience4j' import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker; import org.springframework.stereotype.Service; @Service public class ExampleService { @CircuitBreaker(name = "exampleBreaker", fallbackMethod = "fallbackMethod") public String callExternalService() { // Simulating an external service call that might fail if (Math.random() < 0.7) { // 70% chance of failure throw new RuntimeException("External service failed"); } return "Success from external service"; } public String fallbackMethod(Exception ex) { return "Fallback response: " + ex.getMessage(); } } // In application.properties or application.yml resilience4j.circuitbreaker.instances.exampleBreaker.failureRateThreshold=50 resilience4j.circuitbreaker.instances.exampleBreaker.waitDurationInOpenState=5000ms resilience4j.circuitbreaker.instances.exampleBreaker.slidingWindowSize=10 In this instance of implementation: A developer adds the @CircuitBreaker annotation to the callExternalService function. When the circuit is open, the developer specifies a fallback method that will be called. Configure the application configuration file's circuit breaker properties. This configuration enhances system stability by eliminating cascade failures and allowing the service to handle errors gracefully in the external service call. Conclusion By applying the microservices pattern, event-driven pattern, command query responsibility segregation, API gateway pattern, saga pattern, and circuit breaker pattern with the help of Spring Boot, developers and programmers can develop distributed systems that are scalable, recoverable, easily maintainable, and subject to evolution. An extensive ecosystem of Spring Boot makes it possible to solve all the problems associated with distributed computing, which makes this framework the optimal choice for developers who want to create a cloud application. Essential examples and explanations in this article are constructed to help the reader begin using distributed system patterns while developing applications with Spring Boot. However, in order to better optimize and develop systems and make sure they can withstand the demands of today's complex and dynamic software environments, developers can investigate more patterns and sophisticated methodologies as they gain experience. References Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. O'Reilly Media. Richards, M. (2020). Software Architecture Patterns. O'Reilly Media. AWS Documentation. (n.d.). AWS Step Functions - Saga Pattern Implementation Nygard, M. T. (2007). Release It!: Design and Deploy Production-Ready Software. Pragmatic Bookshelf. Resilience4j Documentation. (n.d.). Spring Cloud Circuit Breaker with Resilience4j. Red Hat Developer. (2020). Microservices with the Saga Pattern in Spring Boot.
Managing database connection strings securely for any microservice is critical; often, we secure the username and password using the environment variables and never factor in masking or hiding the database hostname. In reader and writer database instances, there would be a mandate in some organizations not to disclose the hostname and pass that through an environment variable at runtime during the application start. This article discusses configuring the hostname through environment variables in the properties file. Database Configurations Through Environment Variables We would typically configure the default connection string for Spring microservices in the below manner, with the database username and password getting passed as the environment variables. Java server.port=8081 server.servlet.context-path=/api/e-sign/v1 spring.esign.datasource.jdbc-url=jdbc:mysql://localhost:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} spring.esign.datasource.driver-class-name=com.mysql.cj.jdbc.Driver spring.esign.datasource.minimumIdle=5 spring.esign.datasource.maxLifetime=120000 If our microservice connects to a secure database with limited access and the database administrator or the infrastructure team does not want you to provide the database hostname, then we have an issue. Typically, the production database hostname would be something like below: Java spring.esign.datasource.jdbc-url=jdbc:mysql://prod-db.fabrikam.com:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} Using @Configuration Class In this case, the administrator or the cloud infrastructure team wants them to provide the hostname as an environment variable at runtime when the container starts. One of the options is to build and concatenate the connection string in the configuration class as below: Java @Configuration public class DatabaseConfig { private final Environment environment; public DatabaseConfig(Environment environment) { this.environment = environment; } @Bean public DataSource databaseDataSource() { String hostForDatabase = environment.getProperty("ESIGN_DB_HOST", "localhost:3306"); String dbUserName = environment.getProperty("DB_USER_NAME", "user-name"); String dbUserPassword = environment.getProperty("DB_USER_PASSWORD", "user-password"); String url = String.format("jdbc:mysql://%s/e-sign?allowPublicKeyRetrieval=true&useSSL=false", hostForDatabase); DriverManagerDataSource dataSource = new DriverManagerDataSource(); dataSource.setDriverClassName("com.mysql.cj.jdbc.Driver"); dataSource.setUrl(url); dataSource.setUsername(dbUserName); // Replace with your actual username dataSource.setPassword(dbUserPassword); // Replace with your actual password return dataSource; } } The above approach would work, but we need to use the approach with application.properties, which is easy to use and quite flexible. The properties file allows you to collate all configurations in a centralized manner, making it easier to update and manage. It also improves readability by separating configuration from code. The DevOps team can update the environment variable values without making code changes. Environment Variable for Database Hostname Commonly, we use environment variables for database username and password and use the corresponding expression placeholder expressions ${} in the application properties file. Java spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} However, for the database URL, we need to use the environment variable only for the hostname and not for the connection string, as each connection string for different microservices would have different parameters. So, to address this, Spring allows you to have the placeholder expression within the connection string shown below; this gives flexibility and the ability to stick with the approach of using the application.properties file instead of doing it through the database configuration class. Java spring.esign.datasource.jdbc-url=jdbc:mysql://${ESIGN_DB_HOST}:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false Once we have decided on the above approach and if we need to troubleshoot any issue for whatever reason in lower environments, we can then use the ApplicationListener interface to see the resolved URL: Java @Component public class ApplicationReadyLogger implements ApplicationListener<ApplicationReadyEvent> { private final Environment environment; public ApplicationReadyLogger(Environment environment) { this.environment = environment; } @Override public void onApplicationEvent(ApplicationReadyEvent event) { String jdbcUrl = environment.getProperty("spring.esign.datasource.jdbc-url"); System.out.println("Resolved JDBC URL: " + jdbcUrl); } } If there is an issue with the hostname configuration, it will show as an error when the application starts. However, after the application has been started, thanks to the above ApplicationReadyLogger implementation, we can see the database URL in the application logs. Please note that we should not do this in production environments where the infrastructure team wants to maintain secrecy around the database writer hostname. Using the above steps, we can configure the database hostname as an environment variable in the connection string inside the application.properties file. Conclusion Using environment variables for database hostnames to connect to data-sensitive databases can enhance security and flexibility and give the cloud infrastructure and DevOps teams more power. Using the placeholder expressions ensures that our configuration remains clear and maintainable.
A specter is haunting modern development: as our architecture has grown more mature, developer velocity has slowed. A primary cause of lost developer velocity is a decline in testing: in testing speed, accuracy, and reliability. Duplicating environments for microservices has become a common practice in the quest for consistent testing and production setups. However, this approach often incurs significant infrastructure costs that can affect both budget and efficiency. Testing services in isolation isn’t usually effective; we want to test these components together. High Costs of Environment Duplication Duplicating environments involves replicating entire setups, including all microservices, databases, and external dependencies. This approach has the advantage of being technically quite straightforward, at least at first blush. Starting with something like a namespace, we can use modern container orchestration to replicate services and configuration wholesale. The problem, however, comes in the actual implementation. For example, a major FinTech company was reported to have spent over $2 million annually just on cloud costs. The company spun up many environments for previewing changes and for the developers to test them, each mirroring their production setup. The costs included server provisioning, storage, and network configurations, all of which added up significantly. Each team needed its own replica, and they expected it to be available most of the time. Further, they didn’t want to wait for long startup times, so in the end, all these environments were running 24/7 and racking up hosting costs the whole time. While namespacing seems like a clever solution to environment replication, it just borrows the same complexity and cost issues from replicating environments wholesale. Synchronization Problems Problems of synchronization are one issue that rears its head when trying to implement replicated testing environments at scale. Essentially, for all internal services, how certain are we that each replicated environment is running the most updated version of every service? This sounds like an edge case or small concern until we remember the whole point of this setup was to make testing highly accurate. Finding out only when pushing to production that recent updates to Service C have broken my changes to Service B is more than frustrating; it calls into question the whole process. Again, there seem to be technical solutions to this problem: Why don’t we just grab the most recent version of each service at startup? The issue here is the impact on velocity: If we have to wait for a complete clone to be pulled configured and then started every time we want to test, we’re quickly talking about many minutes or even hours to wait before our supposedly isolated replica testing environment is ready to be used. Who is making sure individual resources are synced? This issue, like the others mentioned here, is specific to scale: If you have a small cluster that can be cloned and started in two minutes, very little of this article applies to you. But if that’s the case, it’s likely you can sync all your services’ states by sending a quick Slack message to your single two-pizza team. Third-party dependencies are another wrinkle with multiple testing environments. Secrets handling policies often mean that third-party dependencies can’t have all their authentication info on multiple replicated testing environments; as a result, those third-party dependencies can’t be tested at an early stage. This puts pressure back on staging as this is the only point where a real end-to-end test can happen. Maintenance Overhead Managing multiple environments also brings a considerable maintenance burden. Each environment needs to be updated, patched, and monitored independently, leading to increased operational complexity. This can strain IT resources, as teams must ensure that each environment remains in sync with the others, further escalating costs. A notable case involved a large enterprise that found its duplicated environments increasingly challenging to maintain. Testing environments became so divergent from production that it led to significant issues when deploying updates. The company experienced frequent failures because changes tested in one environment did not accurately reflect the state of the production system, leading to costly delays and rework. The result was small teams “going rogue," pushing their changes straight to staging, and only checking if they worked there. Not only were the replicated environments abandoned, hurting staging’s reliability, but it also meant that the platform team was still paying to run environments that no one was using. Scalability Challenges As applications grow, the number of environments may need to increase to accommodate various stages of development, testing, and production. Scaling these environments can become prohibitively expensive, especially when dealing with high volumes of microservices. The infrastructure required to support numerous replicated environments can quickly outpace budget constraints, making it challenging to maintain cost-effectiveness. For instance, a tech company that initially managed its environments by duplicating production setups found that as its service portfolio expanded, the costs associated with scaling these environments became unsustainable. The company faced difficulty in keeping up with the infrastructure demands, leading to a reassessment of its strategy. Alternative Strategies Given the high costs associated with environment duplication, it is worth considering alternative strategies. One approach is to use dynamic environment provisioning, where environments are created on demand and torn down when no longer needed. This method can help optimize resource utilization and reduce costs by avoiding the need for permanently duplicated setups. This can keep costs down but still comes with the trade-off of sending some testing to staging anyway. That’s because there are shortcuts that we must take to spin up these dynamic environments like using mocks for third-party services. This may put us back at square one in terms of testing reliability, that is how well our tests reflect what will happen in production. At this point, it’s reasonable to consider alternative methods that use technical fixes to make staging and other near-to-production environments easier to test on. One such is request isolation, a model for letting multiple tests occur simultaneously in the same shared environment. Conclusion: A Cost That Doesn’t Scale While duplicating environments might seem like a practical solution for ensuring consistency in microservices, the infrastructure costs involved can be significant. By exploring alternative strategies such as dynamic provisioning and request isolation, organizations can better manage their resources and mitigate the financial impact of maintaining multiple environments. Real-world examples illustrate the challenges and costs associated with traditional duplication methods, underscoring the need for more efficient approaches in modern software development.
Architecture design is imperative to modern software development. It impacts application performance, development time, quality, and overall user experience. Monolithic and microservices are two popular architectural designs powering most software applications today. Monolithic design is a more traditional approach that treats the entire application as a single entity. On the other hand, microservices are a more modern approach that divides the application into smaller components, giving developers granular control. For most developers, currently, microservices stand out as the best approach since they are a novel concept and provide finer control. However, this is not necessarily true in all cases. This article will discuss the monolithic vs microservices architectural design approaches in detail and highlight scenarios where either is desirable. Table of Contents What is monolithic architecture What is microservices architecture Key differences between monolithic vs microservices architecture Monolithic vs microservices architecture: How to choose Conclusion What Is Monolithic Architecture? Monolithic architecture is the conventional method in application development. A single codebase contains all logic, workflows, and databases. This architecture treats the entire application as a single unit, with all components tightly coupled. To illustrate what monolithic architecture entails, consider a large machine with several cogs and gears working simultaneously. If a single gear breaks, the entire machine stops, and if a single component is to be replaced, all gears must be halted. Pros Easier development and deployment: Since the entire logic is all in one place, it is easier to understand and extend, especially for small applications. A single application also means fewer individual components and easier deployment. Easier testing and debugging: Test cases are simpler to implement due to a unified architecture and a single executable file. Bugs are also generally easier to locate and fix. Cons Difficult to scale: Maintaining a growing codebase becomes exponentially challenging in monolithic applications. Since all modules are tightly coupled, any updates to a single component will require updating the entire application code. Slower development: A monolithic architecture usually does not allow for parallel development since the ripple effects impact multiple modules. The growing application complexity also means that bugs become challenging to fix over time. Technology bound: Monolithic apps are usually built on a single programming language, so their technology adaptation is limited. They are strictly linked to their initial design, and new database architecture or development frameworks outside the language cannot be included in the application. What Is Microservices Architecture? Microservices architecture improves upon monolithic design by introducing modularity to the application. It divides the application into smaller chunks called microservices, each developed and operating independently of the other. The architecture design in Microservices is such that each module handles a different functionality. One might be hosting databases and responsible for CRUD operations, while another might process Frontend forms. Each module is loosely coupled with the other and is often linked via API endpoints. Pros Scalability: Microservices offer unparalleled scalability and flexibility for application development. Every time a new feature is introduced, only the specific service in question needs to be stopped, or an entirely new service can be created and integrated depending on the requirements. As a result, any new development or bug fix can be implemented with minimal downtime. Flexibility: The modular architecture can accommodate various languages and technologies depending on requirements. As such, separate services can be developed to maintain SQL and NoSQL databases. Moreover, the same application can integrate Python for a machine learning backend and React JS for a blazing-fast frontend. Fault isolation: Disruption in one service does not halt the entire application. All independent modules continue to function as expected. Faster deployment: Individual services are independently deployable. This allows multiple developers to work in parallel for development and push all changes to deployment. Cons Increased development costs: Each microservice has an independent development cycle, test suites, and deployment procedures. They also increase infrastructure complexity and cost. Increased overhead: There is an increase in overhead with maintaining the various services and communication between them. Complex architecture may also require maintenance from specialized experts. Lack of standardization: With multiple services running in parallel, standardizing workflows becomes a challenge. Different developers working on the various services might implement different coding practices, logging schemas, and development frameworks. Key Differences Between Monolithic vs Microservices Architecture Both monolithic and microservices architectures are popular in software development. However, these design patterns have stark differences, making them suitable for different use cases. Monolithic Microservices Structure A unified grand structure A modular structure consisting of various smaller services Scalability Difficult to scale due to tightly coupled modules Easier scalability due to the modular structure Deployment Relatively easy to deploy for a small application but gets exponentially challenging as the application grows Each microservice is independently deployable, making the process relatively easier. Development Development is simple as long as the application remains small. However, large-scale applications can get complex and difficult to develop and maintain. Carries a development overhead due to the additional maintenance of the infrastructure itself; However, the development of individual services is amplified. Technology Diversity Strictly bound to legacy frameworks, databases, and programming language Each new service can be built with the technology of choice Modification Even minor modifications require halting the entire application. The ripple effects are also experienced across the application. Each service can be modified individually without tampering with the remaining application. Fault Tolerance Failure of any workflow can disrupt the entire application. Failure of a single service does not impact other services. Resource Utilization Utilizes fewer resources due to simpler architecture and limited technologies used. The complex infrastructure requires high resource utilization for smooth functioning. Monolithic architecture contains all its components in a single place. All workflows are created in a single deployable package. The unified architecture is easy to start but gets challenging to maintain as the application grows. Essentially, development and deployment complexities grow exponentially, making the application difficult to scale and modify. On the other hand, microservices were created to tackle the challenges of modern large-scale applications. They treat different modules as independent components and connect them via API endpoints. Their complex infrastructure carries additional overhead but allows: Scalability Fault tolerance Easy modification Monolithic vs Microservices Architecture: How To Choose At first glance, it may seem that the microservices architecture is the better choice. However, there are many more factors to consider before developing app infrastructure. The first aspect to consider is the nature of the application and the designated budget. From this perspective, microservices architecture is best suited for large-scale applications expected to scale and host several thousand users. The complex infrastructure is challenging but pays off in the long run since the large application will be easier to maintain and upgradable to the latest technologies. Consequently, microservices architecture is better suited for high-budget scenarios in that it requires senior software architects to build and long development times to get the initial infrastructure up and running. However, not all applications fit the description discussed above. Many small-scale applications have a limited development budget and target a limited user base. While these can still benefit from modularity, the development overhead is not worth the extra effort. As such, monolithic architecture is best suited for such applications as the unified implementation is simple to build and understand. Other aspects like debugging and testing are also easier to perform while avoiding unnecessary resource utilization and Development costs. Conclusion Monolithic and microservices are two popular development architectures in the software domain. The monolithic structure is a legacy approach that treats the entire application as a single unit. On the other hand, microservices divide the application into smaller modules called services. Both design patterns offer different advantages and suit different use cases. Monolithic architecture is better suited for smaller applications that require little to no scalability. The unified structure makes developing and maintaining the code base easier and does not carry the overhead costs of complex infrastructure design. Microservices, in contrast, are suited for large applications with a growing user base. This offers better scalability, fault tolerance, and quicker updates and deployment. In closing, design architecture can significantly impact the application's performance, scalability, and user experience. For this reason, it is imperative to select the right architecture and think it through before making a decision.
Ever wondered how Netflix keeps you glued to your screen with uninterrupted streaming bliss? Netflix Architecture is responsible for the smooth streaming experience that attracts viewers worldwide behind the scenes. Netflix's system architecture emphasizes how important it is to determine how content is shaped in the future. Join us on a journey behind the scenes of Netflix’s streaming universe! Netflix is a term that means entertainment, binge-watching, and cutting-edge streaming services. Netflix’s rapid ascent to popularity may be attributed to its vast content collection, worldwide presence, and resilient and inventive architecture. From its start as a DVD rental service in 1997 to its development into a major worldwide streaming company, Netflix has consistently used cutting-edge technology to revolutionize media consumption. Netflix Architecture is designed to efficiently and reliably provide content to millions of consumers at once. The scalability of Netflix’s infrastructure is critical, given its 200 million+ members across more than 190 countries. So, let’s delve into the intricacies of Netflix Architecture and uncover how it continues shaping how we enjoy our favorite shows and movies. Why Understand Netflix System Architecture? It’s important to understand Netflix System Architecture for several reasons. Above all, it sheds light on how Netflix accommodates millions of customers throughout the globe with a flawless streaming experience. We can learn about the technology and tactics that underlie its success better by exploring the nuances of this architecture. Furthermore, other industries can benefit from using Netflix’s design as a blueprint for developing scalable, reliable, and efficient systems. Its design principles and best practices can teach us important lessons about building and optimizing complicated distributed systems. We may also recognize the continual innovation driving the development of digital media consumption by understanding Netflix’s Architecture. Understanding the Requirements for System Design System design is crucial in developing complex software or technological infrastructure. These specifications act as the basis around which the entire system is constructed, driving choices and forming the end product. However, what are the prerequisites for system design, and what makes them crucial? Let’s explore. Functional Requirements The system’s functional requirements specify the features, functions, and capabilities that it must include. These specifications outline the system’s main objective and detail how various parts or modules interact. Functional requirements for a streaming platform like Netflix, for instance, could encompass the following, including but not limited to: Account creation: Users should be able to create accounts easily, providing necessary information for registration. User login: Registered users should have the ability to securely log in to their accounts using authentication credentials. Content suggestion: The platform should offer personalized content suggestions based on user preferences, viewing history, and other relevant data. Video playback capabilities: Users should be able to stream videos seamlessly, with options for playback controls such as play, pause, rewind, and fast forward. Non-Functional Requirements Non-functional requirements define the system’s behavior under different scenarios and ensure that it satisfies certain quality requirements. They cover performance, scalability, dependability, security, and compliance aspects of the system. Non-functional requirements for a streaming platform like Netflix, for instance, could include but are not limited to: Performance requirements: During periods of high utilization, the system must maintain low latency and high throughput. Compliance requirements: Regarding user data protection, the platform must abide by Data Protection Regulations standards. Scalability requirements: The infrastructure must be scalable to handle growing user traffic without sacrificing performance. Security requirements: To prevent unwanted access to user information, strong authentication and encryption procedures must be put in place. Reliability and availability requirements: For uninterrupted service delivery, the system needs to include failover methods and guarantee high uptime. Netflix Architecture: Embracing Cloud-Native After a significant setback due to database corruption in August 2008, Netflix came to the crucial conclusion that it was necessary to move away from single points of failure and towards highly dependable, horizontally scalable, cloud-based solutions. Netflix started a revolutionary journey by selecting Amazon Web Services (AWS) as its cloud provider and moving most of its services to the cloud by 2015. Following seven years of intensive work, the cloud migration was finished in early January 2016, which meant that the streaming service’s last remaining data center components were shut down. But getting to the cloud wasn’t a simple task. Netflix adopted a cloud-native strategy, completely overhauling its operational model and technological stack. This required embracing NoSQL databases, denormalizing their data model, and moving from a monolithic application to hundreds of microservices. Changes in culture were also necessary, such as adopting DevOps procedures, continuous delivery, and a self-service engineering environment. Despite the difficulties, this shift has made Netflix a cloud-native business that is well-positioned for future expansion and innovation in the rapidly changing field of online entertainment. Netflix Architectural Triad A strong architectural triad — the Client, Backend, and Content Delivery Network (CDN) — is responsible for Netflix’s flawless user experience. With millions of viewers globally, each component is essential to delivering content. Client The client-side architecture lies at the heart of the Netflix experience. This includes the wide range of devices users use to access Netflix, such as computers, smart TVs, and smartphones. Netflix uses a mix of web interfaces and native applications to ensure a consistent user experience across different platforms. Regardless of the device, these clients manage playback controls, user interactions, and interface rendering to deliver a unified experience. Users may easily browse the extensive content library and enjoy continuous streaming thanks to the client-side architecture’s responsive optimization. Netflix Architecture: Backend Backend architecture is the backbone of Netflix’s behind-the-scenes operations. The management of user accounts, content catalogs, recommendation algorithms, billing systems, and other systems is done by a complex network of servers, databases, and microservices. In addition to handling user data and coordinating content delivery, the backend processes user requests. Furthermore, the backend optimizes content delivery and personalizes recommendations using state-of-the-art technologies like big data analytics and machine learning, which raises user satisfaction and engagement. The backend architecture of Netflix has changed significantly over time. It moved to cloud infrastructure in 2007 and adopted Spring Boot as its primary Java framework in 2018. When combined with the scalability and dependability provided by AWS (Amazon Web Services), proprietary technologies like Ribbon, Eureka, and Hystrix have been crucial in effectively coordinating backend operations. Netflix Architecture: Content Delivery Network The Content Delivery Network completes Netflix Architectural Triangle. A Content Delivery Network (CDN) is a strategically positioned global network of servers that aims to deliver content to users with optimal reliability and minimum delay. Netflix runs a Content Delivery Network (CDN) called Open Connect. It reduces buffering and ensures smooth playback by caching and serving material from sites closer to users. Even during times of high demand, Netflix reduces congestion and maximizes bandwidth utilization by spreading content over numerous servers across the globe. This decentralized method of content delivery improves global viewers’ watching experiences, also lowering buffering times and increasing streaming quality. Client-Side Components Web Interface Over the past few years, Netflix’s Web Interface has seen a considerable transformation, switching from Silverlight to HTML5 to stream premium video content. With this change, there would be no need to install and maintain browser plug-ins, which should simplify the user experience. Netflix has increased its compatibility with a wide range of online browsers and operating systems, including Chrome OS, Chrome, Internet Explorer, Safari, Opera, Firefox, and Edge, since the introduction of HTML5 video. Netflix’s use of HTML5 extends beyond simple playback. The platform has welcomed HTML5 adoption as an opportunity to support numerous industry standards and technological advancements. Mobile Applications The extension of Netflix’s streaming experience to users of smartphones and tablets is made possible via its mobile applications. These applications guarantee that users may access their favorite material while on the road. They are available on multiple platforms, including iOS and Android. By utilizing a combination of native development and platform-specific optimizations, Netflix provides a smooth and user-friendly interface for a wide range of mobile devices. With features like personalized recommendations, seamless playback, and offline downloading, Netflix’s mobile applications meet the changing needs of viewers on the go. Users of the Netflix mobile app may enjoy continuous viewing of their favorite series and films while driving, traveling, or just lounging around the house. Netflix is committed to providing a captivating and delightful mobile viewing experience with frequent upgrades and improvements. Smart TV Apps The Gibbon rendering layer, a JavaScript application for dynamic updates, and a native Software Development Kit (SDK) comprise the complex architecture upon which the Netflix TV Application is based. The application guarantees fluid UI rendering and responsiveness across multiple TV platforms by utilizing React-Gibbon, a customized variant of React. Prioritizing performance optimization means focusing on measures such as frames per second and key input responsiveness. Rendering efficiency is increased by methods like prop iteration reduction and inline component creation; performance is further optimized by style optimization and custom component development. With a constant focus on enhancing the TV app experience for consumers across many platforms, Netflix cultivates a culture of performance excellence. Revamping the Playback Experience: A Journey Towards Modernization Netflix has completely changed how people watch and consume digital media over the last ten years. But even though the streaming giant has been releasing cutting-edge features regularly, the playback interface’s visual design and user controls haven’t changed much since 2013. After realizing that the playback user interface needed to be updated, the Web UI team set out to redesign it. The team’s three main canvases were Pre Play, Video Playback, and Post Play. Their goal was to increase customer pleasure and engagement. By utilizing technologies like React.js and Redux to expedite development and enhance performance, Netflix revolutionized its playback user interface Netflix Architecture: Backend Infrastructure Content Delivery Network (CDN) Netflix’s infrastructure depends on its Content Delivery Network (CDN), additionally referred to as Netflix Open Connect, which allows content to be delivered to millions of viewers globally with ease. Globally distributed, the CDN is essential to ensuring that customers in various locations receive high-quality streaming content. The way Netflix Open Connect CDN works is that servers, called Open Connect Appliances (OCAs), are positioned strategically so that they are near Internet service providers (ISPs) and their users. When content delivery is at its peak, this proximity reduces latency and guarantees effective performance. Netflix is able to maximize bandwidth utilization and lessen its dependence on costly backbone capacity by pre-positioning content within ISP networks, which improves the total streaming experience. Scalability is one of Netflix’s CDN’s primary features. With OCAs installed in about 1,000 locations across the globe, including isolated locales like islands and the Amazon rainforest, Netflix is able to meet the expanding demand for streaming services across a wide range of geographic areas. Additionally, Netflix grants OCAs to qualified ISPs so they can offer Netflix content straight from their networks. This strategy guarantees improved streaming for subscribers while also saving ISPs’ running expenses. Netflix cultivates a win-win relationship with ISPs by providing localized content distribution and collaborating with them, which enhances the streaming ecosystem as a whole. Transforming Video Processing: The Microservices Revolution at Netflix By implementing microservices, Netflix has transformed its video processing pipeline, enabling unmatched scalability and flexibility to satisfy the needs of studio operations as well as member streaming. With the switch to the microservices-based platform from the monolithic platform, a new age of agility and feature development velocity was brought in. Each step of the video processing workflow is represented by a separate microservice, allowing for simplified orchestration and decoupled functionality. Together, these services—which range from video inspection to complexity analysis and encoding—produce excellent video assets suitable for studio and streaming use cases. Microservices have produced noticeable results by facilitating quick iteration and adaptation to shifting business requirements. Playback Process in Netflix Open Connect Worldwide customers can enjoy a flawless and excellent viewing experience thanks to Netflix Open Connect’s playback procedure. It functions as follows: Health reporting: Open Connect Appliances (OCAs) report to the cache control services in Amazon Web Services (AWS) on a regular basis regarding their learned routes, content availability, and overall health. User request: From the Netflix application hosted on AWS, a user on a client device requests that a TV show or movie be played back. Authorization and file selection: After verifying user authorization and licensing, the AWS playback application services choose the precise files needed to process the playback request. Steering service: The AWS steering service chooses which OCAs to serve files from based on the data that the cache control service has saved. The playback application services receive these OCAs from it when it constructs their URLs. Content delivery: On the client device, the playback application services send the URLs of the relevant OCAs. When the requested files are sent to the client device over HTTP/HTTPS, the chosen OCA starts serving them. Below is a visual representation demonstrating the playback process: Databases in Netflix Architecture Leveraging Amazon S3 for Seamless Media Storage Netflix’s ability to withstand the April 21, 2022, AWS outage demonstrated the value of its cloud infrastructure, particularly its reliance on Amazon S3 for data storage. Netflix’s systems were built to endure such outages by leveraging services like SimpleDB, S3, and Cassandra. Netflix’s infrastructure is built on the foundation of its use of Amazon S3 (Simple Storage Service) for media storage, which powers the streaming giant’s huge collection of films, TV series, and original content. Petabytes of data are needed to service millions of Netflix users worldwide, and S3 is the perfect choice for storing this data since it offers scalable, reliable, and highly accessible storage. Another important consideration that led Netflix to select S3 for media storage is scalability. With S3, Netflix can easily expand its storage capacity without having to worry about adding more hardware or maintaining complicated storage infrastructure as its content collection grows. To meet the growing demand for streaming content without sacrificing user experience or speed, Netflix needs to be scalable. Embracing NoSQL for Scalability and Flexibility The need for structured storage access throughout a highly distributed infrastructure drives Netflix’s database selection process. Netflix adopted the paradigm shift towards NoSQL distributed databases after realizing the shortcomings of traditional relational models in the context of Internet-scale operations. In their database ecosystem, three essential NoSQL solutions stand out: Cassandra, Hadoop/HBase, and SimpleDB. Amazon SimpleDB As Netflix moved to the AWS cloud, SimpleDB from Amazon became an obvious solution for many use cases. It was appealing because of its powerful query capabilities, automatic replication across availability zones, and durability. SimpleDB’s hosted solution reduced operational overhead, which is in line with Netflix’s policy of using cloud providers for non-differentiated operations. Apache HBase Apache HBase evolved as a practical, high-performance solution for Hadoop-based systems. Its dynamic partitioning strategy makes it easier to redistribute load and create clusters, which is crucial for handling Netflix’s growing volume of data. HBase’s robust consistency architecture is enhanced by its support for distributed counters, range queries, and data compression, which makes it appropriate for a variety of use cases. Apache Cassandra The open-source NoSQL database Cassandra provides performance, scalability, and flexibility. Its dynamic cluster growth and horizontal scalability meet Netflix’s requirement for unlimited scale. Because of its adaptable consistency, replication mechanisms, and flexible data model, Cassandra is perfect for cross-regional deployments and scaling without single points of failure. Since each NoSQL tool is best suited for a certain set of use cases, Netflix has adopted a number of them. While Cassandra excels in cross-regional deployments and fault-tolerant scaling, HBase connects with the Hadoop platform naturally. A learning curve and operational expense accompany a pillar of Netflix’s long-term cloud strategy, NoSQL adoption, but the benefits in terms of scalability, availability, and performance make the investment worthwhile. MySQL in Netflix’s Billing Infrastructure Netflix’s billing system experienced a major transformation as part of its extensive migration to AWS cloud-native architecture. Because Netflix relies heavily on billing for its operations, the move to AWS was handled carefully to guarantee that there would be as little of an impact on members’ experiences as possible and that strict financial standards would be followed. Tracking billing periods, monitoring payment statuses, and providing data to financial systems for reporting are just a few of the tasks that Netflix’s billing infrastructure handles. The billing engineering team managed a complicated ecosystem that included batch tasks, APIs, connectors with other services, and data management to accomplish these functionalities. The selection of database technology was one of the most important choices made during the move. MySQL was chosen as the database solution due to the need for scalability and the requirement for ACID transactions in payment processing. Building robust tooling, optimizing code, and removing unnecessary data were all part of the migration process in order to accommodate the new cloud architecture. Before transferring the current member data, a thorough testing process using clean datasets was carried out using proxies and redirectors to handle traffic redirection. It was a complicated process to migrate to MySQL on AWS; it required careful planning, methodical implementation, and ongoing testing and iteration. In spite of the difficulties, the move went well, allowing Netflix to use the scalability and dependability of AWS cloud services for its billing system. In summary, switching Netflix’s billing system to MySQL on AWS involved extensive engineering work and wide-ranging effects. Netflix's system architecture has updated its billing system and used cloud-based solutions to prepare for upcoming developments in the digital space. Here is Netflix’s post-migration architecture: Content Processing Pipeline in Netflix Architecture The Netflix content processing pipeline is a systematic approach for handling digital assets that are provided by partners in content and fulfillment. The three main phases are ingestion, transcoding, and packaging. Ingestion Source files, such as audio, timed text, or video, are thoroughly examined for accuracy and compliance throughout the ingestion stage. These verifications include semantic signal domain inspections, file format validation, decodability of compressed bitstreams, compliance with Netflix delivery criteria, and the integrity of data transfer. Transcoding and Packaging The sources go through transcoding to produce output elementary streams when they make it beyond the ingestion stage. After that, these streams are encrypted and placed in distribution-ready streamable containers. Ensuring Seamless Streaming With Netflix’s Canary Model Since client applications are the main way users engage with a brand, they must be of excellent quality for global digital products. At Netflix's system architecture, significant amounts of money are allocated towards guaranteeing thorough evaluation of updated application versions. Nevertheless, thorough internal testing becomes difficult because Netflix is accessible on thousands of devices and is powered by hundreds of independently deployed microservices. As a result, it is crucial to support release decisions with solid field data acquired during the update process. To expedite the assessment of updated client applications, Netflix’s system architecture has formed a specialized team to mine health signals from the field. Development velocity increased as a result of this system investment, improving application quality and development procedures. Client applications: There are two ways that Netflix upgrades its client apps: through direct downloads and app store deployments. Distribution control is increased with direct downloads. Deployment strategies: Although the advantages of regular, incremental releases for client apps are well known, updating software presents certain difficulties. Since every user’s device delivers data in a stream, efficient signal sampling is crucial. The deployment strategies employed by Netflix are customized to tackle the distinct challenges posed by a wide range of user devices and complex microservices. The strategy differs based on the kind of client — for example, smart TVs vs mobile applications. New client application versions are progressively made available through staged rollouts, which provide prompt failure handling and intelligent backend service scaling. During rollouts, keeping an eye on client-side error rates and adoption rates guarantees consistency and effectiveness in the deployment procedure. Staged rollouts: To reduce risks and scale backend services wisely, staged rollouts entail progressively deploying new software versions. AB tests/client canaries: Netflix employs an intense variation of A/B testing known as “Client Canaries,” which involves testing complete apps to guarantee timely upgrades within a few hours. Orchestration: Orchestration lessens the workload associated with frequent deployments and analysis. It is useful for managing A/B tests and client canaries. In summary, millions of customers may enjoy flawless streaming experiences thanks to Netflix’s use of the client canary model, which guarantees frequent app updates. Netflix Architecture Diagram Netflix system Architecture is a complex ecosystem made up of Python and Java with Spring Boot for backend services, and Apache Kafka and Flink for data processing and real-time event streaming. Redux, React.js, and HTML5 on the front end provide a captivating user experience. Numerous databases offer real-time analytics and handle enormous volumes of media content, including Cassandra, HBase, SimpleDB, MySQL, and Amazon S3. Jenkins and Spinnaker help with continuous integration and deployment, and AWS powers the entire infrastructure with scalability, dependability, and global reach. Netflix’s dedication to providing flawless entertainment experiences to its vast worldwide audience is demonstrated by the fact that these technologies only make up a small portion of its huge tech stack. Conclusion of Netflix Architecture Netflix System Architecture has revolutionized the entertainment industry. Throughout its evolution from a DVD rental service to a major worldwide streaming player, Netflix’s technological infrastructure has been essential to its success. Netflix Architecture, supported by Amazon Web Services (AWS), guarantees uninterrupted streaming for a global user base. Netflix ensures faultless content delivery across devices with its Client, Backend, and Content Delivery Network (CDN). The innovative usage of HTML5 and personalized suggestions by Netflix System Architecture improves user experience. Despite some obstacles along the way, Netflix came out stronger after making the switch to a cloud-native setup. In the quickly evolving field of online entertainment, Netflix has positioned itself for future development and innovation by embracing microservices, NoSQL databases, and cloud-based solutions. Any tech venture can benefit from understanding Netflix's system. Put simply, Netflix's System Architecture aims to transform the way we consume media — it’s not just about technology. This architecture secretly makes sure that everything runs well when viewers binge-watch, increasing everyone’s enjoyment of the entertainment.
Navigating toward a cloud-native architecture can be both exciting and challenging. The expectation of learning valuable lessons should always be top of mind as design becomes a reality. In this article, I wanted to focus on an example where my project seemed like a perfect serverless use case, one where I’d leverage AWS Lambda. Spoiler alert: it was not. Rendering Fabric.js Data In a publishing project, we utilized Fabric.js — a JavaScript HTML5 canvas library — to manage complex metadata and content layers. These complexities included spreads, pages, and templates, each embedded with fonts, text attributes, shapes, and images. As the content evolved, teams were tasked with updates, necessitating the creation of a publisher-quality PDF after each update. We built a Node.js service to run Fabric.js, generating PDFs and storing resources in AWS S3 buckets with private cloud access. During a typical usage period, over 10,000 teams were using the service, with each individual contributor sending multiple requests to the service as a result of manual page saves or auto-saves driven by the Angular client. The service was set up to run as a Lambda in AWS. The idea of paying at the request level seemed ideal. Where Serverless Fell Short We quickly realized that our Lambda approach wasn’t going to cut it. The spin-up time turned out to be the first issue. Not only was there the time required to start the Node.js service but preloading nearly 100 different fonts that could be used by those 10,000 teams caused delays too. We were also concerned about Lambda’s processing limit of 250 MB of unzipped source code. The initial release of the code was already over 150 MB in size, and we still had a large backlog of feature requests that would only drive this number higher. Finally, the complexity of the pages — especially as more elements were added — demanded increased CPU and memory to ensure quick PDF generation. After observing the usage for first-generation page designs completed by the teams, we forecasted the need for nearly 12 GB of RAM. Currently, AWS Lambdas are limited to 10 GB of RAM. Ultimately, we opted for dedicated EC2 compute resources to handle the heavy lifting. Unfortunately, this decision significantly increased our DevOps management workload. Looking for a Better Solution Although I am no longer involved with that project, I’ve always wondered if there was a better solution for this use case. While I appreciate AWS, Google, and Microsoft providing enterprise-scale options for cloud-native adoption, what kills me is the associated learning curve for every service. The company behind the project was a smaller technology team. Oftentimes teams in that position struggle with adoption when it comes to using the big three cloud providers. The biggest challenges I continue to see in this regard are: A heavy investment in DevOps or CloudOps to become cloud-native. Gaining a full understanding of what appears to be endless options. Tech debt related to cost analysis and optimization. Since I have been working with the Heroku platform, I decided to see if they had an option for my use case. Turns out, they introduced large dynos earlier this year. For example, with their Performance-L RAM Dyno, my underlying service would get 50x the compute power of a standard Dyno and 30 GB of RAM. The capability to write to AWS S3 has been available from Heroku for a long time too. V2 Design in Action Using the Performance-L RAM dyno in Heroku would be no different (at least operationally) than using any other dyno in Heroku. To run my code, I just needed the following items: A Heroku account The Heroku command-line interface (CLI) installed locally After navigating to the source code folder, I would issue a series of commands to log in to Heroku, create my app, set up my AWS-related environment variables, and run up to five instances of the service using the Performance-L dyno with auto-scaling in place: Shell heroku login heroku apps:create example-service heroku config:set AWS_ACCESS_KEY_ID=MY-ACCESS-ID AWS_SECRET_ACCESS_KEY=MY-ACCESS-KEY heroku config:set S3_BUCKET_NAME=example-service-assets heroku ps:scale web=5:Performance-L-RAM git push heroku main Once deployed, my example-service application can be called via standard RESTful API calls. As needed, the auto-scaling technology in Heroku could launch up to five instances of the Performance-L Dyno to meet consumer demand. I would have gotten all of this without having to spend a lot of time understanding a complicated cloud infrastructure or worrying about cost analysis and optimization. Projected Gains As I thought more about the CPU and memory demands of our publishing project — during standard usage seasons and peak usage seasons — I saw how these performance dynos would have been exactly what we needed. Instead of crippling our CPU and memory when the requested payload included several Fabric.js layers, we would have had enough horsepower to generate the expected image, often before the user navigated to the page containing the preview images. We wouldn’t have had size constraints on our application source code, which we would inevitably have hit in AWS Lambda limitations within the next 3 to 4 sprints. The time required for our DevOps team to learn Lambdas first and then switch to EC2 hit our project’s budget pretty noticeably. And even then, those services weren't cheap, especially when spinning up several instances to keep up with demand. But with Heroku, the DevOps investment would be considerably reduced and placed into the hands of software engineers working on the use case. Just like any other dyno, it’s easy to use and scale up the performance dynos either with the CLI or the Heroku dashboard. Conclusion My readers may recall my personal mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” — J. Vester In this example, I had a use case that required a large amount of CPU and memory to process complicated requests made by over 10,000 consumer teams. I walked through what it would have looked like to fulfill this use case using Heroku's large dynos, and all I needed was a few CLI commands to get up and running. Burning out your engineering and DevOps teams is not your only option. There are alternatives available to relieve the strain. By taking the Heroku approach, you lose the steep learning curve that often comes with cloud adoption from the big three. Even better, the tech debt associated with cost analysis and optimization never sees the light of day. In this case, Heroku adheres to my personal mission statement, allowing teams to focus on what is likely a mountain of feature requests to help product owners meet their objectives. Have a really great day!
The Cloud Native Telecom Initiative (CNTI) and the Cloud-Native Network Functions (CNF) Test Catalog are powerful tools designed to ensure telco applications adhere to cloud-native principles and best practices. However, a common misconception is that this tool is limited to telco applications. In reality, the CNTI/CNF Test Catalog is highly versatile and can be effectively used to validate the cloud nativeness of non-telco microservices. This article aims to guide you through the process of utilizing the CNTI/CNF Test Catalog for non-telco microservices, overcoming potential challenges, and adding custom tests. Installation To get started, download the latest CNTI test suite binary from the GitHub releases page. Extract the zip file to your preferred location and provide executable permissions. Ensure your machine meets the prerequisites. Execute the following command to set up the test suite: ./cnf-testsuite setup This command will create a cnt-testsuite namespace in your cluster. For detailed preparation steps, refer to the installation guide. Configuration Before executing tests, note that the tool installs your microservice from the Helm charts or manifest location provided in the configuration file. This can be a challenge for non-telco applications, as they may have dependencies or require pre/post-scripting. To address this, you can run the suite on an already installed microservice using the same release name and namespace. Here's a sample configuration file: YAML --- release_name: <release name of existing service> helm_directory: <folder location> # or use manifest directory helm_install_namespace: <namespace of existing service> Save this as cnf-testsuite.yml and ensure all paths are relative to the directory containing the CNTI binary. Use the following command to set up the test configuration: cnf-testsuite cnf_setup cnf-config=./cnf-testsuite.yml Execution With the CNTI suite successfully configured, you can execute the suite for all categories, specific categories, or individual tests: ./cnf-testsuite compatibility Refer to the test categories and execution guide to identify and run applicable tests for your service. Non-telco applications, for example, won't require 5G-specific tests. Adding Custom Tests To add custom tests, clone the CNTI test suite repository: git clone https://github.com/cnti-testcatalog/testsuite Navigate to the src/tasks/workload directory and edit the category test file where you want to add your custom test. The suite is written in Crystal, so it's advisable to use Ubuntu for development or perform changes on Windows and build on Ubuntu. Custom Test Example Here's an example of a custom test to check if resource requests are below specified limits: Crystal desc "Check if resource requests are less than 0.5 CPU and 512 MB memory" task "resource_requests" do |t, args| CNFManager::Task.task_runner(args, task: t) do |args, config| resp = "" task_response = CNFManager.workload_resource_test(args, config) do |resource, container, initialized| test_passed = true resource_ref = "#{resource[:kind]}/#{resource[:name]}" cpu_request_value = 1 memory_request_value = 1024 begin cpu_request = container.as_h["resources"].as_h["requests"].as_h["cpu"].as_s memory_request = container.as_h["resources"].as_h["requests"].as_h["memory"].as_s cpu_request_value = if cpu_request.ends_with?("m") cpu_request.gsub("m", "").to_i / 1000.0 else cpu_request.to_i end memory_request_value = if memory_request.ends_with?("Mi") memory_request.gsub("Mi", "").to_i elsif memory_request.ends_with?("Gi") memory_request.gsub("Gi", "").to_i * 1024 else memory_request.to_i end if cpu_request_value > 0.5 || memory_request_value > 512 test_passed = false stdout_failure("Resource requests for container #{container.as_h["name"].as_s} part of #{resource_ref} in #{resource[:namespace]} namespace exceed limits (CPU: #{cpu_request}, Memory: #{memory_request})") end rescue ex test_passed = false stdout_failure("Error occurred while checking resource requests for container #{container.as_h["name"].as_s} part of #{resource_ref} in #{resource[:namespace]} namespace") end test_passed end if task_response CNFManager::TestcaseResult.new(CNFManager::ResultStatus::Passed, "Resource requests within limits") else CNFManager::TestcaseResult.new(CNFManager::ResultStatus::Failed, "Resource requests exceed limits") end end end This test ensures that the CPU and memory requests for a container are within specified limits. Conclusion The CNTI/CNF Test Catalog is a robust tool, not just for telco applications, but for any cloud-native microservices. By following the steps outlined in this article, you can configure, execute, and even extend the capabilities of the CNTI test suite to fit your non-telco applications. Embrace the flexibility and power of the CNTI/CNF Test Catalog to enhance the reliability and performance of your microservices.
The evolution of software architecture and process orchestration reflects a continual quest for optimization and efficiency, mirroring the progression in the domain of AI model development. From monolithic architectures to service-oriented designs and beyond, each phase has built upon its predecessors to enhance flexibility and responsiveness. This journey provides a valuable framework for understanding the emerging paradigm of the LLM Orchestrator. Monolithic to Modular: The Foundations Initially, software systems were largely monolithic, with all components tightly integrated into a single, indivisible unit. This architecture made deployments simple and straightforward but lacked scalability and flexibility. As systems grew more complex, the limitations of the monolithic design became apparent, sparking a shift towards more modular architectures. Emergence of Service-Oriented Architecture (SOA) and Microservices The advent of Service-Oriented Architecture (SOA) marked a significant evolution in software design. In SOA, discrete functions are broken down into individual services, each performing a specific task. This modularity allowed for greater scalability and easier maintenance, as services could be updated independently without affecting the entire system. SOA also facilitated reuse, where services could be leveraged across different parts of an organization or even between multiple applications, significantly enhancing efficiency. Building on the principles of SOA, the concept of microservices emerged as an even more granular approach to structuring applications. Microservices architecture takes the idea of SOA further by decomposing services into smaller, more tightly focused components that are easier to develop, deploy, and scale independently. This evolution represented a natural extension of SOA, aiming to provide even greater flexibility and resilience in application development and management. BPEL and Dynamic Orchestration To orchestrate the services facilitated by SOA effectively, Business Process Execution Language (BPEL) was developed as a standard way to manage complex workflows and business processes. BPEL supports dynamic orchestration, allowing for adaptations to changing business conditions and enabling seamless integration with various systems. This capability makes it an essential tool in advanced process management, providing the flexibility to manage and automate detailed service interactions at scale. By defining precise process logic and execution paths, BPEL helps businesses enhance operational efficiency and responsiveness. The principles and functionalities that BPEL introduced are now being mirrored in the capabilities being evolved with the LLM Orchestrator, illustrating a clear lineage and similarity in advancing orchestration technologies. AI and LLM Orchestration: Navigating Model Diversity and Strategic Selection As the domain of AI model development has evolved, so has the sophistication in deploying these models. The modern AI ecosystem, enriched by platforms like Hugging Face, showcases an extensive array of Large Language Models (LLMs), each specialized to perform distinct tasks with precision and efficiency. This rich tapestry of models ranges from those optimized for language translation and legal document analysis to those suited for creative content generation and more. This diversity necessitates a strategic approach to orchestration, where selecting the right model is just one facet of a broader orchestration strategy. Strategic Model Selection: A Key Aspect of LLM Orchestration Choosing the right LLM involves a multidimensional evaluation, where parameters like task suitability, cost efficiency, performance metrics, and sustainability considerations like carbon emissions play crucial roles. This process ensures that the selected model aligns with the task’s specific requirements and broader organizational goals. Task suitability: The primary factor is aligning a model’s training and capabilities with the intended task. Cost efficiency: This involves evaluating the economic impact, especially for processes involving large volumes of data or continuous real-time analysis. Performance metrics: Assessing a model’s accuracy, speed, and reliability based on benchmark tests and real-world applications Carbon emission: For sustainability-focused organizations, prioritizing models optimized for lower energy consumption and reduced carbon emissions is crucial. Beyond Selection: The Broader Role of LLM Orchestration While selecting the right model is vital, LLM orchestration encompasses much more. It involves dynamically integrating various AI models to function seamlessly within complex operational workflows. This orchestration not only leverages the strengths of each model but also ensures that they work in concert to address multi-faceted challenges effectively. By orchestrating multiple specialized models, organizations can create more comprehensive, agile, and adaptive AI-driven solutions. The Future: Seamless AI Integration and Cloud Evolution Looking ahead, the LLM Orchestrator promises to enhance the capability of AI systems to handle more complex, nuanced, and variable tasks. By dynamically selecting and integrating task-specific models based on real-time data, the Orchestrator can adapt to changing conditions and requirements with unprecedented agility. Cloud platforms will further enhance their AI deployment capabilities with the introduction of services like the LLM Orchestrator. This feature is set to revolutionize how AI capabilities are managed and deployed, enabling on-demand scalability and the integration of specialized AI microservices. These advancements will allow for the dynamic combination of services to efficiently tackle complex tasks, meeting the evolving needs of modern enterprises Summary The evolution from monolithic software to service-oriented architectures, and the subsequent orchestration of these services through BPEL, provides a clear parallel to the current trends in AI model development. The LLM Orchestrator stands poised to drive this evolution forward, heralding a future where AI not only supports but actively enhances human decision-making and creativity through sophisticated, seamless integration. This orchestration is not merely a technological improvement — it represents a significant leap toward a more responsive and intelligent digital ecosystem.
Amol Gote
Solution Architect,
Innova Solutions (Client - iCreditWorks Start Up)
Ray Elenteny
Solution Architect,
SOLTECH
Nicolas Duminil
Silver Software Architect,
Simplex Software
Satrajit Basu
Chief Architect,
TCG Digital