Monitoring and Observability Resources for Engineers

DZone's Featured Monitoring and Observability Resources

Kubernetes in the Cloud: A Guide to Observability

By Samarth Shah

As per the saying “If you don’t measure it, you can’t manage it” by Deming, observability and monitoring is our way to measure our services. Kubernetes is pretty revolutionary when it comes to the way it handles deployments and scales. But the way containers are continuously created and destroyed can sometimes present challenges with monitoring. This is where observability comes into play, offering critical insights into how your system is performing and why issues occur. Want to revisit Kubernetes terminology? Read Demystifying Kubernetes in 5 Minutes. What Is Observability in Kubernetes? People like to use Observability as an umbrella term. But typically, it would mean metrics, logs, and traces. It’s like having a lens into the heart of your applications and infrastructure. By collecting and analyzing these outputs, observability helps you spot potential issues before they disrupt service and optimize overall system performance. Three things that come to mind are: Metrics These are numbers, and they provide data about resource usage, error rates, and performance. A few popular metrics are CPU usage and memory usage in percentage, along with additional metadata about the metrics themselves (sometimes called dimensions). Logs Logs provide a detailed history of events within your system, such as errors or user actions. They offer context for troubleshooting and understanding application behavior. I am sure you have seen a "log" before: SystemVerilog [2025-01-01 12:30:00] ERROR: Failed to connect to database on attempt 3, retrying... Traces Tracing gives an end-to-end view of requests as they pass through services, helping identify bottlenecks or latency issues. By following requests across multiple microservices, you can pinpoint where performance problems arise. Logs and traces might sound similar, but they are different. Think of logs as a snapshot of what happened, whereas traces tell you how and why it happened across the entire system. Observability is not really limited to one role in an organization, in itself is a piece of critical information passed around among different roles. For example, as a software engineer, you instrument the application code with metrics, logs, and traces. Now, you need something to collect, store, and analyze this data, using tools like Prometheus for metrics and Jaeger for traces. If you are not already sold on Observability, I will summarize: It makes sure everything runs smoothly and efficiently by identifying performance bottlenecks.Improves system resilience and helps apps recover from failures (hopefully) quickly.Continuous monitoring allows teams to detect anomalies early, preventing security breaches and ensuring sensitive data is protected.You can build a wonderful-looking dashboard, which helps give you better insights on system performance. It may even help you save significant infrastructure costs (looking at you, AWS!). Wait, I also mentioned Monitoring above. So what is that and how is THAT different? While observability and monitoring are related, they serve different purposes. Monitoring involves setting up predefined checks/alerts to ensure that a system is functioning within acceptable parameters, your SLAs/SLOs. Observability, on the other hand, goes further by providing a comprehensive understanding of system behavior. It’s not just about knowing when something breaks; it’s about understanding why and how it happened. Both monitoring and observability are essential to effective system management. Call Out: OpenTelemetry OpenTelemetry (aka OTel) is a leading open-source collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. OpenTelemetry integrates with many popular libraries and frameworks, and supports code-based and zero-code instrumentation across diverse Kubernetes environments. Conclusion To conclude, Observability is more than a technical requirement — it's a strategic imperative for organizations looking to stay ahead in today’s competitive market. By leveraging the right tools and strategies, such as OTel for unified data collection, organizations can monitor, troubleshoot, and continuously optimize their Kubernetes applications. Through better visibility into system performance, organizations can make data-driven decisions, enhance application reliability, and meet business goals more effectively. I don’t know who said that, but I love this quote: Stop guessing, start knowing! More

Optimizing Performance in Azure Cosmos DB: Best Practices and Tips

By Muhammad Imran Ansari

When we are working with a database, optimization is crucial and key in terms of application performance and efficiency. Likewise, in Azure Cosmos DB, optimization is crucial for maximizing efficiency, minimizing costs, and ensuring that your application scales effectively. Below are some of the best practices with coding examples to optimize performance in Azure Cosmos DB. 1. Selection of Right Partition Key Choosing an appropriate partition key is vital for distributed databases like Cosmos DB. A good partition key ensures that data is evenly distributed across partitions, reducing hot spots and improving performance. The selection of a partition key is simple but very important at design time in Azure Cosmos DB. Once we select the partition key, it isn't possible to change it in place. Best Practice Select a partition key with high cardinality (many unique values).Ensure it distributes reads and writes evenly.Keep related data together to minimize cross-partition queries. Example: Creating a Container With an Optimal Partition Key C# var database = await cosmosClient.CreateDatabaseIfNotExistsAsync("YourDatabase"); var containerProperties = new ContainerProperties { Id = "myContainer", PartitionKeyPath = "/customerId" // Partition key selected to ensure balanced distribution }; // Create the container with 400 RU/s provisioned throughput var container = await database.CreateContainerIfNotExistsAsync(containerProperties, throughput: 400); 2. Properly Use Indexing In Azure Cosmos DB, indexes are applied to all properties by default, which can be beneficial but may result in increased storage and RU/s costs. To enhance query performance and minimize expenses, consider customizing the indexing policy. Cosmos DB supports three types of indexes: Range Indexes, Spatial Indexes, and Composite Indexes. Use the proper type of wisely. Best Practice Exclude unnecessary fields from indexing.Use composite indexes for multi-field queries. Example: Custom Indexing Policy C# { "indexingPolicy": { "automatic": true, "indexingMode": "consistent", // Can use 'none' or 'lazy' to reduce write costs "includedPaths": [ { "path": "/orderDate/?", // Only index specific fields like orderDate "indexes": [ { "kind": "Range", "dataType": "Number" } ] } ], "excludedPaths": [ { "path": "/largeDataField/*" // Exclude large fields not used in queries } ] } } Example: Adding a Composite Index for Optimized Querying C# { "indexingPolicy": { "compositeIndexes": [ [ { "path": "/lastName", "order": "ascending" }, { "path": "/firstName", "order": "ascending" } ] ] } } You can read more about Indexing types here. 3. Optimize Queries Efficient querying is crucial for minimizing request units (RU/s) and improving performance in Azure Cosmos DB. The RU/s cost depends on the query's complexity and size. Utilizing bulk executors can further reduce costs by decreasing the RUs consumed per operation. This optimization helps manage RU usage effectively and lowers your overall Cosmos DB expenses. Best Practice Use SELECT queries in limited amounts, retrieve only necessary properties.Avoid cross-partition queries by providing the partition key in your query.Use filters on indexed fields to reduce query costs. Example: Fetch Customer Record C# var query = new QueryDefinition("SELECT c.firstName, c.lastName FROM Customers c WHERE c.customerId = @customerId") .WithParameter("@customerId", "12345"); var iterator = container.GetItemQueryIterator<Customer>(query, requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey("12345") // Provide partition key to avoid cross-partition query }); while (iterator.HasMoreResults) { var response = await iterator.ReadNextAsync(); foreach (var customer in response) { Console.WriteLine($"{customer.firstName} {customer.lastName}"); } } 4. Consistency Levels Tuning The consistency levels define specific operational modes designed to meet speed-related guarantees. There are five consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual) available in Cosmos DB. Each consistency level impacts latency, availability, and throughput. Best Practice Use Session consistency for most scenarios to balance performance and data consistency.Strong consistency guarantees data consistency but increases RU/s and latency. Example: Setting Consistency Level C# var cosmosClient = new CosmosClient( "<account-endpoint>", "<account-key>", new CosmosClientOptions { // Set consistency to "Session" for balanced performance ConsistencyLevel = ConsistencyLevel.Session }); Read more about the consistency level here. 5. Use Provisioned Throughput (RU/s) and Auto-Scale Wisely Provisioning throughput is a key factor in achieving both cost efficiency and optimal performance in Azure Cosmos DB. The service enables you to configure throughput in two ways: Fixed RU/s: A predefined, constant level of Request Units per second (RU/s), suitable for workloads with consistent performance demands.Auto-Scale: A dynamic option that automatically adjusts the throughput based on workload fluctuations, providing scalability while avoiding overprovisioning during periods of low activity. Choosing the appropriate throughput model helps balance performance needs with cost management effectively. Best Practice For predictable workloads, provision throughput manually.Use auto-scale for unpredictable or bursty workloads. Example: Provisioning Throughput With Auto-Scale C# var throughputProperties = ThroughputProperties.CreateAutoscaleThroughput(maxThroughput: 4000); // Autoscale up to 4000 RU/s var container = await database.CreateContainerIfNotExistsAsync(new ContainerProperties { Id = "autoscaleContainer", PartitionKeyPath = "/userId" }, throughputProperties); Example: Manually Setting Fixed RU/s for Stable Workloads C# var container = await database.CreateContainerIfNotExistsAsync(new ContainerProperties { Id = "manualThroughputContainer", PartitionKeyPath = "/departmentId" }, throughput: 1000); // Fixed 1000 RU/s 6. Leverage Change Feed for Efficient Real-Time Processing The change feed allows for real-time, event-driven processing by automatically capturing changes in the database, eliminating the need for polling. This reduces query overhead and enhances efficiency. Best Practice Use change feed for scenarios where real-time data changes need to be processed (e.g., real-time analytics, notifications, alerts). Example: Reading From the Change Feed C# var iterator = container.GetChangeFeedIterator<YourDataModel>( ChangeFeedStartFrom.Beginning(), ChangeFeedMode.Incremental); while (iterator.HasMoreResults) { var changes = await iterator.ReadNextAsync(); foreach (var change in changes) { Console.WriteLine($"Detected change: {change.Id}"); // Process the change (e.g., trigger event, update cache) } } 7. Utilization of Time-to-Live (TTL) for Automatic Data Expiration If you have data that is only relevant for a limited time, such as logs or session data, enabling Time-to-Live (TTL) in Azure Cosmos DB can help manage storage costs. TTL automatically deletes expired data after the specified retention period, eliminating the need for manual data cleanup. This approach not only reduces the amount of stored data but also ensures that your database is optimized for cost-efficiency by removing obsolete or unnecessary information. Best Practice Set TTL for containers where data should expire automatically to reduce storage costs. Example: Setting Time-to-Live (TTL) for Expiring Data C# { "id": "sessionDataContainer", "partitionKey": { "paths": ["/sessionId"] }, "defaultTtl": 3600 // 1 hour (3600 seconds) } In Cosmos DB, the maximum Time-to-Live (TTL) value that can be set is 365 days (1 year). This means that data can be automatically deleted after it expires within a year of creation or last modification, depending on how you configure TTL. 8. Avoid Cross-Partition Queries Cross-partition queries can significantly increase RU/s and latency. To avoid this: Best Practice Always include partition key in your queries.Design your partition strategy to minimize cross-partition access. Example: Querying With Partition Key to Avoid Cross-Partition Query C# var query = new QueryDefinition("SELECT * FROM Orders o WHERE o.customerId = @customerId") .WithParameter("@customerId", "12345"); var resultSetIterator = container.GetItemQueryIterator<Order>(query, requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey("12345") }); while (resultSetIterator.HasMoreResults) { var response = await resultSetIterator.ReadNextAsync(); foreach (var order in response) { Console.WriteLine($"Order ID: {order.Id}"); } } Conclusion These tips are very effective during development. By implementing an effective partitioning strategy, customizing indexing policies, optimizing queries, adjusting consistency levels, and selecting the appropriate throughput provisioning models, you can greatly improve the performance and efficiency of your Azure Cosmos DB deployment. These optimizations not only enhance scalability but also help in managing costs while providing a high-performance database experience. More

Stop Being Afraid of Databases

By Adam Furmanek

CORE

How OpenAI’s Downtime Incident Teaches Us to Build More Resilient Systems

By Vasanthi Govindaraj

Monitor Spring Boot Web Application Performance Using Micrometer and InfluxDB

By Suyash Joshi

Observability 2.0: The Best Thing Since Sliced Bread

Traditional monitoring is not enough. We need developer-centric solutions that only Observability 2.0 can give us. Read on to see why. Beyond Traditional Monitoring In today's software development landscape, creating cutting-edge applications requires an acute focus on the developers who craft these digital experiences from start to finish; henceforth, contemporary tools are not just expected but demanded to be inherently developer-centric-offering environments that promote efficiency and creativity. Observability 2.0 goes beyond traditional monitoring paradigms by embedding a continuous feedback mechanism directly into the development lifecycle itself rather than as an afterthought or separate process, demanding transparency across every phase of software production to maintain system health at all times while ensuring that code quality and application performance adhere strictly to enterprise standards. This modern approach mandates developers work within a cohesive platform ecosystem where debugging tools are tightly integrated with the IDE. This immediate access allows for quick identification, analysis, and resolution of issues directly from their coding space without extraneous context switching or external dependency on legacy systems often associated with traditional observability setups. Additionally, modern developer-centric solutions facilitate real-time collaboration through shared canvases where developers can visually track system states alongside peers. This not only streamlines the debugging process but also enhances knowledge sharing and collective problem solving which is pivotal for developing complex software systems that are robust against failure. Telemetry Is the Key Observability 2.0 demands a sophisticated layer of telemetry capture, where metrics related to performance (response times), throughput (transactions per second), resource utilization (CPU and memory usage) as well as log files from each unit test or integration phase are all recorded with precision. This data is then processed using advanced analytics tools that provide a granular view of system health, enabling developers to proactively pinpoint the root cause before problems escalate into critical failures. Furthermore, more than just flagging issues when performance dips below acceptable thresholds, these solutions incorporate machine learning techniques for predictive analysis. This means identifying patterns and potential future risks based on historical data, which in turn allows developers to iterate with an awareness of possible scaling concerns or resource bottlenecks. This next-gen observability approach also integrates into the Continuous Integration/Continuous Deployment (CI/CD) pipelines. By doing so it informs build automation and deployment strategies, ensuring that only applications with verified metrics pass through to subsequent stages of testing or release. Developers are empowered by dashboards within their workflow which highlight the health status of different modules; these visual indicators provide clarity on areas in need of immediate attention thus enabling an accelerated development cycle while keeping developers abreast without distracting them from writing quality code under real-time conditions. To truly be modern and developer-centric, observability solutions must also incorporate robust logging mechanisms that allow for tracing execution flow. This granular detail in log files becomes essential when debugging complex application flows or distributed systems where component interactions can lead to obscure interdependencies causing unexpected errors. Advanced monitoring tools now provide contextual information about these logs while still within the developer's environment, thus not only facilitating rapid issue resolution but also allowing for a deeper understanding of how code elements interact throughout their lifecycle. This insight is critical when developing with microservices or serverless architectures where traditional observability techniques may fail to capture subtle inter-service communication nuances. Moreover, Observability 2.0 in the context of developer tools means implementing end-to-end trace visualization capabilities so that developers can comprehensively understand how their code interacts with various system components. This is not only about pinpointing issues but also validating design choices; for example, understanding latency between API calls within a service mesh or tracing data flows through multiphase transactions. Integration With Developers’ Tools Integration of developer-centric observability tools into the daily workflow requires careful planning and thoughtful architecture that supports various testing environments. This may range from unit tests to endurance runs in production replicas, ensuring that monitoring is not just an afterthought but a pervasive element throughout development. It becomes part of their armor as they write code; visualization dashboards are embedded within IDEs or dedicated developer hub applications enabling immediate insights into the behavior and health metrics at all times. This transparency builds trust among teams, fostering an environment where developers can confidently push new features without fearing that a bug introduced today could cause tomorrow’ extraneous distractions. Modern solutions must also facilitate observability in containerized or cloud-native environments which are becoming increasingly common. This means adaptive tools capable of spanning across multiple infrastructure layers whether it's monitoring containers, Kubernetes pods, serverless functions, and beyond — each layer offering unique challenges but equally demanding precise telemetry collection for effective observability. Developers should leverage these modern solutions to not only maintain the high performance expected by end-users today but also architect futureproof systems that can rapidly scale without compromising on reliability or stability during sudden traffic surges, all while retaining their focus on writing clean and robust code where developers are empowered through observance of how every line they write impacts overall system behavior. In summary, a modern developer-centric approach to Observability 2.0 insists on integrating real-time analytics into the development process for maintaining software health. A multiprong strategy encompasses embedding debugging tools within IDES offering immediate feedback and collaborative canvases that align with contemporary cloud workflows, incorporating advanced metrics processing in CI/CD pipelines, adopting comprehensive logging to trace execution flow through complex application structures while providing end-to-end visualization for full contextual understanding of code interactions. Modern software development demands these solutions not just as optional but as core components driving efficiency and precision: the bedrock upon which developers construct systems that are resilient, performant, and scalable, maintaining fidelity to enterprise standards while fostering a transparent environment for rapid iteration leading towards the ultimate goal of high-quality software delivery. Observability 2.0 Is a Must In conclusion, embracing developer tools with Observability 2.0 in mind is no longer optional but rather an imperative element. Developers today require these advanced features as part and parcel of their everyday coding practice just like they would rely on any other essential toolkit such as version control systems or build automation. Modern solutions must evolve beyond conventional boundaries, becoming intrinsic aspects of a developer's environment where each keystroke is informed by real-time metrics that influence immediate decisions and promote an enriched understanding. This harmony between coding fluency and observance ensures not just delivery but also sustainability in today’s ever-evolving landscape.

By Adam Furmanek

CORE

Mainframe to Serverless Migration on AWS: Challenges and Solutions

Companies across the globe spend more than $65 billion each year to maintain their legacy mainframe systems. Moving from mainframes to serverless systems on AWS gives businesses a great chance to cut operating costs. They can also benefit from cloud-native architecture. This fundamental change lets companies replace their rigid, monolithic systems with adaptable ones that meet market needs. AWS serverless technologies offer modern alternatives to traditional mainframe parts. Amazon EventBridge and Amazon API Gateway stand out as prime examples. These services make infrastructure management simple. They also deliver better scaling options and lower operating costs. This piece gets into the technical hurdles, strategies, and best practices you need for a successful mainframe-to-AWS serverless move. Your organization can direct this complex transition with confidence. Understanding Mainframe Architecture and AWS Serverless Components Mainframe systems have remained the backbone of enterprise computing since the 1950s. The world's largest banks still depend on these systems, with 96 out of the top 100 using them. About 71 percent of Fortune 500 companies rely on mainframes for their critical operations. A single powerful computer handles multiple users through terminal connections, which defines the traditional mainframe architecture. These systems handle both batch and online transaction processing. They use Job Control Language (JCL) for batch operations and let users interact through GUI or 3270 terminal interfaces. Mainframes excel at processing massive I/O volumes. They manage huge data repositories with databases that range from gigabytes to terabytes. AWS serverless architecture brings a radical alteration to computing. It offers a complete suite of services that removes infrastructure management worries. The main AWS serverless components are: AWS Lambda: Provides event-driven compute service that scales automaticallyAmazon API Gateway: Lets you create and manage RESTful APIsAmazon EventBridge: Makes serverless event bus implementation easierAWS Step Functions: Coordinates complex workflows and state management The serverless platform shows impressive scalability. AWS Lambda can handle concurrent executions of multiple functions while keeping costs low through a pay-per-use model. AWS has launched many fully-managed serverless services over the last several years. These services combine smoothly with existing AWS services and third-party solutions. Organizations must assess several critical factors before moving from mainframe to serverless architecture. The AWS Migration Acceleration Program (MAP) for Mainframe provides a structured approach. It offers processes, tools, and services built specifically for cloud migration projects. The program follows three steps: assess readiness, mobilize resources, and migrate workloads. Data migration needs careful planning because mainframes store data in Direct Access Storage Device (DASD) or Virtual Tape Library (VTL) formats. AWS offers storage options like Amazon S3, Amazon EFS, and Amazon FSx. These alternatives improve scalability and security while delivering high performance. The move to serverless requires attention to performance optimization. New challenges like cold start latencies can take 5-10 seconds for inactive functions. However, the benefits often outweigh these challenges. Customers report 60 to 90 percent cost savings after moving mainframe workloads to AWS. Automatic scaling and reduced operational overhead make the transition worthwhile. Technical Migration Challenges Organizations face major technical hurdles when moving from mainframe to serverless architecture. Studies show that more than 80% of data migration projects fail to achieve their goals. This highlights how complex these changes can be. Data Migration Complexities Data migration stands as a critical challenge in mainframe modernization. Legacy systems store massive amounts of data that could be flawed, inconsistent, or fail to meet current industry standards. The task becomes even more complex because mainframe systems use proprietary languages and technologies. This makes adapting data to cloud platforms extremely difficult. Organizations should put these measures in place to tackle these challenges: Resilient data management systems with strong backup and recovery protocolsStep-by-step migration phases with thorough validation testingAutomated validation tools that check compliance with GDPR and HIPAA Code Conversion and Refactoring Challenges We see fewer professionals who know mainframe legacy programming languages like COBOL/DB2 and NATURAL/ADABAS. This talent gap leads to higher costs and risks in maintaining legacy systems. Teams must handle complex tasks like flow normalization, code restructuring, and data layer extraction during refactoring. Large and complex mainframe systems often lack proper documentation, which makes code conversion harder. Teams find it difficult to integrate with modern agile development processes. This affects how quickly organizations can bring products to market and create new solutions. Performance and Scalability Concerns Many believe cloud migration offers unlimited scalability. Cloud platforms do offer better scalability than on-premises setups, but they have their limits. Organizations must work hard to maintain performance levels during and after migration, especially with high-volume transaction processing. Teams need to optimize performance by carefully planning resource use and capacity. Well-executed modernization projects can cut infrastructure costs by up to 70%. Legacy mainframe systems often can't keep up with modern needs. This creates bottlenecks that stop organizations from moving forward. The COVID-19 pandemic has made these challenges more obvious, especially with remote access issues and unpredictable demand patterns. Organizations now need to break down data silos faster and use data analysis better to stay competitive. Implementation Strategy and Architecture A successful move from mainframe to serverless migration needs a well-laid-out plan that tackles both technical and operational aspects. AWS provides complete solutions that help organizations modernize their legacy systems and keep their business running smoothly. Choosing the Right AWS Services AWS ecosystem gives you a strong set of services built specifically for mainframe modernization. The solution typically runs modernized applications inside Docker containers that Amazon Elastic Container Service (Amazon ECS) arranges, while AWS Secrets Manager and Parameter Store manage environmental configurations. Here are the most important AWS services for modernization: Amazon Aurora PostgreSQL: Serves as a replacement for mainframe database enginesAmazon S3: Handles task inputs and outputsAWS Step Functions: Manages workflow arrangementAmazon EventBridge: Enables live event processingAmazon API Gateway: Helps with service integration Breaking Down Monolithic Applications Moving from monolithic to microservices architecture needs a systematic approach. Organizations should use a two-phase transformation strategy: 1. Technical Stack Transformation Convert programs to REST APIsChange COBOL programs and JCLs into single executablesImplement in-memory cache optimizationDeploy services to chosen servers 2. Business Split Transformation Apply Domain-Driven Design principlesIdentify bounded contextsSeparate business functionalitiesCreate independent microservices Designing Serverless Microservices Serverless architecture implementation aims to create expandable, maintainable services. AWS Mainframe Modernization service supports both automated refactoring and replatforming patterns. It delivers cloud-native deployment by changing online and batch COBOL and PL/I applications to Java. This approach has shown remarkable results. One implementation delivered 1,018 transactions per second — equivalent to a 15,200 MIPS IBM Mainframe — and reduced annual infrastructure costs from $16 million to $365,000. The architecture makes use of AWS-managed services and serverless technology. Each microservice stays elastic and reduces system administrator tasks. Application Load Balancers provide encryption in transit and application health checks for HTTP-based services. Network Load Balancers handle other services, such as IBM CICS. AWS Secrets Manager handles sensitive data, while Parameter Store manages non-sensitive configurations for environmental settings, including database endpoints and credentials. This separation provides secure and efficient configuration management while maintaining operational flexibility. Security and Compliance Considerations Cloud migration security has changed substantially with serverless architectures. AWS shared responsibility model moves about 43% of compliance requirements to AWS. This allows organizations to concentrate on securing their applications. Identity and Access Management AWS Identity and Access Management (IAM) is the lifeblood of security control in serverless environments. Organizations need to set up detailed permissions that follow the principle of least privilege. Users should only get the permissions they need for their specific job functions. IAM offers a complete system for authentication and authorization that includes: Multi-factor authentication (MFA) to improve securityRole-based access control to manage resourcesProgrammatic and console-based access managementIntegration with existing identity providers Data Encryption and Protection The mainframe for serverless migration needs multiple security layers for data protection. AWS Mainframe Modernization works with AWS Key Management Service (KMS) to encrypt all stored data on the server side. The service creates and manages symmetric encryption keys. This helps organizations meet strict encryption requirements and reduces operational complexity. Security measures protect data in different states: TLS 1.2 or higher protocols safeguard data in transitAWS KMS-managed keys encrypt data at restAWS Secrets Manager protects application secrets Regulatory Compliance Requirements AWS serverless architecture supports various compliance frameworks with built-in controls for major regulatory standards. Organizations can make use of information from AWS compliance programs certified for: SOC (System and Organization Controls)PCI DSS (Payment Card Industry Data Security Standard)HIPAA (Health Insurance Portability and Accountability Act)FedRAMP (Federal Risk and Authorization Management Program)ISO (International Organization for Standardization) Container security needs a different approach than traditional environments, especially in highly regulated industries. Serverless environments change rapidly. This demands automated security controls throughout the software development lifecycle. Traditional security tools don't deal very well with the dynamic nature of serverless architectures. Risk intelligence plays a vital role in container security. Organizations need complete scanning and monitoring capabilities to maintain their security posture. AWS provides integrated security services that enable automated vulnerability scanning, compliance monitoring, and threat detection across serverless infrastructure. Performance Optimization and Testing Performance optimization and testing are crucial for successful mainframe to serverless migration on AWS. Studies show that performance standards of serverless platforms focused on CPU performance, network speed, and memory capacity measurements. Load Testing and Benchmarking Testing serverless infrastructure needs a systematic approach to confirm system performance. Artillery Community Edition has become a popular open-source tool to test serverless APIs. It shows median response times of 111ms with a p95 time of 218ms in standard implementations. Organizations can utilize Serverless Artillery to handle higher throughput scenarios. It runs the Artillery package on Lambda functions to achieve boosted performance metrics. Performance testing tools show that AWS serverless platforms have decreased tail latency, boosted bursty behavior, and improved image fetch speed. The ServerlessBench framework stands out with its detailed performance analysis capabilities. Monitoring and Observability Setup AWS CloudWatch works as the core monitoring solution and gives detailed insights into serverless application performance. Lambda Insights delivers essential metrics such as: Invocation rates and duration trackingSystem-level CPU utilizationMemory usage patternsNetwork performance indicatorsError count and failure rates CloudWatch Application Insights makes use of machine learning to create dashboards that spot potential problems, including metric anomalies and log error detection. AWS X-Ray helps developers create service maps with visual representations of tracing results that identify bottlenecks and connection latencies. Performance Tuning Strategies You can optimize serverless performance through smart capacity planning and resource allocation. Lambda functions support memory configurations from 128 MB to 10, 240 MB. CPU allocation increases proportionally with memory allocation. This scalability lets organizations fine-tune performance based on specific workload needs. Key optimization steps include: Function startup time evaluation and optimizationSDK client initialization outside function handlersImplementation of execution environment reuseSmart use of local file system cachingConnection pooling for database operations The AWS Lambda Power Tuning tool makes the optimization process automatic. It tests different memory configurations systematically to find the most efficient settings for specific use cases. Testing data shows that importing individual service libraries instead of the entire AWS SDK can cut initialization time by up to 125ms. CloudWatch Container Insights gives live visibility into containerized workloads. It offers detailed monitoring at the task, service, and cluster levels. Organizations can maintain optimal performance while managing complex serverless architectures during and after migration from mainframe systems. Conclusion AWS's complete suite of services helps organizations plan and execute their mainframe to serverless migration carefully. This technological move needs thorough planning. Companies that begin this experience can address complex modernization challenges while keeping their operations stable. Several key aspects lead to successful migration: AWS services like Lambda, EventBridge, and API Gateway offer strategic ways to apply changesSecurity frameworks protect data through encryption, access management, and compliance measuresSystem optimization techniques ensure strong operationsTesting methods verify migration success and system reliability Organizations that switched from mainframe to serverless architecture showed remarkable benefits. Many achieved 90% cost reduction and improved operational efficiency. AWS's serverless platform meets modern enterprise computing's needs through scalability, security, and performance. Your mainframe modernization success depends on monitoring, optimization, and adaptation to new technologies. Smart organizations embrace this change, and they position themselves well to gain agility, reduce costs, and gain competitive advantages.

By Sajith Narayanan

DevOps Tutorial: Docker, Kubernetes, and Azure DevOps

In this article, we will learn about DevOps and how it is different from the Agile methodology. We will also cover some popular DevOps tools and their roles in the DevOps lifecycle. You Will Learn What is Docker, Kubernetes, and Azure DevOpsWhat is DevOps and why do we need it?How is DevOps different from Agile?What are some important DevOps tools?How does Docker help DevOps?How does Kubernetes help DevOps?How does Azure DevOps help DevOps?What is Continuous Integration and Continuous Delivery (CI/CD)?What is Infrastructure as Code?How do Terraform and Ansible help DevOps? Docker Docker is an open-source software tool that is used to build, test, and deploy containerized applications. What is Containerization, by the way? Containerization is the concept of bundling all the libraries and files along with the application code in a single unit called a "Container" so that it can run on any infrastructure. Kubernetes Kubernetes is a container orchestration system that manages containerized applications and services. It takes care of the tasks performed in the containerized environment such as scaling, deployment, load balancing, etc. Kubernetes is portable, efficient, and cost-effective, and offers features like system integrations, API-based support, etc. Azure DevOps Azure DevOps is a Microsoft product that provides a wide range of tools and features that make the process of software development and deployment faster and more organized. It offers a set of processes that allows software developers, project managers, and other contributors to work together to develop software. It can be added to the existing editors or IDEs to allow the team to work effectively on projects of all sizes. Let’s get started with a simple use case. Free Courses — Learn in 10 Steps Learn Docker in 10 StepsLearn Kubernetes in 10 StepsLearn AWS in 10 Steps What Is DevOps? As with most buzzwords around software development, there is no accepted definition for DevOps. Definitions vary from simple, like the one below, to complex, which can span the complete page of a book. DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity — Amazon Web Services (AWS) Instead of trying to define DevOps, let’s understand how software development evolved to DevOps. Waterfall Model The Waterfall Model is a methodology that follows a structured approach. It consists of six phases: Requirements, Design, Implementation, Testing, Deployment, and Maintenance. Each phase builds upon the completion of the previous one. It progresses through these stages without revisiting earlier processes, making it a linear process. The first few decades of software development were centered around the Waterfall model, and it approached software development the same way that you would approach building a real estate project — for example, building a bridge. You will build software in multiple phases that can go anywhere from a few weeks to a few months. In most waterfall projects, it would be months before the business sees a working version of an application. Key Elements to Build Great Software While working in the waterfall model for a few decades, we understand a few key elements around developing great software: CommunicationFeedbackAutomation Importance of Communication Communication between people is essential for the success of a software project. In the Waterfall model, we tried to enhance communication by preparing 1000-page documents on Requirements, Design, Architecture, and Deployment. But, over time, we discovered that: The best way to enhance communication within a team is to get them together. And get a variety of skills in the same team.Cross-functional teams — with a wide range of skills — work great. Importance of Early Feedback Getting feedback quickly is important. Building great software is all about getting quick feedback. Are we building an application that meets business expectations?Will your application have problems if it is deployed to production? You don’t want to know it after a few months. You want to find it out as early as possible because the earlier we find a problem, the easier it is to fix it. We found that the best software teams are structured to enable quick feedback. Importance of Automation Automation is critical. Software development involves a wide range of activities. Doing things manually is slow and error-prone. We understand that it’s essential to always look for opportunities to introduce automation. Having understood the key elements to develop great software, let's look at how we evolved to Agile and DevOps. Evolution to Agile Agile is an approach that emphasizes incremental progress, frequent feedback, and the ability to respond to changing requirements through the development lifecycle. Agile promotes cross-functional teams to work in short development cycles, which fosters continuous improvement and delivers value to end users quickly. It was the first step in the evolution towards implementing our learnings with enhanced communication between teams, getting feedback, and bringing in automation. Agile brought the business and development teams together into one team, which works to build great software in small iterations called Sprints. Instead of spending weeks or months on each phase of development, Agile focuses on taking small requirements called user stories through the development cycle within a few days, sometimes within the same day. How Did Agile Enhance Communication Between Teams? Agile brought business and development teams together. Businesses are responsible for defining what to build. What are the requirements?Development is responsible for building a product that meets requirements. Development includes everybody involved in the design, coding, testing, and packaging of your software. In Agile, a representative from the business, called a Product Owner, is always present with the team, and the team understands the business objectives clearly. When the development team does not understand the requirements and is going down the wrong path, the Product Owner helps them do a course correction and stay on the correct path. Result: The final product the team builds is something that the business wants. Another important factor is that agile teams have cross-functional skills: coding skills (frontend, API, and databases), testing skills, and business skills. This enhances communication between people that have to work together to build great software. Agile and Automation What are the automation areas that agile teams focus on? Software products can have a variety of defects: Functional defects mean the product does not work as expected.Technical defects make the maintenance of the software difficult. For example, code quality problems. In general, agile teams are focused on using automation to find technical and functional defects as early as possible. Agile teams also focus extensively on code quality. Tools like SONAR are used to assess the code quality of applications. Is it sufficient if you have great automation tests and great code quality checks? The key is to run these processes frequently. Agile teams emphasize Continuous Integration, where commits to version control trigger a series of actions. This includes running Unit Tests, Automation Tests, and Code Quality Checks, all seamlessly integrated into a Continuous Integration Pipeline. Jenkins, a widely adopted CI/CD tool during the early Agile era, played a pivotal role in orchestrating these automated processes. How Did Agile Promote Immediate Feedback? The most important factor is that a business does not need to wait for months to see the final product. At the end of every sprint, the product is demoed to all stakeholders including architecture and business teams. All feedback is taken in while prioritizing user stories for the next sprint. Result: the final product the team builds is something that the business wants. Another important factor that enables immediate feedback is continuous integration. Let’s say I commit some code into version control. Within 30 minutes, I get feedback if my code causes a unit test failure or an integration test failure. I will get feedback if my code does not meet code quality standards or does not have enough code coverage in the unit tests. Was Agile successful? Yes. For sure. By focusing on improving the communication between business and development teams, and focusing on finding a variety of defects early, Agile took software development to the next level. I had a wonderful experience working with some amazing teams using Agile. Software engineering, which for me represents all the efforts in building software from requirements to taking applications to live, for the first time, was as enjoyable as programming. But does evolution stop? Nope. New challenges emerged. Evolution of Microservices Architectures We started moving towards a microservices architecture, and we started building several small APIs instead of building large monolith applications. What was the new challenge? Operations have become more important. Instead of doing one monolith release a month, you are doing hundreds of small microservice releases every week. Debugging problems across microservices and getting visibility into what’s happening with the microservices became important. It was time for a new buzzword in software development. DevOps. Emergence of DevOps What was the focus of DevOps? The focus of DevOps was to enhance the communication between the development and operations teams. How do we make deployments easier?How do we make the work operations team does more visible to the development team? How Did DevOps Enhance Communication Between Teams? DevOps brought operations teams closer to development teams. In more mature enterprises, development and operations teams worked as one team. They started sharing common goals and both teams started to understand the challenges the other team faced.In enterprises, in the early stages of DevOps evolution, a representative from the operations team can be involved in the sprints — standups and retrospectives. What Are the Automation Areas DevOps Teams Focus on? In addition to the focus areas of Agile — Continuous Integration and test automation — DevOps teams were focused on helping automate several operations team's activities like provisioning servers, configuring software on servers, deploying applications, and monitoring production environments. A few key terminologies are continuous deployment, continuous delivery, and infrastructure as code. Continuous deployment is all about continuously deploying a new version of software on test environments. In even more mature organizations like Google and Facebook, continuous delivery helps in continuously deploying software to production — maybe hundreds of production deployments per day. Infrastructure as code is all about treating your infrastructure like you treat your application code. You create your infrastructure — servers, load balancers, and database — in an automated way using configuration. You would version control your infrastructure — so that you can track your infrastructure changes over some time. How Did DevOps Promote Immediate Feedback? DevOps brings operations and development teams together. Because operations and development are part of the same team, the entire team understands the challenges associated with operations and development. Any operational problems get quick attention from developers.Any challenges in taking software live to get the early attention of the operations team. DevOps encouraged continuous integration, continuous delivery, and infrastructure as code. Because of continuous delivery, if I make a code change or a configuration change that might break a test or a staging environment, I would know it within a few hours.Because of Infrastructure as Code, developers can self-provision environments, deploy code, and find problems on their own without any help from the operations team. I see Agile and DevOps as two phases that help us improve how we build great software. They don’t compete against each other, but together they help us build amazing software products. As far as I am concerned, the objective of Agile and DevOps together is to do things that: Promote communication and feedback between business, development and operations teamsEase the pain points with automation. A DevOps Story Here’s an example story: You are the star developer in a team, and you need to make a quick fix. You go to a GitHub repository.You quickly check out the project.You quickly create your local environment.You make a change. You test it. You update the unit and automation tests.You commit it.You get an email saying it is deployed to QA.A few integration tests are automatically run.Your QA team gets an email asking for approval. They do a manual test and approve.Your code is live in production in a few minutes.You might think this is an ideal scenario. But, do you know that this is what is happening in innovative enterprises like Netflix, Amazon, and Google day in and day out? This is the story of DevOps. DevOps = Development + Operations DevOps is a natural evolution of software development. DevOps is NOT JUST a tool, a framework, or just automation. It is a combination of all these. DevOps focuses on people, processes, and products. The people part of DevOps is all about culture and creating a great mindset — a culture that promotes open communication and values quick feedback, a culture that values high-quality software. Agile helped in bridging the gap between business and development teams. Development teams understood the priorities of the business and worked with the business to deliver the stories providing the most value first; however, the dev and ops teams were not aligned. They had different goals. The goal of the dev team is to take as many new features to production as possible.The goal of the ops team was to keep the production environment as stable as possible. As you can see, if taking things to production is difficult, dev and ops are unaligned. DevOps aims to align the dev and ops teams with shared goals. The dev team works with the ops team to understand and solve operational challenges. The ops team is part of the scrum team and understands the features under development. How can we make this possible? Break down the wall between dev and ops! Getting Dev and Ops Together Option 1 In mature DevOps enterprises, dev and ops work as part of the same scrum team and share each other's responsibilities. Option 2 However, if you are in the early stages of DevOps evolution, how can you get dev and ops to have common objectives and work together? Here are some of the things you can do: Have the development team share some of the responsibilities of the operation team. For example, the dev team can take responsibility for new releases for the first week after production deployment. This helps the development team understand the challenges faced by operations in taking new releases live and helps them come together and find better solutions.Another thing you can do is involve a representative of the operations team in the scrum activities. Involve them in standups and retrospectives.The next thing you can do is to make the challenges faced by the Operations team more visible to the Development team. When you face any challenges in operations, make development teams part of the teams working on solutions. Whichever way you take, find ways of breaking the wall and get the development and operations team together. Another interesting option emerges because of automation. By using Infrastructure as Code and enabling self-provisioning for developers, you can create a common language that operations and development teams understand — code. A DevOps Use Case Consider the picture below: This picture showcases two simple workflows Infrastructure as Code using Terraform and Azure DevOps to provision Kubernetes clusters.Continuous Deployment of microservices using Azure DevOps to build and deploy Docker images for microservices into Kubernetes clusters. Does this sound complex? Let’s break it down and try and understand them. Let’s start with #2 — Continuous Deployment first. #2: DevOps Continuous Deployment With Azure DevOps and Jenkins What is the use of having great tests and code quality checks if you don’t run them often? What is the use of deployment automation if you don't deploy software often enough? As soon as a developer commits code into the version control system, the following steps are executed: Unit TestsCode Quality ChecksIntegration TestsApplication Packaging — Building a deployable version of the application. Tools — Maven, Gradle, DockerApplication Deployment — Putting new applications or new versions of the application liveAn email to the testing team to test the application As soon as there is approval from the test team, the app is immediately deployed to the next environment. This is called continuous deployment. If you continuously deploy up to production, it is called continuous delivery. The most popular CI/CD tools are Azure DevOps and Jenkins. #1: DevOps Infrastructure as Code With Terraform Back in the day, we used to create environments and deploy applications manually. Every time you create a server, this needs to be done manually. The software version needs to be updatedSecurity patches need to be installed manually You do it manually, and the following are the results: High chance of errorsReplication environments are difficult Infrastructure as Code Infrastructure as Code — treat infrastructure the same way as application code. Here are some of the important things to understand with Infrastructure as Code: Infra team focuses on value-added work (instead of routine work)Fewer errors and quick recovery from failuresServers are consistent (avoids configuration drift) The most popular IaC tools are Ansible and Terraform. Typically these are the steps in IaC: Provision Servers (Enabled by Cloud) from a templateInstall softwareConfigure software Server Provisioning Typically, provisioning tools are used to provision servers and get the new server ready with networking capabilities. The most popular provisioning tools are Cloud Formation and Terraform. Using Terraform, you can provision servers and the rest of your infrastructure, like load balancers, databases, networking configuration, etc. You can create servers using pre-created images created using tools like Packer and AMI (Amazon Machine Image). Configuration Management Configuration management tools are used to: Install softwareConfigure software Popular configuration management tools are Chef, Puppet, Ansible, and SaltStack. These are designed to install and manage software on existing servers. Role of Docker and Kubernetes in DevOps In the microservices world, a few microservices might be built with Java, a few with Python, and a few with JavaScript. Different microservices will have different ways of building applications and deploying them to servers. This makes the operations team’s job difficult. How can we have a similar way of deploying multiple types of applications? Enter containers and Docker. Using Docker, you can build images of microservices — irrespective of their language. You can run these images the same way on any infrastructure. This simplifies operations. Kubernetes adds on to this by helping to orchestrate different types of containers and deploying them to clusters. Kubernetes also provides: Service discoveryLoad balancingCentralized configuration Docker and Kubernetes make DevOps easy. Important DevOps Metrics The following are some of the important DevOps metrics you can track and improve over some time. Deployment Frequency — How often are applications deployed to production?Time to Market — How long do you need to take a feature from coding to production?Failure Rate of New Releases — How many of your releases fail?Lead Time to Fixes — How long do you need to make a production fix and release it to production?Mean Time to Recovery — How long do you take to recover your production environment from a major issue? DevOps Best Practices Agile Project Management Agile project management is an iterative approach to developing software applications. Through this practice, teams can enhance development speed, and respond well to varying customer needs. Agile methodology is different from the traditional Waterfall method wherein there were long release cycles. Agile uses Scrum and Kanban frameworks to deliver the software as per client's needs. Using the Right Set of Tools Software developers and system administrators need to pick up and use the right set of DevOps tools in each stage of the DevOps lifecycle to build high-value applications. Below are some of the examples of tools DevOps engineers, system admins, and other stakeholders can use: Tools like Jira can help the team to segregate tasks into smaller and more manageable pieces and hence helps in increasing the productivity of the team. Tools like Jenkins and Bitbucket can help you automate code flows right from the testing to the deployment phase.Tools like Slack, GetFeedback, etc. can help DevOps teams to integrate chat tools with the survey platforms to collect and review real-time feedback. Continuous Integration/Continuous Delivery Continuous Integration (CI) and Continuous Delivery (CD) are modern software development practices that help organizations ship software quickly and effectively. With CI, developers continuously commit the application code into a shared repository several times. With CD, the code gets delivered to production quickly and seamlessly. CD also ensures the integration happens without any delays or glitches. Integrating Security Security is an important part of the software development process. In the present world, where cyber crimes and data breach incidents are on the rise, organizations are realizing the importance of integrating securities into their systems. In the past, security was generally considered in the last phases of the software development lifecycle but with the advent of DevSecOps, security is being considered and integrated right from day one of the application development. Observability Observability is important while developing complex applications that use microservice and cloud architectures. Observability helps DevOps teams understand the complex structure of different applications (microservices, cloud apps, etc.) and helps address the environment's future needs. Kubernetes observability and Splunk are some of the best observability platforms. DevOps Maturity Signals How do you measure the maturity of your DevOps implementations? The time taken from the development process to the deployment should be overall satisfactoryDetermining the frequency of the new code deployment Mean Time to Recovery (MTTR) from an incident or unexpected event should be as low as possible Successful deployments should outgrow failure deploymentsFaster and reliable releases should yield high Return on Investment (ROI). DevOps Transformation Best Practices Leadership buy-in is criticalInvolves upfront costsSetup COEs to help teamsChoose the right application and teamStart smallSharing learnings (newsletters, communication, COEs)Encourage people with exploration and automation mindsetRecognize DevOps teams

By Ranga Karanam

CORE

Security Considerations for Observability: Enhancing Reliability and Protecting Systems Through Unified Monitoring and Threat Detection

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. In a world where organizations are considering migrating to the cloud or have already migrated their workloads to the cloud, ensuring all critical workloads are running seamlessly is a complicated task. As organizations scale their infrastructure, maintaining system uptime, performance, and resilience becomes increasingly challenging. Observability plays a crucial role in monitoring, collecting, and analyzing system data to gain insights into the health and performance of services. However, observability is not just about uptime and availability; it also intersects with security. Managing site reliability involves addressing any security concerns such as data breaches, unauthorized access, and misconfigurations, all of which can lead to system downtime or compromise. Observability often comes with a powerful toolset that allows site reliability engineering (SRE) and security teams to collaborate, detect potential threats in real time, and ensure that both performance and security objectives are met. This article reviews the biggest security challenges in managing site reliability, examines how observability can help mitigate these risks, and explores critical areas such as incident response and how observability can be unified with security practices to build more resilient, secure systems. The Role of Observability in Security Observability is a critical instrument that helps both security and SRE teams by providing real-time insights into system behavior. Unified Telemetry for Proactive Threat Detection Observability unifies telemetry data — logs, traces, and metrics — into a centralized system, providing comprehensive visibility across the entire infrastructure. This convergence of data is essential for both site reliability and security. By monitoring this unified telemetry, teams can proactively detect anomalies that may indicate potential threats, such as system failures, misconfigurations, or security breaches. SRE teams may use this data to identify issues that could affect system availability, while security teams can use the same data to uncover patterns that suggest a cyberattack. For example, abnormal spikes in CPU usage may indicate a denial-of-service attack, and unexpected traffic from unknown IPs could be a sign of unauthorized access attempts. Incident Detection and Root Cause Analysis Effective incident detection and root cause analysis are critical for resolving both security breaches and performance issues. Observability empowers SRE and cybersecurity teams with the data needed to detect, analyze, and respond to a wide range of incidents. Logs provide detailed records of actions leading up to an incident, traces illustrate how transactions flow through the system, and metrics spotlight unusual patterns that may indicate anomalies. Observability integrated with automated systems enables faster detection and response to diverse cybersecurity incidents: Data exfiltration. Observability detects unusual data access patterns and spikes in outbound traffic, limiting data loss and regulatory risks.Insider threats. Continuous monitoring identifies suspicious access patterns and privilege escalations, allowing swift mitigation of insider risks.Malware infiltration. Anomalies in resource usage or unauthorized code execution indicate potential malware, enabling quick containment and limiting system impact.Lateral movement. Unexpected cross-system access reveals attacker pathways, helping contain threats before they reach critical systems. Automated observability shortens detection and response times, minimizing downtime and strengthening system security and performance. Monitoring Configuration and Access Changes One of the critical benefits of observability is its ability to monitor configuration changes and user access in real time. Configuration drift — when system configurations deviate from their intended state — can lead to vulnerabilities that expose the system to security risks or reliability issues. Observability platforms track these changes and alert teams when unauthorized or suspicious modifications are detected, enabling rapid responses before any damage is done. How Observability Can Be Unified With Security The integration of observability with security is essential for ensuring both the reliability and safety of cloud environments. By embedding security directly into observability pipelines and fostering collaboration between SRE and security teams, organizations can more effectively detect, investigate, and respond to potential threats. Security-First Observability Embedding security principles into observability pipelines is a key strategy for uniting observability with security. Security-first observability ensures that the data generated from logs, metrics, and traces is encrypted and accessible only to authorized personnel using access control mechanisms such as role-based access control. Figure 1. Observability data encrypted in transit and at rest Additionally, security teams can leverage SRE-generated telemetry to detect vulnerabilities or attack patterns in real time. By analyzing data streams that contain information on system performance, resource usage, and user behavior, security teams can pinpoint anomalies indicative of potential threats, such as brute-force login attempts or distributed denial-of-service (DDoS) attacks, all while maintaining system reliability. SRE and Security Collaboration Collaboration between SRE and security teams is essential for creating a unified approach to observability. One of the best ways to foster this collaboration is by developing joint observability dashboards that combine performance metrics with security alerts. These dashboards provide a holistic view of both system health and security status, allowing teams to identify anomalies related to both performance degradation and security breaches simultaneously. Another key collaboration point is integrating observability tools with security information and event management (SIEM) systems. This integration enables the correlation of security incidents with reliability events, such as service outages or configuration changes. For instance, if an unauthorized configuration change leads to an outage, both security and SRE teams can trace the root cause through the combined observability and SIEM data, enhancing incident response effectiveness. Incident Response Synergy Unified observability also strengthens incident response capabilities, allowing for quicker detection and faster recovery from security incidents. Observability data, such as logs, traces, and metrics, provide real-time insights that are crucial for detecting and understanding security breaches. When suspicious activities (e.g., unauthorized access, unusual traffic patterns) are detected, observability data can help security teams isolate the affected systems or services with precision. Figure 2. Automating security response based on alerts generated Moreover, automating incident response workflows based on observability telemetry can significantly reduce response times. For instance, if an intrusion is detected in one part of the system, automated actions such as isolating the compromised components or locking down user accounts can be triggered immediately, minimizing the potential damage. By integrating observability data into security response systems, organizations can ensure that their response is both swift and efficient. Penetration Testing and Threat Modeling Observability also strengthens proactive security measures like penetration testing and threat modeling. Penetration testing simulates real-world attacks, and observability tools provide a detailed view of how those attacks affect system behavior. Logs and traces generated during these tests help security teams understand the attack path and identify vulnerabilities. Threat modeling anticipates potential attack vectors by analyzing system architecture. Observability ensures that these predicted risks are continuously monitored in real time. For example, if a threat model identifies potential vulnerabilities in APIs, observability tools can track API traffic and detect any unauthorized access attempts or suspicious behavior. By unifying observability with penetration testing and threat modeling, organizations can detect vulnerabilities early, improve system resilience, and strengthen their defenses against potential attacks. Mitigating Common Threats in Site Reliability With Observability Observability is essential for detecting and mitigating threats that can impact site reliability. By providing real-time insights into system performance and user behavior, observability enables proactive responses to potential risks. Table 1 reviews how observability helps address common threats: Table 1. Common threats and mitigation strategies ThreatMitigation StrategyPreventing service outages from cyberattacks Use real-time observability data to identify and mitigate DDoS attacks before they impact service availabilityMonitor performance metrics continuously to detect and prevent service-level agreement (SLA) violationsPreventing data breaches Continuously monitor for signs of data exfiltration or compromise within the telemetry streamUtilize observability to detect exfiltration attempts early, with a clear difference in detection capabilities between environments with and without observabilityHandling insider threats Leverage system-level observability data to detect anomalous actions by authorized users, indicating potential insider threatsUse observability data for forensic analysis and audits in case of an insider attack to trace user activities and system changesAutomation for incident resolution Implement automated alerting and self-healing processes that trigger based on observability insights to ensure rapid incident resolution and maintain uptime Building a Secure SRE Pipeline With Observability Integrating observability into SRE and security workflows creates a robust pipeline that enhances threat detection and response. This section outlines key components for building an effective and secure SRE pipeline. End-to-End Integration To build a secure SRE pipeline, it is essential to seamlessly integrate observability tools with existing security infrastructure (e.g., SIEM); security orchestration, automation, and response (SOAR); and extended detection and response (XDR) platforms. This integration allows for comprehensive monitoring of system performance alongside security events. Figure 3. Security and observability platform integration with automated response By creating a unified dashboard, teams can gain visibility into both reliability metrics and security alerts in one place. This holistic view enables faster detection of issues, improves incident response times, and fosters collaboration between SRE and security teams. Proactive Monitoring and Auto-Remediation Leveraging artificial intelligence (AI) and machine learning (ML) within observability systems allows for the analysis of historical data to predict potential security or reliability issues before they escalate. For example, by learning historical data, AI and ML can identify patterns and anomalies in system behavior. Additionally, automated remediation processes can be triggered when specific thresholds are met, allowing for quick resolution without manual intervention. Custom Security and SRE Alerts A secure SRE pipeline requires creating tailored alerting systems that combine security and SRE data. By customizing alerts to focus on meaningful insights, teams can ensure they receive relevant notifications that prioritize critical issues. For instance, alerts can be set up to notify SRE teams of security misconfigurations that could impact system performance or alerts that would notify the security teams of system performance issues that could indicate a potential security incident. This synergy ensures that both teams are aligned and can respond to incidents swiftly, maintaining a balance between operational reliability and security. Conclusion As organizations and their environments grow in complexity, integrating observability with security is crucial for effective site reliability management. Observability provides the real-time insights needed to detect threats, prevent incidents, and maintain system resilience. By aligning SRE and security efforts, organizations can proactively address vulnerabilities, minimize downtime, and respond swiftly to breaches. Unified observability not only enhances uptime but also strengthens security, making it a key component in building reliable, secure systems. In an era when both performance and security are critical, this integrated approach is essential for success. This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report

By Lahiru Hewawasam

Embracing the Future With Hybrid and Cloud-Native Observability: An In-Depth Exploration of Observability With Architectural Examples and Best Practices

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. The way we manage software systems is quickly evolving from traditional on-premises systems to modern cloud-native environments. This transformation involves a vital need to monitor and manage applications that run across distributed environments. For real-time insights into both on-premises and cloud-based systems, developers are using hybrid and cloud-native observability to achieve holistic visibility across their technology stacks. By integrating observability solutions, teams can detect issues swiftly, ensuring the optimal performance and reliability of their applications. Additionally, this type of proactive problem solving supports more effective troubleshooting by correlating data across various sources, reducing mean time to resolution. However, the implementation of these observability solutions has its own set of challenges: It requires careful consideration of data compatibility as different systems may produce data in diverse formats. Normalization or transformation of data therefore becomes crucial to ensure that the information can be effectively analyzed in a standardized manner. This strategy demands a robust toolset capable of handling large volumes of data generated from multiple sources while also providing advanced analytical capabilities to derive meaningful insights. Additionally, sensitive data needs to be encrypted and managed through restricted access. Understanding Hybrid and Cloud-Native Observability Cloud-native observability focuses on applications that are built on modern architectures characterized by microservices and containers, which offer scalability and flexibility, and serverless applications, which further abstract operations that require monitoring of ephemeral systems. Hybrid observability, on the other hand, requires attention to both new and legacy systems. Meanwhile, more static, traditional systems still necessitate a transition to hybrid observability to ensure continuity and efficiency as organizations shift toward cloud paradigms. Understanding these differences enables developers to deal with the unique challenges posed by each environment. Tracking interactions like network connections and information flow across various components (e.g., cloud services, data centers, network resources) creates comprehensive visibility. To navigate these complexities, organizations must adopt a strategic approach to observability that encompasses both hybrid and cloud-native elements. This involves leveraging tools and practices that are adept at managing the sprawling nature of modern IT landscapes. For instance, employing monitoring solutions that can seamlessly integrate with a variety of platforms and services is key. These solutions must not only be flexible but also be capable of evolving alongside the technologies that they are designed to observe. Opportunities in Observability Solutions The landscape of observability solutions offers numerous opportunities for developers to optimize their systems. Tools for distributed tracing, log aggregation, and customizable dashboards play a crucial role in effective monitoring. These tools facilitate seamless integration across interconnected services and aid in identifying performance bottlenecks. This results in improved scalability, enabling applications to adapt to varying loads and grow without compromising performance. Cost optimization is another significant advantage as efficient resource use can reduce unnecessary spending. An enhanced customer experience emerges from promptly identifying and resolving issues, which demonstrates the value of an effective observability strategy. Utilizing AI and machine learning within observability tools can further augment these benefits. These technologies provide sophisticated data analysis and facilitate predictive maintenance. This approach not only improves reliability but also contributes to cost savings. Moreover, cloud-native observability practices enable developers to leverage the inherent flexibility and scalability of cloud environments. This is particularly beneficial in distributed systems where workloads can vary drastically. Cloud-native tools, built to operate in these highly distributed environments, provide enhanced visibility across services regardless of their deployment location. Figure 1. Observability in a cloud environment Drawbacks and Challenges Observability solutions are not without challenges: Implementing these modern solutions can be complex and, in most cases, requires significant expertise and investment. Dealing with vast volumes of data across disparate systems can lead to information overload. As engineers accumulate and process larger quantities of data, they must also address the risk of data breaches and ensure compliance. For example, systems must comply with global data protection regulations such as GDPR in Europe and CCPA in California. Moreover, there is the challenge of adaptation to the culture. Moving toward a more observant and data-driven approach may require significant changes in an organization's culture. This includes democratizing a mindset that values proactive maintenance over reactive problem solving. Achieving such a shift requires buy-in from all stakeholders. Another aspect to consider is the potential for "alert fatigue" among teams tasked with monitoring these observability systems. With the increased granularity of data comes a higher volume of alerts, not all of which are actionable or indicative of significant issues. The Role of AI AI and ML are revolutionizing observability. A well-trained model can: Enhance the capability to monitor and manage complex systemsAutomate tasks such as anomaly detection, predictive analytics, and root cause analysisIdentify performance issues faster These abilities result in proactive system management and quicker problem resolution. However, AI introduces challenges such as the need for high-quality training data and the risk of over-reliance on algorithms, which can have erroneous output. It is important to have a balance between automation and expert human oversight as this ensures that systems are not wholly dependent on ML algorithms. Organizations need to keep investing in technology. As more information becomes available and new issues come up, existing AI models need to be updated. Expertise to train and maintain AI systems is needed, including a plan to use new data and tune hyperparameters to maintain accuracy. AI can also sometimes be biased. It's crucial to make sure that ML models are fair and clear about how they make decisions. To handle these challenges, different teams need to work together, including IT, security, and operations. This way, it is possible to get the most out of AI while keeping risks low. Example Architectures and Best Practices of Hybrid and Cloud-Native Observability When discussing observability in hybrid and cloud-based architectures, it's essential to understand the unique characteristics and requirements of different architectural types. Observability involves the famed trio of logs, metrics, and traces to provide a comprehensive view of an application's performance and health. These elements must be adapted to suit various architectures and platforms. Cloud providers offer robust platforms for implementing observability through various architectures, including the following: Microservices architectures, which deconstruct applications into manageable services, benefit from observability tools that monitor service interactions and dependencies.Serverless architectures, with on-demand resource allocation, need observability frameworks that provide visibility into function execution and resource usage. Event-driven architectures, where systems respond to real-time changes, benefit from observability by ensuring that events trigger appropriate responses.Hybrid applications, where one part of the system is on-premises and the other is in the cloud, need to observe end-to-end data flow and network functioning. Adhering to best practices is crucial for optimizing the aforementioned architectures. Observability plays an integral part in this. Implementing observability involves several activities across multiple tiers of an application: Collect logs, metrics, and traces from all components and aggregate them for centralized analysisImplement end-to-end tracing to understand how requests or events propagate through various services or functionsSet up real-time processing and alerts to detect anomalous behavior early and respond swiftly to manage issuesUse dashboards to visualize data trends and hotspot areas for easy interpretation and drill-down analysis Table 1 features examples of observability solutions from widely known cloud providers. These are just a few of many notable options. Table 1. Observability solutions from cloud providers SolutionsDescriptionGoalsAmazon CloudWatchCentralized logging and metricsThese tools enable developers to track service latencies, error rates, and the flow of requests through multiple servicesAWS X-RayService request tracingAWS CloudTrailAPI activity monitoringAzure MonitorObservability service for apps, infrastructure, and networkComprehensive monitoring solutions for collecting, analyzing, and responding to monitoring data from cloud and on-premises environmentsAzure Log AnalyticsRuns queries on log dataAzure Application InsightsApplication performance monitoringGoogle Cloud LoggingReal-time log managementIntegrated monitoring, logging, and tracing managed services for applications and systems running on Google Cloud and other environmentsGoogle Cloud MonitoringVisibility into app performance, availability, and healthGoogle Managed Service for PrometheusVisualization and analytics service The Future of Hybrid and Cloud-Native Observability Looking ahead, I think we'll see more focus on better AI, more security compliance features, and solutions tailored for specific industries. Embracing this shift in focus will make us ready to handle the changing digital infrastructure landscape with ease and accuracy. I also believe AI and machine learning will be crucial for improving our observability solutions. These technologies can help us automatically spot issues and predict system failures before they cause problems, and implementing AI-driven analytics into our observability tools will give us a deeper understanding of how our systems are performing. This proactive approach improves resource utilization and keeps systems running efficiently and reliably. Cybersecurity threats are becoming more advanced, indicating that we need to include more security compliance features in our observability platforms. This means not just watching for potential security breaches but also making sure all our data handling follows the right rules and standards. By using observability tools that offer thorough security analysis and reporting, we can quickly find weak spots and fix them. Another trend is the need to tailor observability solutions for different industries. For example, in healthcare, we have to be careful about patient privacy laws. In finance, we need to focus on keeping transactions secure and accurate. By customizing our observability tools for each industry, we can better meet their unique needs. Conclusion Managing modern applications requires the adoption of hybrid and cloud-native observability. We've explored the distinctions between hybrid and cloud-native approaches, emphasizing the importance of real-time insights. The integration of AI and machine learning enhances efficiency, enabling proactive issue resolution and swift anomaly detection. Essential features include distributed tracing, log aggregation, and customizable dashboards, which facilitate robust monitoring across diverse environments. Successful implementation of observability involves strategic data integration and prioritization, ensuring flexibility and scalability to meet evolving business needs. IT ecosystems are becoming more complex, so strong observability strategies are needed and will help us keep things running smoothly and performing well. This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report

By Satrajit Basu

CORE

Observability Fundamentals Beyond Traditional Monitoring

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. Gone are the days when health metrics like CPU usage, memory, and disk space were all that we needed. Traditional monitoring and observability, although still valuable, fall short in the full-stack arena. In today's distributed computing world, where Kubernetes and microservices are becoming increasingly popular, the term "back end" rarely refers to one large application with a single large database. It often refers to a collection of smaller, interconnected services and databases that work together to handle specific functions. Orchestrated through platforms like Kubernetes, full-stack monitoring requires more than logging a single large database. At an infrastructure level as well as at an application level, distributed tracing, logs, and metrics are necessary to get a clearer picture of your system's health. We should, at a minimum, continuously collect and analyze this data to establish baselines and thresholds that help identify deviations from the norm. Key Components of Full-Stack Observability The four basic components of full-stack observability are metrics, logs, traces, and user experience. Example use cases and tools for each component are depicted in Table 1: Table 1. Components of full-stack observability and monitoring ComponentPurposeExamplesToolsMetricsQuantitative data that tracks system and application performanceApplication throughput, cloud resource usage, container healthPrometheus, Grafana, Amazon CloudWatch, Google Cloud Observability, Azure MonitorLogsRecords of events that capture system and application activities, errors, warnings, and messages for troubleshooting and debuggingSystem logs (OS-level events security, hardware), application logs (internal workings, errors, warnings, debug information), database logs (query performance, data access patterns)ELK Stack, Fluentd, AWS CloudWatch Logs, Azure Monitor LogsTracesTracking the journey of requests through a distributed system to identify performance bottlenecks and dependenciesDistributed tracing of API requests, service dependencies, latency bottlenecksOpenTelemetry, Jaeger, Zipkin, AWS X-Ray, Google Cloud Trace, Azure MonitorUser experienceMonitoring how real users interact with the system, focusing on performance and usabilityPage load times, client-side errors, user behavior patternsGoogle Analytics, AWS CloudWatch RUM, Azure Application Insights Metrics, traces, logs, and user experience monitoring are very important to get a clear picture of your full-stack performance. For example, consider a streaming service provider that uses metrics to track system load: Metrics could be defined as video buffering times and bandwidth usage.When customers report playback issues, logs record detailed events of CDN failures and content delivery problems.Traces are used to follow a video stream request across multiple services (content recommendation, CDN, playback engine); this way, they can pinpoint which service caused delays.User experience tools monitor user interactions to detect how buffering or playback issues impact user retention and satisfaction. This can help optimize content delivery strategies. The goals of full-stack observability and monitoring are to: Quickly identify and resolve issues across the stackUnderstand complex system behaviors across the stackMake data-driven decisions about system design and optimizationImprove the overall reliability and performance at both infrastructure and application levels Correlating Data for Full-Stack Observability and Monitoring By correlating data from sources, such as the key components above, full-stack observability allows you to gain a holistic understanding of system behavior. This leads to faster root cause analysis, proactive performance management, and improved decision making in a complex and distributed computing environment. Here are some ways that correlating data can achieve full-stack observability: By discovering and visualizing the relationships between different components of your system, dependency mapping will give insight on the impact of changes. This way, troubleshooting complex issues will be more effective.Using distributed observability data to continuously improve system performance may involve identifying slow database queries, optimizing API calls, or refining caching strategies.Setting up intelligent alerting systems that can correlate multiple data sources can reduce the time we spend on irrelevant or less critical issues and focus on the critical ones.Integrating security-related data into your observability platform can help users detect and respond quickly to potential security threats. Otherwise, your response time could be too slow or the threats could go entirely undetected.Using observability data to ensure compliance with regulatory requirements can be done by maintaining distributed audit trails for security and operational changes.Correlating technical metrics with business KPIs can help summarize how the overall system performance affects the bottom line. Your technical improvements can now be prioritized based on their potential business impact. Step-by-Step Full-Stack Observability and Monitoring Implementing a full-stack observability and monitoring strategy involves several steps. From setting objectives to choosing the right tools and defining the right processes, here's a way to get started. Step 1: Set Clear Objectives Before you start implementing an observability and monitoring strategy, define your specific goals. By establishing clear objectives, you can focus your efforts on collecting the most relevant data and insights. Example objectives of a full-stack strategy might include: Reducing mean time to recovery by quickly identifying and resolving incidents to minimize downtimeImproving application performance by identifying and optimizing slow-performing services or componentsEnhancing the user experience by monitoring real user interactions to ensure seamless performanceFollowing compliance and security standards by monitoring access and anomaly patterns Step 2: Choose the Right Tools Selecting the right tools is crucial for effective full-stack observability and monitoring. You need platforms that support various aspects of observability, including metrics collection, log aggregation, distributed tracing, and user monitoring. A list of popular tools can be found in Table 1. Step 3: Instrument Applications and Infrastructure With OpenTelemetry To implement full-stack observability, you can start by instrumenting your applications and infrastructure using OpenTelemetry. The following steps demonstrate, in Python, how to instrument a Flask application using OpenTelemtry and Jaeger. 1. Install OpenTelemetry SDKs: Shell pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation pip install opentelemetry-instrumentation-flask pip install opentelemetry-exporter-jaeger 2. Initialize and set up the basic OpenTelemetry configuration: Shell from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.resources import Resource from opentelemetry.semconv.resource import ResourceAttributes # Create a resource with service name resource = Resource(attributes={ ResourceAttributes.SERVICE_NAME: "my-flask-service" }) # Create a tracer provider tracer_provider = TracerProvider(resource=resource) # Set the tracer provider as the global default trace.set_tracer_provider(tracer_provider) # Get a tracer tracer = trace.get_tracer(__name__) 3. Set up data export. Set up the Jaeger exporter and configure it to send data: Shell from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.jaeger.thrift import JaegerExporter # Create a Jaeger exporter jaeger_exporter = JaegerExporter( agent_host_name="localhost", agent_port=6831, ) # Create a BatchSpanProcessor and add the exporter to it span_processor = BatchSpanProcessor(jaeger_exporter) # Add the SpanProcessor to the TracerProvider tracer_provider.add_span_processor(span_processor) 4. Integrate OpenTelemetry with your chosen observability platform. In this example, we're integrating with Jaeger by instrumenting a Flask application. Most platforms offer built-in support for OpenTelemetry, simplifying data collection and analysis. Shell from flask import Flask from opentelemetry.instrumentation.flask import FlaskInstrumentor # Create and instrument a Flask application app = Flask(__name__) FlaskInstrumentor().instrument_app(app) @app.route('/') def hello(): return "Hello, OpenTelemetry with Jaeger!" @app.route('/api/data') def get_data(): with tracer.start_as_current_span("get_data"): # Simulate some work import time time.sleep(0.1) return {"data": "Some important information"} if __name__ == '__main__': app.run(debug=True) To use the steps above, make sure you have a Jaeger back end running. Run your Flask application and generate some traffic by accessing http://localhost:5000 and http://localhost:5000/api/data in your browser or using a tool like cURL. Open the Jaeger UI to view your traces. Step 4: Implement Logging and Log Aggregation Logs provide detailed records of the events occurring within your applications and infrastructure. For comprehensive observability, centralize logs from various services, applications, and infrastructure components using a log aggregation tool. Best practices for log management include: Using a structured format (e.g., JSON) for logs to make parsing and searching easierAggregating logs in a centralized storage system like Elasticsearch for easier access and analysisImplementing automated log rotation to manage storage and performance efficientlyUsing trace IDs to correlate logs with distributed traces, which provide contextual information for debugging Step 5: Set Up Distributed Tracing for Visibility Across Services Distributed systems may require end-to-end visibility for latency and bottlenecks. One way to achieve end-to-end visibility is by tracing requests as they traverse across services. Distributed tracing is especially useful in microservices and cloud-native environments, where understanding inter-service communication is critical. To implement distributed tracing: Use OpenTelemetry to instrument your services for tracing. OpenTelemetry provides APIs and SDKs to create and propagate trace contexts across service boundaries. Collect and export trace data to a back end for analysis. Use your observability platform's tracing dashboard to visualize request flows, identify bottlenecks, and perform root cause analysis. Step 6: Implement Real User Monitoring for User Experience Insights Real user monitoring (RUM) tracks real users' interactions with your application, offering insights into their experience. By collecting data on page load times, user interactions, and errors, RUM helps identify performance issues that impact user satisfaction. To integrate RUM: Select a RUM tool that integrates seamlessly with your observability stack.Instrument your application by adding RUM tracking code to your front-end application to start collecting user interaction data. Use the RUM dashboard to analyze user sessions, identify trends, and detect performance issues that affect the user experience. Step 7: Define and Implement Alerts and Automation Alerts are crucial for proactive observability. Set up automated alerts based on predefined thresholds or anomalies detected by your observability platform. For effective alerting: Define clear alert criteria based on key metrics (e.g., latency, error rates) relevant to your observability objectives. Use AI/ML-based anomaly detection to identify unusual patterns in real time. Integrate with incident response tools to automate responses to critical alerts. Step 8: Scale and Optimize Observability Processes As systems grow, scaling observability processes becomes vital. Scaling involves optimizing data collection, storage, and analysis to handle increasing telemetry data volumes. As a quick start: Use sampling to reduce the amount of trace data collected while retaining meaningful insights.Create centralized observability dashboards to monitor key metrics and logs. This will ensure quick access to critical information.Periodically review your observability processes to ensure they align with changing system architectures and business objectives. Conclusion There are multiple observability and monitoring trends that may evolve in the future in interesting ways. A major improvement in full-stack observability would be to shift from reactive problem solving to proactive system optimization, driven by advancements in AI to detect anomalies across the stack and machine learning to predict potential issues before they occur. Data convergence and cross-stack correlation is another evolving trend that can support the shift from reactive to proactive optimization. Metrics, logs, traces, and user experience data will be more tightly integrated, providing a holistic view of system health. Platforms are expected to automatically correlate events across different layers of the stack, from infrastructure to application code to user interactions. More sophisticated auto-instrumentation techniques will reduce the need for manual code changes. Observability data will feed directly into automation systems, enabling automatic problem resolution in many cases. Full-stack observability and monitoring are crucial practices that should be kept up to date with progressing trends. This is especially true for organizations seeking to maintain optimal performance, reliability, and user experience in distributed and complex software environments. This article highlighted the steps to achieve scalable full-stack observability and monitoring by leveraging OpenTelemetry; integrating metrics, logs, traces, and real user monitoring; and adopting a proactive alerting strategy. The insights gained from the outlined steps will help you identify and troubleshoot issues efficiently, and they will empower your teams to make data-driven decisions for continuous improvement and innovation. Related Refcards: Full-Stack Observability Essentials by Joana Carvalho, DZone Refcard Getting Started With OpenTelemetry by Joana Carvalho, DZone RefcardGetting Started With Prometheus by Colin Domoney, DZone RefcardGetting Started With Log Management by John Vester, DZone Refcard This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report

By Stelios Manioudakis, PhD

CORE

Deploy Smarter, Not Harder: A Guide to Setting Up AWS CI/CD Pipelines for ECS

In today’s fast-paced development environment, delivering applications quickly, reliably, and at scale is crucial. Continuous Integration and Continuous Delivery (CI/CD) pipelines enable teams to automate the deployment process and minimize manual intervention. Amazon Elastic Container Service (Amazon ECS), paired with tools like AWS CodePipeline and AWS CodeBuild, provides a robust framework for automating the deployment of containerized applications. This post will walk you through building a CI/CD pipeline for an application running on Amazon ECS. We will utilize AWS services such as CodePipeline, CodeBuild, ECS, and Elastic Container Registry (ECR). Prerequisites Before getting started, ensure you have the following: An AWS account.Docker installed locally.A GitHub repository (or any other Git-based repository) to host your application’s source code.Familiarity with basic AWS services, particularly ECS and ECR. 1. Set Up Amazon ECR for Container Images Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy to store, manage, and deploy Docker images. Create a repository in ECR where your Docker images will be stored: Navigate to the ECR Console.Click on Create repository.Name your repository and choose visibility settings (private by default). Authenticate Docker with your ECR repository: Run the following command locally to authenticate Docker with ECR (replace your-region with the AWS region you're using): Shell aws ecr get-login-password --region your-region | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.<your-region>.amazonaws.com Build and push your Docker image: Shell docker build -t <repository-name>:latest . docker tag <repository-name>:latest <aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<repository-name>:latest docker push <aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<repository-name>:latest 2. Set Up Amazon ECS Amazon ECS is a fully managed container orchestration service. We’ll set up an ECS cluster and task definition that will run your containerized application. Create an ECS Cluster: Navigate to the ECS Console.Click Create Cluster and choose the cluster template (EC2 or Fargate).Follow the steps to create your cluster. Define a Task Definition: In the ECS Console, click Task Definitions and choose Create new task definition.Select Fargate (for serverless deployments) or EC2 as the launch type.Specify your ECR image in the task definition and configure the required resources like CPU and memory. Create an ECS Service: From the ECS Console, create a service linked to the task definition.Choose the desired deployment strategy and scaling options. 3. Set Up AWS CodePipeline AWS CodePipeline automates the process of deploying your application from your repository to ECS. Let’s configure the pipeline. Create a Pipeline: Navigate to the CodePipeline Console.Click Create Pipeline.Specify a pipeline name and choose an S3 bucket for storing artifacts. Add a Source Stage: In the Source Stage, connect your GitHub repository or another Git-based repository. AWS CodePipeline will automatically trigger a new build whenever a new commit is pushed. Add a Build Stage: For the Build Stage, select AWS CodeBuild.Create a new CodeBuild project, and in the environment settings, specify the runtime as Docker.Provide a buildspec.ymlfile in the root of your repository to define the build steps. For example: YAML version: 0.2 phases: pre_build: commands: - echo Logging in to Amazon ECR... - $(aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.us-west-2.amazonaws.com) - REPOSITORY_URI=<aws-account-id>.dkr.ecr.us-west-2.amazonaws.com/<repository-name> - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7) - IMAGE_TAG=${COMMIT_HASH:=latest} build: commands: - echo Build started on `date` - docker build -t $REPOSITORY_URI:latest . - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG post_build: commands: - echo Pushing the Docker image... - docker push $REPOSITORY_URI:latest - docker push $REPOSITORY_URI:$IMAGE_TAG Add a Deploy Stage: In the Deploy Stage, select Amazon ECS as the deployment providerChoose your cluster, service, and the desired ECS deployment configuration. 4. Automate Deployment With the pipeline in place, each push to the repository triggers the CI/CD pipeline. CodePipeline pulls the latest changes, CodeBuild builds the Docker image, pushes it to ECR, and ECS automatically deploys the updated container. 5. Monitor and Troubleshoot AWS provides various monitoring tools for ECS and CI/CD pipelines: Amazon CloudWatch: For monitoring ECS clusters and services, and providing logs and performance metrics.AWS X-Ray: For tracing application requests in microservice architectures.AWS CloudTrail: For auditing API calls related to ECS and pipeline activity. Conclusion Building a CI/CD pipeline with Amazon ECS allows you to streamline your container-based application deployment. By integrating services like CodePipeline, CodeBuild, and ECR, you can automate the process of building, testing, and deploying your application, ensuring a faster time-to-market with fewer manual errors. With this setup, you now have a scalable, automated process for continuous delivery of containerized applications using AWS services.

By Manish Parapudi

Automate Azure Databricks Unity Catalog Permissions at the Table Level

Disclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This article is not a promotion for any cloud/data management platform. All the images and code snippets are publicly available on the Azure/Databricks website. In my other DZone articles, I have discussed what Databricks Unity Catalog is, how the privilege model works in Unity Catalog, schema level, and script to automate permission management at both the catalog and schema levels. In this article, I aim to provide the script that will automate the permission management at the Unity Catalog table level. Privileges at the Unity Catalog Table Level Unity Catalog privilege model In Unity Catalog (Databricks), applying permissions at the table level is necessary when you want to control access to specific tables or views, rather than applying permissions to all objects within a schema. Table-level permissions are useful in the following scenarios: 1. Granular Access Control When you need to grant or restrict access to specific tables or views without affecting the entire schema. This is especially important when a schema contains sensitive data, and you want to allow access to some tables while limiting access to others. 2. Protecting Sensitive Data If certain tables within a schema contain confidential or restricted data (e.g., personal identifiable information, financial data), you can apply table-level permissions to ensure that only authorized users can view or query these tables. For example, you might allow access to some summary or aggregated data tables but restrict access to raw, detailed tables containing sensitive information. 3. Delegating Access to Specific Teams/Users If the Schema Contains Tables Intended for Multiple Teams When different users or teams need access to different tables within the same schema. For example, the sales team might need access to customer-related tables, while the finance team needs access to revenue tables. Applying permissions at the table level ensures that each team can access only the tables relevant to their work. 4. Compliance With Data Governance When enforcing strict data governance policies, you might need to control access at a more granular level (down to individual tables). Table-level permissions help ensure compliance by granting access only to the data that a user or role is authorized to work with. 5. Handling Mixed Access Needs Within a Schema In cases where a schema contains tables with varying levels of sensitivity or confidentiality, applying permissions at the schema level may be too broad. Table-level permissions allow you to manage access for each table individually based on specific needs. Permissions That Can Be Applied at the Table Level SELECT: Grants read access to the table, allowing users to query it.MODIFY: Gives the ability to add, delete, and modify data to or from an object.APPLY TAG: Gives the ability to apply tags to an object.ALL PRIVILEGES: Gives all privileges. Automation Script Prerequisites Unity Catalog is already set up.Principal(s) is/are associated with the Databricks workspace.User running the permission script has proper permissions on the table(s), schema, and catalog. Step 1: Create a Notebook and Declare the Variables Create a notebook in Databricks workspace. To create a notebook in your workspace, click the "+" New in the sidebar, and then choose Notebook. A blank notebook opens in the workspace. Make sure Python is selected as the notebook language. Copy and paste the code snippet below into the notebook cell and run the cell. Python catalog = 'main' # Specify your catalog name schema = 'default' # Specify your schema name tables_arr= 'test1,test2' # Specify the Comma(,) seperated values of table name tables = tables_arr.split(',') principals_arr = '' # Specify the Comma(,) seperated values for principals in the blank text section (e.g. groups, username) principals = principals_arr.split(',') privileges_arr = 'SELECT,APPLY TAG' # Specify the Comma(,) seperated values for priviledges in the blank text section (e.g. SELECT,APPLY TAG) privileges = privileges_arr.split(',') Step 2: Set the Catalog and the Schema Copy, paste, and run the below code block in a new or in the existing cell and run the cell. Python query = f"USE CATALOG `{catalog}`" #Sets the Catalog spark.sql(query) query = f"USE SCHEMA `{schema}`" #Sets the Schema spark.sql(query) Step 3: Loop Through the Principals and Privileges and Apply Grant at the Catalog, Schema, and Tables Copy, paste, and run the code block below in a new or existing cell, then run the cell to apply the permissions. Python for principal in principals: query = f"GRANT USE_CATALOG ON CATALOG `{catalog}` TO `{principal}`" # Use catalog permission at Catalog level spark.sql(query) query = f"GRANT USE_SCHEMA ON SCHEMA `{schema}` TO `{principal}`" # Use schema permission at Schema level spark.sql(query) for table in tables: for privilege in privileges: query = f"GRANT `{privilege}` ON `{table}` TO `{principal}`" # Grant priviledges on the tables to the pricipal print(query) spark.sql(query) Validation You can validate the privileges by opening Databricks UI and navigating to "Catalog" in the Data Explorer. Once the catalog shows up in the Data section, click on the catalog, then expand the schema and select the table inside the schema where you have applied the permissions, and go to the "permissions" tab. You can now see all the privileges applied to the table. Below is the screen shot of permissions applied in the out-of-box catalog table inside the main catalog and information_schema schema. You can also run the below SQL script in a notebook to display all the permissions for a table as part of your validation. SQL SHOW GRANTS ON TABLE table_name; Conclusion Automating privilege management in Databricks Unity Catalog at the table level helps ensure consistent and efficient approach of applying permissions at the lowest level in the Unity Catalog. The code provided demonstrates a practical way to assign multiple table-level privileges for multiple principals and tables in a single catalog and schema. In a catalog where 100's of table are present and different permissions needs to be provided for different principals, the above automation significantly reduces manual error and effort.

By Soumya Barman

The Science Behind Durability: Write-Ahead Logging Explained

For any persistence store system, guaranteeing durability of data being managed is of prime importance. To achieve this durability, the system should be resilient to failures and crashes, which are inevitable and could happen at any point of time. Once the system agrees and acknowledges to perform any action, it should honor it even in case of a crash. Thus, for systems to know what actions it has agreed to perform, but might not have them executed yet, write ahead logs (WAL) are employed. Write Ahead Logs Defined WAL (sometimes also referred as REDO or commit logs) simply are a collection of immutable log entries appended sequentially to a file stored on a hard disk. For each command, a system makes a log entry to WAL first. Only once the system confirms the log entry is successfully written to WAL, the specified action in command is performed. Upon the successful completion of the action, the log entry is marked as committed. This ensures that even in case of any failure or crash between the write to WAL and the action being performed, on recovery/restart the process will perform the pending action(s). Thus, durability is guaranteed. Figure 1 — Write-Ahead Logging before command execution to ensure durability It's crucial to note that WAL relies heavily on a stable storage system itself. In case of a media failure, whole WAL files could be lost. Thus, to tolerate such failures, replicated logs are used. Performance Considerations for Write-Ahead Logging Flushing each write ahead log entry to a disk may immediately provide a strong durability guarantee, but it could be inefficient in terms of performance. Thus, as a tradeoff, multiple log entries are batched while flushing. This however comes with risk of losing more entries in case of failure or crash. To improve overall throughput, WAL is prioritized over action. Both are decoupled such that actions are performed asynchronously post-log entry. This could mean the system sees the changes being applied with delays. If such delays are significant and unacceptable, decoupling can be switched off. Ideal Log Entry Structure Each log entry should: have all required information to perform a specific action. For example, a change of user name from JohnDoe to FooBar (which in RDBMS terminologies translates to "update USERNAME to FooBar" in table USER which previously was JohnDoe). To achieve atomicity, a set of actions can be batched together and written as a single log entry.be assigned a unique identifier viz. Log Sequence Number (LSN), ensuring a strict order of execution. This helps in recovering the exact state of system.have either a cyclic redundancy check (CRC) or an end-of-entry marker to detect and discard corrupted entries. Log entry corruption is possible due to various reasons such as incomplete write (arising out of sudden process crash) or network/transmission failures. Scalability Considerations As the system grows consider the following: a single WAL file could quickly become a bottleneck. To overcome any such limitations, segmented logs can be utilized to scale the system by logically splitting them into smaller files, i.e. segments for easier management.clean up of committed log entries via low water mark (LWM) can be performed. LWM is a threshold which signifies that all entries up to it are applied and thus can be safely discarded. These committed log entries are identified via their respective LSN. System Recovery On recovery, from either failure or crash, the system scans the write ahead logs and performs all pending actions starting from the checkpoint (or last committed entry identified via its LSN). While doing so, the system advances the checkpoint to newly applied changes. The system also identifies and discards corrupted entries to maintain data integrity, where applicable. Since the log entries are immutable and append only, WAL could have duplicates due to client retries or other errors. Thus, recovery should either be idempotent or employ a mechanism to identify and discard duplicates. WAL Usage and Similarities All traditional RDBMS systems and a few NoSQL systems use write-ahead logging to guarantee durability.Apache Kafka utilizes similar structure as WAL for its storage and replication needs.The Git concept of “commit” is similar as a log entry to journal every change. This can be used to restore any previous state. Further Reading Algorithms for Recovery and Isolation Exploiting Semantics (ARIES) is a popular algorithm utilizing WAL. Write-Ahead Logging vs. Event Sourcing While both WAL and event sourcing involve logging changes, they serve different purposes and operate at different levels of abstraction. WAL is a low-level technique for ensuring data integrity in databases, while event sourcing is a higher-level architectural pattern for capturing and utilizing the complete history of a system’s state changes. Also, they differ in terms of lifespan and granularity. Write ahead logs are short lived and focus on the “how” behind data changes, while event sourcing may keep data indefinitely to construct a state at any point of time (historical) with focus on the “what” happened in a system from a business perspective.

By Ammar Husain

Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service

The Apache Flink Managed Service in AWS, offered through Amazon Kinesis data analytics for Apache Flink, allows developers to run Flink-based stream processing applications without the complexities of managing the underlying infrastructure. This fully managed service simplifies the deployment, scaling, and operation of real-time data processing pipelines, enabling users to concentrate on building applications rather than handling cluster setup and maintenance. With seamless integration into AWS services such as Kinesis and S3, it provides automatic scaling, monitoring, and fault tolerance, making it ideal for real-time analytics, event-driven applications, and large-scale data processing in the cloud. This guide talks about how to use the Apache Flink dashboard for monitoring and managing real-time data processing applications within AWS-managed services, ensuring efficient and reliable stream processing. The Apache Flink Dashboard The Apache Flink dashboard offers an intuitive interface for managing real-time data services on AWS, enabling developers to monitor, debug, and optimize Flink applications effectively. AWS-managed services like Amazon Kinesis data analytics leverage the dashboard’s insights into job statuses, task performance, and resource usage, assisting developers in monitoring live data streams and assessing job health through metrics such as throughput, latency, and error rates. The Flink dashboard facilitates real-time debugging and troubleshooting by providing access to logs and task execution metrics. This visibility is essential for identifying performance bottlenecks and errors, ensuring high availability and low latency for AWS-managed real-time data processing services. Overall, the dashboard equips users with the necessary transparency to maintain the health and efficiency of these services. Accessing the Apache Flink Dashboard To begin analyzing Flink applications, access the Apache Flink dashboard, which provides real-time insights into job performance and health. Code Example Consider the following code snippet where an Apache Flink application processes streaming data from Amazon Kinesis using Flink’s data stream API: Java DataStream<String> dataStream = env.addSource(new FlinkKinesisConsumer<>( INPUT_STREAM_NAME, new SimpleStringSchema(), setupInputStreamProperties(streamRole, streamRegion)) ); SingleOutputStreamOperator<ArrayList<TreeMap<String, TreeMap<String, Integer>>>> result = dataStream .map(Application::toRequestEventTuple) .returns(Types.TUPLE(Types.LIST(Types.STRING), Types.LIST(Types.STRING), Types.LIST(Types.INT))) .windowAll(TumblingProcessingTimeWindows.of(Time.minutes(5))) .aggregate(new EventObservationAggregator()); REGIONS.forEach(region -> { result.flatMap(new CountErrorsForRegion(region)).name("CountErrors(" + region + ")"); result.flatMap(new CountFaultsForRegion(region)).name("CountFaults(" + region + ")"); }); env.execute("Kinesis Analytics Application Job"); This Apache Flink application processes real-time data from an Amazon Kinesis stream using Flink's data stream API. The execution environment is established, retrieving AWS-specific properties such as the role ARN and region to access the Kinesis stream. The data stream is consumed and deserialized as strings, which are then mapped to tuples for further processing. The application utilizes 5-minute tumbling windows to aggregate events, applying custom functions to count errors and faults for various AWS regions. The job is executed continuously, processing and analyzing real-time data from Kinesis to ensure scalable, region-specific error and fault tracking. Summary Source: Reads data from a Kinesis stream, using a Flink Kinesis consumer with a specified region and roleTransformation: The data is converted into tuples and aggregated in 5-minute windows.Counting: Errors and faults are counted for each AWS region.Execution: The job runs indefinitely, processing data in real-time as it streams from Kinesis. Job Graph The job graph in the Flink Dashboard visually represents the execution of an Apache Flink job, highlighting the data processing flow across different regions while tracking errors and faults. Explanation Source: Custom Source -> Map: The initial component is the source, where data is ingested from Amazon Kinesis. The custom source processes data in parallel with two tasks (as you see in image Parallelism: 2).Trigger window (TumblingProcessingTimeWindows): The next step applies a TumblingWindow with a 5-minute processing time; i.e., grouping incoming data into 5-minute intervals for batch-like processing of streaming data. The aggregation function combines data within each window (as represented by AllWindowedStream.aggregate()) with Parallelism: 1 indicating a single task performing this aggregation.Regional processing (CountErrors/CountFaults): Following window aggregation, the data is rebalanced and distributed across tasks responsible for processing different regions. Each region has two tasks responsible for counting errors and faults, each operating with Parallelism: 2, ensuring concurrent processing of each region's data. Summary The data flows from a custom source, is mapped and aggregated in 5-minute tumbling windows, and is processed to count errors and faults for different regions. The parallel processing of each region ensures efficient handling of real-time streaming data across regions, as depicted in the diagram. Operator/Task Data Flow Information The dashboard provides a quick overview of the data flow within the Flink job, showcasing the processing status and data volume at each step. It displays information about various operators or tasks in the Flink job. Here’s a breakdown of what the table shows: Name: Lists operators or processing steps in the Flink job, such as "Source: Custom Source -> Map," "TriggerWindow," and various "CountErrors" and "CountFaults" for different regionsStatus: This displays the status of tasks. All listed operators are in "RUNNING" status with green labels.Bytes Received: Displays the amount of data received by each operator; for example, the "TriggerWindow" operator receiving the 31.6 MB of dataRecords Received: Indicates the number of records processed by each operator, again with the "TriggerWindow" operator leading (148,302)Bytes Sent: Shows the amount of data sent by each operator; for example: the "Source: Custom Source -> Map" sending the most (31.6 MB)Records Sent: Displays the number of records sent by each operator, with the "Source: Custom Source -> Map" also sending the most (148,302)Tasks: Indicates the number of parallel tasks for each operator; all tasks have parallelism 2 except the "TriggerWindow" operator having 1 parallelism. This configuration view provides insights into the Flink job manager setup, encompassing cluster behavior, Java options, and exception handling. Understanding and potentially adjusting these parameters is crucial for optimizing the Flink environment's behavior. Conclusion In this guide, we explored several key views of the Apache Flink Dashboard that enhance the understanding and management of data pipelines. These include the Job Graph, which visually represents data processing flow; the Operator/Task Data Flow Information Table, which provides detailed insights into the flow between tasks and operators; and the Configuration Tab, which offers control over job manager settings. The dashboard provides numerous additional features that help developers gain a deeper understanding of their Apache Flink applications, facilitating the monitoring, debugging, and optimization of real-time data processing pipelines within AWS-managed services.

By Sneha Murganoor

Monitoring and Observability

DZone's Featured Monitoring and Observability Resources

Top Monitoring and Observability Experts

The Latest Monitoring and Observability Topics