Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
Monitoring is a small aspect of our operational needs; configuring, monitoring, and checking the configuration of tools such as Fluentd and Fluentbit can be a bit frustrating, particularly if we want to validate more advanced configuration that does more than simply lift log files and dump the content into a solution such as OpenSearch. Fluentd and Fluentbit provide us with some very powerful features that can make a real difference operationally. For example, the ability to identify specific log messages and send them to a notification service rather than waiting for the next log analysis cycle to be run by a log store like Splunk. If we want to test the configuration, we need to play log events in as if the system was really running, which means realistic logs at the right speed so we can make sure that our configuration prevents alerts or mail storms. The easiest way to do this is to either take a real log and copy the events into a new log file at the speed they occurred or create synthetic events and play them in at a realistic pace. This is what the open-source LogGenerator (aka LogSimulator) does. I created the LogGenerator a couple of years ago, having addressed the same challenges before and wanting something that would help demo Fluentd configurations for a book (Logging in Action with Fluentd, Kubernetes, and more). Why not simply copy the log file for the logging mechanism to read? Several reasons for this. For example, if you're logging framework can send the logs over the network without creating back pressure, then logs can be generated without being impacted by storage performance considerations. But there is nothing tangible to copy. If you want to simulate into your monitoring environment log events from a database, then this becomes even harder as the DB will store the logs internally. The other reason for this is that if you have alerting controls based on thresholds over time, you need the logs to be consumed at the correct pace. Just allowing logs to be ingested whole is not going to correctly exercise such time-based controls. Since then, I've seen similar needs to pump test events into other solutions, including OCI Queue and other Oracle Cloud services. The OCI service support has been implemented using a simple extensibility framework, so while I've focused on OCI, the same mechanism could be applied as easily to AWS' SQS, for example. A good practice for log handling is to treat each log entry as an event and think of log event handling as a specialized application of stream analytics. Given that the most common approach to streaming and stream analytics these days is based on Kafka, we're working on an adaptor for the LogSimulator that can send the events to a Kafka API point. We built the LogGenerator so it can be run as a script, so modifying it and extending its behavior is quick and easy. we started out with developing using Groovy on top of Java8, and if you want to create a Jar file, it will compile as Java. More recently, particularly with the extensions we've been working with, Java11 and its ability to run single file classes from the source. We've got plans to enhance the LogGenerator so we can inject OpenTelementry events into Fluentbit and other services. But we'd love to hear about other use cases see for this. For more on the utility: Read the posts on my blog See the documentation on GitHub
Testing is a best-case scenario to validate the system's correctness. But, it doesn't predict the failure cases that may occur in production. Experienced engineering teams would tell you that production environments are not uniform and full of exciting deviations. The fun fact is – testing in production helps you test the code changes on live user traffic, catch the bugs early, and deliver a robust solution that increases customer satisfaction. But, it doesn't help you detect the root cause of the failure. And that's why adopting observability in testing is critical. It gives you full-stack visibility inside the infrastructure and production to detect and resolve problems faster. As a result, people using observability are 2.1 times more likely to detect any issues and report a 69% better MTTR. Symptoms of a Lack of Observability The signs of not having proper observability showed up in the work of engineers every day. When there was a problem with production, which happened daily, the developers' attempts to find the cause of the problem would often hit a wall, and tickets would stay stuck in Jira. This happened because they didn't have enough information and details to figure out the root cause. To overcome these challenges, the developers sometimes used existing logs, which was not very helpful as they had to access logs for each service one at a time using Notepad++ and manually search through them. This made the developers feel frustrated and made it difficult for the company to clearly show customers how and when critical issues would be fixed, which could harm the company's reputation over time. Observability: What Does It Really Mean? For a tester, having proper observability means the ability to know what's happening within a system. This information is very valuable for testers. Although observability is commonly associated with reliability engineering, it helps testers better understand and investigate complex systems. This allows the tester and their team to enhance the system's quality, such as its security, reliability, and performance, to a greater extent. I found out about this problem through a challenging experience. Many others might have had a similar experience. While checking a product, I had trouble understanding the complications of the product, which is common for testers. As I tried to understand the product by reading its instructions and talking to the people involved, I noticed that the information I gathered did not make sense. At the time, I was unfamiliar with the technical term for this, but in hindsight, it was evident that the system lacked observability. It was almost impossible to know what was happening inside the application. While testing concentrates on determining if a specific functionality performs as intended, observability concentrates on the system's overall health. As a result, they paint a complete picture of your system when taken as a whole. Traditional software testing, i.e., testing in pre-production or staging environments, focus on validating the system's correctness. However, until you run your services inside the production environment, you won't be able to cover and predict every failure that may occur. Testing in production helps you discover all the possible failure cases of a system, thereby providing service reliability and stability. With observability, you can have an in-depth view of your infrastructure and production environments. In addition, you can predict the failure in production environments through the telemetry data, such as logs, metrics, and traces. Observability in the production environment helps you deliver robust products to the customers. Is Observability Really Replacing Testing? From a tester's perspective, there's no replacement for the level of detail that a truly observable system can provide. Although on a practical level, observability has three pillars — logs (a record of an event that has happened inside a system), metrics (a value that reflects some particular behavior inside a system), and traces (a low-level record of how something has moved inside a system) — it is also more than those three elements. "It's not about logs, metrics, or traces," software engineer Cindy Sridharan writes in Distributed Systems Observability, "but about being data-driven during debugging and using the feedback to iterate on and improve the product." In other words, to do observability well, you not only need effective metrics, well-structured logs, and extensive tracing. You also need a mindset that is inquisitive, exploratory, and eager to learn and the processes that can make all of those things meaningful and impactful. This makes testers and observability natural allies. Testing is, after all, about asking questions about a system or application, being curious about how something works or, often, how something should work; observability is very much about all of those things. It's too bad, then, that too many testers are unaware of observability — not only will it help them do their job more effectively, but they're also exactly the sort of people in the software development lifecycle who can evangelize for building observable systems. To keep things simple, there are two key ways we should see observability as helping testers: It helps testers uncover granular details about system issues: During exploratory testing, observability can help testers find the root cause of any issues through telemetry data such as logs, traces, and metrics, helping in better collaboration among various teams and providing faster incident resolution. It helps testers ask questions and explore the system: Testers are curious and like to explore new things. With the observability tool, they can explore the system deeply and discover the issues. It helps them uncover valuable information that assists them in making informed decisions while testing. Conclusion Testing and observability go hand-in-hand in ensuring the robustness and reliability of a system. While traditional testing focuses on validating the system's correctness in pre-production environments, testing in production can uncover all the possible failure cases. On the other hand, Observability provides full-stack visibility into the infrastructure and production environments, helping detect and resolve problems faster. In addition, observability helps testers uncover granular details about system issues and enables them to ask questions and explore the system more deeply. Testers and observability are natural allies, and adopting observability can lead to better incident resolution, informed testing decisions, and increased customer satisfaction.
When exploring the capabilities of Blackbox Exporter and its role in monitoring and observability, I was eager to customize it to meet my specific production needs. Datadog is a powerful monitoring system that comes with pre-planned packages containing all the necessary services for your infrastructure. However, at times I need a more precise and intuitive solution for my infrastructure that allows me to seamlessly transition between multiple cloud monitoring systems. My use case involved the need to scrape metrics from endpoints using a range of protocols, including HTTP, HTTPS, DNS, TCP, and ICMP. That’s where Blackbox Exporter came into play. It’s important to note that there are numerous open-source exporters available for a variety of technologies, such as databases, message brokers, and web servers. However, for the purposes of this article, we will focus on Blackbox Exporter and how we can scrape metrics and send them to Datadog. If your system doesn’t use Datadog, you can jump to implement Step 1 and Step 3. Following are the steps one takes in order to scrape custom metrics to Datadog: Step-by-step instructions on how to install Blackbox Exporter using Helm, with guidance on how to use it locally or in a production environment Extract custom metrics to DataDog from the BlackBox Exporter endpoints Collect custom metrics from BlackBox Exporter endpoints and make them available in Prometheus, then use Grafana to visualize them for better monitoring Step 1: How to Install Blackbox Exporter Using Helm We’ll use Helm to install the Blackbox Exporter. If necessary, you can customize the Helm values to suit your needs. If you’re running in a Kubernetes production environment, you could opt to create an ingress: ingress: enabled: true annotations: kubernetes.io/ingress.class: ingress-class nginx.ingress.kubernetes.io/proxy-connect-timeout: "30" nginx.ingress.kubernetes.io/proxy-read-timeout: "180" nginx.ingress.kubernetes.io/proxy-send-timeout: "180" hosts: - host: blackbox-exporter.<organization_name>.com paths: - backend: serviceName: blackbox-exporter servicePort: 9115 path: / tls: - hosts: - '*.<organization_name>.com' We won’t create an ingress in our tutorial as we test the example locally.Our installation command is by the following: helm upgrade -i prometheus-blackbox-exporter prometheus-community/prometheus-blackbox-exporter --version 7.2.0 Let’s try and see the Blackbox Exporter in action. We will export the BlackBox Exporter service with port-forward: export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=prometheus-blackbox-exporter,app.kubernetes.io/instance=prometheus-blackbox-exporter" -o jsonpath="{.items[0].metadata.name}") export CONTAINER_PORT=$(kubectl get pod --namespace default $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}") echo "Visit http://127.0.0.1:9115 to use your application" kubectl --namespace default port-forward $POD_NAME 9115:$CONTAINER_PORT Let’s visit the URL : http://localhost:9115/ Let’s make a CURL request to check if we succeed to get a response of 200 from our BlackBox Exporter: curl -I http://localhost:9115/probe\?target\=http://localhost:9115\&module\=http_2xx If it passes successfully, we will be able to see the following result on the BlackBox Exporter dashboard: Step 2: Extract Custom Metrics to DataDog We’ll be using the following version of the Helm chart to install the Datadog agent in our cluster. Once installed, we can specify the metrics we want to monitor by editing the configuration and to add our OpenMetrics block. The OpenMetrics will enable us to extract custom metrics from any OpenMetrics endpoints. Our installation command is by the following: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update We are using the Prometheus Integration with Datadog in order to retrieve metrics from applications. However, instead of configuring the Prometheus URL, we will set up the BlackBox Exporter endpoints. Our configuration in the Datadog Helm values looks like: datadog: confd: openmetrics.yaml: |- instances: - prometheus_url: https://blackbox-exporter.<organization_name>.com/probe?target=https://jenkins.<organization_name>.com&module=http_2xx namespace: status_code metrics: - probe_success: 200 min_collection_interval: 120 prometheus_timeout: 120 tags: - monitor_app:jenkins - monitor_env:production - service_name:blackbox-exporter - prometheus_url: https://blackbox-exporter.<organization_name>.com/probe?target=https://argocd.<organization_name>.com&module=http_2xx namespace: status_code metrics: - probe_success: 200 min_collection_interval: 120 prometheus_timeout: 120 tags: - monitor_app:argocd - monitor_env:production - service_name:blackbox-exporter We’ve selected “probe_success” as the metric to scrape, and renamed it to “status_code:200” to make it more intuitive and easier to define alerts for later on. That’s all. Once you log in to your Datadog dashboard, you can explore the custom metrics by filtering based on the service_name tag that we defined as “blackbox-exporter”. Step 3: Extract Custom Metrics and Visualize Them in Grafana Using Prometheus We’ll be using the following version of the Helm chart to install Prometheus in our cluster. First we will create our values.yaml of our Helm configuration: prometheus: prometheusSpec: additionalScrapeConfigs: | - job_name: 'prometheus-blackbox-exporter' scrape_timeout: 15s scrape_interval: 15s metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - http://localhost:9115 - http://localhost:8080 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: prometheus-blackbox-exporter:9115 alertmanager: enabled: false nodeExporter: enabled: false Now we can proceed with the installation of the Prometheus Stack: helm upgrade -i prometheus prometheus-community/kube-prometheus-stack -f values.yaml --version 45.0.0 Let’s utilize Kubernetes port-forwarding for Prometheus: kubectl port-forward service/prometheus-kube-prometheus-prometheus -n default 9090:9090 To see that we’re scraping metrics from the BlackBox Exporter, navigate to http://localhost:9090/metrics. You can search for the job_name “prometheus-blackbox-exporter” that we defined in the Helm values of the Prometheus Stack. Let’s utilize Kubernetes port-forwarding for Grafana: # Get the Grafana password # Grafana username is: admin kubectl get secrets -n default prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo kubectl port-forward service/prometheus-grafana -n default 3000:80 To import the Prometheus Blackbox Exporter dashboard, go to http://localhost:3000/dashboard/import and use dashboard ID 7587. To confirm that Prometheus is consistently collecting metrics from the specified URLs (localhost:9115, localhost:8080), you can check by visiting http://localhost:9115/ and verifying that the “recent probes” count is increasing. Summary As covered in our article, we provided a simple and manageable method for customizing your metrics according to your system monitoring requirements. Whether you are utilizing a paid monitoring system or an open-source one, it is crucial to have the ability to choose and accurately identify your production needs. A thorough understanding of this process will result in cost-effectiveness and the enhancement of team knowledge.
Disclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This is not a promotion of any service, feature, or platform. In my previous article on CloudWatch(CW) cross-account observability for AWS Organization, I provided a step-by-step guide on how to set up multi-account visibility and observability employing a newly released feature called CloudWatch cross-account observability using AWS Console. In this article, I will provide a step-by-step guide on how you can automate the CloudWatch cross-account observability for your AWS Organization using Terraform and a CloudFormation template. Please refer to my earlier article on this topic for a better understanding of the concepts such as Monitoring Accounts and Source Accounts. Monitoring Account Configuration For monitoring account configuration, a combination of Terraform and CloudFormation are chosen as the aws_oam_sink and aws_oam_link. Resources are yet to be available in terraform-provider-aws as of Feb 26, 2023. Please refer to the GitHub issue. Also, the terraform-provider-awscc has an open bug (as of Feb 26, 2023) that fails on applying the Sink policy. Please refer to the GitHub issue link for more details. Terraform Code That Creates the OAM Sink in the Monitoring AWS Account Feel free to customize the code to modify providers and tags or change the naming conventions as per your organization's standards. provider.tf Feel free to modify the AWS provider as per your AWS account, region, and authentication/authorization needs. Refer to the AWS provider documentation for more details on configuring the provider for the AWS platform. provider "aws" { region = "us-east-1" assume_role { role_arn = "arn:aws:iam::MONITORING-ACCOUNT-NUMBER:role/YOUR-IAM-ROLE-NAME" } } terraform { required_providers { aws = { source = "hashicorp/aws" version = "4.53.0" } } } main.tf /* AWS Cloudformation stack resource that runs CFT - oam-sink-cft.yaml The stack creates a OAM Sink in the current account & region as per provider configuration Please create the AWS provider configuration as per your environment. For AWS provider configuration, please refer to https://registry.terraform.io/providers/hashicorp/aws/2.43.0/docs */ resource "aws_cloudformation_stack" "cw_sink_stack" { name = "example" template_body = file("${path.module}/oam-sink-cft.yaml") parameters = { OrgPath = var.org_path } tags = var.tags } /* SSM parameter resource puts the CloudWatch Cross Account Observability Sink ARN in the parameter store, So that the Sink arn can be used from the source account while creating the Link */ resource "aws_ssm_parameter" "cw_sink_arn" { name = "cw-sink-arn" description = "CloudWatch Cross Account Observability Sink identifier" type = "SecureString" value = aws_cloudformation_stack.cw_sink_stack.outputs["ObservabilityAccessManagerSinkArn"] tags = var.tags } variable.tf variable "tags" { description = "Custom tags for AWS resources" type = map(string) default = {} } variable "org_path" { description = "AWS Organization path that will be allowed to send Metric and Log data to the monitoring account" type = string } AWS CloudFormation Template That Is Used in the Terraform “AWS_cloudformation_stack” Resource The below CloudFormation template creates the OAM Sink resource in the Monitoring account. This template will be used to create the CloudFormation Stack in the Monitoring account. Make sure to put the template and the terraform files in the same directory. oam-sink-cft.yaml YAML AWSTemplateFormatVersion: 2010-09-09 Description: 'AWS CloudFormation Template to creates or updates a sink in the current account, so that it can be used as a monitoring account in CloudWatch cross-account observability. A sink is a resource that represents an attachment point in a monitoring account, which source accounts can link to to be able to send observability data.' Parameters: OrgPath: Type: String Description: 'Complete AWS Organization path for source account configuration for Metric data' Resources: ObservabilityAccessManagerSink: Type: 'AWS::Oam::Sink' Properties: Name: "observability-access-manager-sink" Policy: Version: '2012-10-17' Statement: - Effect: Allow Principal: "*" Resource: "*" Action: - "oam:CreateLink" - "oam:UpdateLink" Condition: ForAnyValue:StringLike: aws:PrincipalOrgPaths: - !Ref OrgPath ForAllValues:StringEquals: oam:ResourceTypes: - "AWS::CloudWatch::Metric" - "AWS::Logs::LogGroup" Outputs: ObservabilityAccessManagerSinkArn: Value: !GetAtt ObservabilityAccessManagerSink.Arn Export: Name: ObservabilityAccessManagerSinkArn Apply the Changes in the AWS Provider Platform Once you put all the above terraform and CloudFormation template files in the same directory, run terraform init to install the provider and dependencies and then terraform plan or terraform apply depending upon whether you want to view the changes only or view and apply the changes in your AWS account. Please refer to the Hashicorp website for more details on terraform commands. When you run terraform apply or terraform plan command, you need to input the org_path value. Make sure to provide the complete AWS Organization path to allow the AWS account(s) under that path to send the metric and log data to the monitoring account. For example, if you want to allow all the AWS accounts to send the metric and log data to the monitoring account under the Organization Unit (OU) ou-0dsf-dasd67asd (assuming the OU is directly under the Root account in the organization hierarchy), then the org_path value should look like ORGANIZATION_ID/ROOT_ID/ou-0dsf-dasd67asd/*. For more information on how to set the organization path, please refer to the AWS documentation. Once the org_path value is provided (you can also use the tfvars file to supply the variable values), and terraform apply is successful, you should see the AWS account is designated as a Monitoring account by navigating to CloudWatch settings in CloudWatch console. Source Account Configuration For source account configuration, we can use the terraform-provider-awscc as the link resource works perfectly. Also, the aws_oam_sink and aws_oam_link resources are yet to be available in the terraform-provider-aws as of Feb 26, 2023. Please refer to the GitHub issue. Terraform Code That Creates the OAM Link in the Source AWS Account Feel free to customize the code to modify provider and tags or change the naming conventions as per your organization's standards. provider.tf Feel free to modify the AWSCC provider as per your AWS account, region, and authentication/authorization needs. Refer to the AWSCC provider documentation for more details on configuring the provider. provider "aws" { region = "us-east-1" assume_role { role_arn = "arn:aws:iam::MONITORING-ACCOUNT-NUMBER:role/IAM-ROLE-NAME" } } provider "awscc" { region = "us-east-1" assume_role = { role_arn = "arn:aws:iam::SOURCE-ACCOUNT-NUMBER:role/IAM-ROLE-NAME" } } terraform { required_providers { aws = { source = "hashicorp/aws" version = "4.53.0" } awscc = { source = "hashicorp/awscc" version = "0.45.0" } } } main.tf /* Link resource to create the link between the source account and the sink in the monitoring account */ resource "awscc_oam_link" "cw_link" { provider = awscc label_template = "$AccountName" resource_types = ["AWS::CloudWatch::Metric", "AWS::Logs::LogGroup"] sink_identifier = data.aws_ssm_parameter.cw_sink_arn.value } /* SSM parameter data block retrieves the CloudWatch Cross Account Observability Sink ARN from the parameter store, So that the Sink arn can be associated with the source account while creating the Link */ data "aws_ssm_parameter" "cw_sink_arn" { provider = aws name = "cw-sink-arn" } Put both the terraform files in the same directory and run the terraform init and then terraform apply commands to create the link between the source and monitoring accounts. Steps To Validate the CloudWatch Cross-Account Observability Changes Now that changes are applied in both source and monitoring accounts, it's time to validate that CloudWatch log groups and metric data are showing up in the monitoring account. Navigate to CloudWatch Console > Settings > Manage source accounts in the monitoring account. You should see the new source account is listed, and it should show that CloudWatch log and metric are being shared with the monitoring account If you navigate to CloudWatch log groups in the monitoring account, you should now see some of the log groups from the source account. Also, if you navigate to CloudWatch Metrics > All Metrics in the monitoring account, now you should see some of the Metric data from the source account.
Companies use software to run their business in today’s digital world. With the increased use of microservices, containers, and cloud-based technologies, traditional methods of monitoring and solving problems are no longer enough. That’s where observability comes in. Observability and monitoring are often confusing. While monitoring refers to regular observation and recording of activities taking place within a project, observability watches and understands how a system performs and behaves in real time. Leveraging observability allows developers to better understand the system and quickly resolve any potential issues. Observability Design Patterns Best Practices for Building Observable Systems One of the most widely used design patterns is the “Observability Triad,” which consists of three key components: Logs Metrics Traces However, it’s not just about collecting telemetry data, it’s about using a data-driven approach for debugging and improving an app’s performance and security through a concrete feedback system. Logs provide a detailed view of system activity, including error messages and debugging information. Metrics provide a high-level overview of system performance, such as CPU and memory usage, while traces provide detailed information about the execution of a specific request or transaction. By following these patterns, developers can ensure that their systems have the necessary instrumentation to provide visibility into system behavior. Besides the above-mentioned observability design patterns, developers should focus on health check API, audit logging, and exception tracking. It is advisable to follow the best instrumentation and data collection practices. This ensures the right data is collected, collected data is at the right granularity, and in a format that can be easily analyzed. By following these patterns and best practices, developers can ensure that their systems are highly resilient, self-healing, and easy to monitor and understand. This, in turn, allows them to identify and resolve issues quickly, which will improve the performance and reliability of their systems. The Evolution of Developer Roles From Debugging to Predictive Maintenance With the recent advancement in technology, the process of software development has also changed. The role of developers is no longer focused only on developing software. With the onset of observability, we already are aware of the system’s performance in real time. Developers are now expected to understand the system based on observability metrics and indulge themselves in predictive maintenance. Changes in Developer Roles and Responsibilities Developers are now expected to understand how to design, build, and operate systems that are observable by design. This requires new skills and knowledge, such as an understanding of distributed systems, monitoring, and observability best practices. In the past, developers were mainly focused on finding and fixing issues as they arose. With the rise of observability, developers can proactively identify and fix potential issues before they become a problem. This shift from reactive to proactive maintenance is a key aspect of the changing role of the developer. New Skills and Knowledge Needed The new era of software development requires developers to have new skills and knowledge. They need to understand how to design systems that are easy to monitor and understand and can automatically recover from failures. They also need to understand how to use various available monitoring and observability tools. These include open-source tools like Prometheus, Grafana, Jaeger, and commercial solutions like New Relic and AppDynamics. A Shift in the Way Software Is Developed and Maintained Developers now have to consider observability from the start of the development process. This means they have to understand how to design systems that are simple to monitor and understand and can recover automatically from issues. One important aspect of this is using chaos engineering. Chaos engineering is deliberately causing failures in a system to test its strength. This method allows developers to find and fix potential problems before they happen in real-life situations. Adopting an Observability Mindset Staying Ahead of the Curve Organizations increasingly rely on software to drive their business in today’s digital world. With the rise of microservices, containers, cloud-native technologies, traditional monitoring, and troubleshooting, approaches are no longer sufficient. To stay ahead of the curve, developers must adopt an observability mindset. Staying up to date with the latest trends and developments in observability is an ongoing process. One way to do this is to attend industry conferences and events, such as the observability conference. Another way to stay informed is to read industry publications and follow thought leaders on social media. Embracing observability requires developers to shift their mindset. Rather than considering monitoring and troubleshooting as separate activities, developers should think about observability as an integral part of the development process. This means thinking about observability from the very beginning of the development process and designing systems that are easy to monitor and understand. Wrapping Up Observability is important in modern software development. It helps developers easily spot and fix issues. As observability has grown in popularity, the role of developers has changed too. Now, developers need to know how to design, build, and run systems that are easy to monitor. This means new skills and knowledge are needed. To stay ahead of the game, developers should embrace observability, follow best practices for designing observable systems and stay informed about the latest trends and advancements in the field. This will help ensure the success of any organization that heavily relies on software. In case you have any queries related to the topic, feel free to connect with me in the comments section below. I will be more than happy to address your queries.
In earlier days, it was easy to deduct and debug a problem in monolithic applications because there was only one service running in the back end and front end. Now, we are moving toward microservices architecture, where applications are divided into multiple independently deployable services. These services have their own goal and logic to serve. In this kind of application architecture, it becomes difficult to observe how one service depends on or affects other services. To make the system observable, some logs, metrics, or traces must be emitted from the code, and this data must be sent to an observability back end. This is where OpenTelemetry and Jaeger come into the picture. In this article, we will see how to monitor application trace data (Traces and Spans) with the help of OpenTelemetry and Jaeger. A trace is used to observe the requests as they propagate through the services in a distributed system. Spans are a basic unit of the trace; they represent a single event within the trace, and a trace can have one or multiple spans. A span consists of log messages, time-related data, and other attributes to provide information about the operation it tracks. We will use the distributed tracing method to observe requests moving across microservices, generating data about the request and making it available for analysis. The produced data will have a record of the flow of requests in our microservices, and it will help us understand our application's performance. OpenTelemetry Telemetry is the collection and transmission of data using agents and protocols from the source in observability. The telemetry data includes logs, metrics, and traces, which help us understand what is happening in our application. OpenTelemetry (also known as OTel) is an open source framework comprising a collection of tools, APIs, and SDKs. OpenTelemetry makes generating, instrumenting, collecting, and exporting telemetry data easy. The data collected from OpenTelemetry is vendor-agnostic and can be exported in many formats. OpenTelemetry is formed after merging two projects OpenCensus and OpenTracing. Instrumenting The process of adding observability code to your application is known as instrumentation. Instrumentation helps make our application observable, meaning the code must produce some metrics, traces, and logs. OpenTelemetry provides two ways to instrument our code: Manual instrumentation Auto instrumentation 1. Manual Instrumentation The user needs to add an OpenTelemetry code to the application. The manual instrumentation provides more options for customization in spans and traces. Languages supported for manual instrumentations are C++, .NET, Go, Java, Python, and so on. 2. Automatic Instrumentation This is the easiest way of instrumentation as it requires no code changes and no need to recompile the application. It uses an intelligent agent that gets attached to an application, reads its activity, and extracts the traces. Automatic instrumentation supports Java, NodeJS, Python, and so on. Difference Between Manual and Automatic Instrumentation Both manual and automatic instrumentation have advantages and disadvantages that you might consider while writing your code. A few of them are listed below: Manual Instrumentation Automatic Instrumentation Code changes are required. Code changes are not required. It supports maximum programming languages. Currently, .Net, Java, NodeJS, and Python are supported. It consumes a lot of time as code changes are required. Easy to implement as we do not need to touch the code. Provide more options for the customization of spans and traces. As you have more control over the telemetry data generated by your application. Fewer options for customization. Possibilities of error are high as manual changes are required. No error possibilities. As we don't have to touch our application code. To make the instrumentation process hassle-free, use automatic instrumentation, as it does not require any modification in the code and reduces the possibility of errors. Automatic instrumentation is done by an agent which reads your application's telemetry data, so no manual changes are required. For the scope of this post, we will see how you can use automatic instrumentation in a Kubernetes-based microservices environment. Jaeger Jaeger is a distributed tracing tool initially built by Uber and released as open-source in 2015. Jaeger is also a Cloud Native Computing Foundation graduate project and was influenced by Dapper and OpenZipkin. It is used for monitoring and troubleshooting microservices-based distributed systems. The Jaeger components which we have used for this blog are: Jaeger Collector Jaeger Query Jaeger UI / Console Storage Backend Jaeger Collector: The Jaeger distributed tracing system includes the Jaeger collector. It is in charge of gathering and keeping the information. After receiving spans, the collector adds them to a processing queue. Collectors need a persistent storage backend, hence Jaeger also provides a pluggable span storage mechanism. Jaeger Query: This is a service used to get traces out of storage. The web-based user interface for the Jaeger distributed tracing system is called Jaeger Query. It provides various features and tools to help you understand the performance and behavior of your distributed application and enables you to search, filter, and visualise the data gathered by Jaeger. Jaeger UI/Console: Jaeger UI lets you view and analyze traces generated by your application. Storage Back End: This is used to store the traces generated by an application for the long term. In this article, we are going to use Elasticsearch to store the traces. What Is the Need for Integrating OpenTelemetry With Jaeger? OpenTelemetry and Jaeger are the tools that help us in setting the observability in microservices-based distributed systems, but they are intended to address different issues. OpenTelemetry provides an instrumentation layer for the application, which helps us generate, collect and export the telemetry data for analysis. In contrast, Jaeger is used to store and visualize telemetry data. OpenTelemetry can only generate and collect the data. It does not have a UI for the visualization. So we need to integrate Jaeger with OpenTelemetry as it has a storage backend and a web UI for the visualization of the telemetry data. With the help of Jaeger UI, we can quickly troubleshoot microservices-based distributed systems. Note: OpenTelemetry can generate logs, metrics, and traces. Jaeger does not support logs and metrics. Now you have an idea about OpenTelemetry and Jaeger. Let's see how we can integrate them with each other to visualize the traces and spans generated by our application. Implementing OpenTelemetry Auto-Instrumentation We will integrate OpenTelemetry with Jaeger, where OpenTelemetry will act as an instrumentation layer for our application, and Jaeger will act as the back-end analysis tool to visualize the trace data. Jaeger will get the telemetry data from the OpenTelemetry agent. It will store the data in the storage backend, from where we will query the stored data and visualize it in the Jaeger UI. Prerequisites for this article are: The target Kubernetes cluster is up and running. You have access to run the kubectl command against the Kubernetes cluster to deploy resources. Cert manager is installed and running. You can install it from the website cert-manager.io if it is not installed. We assume that you have all the prerequisites and now you are ready for the installation. The files we have used for this post are available in this GitHub repo. Installation The installation part contains three steps: Elasticsearch installation Jaeger installation OpenTelemetry installation Elasticsearch By default, Jaeger uses in-memory storage to store spans, which is not a recommended approach for the production environment. There are various tools available to use as a storage back end in Jaeger; you can read about them in the official documentation of Jaeger span storage back end. In this article, we will use Elasticsearch as a storage back end. You can deploy Elasticsearch in your Kubernetes cluster using the Elasticsearch Helm chart. While deploying Elasticsearch, ensure you have enabled the password-based authentication and deploy that Elasticsearch in observability namespaces. Elasticsearch is deployed in our Kubernetes cluster, and you can see the output by running the following command. Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m NAME READY AGE statefulset.apps/elasticsearch 1/1 17m Jaeger Installation We are going to use Jaeger to visualize the trace data. Let's deploy the Jaeger Operator on our cluster. Before proceeding with the installation, we will deploy a ConfigMap in the observability namespace. In this ConfigMap, we will pass the username and password of the Elasticsearch which we have deployed in the previous step. Replace the credentials based on your setup. YAML kubectl -n observability apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: jaeger-configuration labels: app: jaeger app.kubernetes.io/name: jaeger data: span-storage-type: elasticsearch collector: | es: server-urls: http://elasticsearch:9200 username: elastic password: changeme collector: zipkin: http-port: 9411 query: | es: server-urls: http://elasticsearch:9200 username: elastic password: changeme agent: | collector: host-port: "jaeger-collector:14267" EOF If you are going to deploy Jaeger in another namespace and you have changed the Jaeger collector service name, then you need to change the values of the host-port value under the agent collector. Jaeger Operator The Jaeger Operator is a Kubernetes operator for deploying and managing Jaeger, an open source, distributed tracing system. It works by automating the deployment, scaling, and management of Jaeger components on a Kubernetes cluster. The Jaeger Operator uses custom resources and custom controllers to extend the Kubernetes API with Jaeger-specific functionality. It manages the creation, update, and deletion of Jaeger components, such as the Jaeger collector, query, and agent components. When a Jaeger instance is created, the Jaeger Operator deploys the necessary components and sets up the required services and configurations. We are going to deploy the Jaeger Operator in the observability namespace. Use the below-mentioned command to deploy the operator. Shell $ kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.38.0/jaeger-operator.yaml -n observability We are using the latest version of Jaeger, which is 1.38.0 at the time of writing this article. By default, the Jaeger script is provided for cluster-wide mode. Suppose you want to watch only a particular namespace. In that case, you need to change the ClusterRole to Role and ClusterBindingRole to RoleBinding in the operator manifest and set the WATCH_NAMESPACE env variable on the Jaeger Operator deployment. To verify whether Jaeger is deployed successfully or not, run the following command: Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m pod/jaeger-operator-5597f99c79-hd9pw 2/2 Running 0 11m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m service/jaeger-operator-metrics ClusterIP 172.20.220.212 <none> 8443/TCP 11m service/jaeger-operator-webhook-service ClusterIP 172.20.224.23 <none> 443/TCP 11m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/jaeger-operator 1/1 1 1 11m NAME DESIRED CURRENT READY AGE replicaset.apps/jaeger-operator-5597f99c79 1 1 1 11m NAME READY AGE statefulset.apps/elasticsearch 1/1 17m As we can see in the above output, our Jaeger Operator is deployed successfully, and all of its pods are up and running; this means Jaeger Operator is ready to install the Jaeger instances (CRs). The Jaeger instance will contain Jaeger components (Query, Collector, Agent); later, we will use these components to query OpenTelemetry metrics. Jaeger Instance A Jaeger Instance is a deployment of the Jaeger distributed tracing system. It is used to collect and store trace data from microservices or distributed applications, and provide a UI to visualize and analyze the trace data. To deploy the Jaeger instance, use the following command. Shell $ kubectl apply -f https://raw.githubusercontent.com/infracloudio/Opentelemertrywithjaeger/master/jaeger-production-template.yaml To verify the status of the Jaeger instance, run the following command: Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m pod/jaeger-agent-27fcp 1/1 Running 0 14s pod/jaeger-agent-6lvp2 1/1 Running 0 15s pod/jaeger-collector-69d7cd5df9-t6nz9 1/1 Running 0 19s pod/jaeger-operator-5597f99c79-hd9pw 2/2 Running 0 11m pod/jaeger-query-6c975459b6-8xlwc 1/1 Running 0 16s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m service/jaeger-collector ClusterIP 172.20.24.132 <none> 14267/TCP,14268/TCP,9411/TCP,14250/TCP 19s service/jaeger-operator-metrics ClusterIP 172.20.220.212 <none> 8443/TCP 11m service/jaeger-operator-webhook-service ClusterIP 172.20.224.23 <none> 443/TCP 11m service/jaeger-query LoadBalancer 172.20.74.114 a567a8de8fd5149409c7edeb54bd39ef-365075103.us-west-2.elb.amazonaws.com 80:32406/TCP 16s service/zipkin ClusterIP 172.20.61.72 <none> 9411/TCP 18s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/jaeger-agent 2 2 2 2 2 <none> 16s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/jaeger-collector 1/1 1 1 21s deployment.apps/jaeger-operator 1/1 1 1 11m deployment.apps/jaeger-query 1/1 1 1 18s NAME DESIRED CURRENT READY AGE replicaset.apps/jaeger-collector-69d7cd5df9 1 1 1 21s replicaset.apps/jaeger-operator-5597f99c79 1 1 1 11m replicaset.apps/jaeger-query-6c975459b6 1 1 1 18s NAME READY AGE statefulset.apps/elasticsearch 1/1 17m As we can see in the above screenshot, our Jaeger instance is up and running. OpenTelemetry To install the OpenTelemetry, we need to install the OpenTelemetry Operator. The OpenTelemetry Operator uses custom resources and custom controllers to extend the Kubernetes API with OpenTelemetry-specific functionality, making it easier to deploy and manage the OpenTelemetry observability stack in a Kubernetes environment. The operator manages two things: Collectors: It offers a vendor-agnostic implementation of how to receive, process, and export telemetry data. Auto-instrumentation of the workload using OpenTelemetry instrumentation libraries. It does not require the end-user to modify the application source code. OpenTelemetry Operator To implement the auto-instrumentation, we need to deploy the OpenTelemetry operator on our Kubernetes cluster. To deploy the k8s operator for OpenTelemetry, follow the K8s operator documentation. You can verify the deployment of the OpenTelemetry operator by running the below-mentioned command: Shell $ kubectl get all -n opentelemetry-operator-system NAME READY STATUS RESTARTS AGE pod/opentelemetry-operator-controller-manager-7f479c786d-zzfd8 2/2 Running 0 30s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/opentelemetry-operator-controller-manager-metrics-service ClusterIP 172.20.70.244 <none> 8443/TCP 32s service/opentelemetry-operator-webhook-service ClusterIP 172.20.150.120 <none> 443/TCP 31s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/opentelemetry-operator-controller-manager 1/1 1 1 31s NAME DESIRED CURRENT READY AGE replicaset.apps/opentelemetry-operator-controller-manager-7f479c786d 1 1 1 31s As we can see in the above output, the opentelemetry-operator-controller-manager deployment is running in the opentelemetry-operator-system namespace. OpenTelemetry Collector The OpenTelemetry facilitates the collection of telemetry data via the OpenTelemetry Collector. Collector offers a vendor-agnostic implementation on how to receive, process, and export the telemetry data. The collector is made up of the following components: Receivers: It manages how to get data into the collector. Processors: It manages the processing of data. Exporters: Responsible for sending the received data. We also need to export the telemetry data to the Jaeger instance. Use the following manifest to deploy the collector. YAML kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel spec: config: | receivers: otlp: protocols: grpc: http: processors: exporters: logging: jaeger: endpoint: "jaeger-collector.observability.svc.cluster.local:14250" tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [] exporters: [logging, jaeger] EOF In the above code, the Jaeger endpoint is the address of the Jaeger service which is running inside the observability namespace. We need to deploy this manifest in the same namespace where our application is deployed, so that it can fetch the traces from the application and export them to Jaeger. To verify the deployment of the collector, run the following command. Shell $ kubectl get deploy otel-collector NAME READY UP-TO-DATE AVAILABLE AGE otel-collector 1/1 1 1 41s OpenTelemetry Auto-Instrumentation Injection The above-deployed operator can inject and configure the auto-instrumentation libraries of OpenTelemetry into an application's codebase as it runs. To enable the auto-instrumentation on our cluster, we need to configure an instrumentation resource with the configuration for the SDK and instrumentation. Use the below-given manifest to create the auto-instrumentation. YAML kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: my-instrumentation spec: exporter: endpoint: http://otel-collector:4317 propagators: - tracecontext - baggage - b3 sampler: type: parentbased_traceidratio argument: "0.25" EOF In the above manifest, we have used three things: exporter, propagator, and sampler. Exporter: Used to send data to OpenTelemetry collector at the specified endpoint. In our scenario, it is "http://otel-collector:4317". Propagators: Carry traces, context, and baggage data between distributed tracing systems. They have three propagation mechanisms: tracecontext: This refers to the W3C Trace Context specification, which defines a standard way to propagate trace context information between services. baggage: This refers to the OpenTelemetry baggage mechanism, which allows for the propagation of arbitrary key-value pairs along with the trace context information. b3: This refers to the B3 header format, which is a popular trace context propagation format used by the Zipkin tracing system. Sampler: Uses a "parent-based trace ID ratio" strategy with a sample rate of 0.25 (25%). This means that when tracing a request, if any of its parent requests has already been sampled (with a probability of 0.25), then this request will also be sampled, otherwise it will not be traced. To verify that our custom resource is created or not, we can use the below-mentioned command. Shell $ kubectl get otelinst NAME AGE ENDPOINT SAMPLER SAMPLER ARG my-instrumentation 6s http://otel-collector:4317 parentbased_traceidratio 0.25 This means our custom resource is created successfully. We are using the OpenTelemetry auto-instrumented method, so we don’t need to write instrumentation code in our application. All we need to do is, add an annotation in the pod of our application for auto-instrumentation. As we are going to demo a Java application, the annotation that we will have to use here is: Shell instrumentation.opentelemetry.io/inject-java: "true" Note: The annotation can be added to a namespace as well so that all pods within that namespace will get instrumentation, or by adding the annotation to individual PodSpec objects, available as part of Deployment, Statefulset, and other resources. Below is an example of how your manifest will look after adding the annotations. In the below example, we are using annotation for a Java application. YAML apiVersion: apps/v1 kind: Deployment metadata: name: demo-sagar spec: replicas: 1 selector: matchLabels: app: demo-sagar template: metadata: labels: app: demo-sagar annotations: instrumentation.opentelemetry.io/inject-java: "true" instrumentation.opentelemetry.io/container-names: "spring" spec: containers: - name: spring image: sagar27/petclinic-demo ports: - containerPort: 8080 We have added instrumentation “inject-java” and “container-name” under annotations. If you have multiple container pods, you can add them in the same “container-names” annotation, separated by a comma. For example, “container-name1,container-name-2,container-name-3” etc. After adding the annotations, deploy your application and access it on the browser. Here in our scenario, we are using port-forward to access the application. Shell $ kubectl port-forward service/demo-sagar 8080:8080 To generate traces, either you can navigate through all the pages of this website or you can use the following Bash script: Shell while true; do curl http://localhost:8080/ curl http://localhost:8080/owners/find curl http://localhost:8080/owners?lastName= curl http://localhost:8080/vets.html curl http://localhost:8080/oups curl http://localhost:8080/oups sleep 0.01 done The above-given script will make a curl request to all the pages of the website, and we will see the traces of the request on the Jaeger UI. We are making curl requests to https://localhost:8080 because we use the port-forwarding technique to access the application. You can make changes in the Bash script according to your scenario. Now let’s access the Jaeger UI, as our service jaeger-query uses service type LoadBalancer, we can access the Jaeger UI on the browser by using the load balancer domain/IP. Paste the load balancer domain/IP on the browser and you will see the Jaeger UI there. We have to select our app from the service list and it will show us the traces it generates. In the above screenshot, we have selected our app name “demo-sagar” under the services option and its traces are visible on Jaeger UI. We can further click on the traces to get more details about it. Summary In this article, we have gone through how you can easily instrument your application using the OpenTelemetry auto-instrumentation method. We also learned how this telemetric data could be exported to the Elasticsearch backend and visualized it using Jaeger. Integrating OpenTelemetry with Jaeger will help you in monitoring and troubleshooting. It also helps perform root cause analysis of any bug/issues in your microservice-based distributed systems, performance/latency optimization, service dependency analysis, and so on. We hope you found this post informative and engaging. References OpenTelemetry Jaeger Tracing
In this article, we will explore Azure Observability, the difference between monitoring and observability, its components, different patterns, and antipatterns. Azure Observability is a powerful set of services provided by Microsoft Azure that allows developers and operations teams to monitor, diagnose, and improve the performance and availability of their applications. With Azure Observability, you can gain deep insights into the performance and usage of your applications and quickly identify and resolve issues. Azure Monitoring and Azure Observability Azure Monitoring and Azure Observability are related but different concepts in the Azure ecosystem. Azure Monitor is a service that provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. The service also includes Azure Log Analytics, which is used to analyze the log data and create custom queries and alerts.Azure Observability, on the other hand, is a broader concept that encompasses a set of services provided by Azure for monitoring, diagnosing, and improving the performance and availability of your applications. It includes Azure Monitor but also encompasses other services such as Azure Application Insights, Azure Metrics, and Azure Diagnostics.Azure Monitor is a service that provides log data collection and analysis, while Azure Observability is a broader set of services that provides deep insights into the performance and availability of your application. Azure Observability is built on top of Azure Monitor, and it integrates with other services to provide a comprehensive view of your application's performance. Key Components of Azure Observability One of the key components of Azure Observability is Azure Monitor. This service provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. This allows you to have a comprehensive view of your environment and understand how your resources are performing.Another important component of Azure Observability is Azure Log Analytics. This service is used to analyze the log data collected by Azure Monitor and to create custom queries and alerts. It uses a query language called Kusto, which is optimized for large-scale data analysis. With Azure Log Analytics, you can easily search and filter through large amounts of log data and create custom queries and alerts to notify you of specific events or issues.Azure Application Insights is another service provided by Azure Observability. This service provides deep insights into the performance and usage of your applications. It can be used to track requests, exceptions, and performance metrics and to create custom alerts. With Azure Application Insights, you can gain a better understanding of how your users interact with your applications and identify and resolve issues quickly.Azure Metrics is another service provided by Azure observability. It allows you to collect and analyze performance data from your applications and services, including CPU usage, memory usage, and network traffic. This will give you a real-time view of your resource's performance and allow for proactive monitoring.Finally, Azure Diagnostics is a service that is used to diagnose and troubleshoot issues in your applications and services. It includes features for collecting diagnostic data, such as performance counters, traces, and logs, and for analyzing that data to identify the root cause of issues. With Azure Diagnostics, you can quickly identify and resolve issues in your applications and services and ensure that they are performing optimally. Example: Flow of Observability Data From an Azure Serverless Architecture An example of using Azure Observability to monitor and improve the performance of an application would involve the following steps:Enabling Azure Monitor for your application: This involves configuring Azure Monitor to collect log data from your application, such as requests, exceptions, and performance metrics. This data can be collected from Azure services such as Azure App Services, Azure Functions, and Azure Virtual Machines.Analyzing log data with Azure Log Analytics: Once data is collected, you can use Azure Log Analytics to analyze the log data and create custom queries and alerts. For example, you can create a query to identify all requests that returned a 500-error code and create an alert to notify you when this happens.Identifying and resolving performance issues: With the data collected and analyzed, you can use Azure Application Insights to identify and resolve performance issues. For example, you can use the performance metrics collected by Azure Monitor to identify slow requests and use Azure Diagnostics to collect additional data, such as traces and logs, to understand the root cause of the issue.Monitoring your resources: With Azure Metrics, you can monitor your resource's performance and understand the impact on the application. This will give you a real-time view of your resources and allow for proactive monitoring.Setting up alerts: Azure Monitor, Azure Log Analytics, and Azure Application Insights can set up alerts; this way, you can be notified of any issues or potential issues. This will allow you to act before it becomes a problem for your users.Continuously monitoring and improving: After resolving the initial issues, you should continue to monitor your application using Azure Observability to ensure that it is performing well and identify any new issues that may arise. This allows you to continuously improve the performance and availability of your application. Observability Patterns Azure Observability provides a variety of patterns that can be used to monitor and improve the performance of your application. Some of the key patterns and metrics include:Logging: Collecting log data such as requests, exceptions, and performance metrics and then analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to identify and troubleshoot issues in your application and to create custom queries and alerts to notify you of specific events or issues.Tracing: Collecting trace data such as request and response headers and analyzing this data using Azure Diagnostics. This can be used to understand the flow of requests through your application and to identify and troubleshoot issues with specific requests.Performance monitoring: Collecting performance metrics such as CPU usage, memory usage, and network traffic and analyzing this data using Azure Metrics. This can be used to identify and troubleshoot issues with the performance of your application and resources.Error tracking: Collecting and tracking errors and exceptions and analyzing this data using Azure Application Insights. This can be used to identify and troubleshoot issues with specific requests and to understand how errors are impacting your users.Availability monitoring: Collecting and monitoring data related to the availability of your application and resources, such as uptime and response times, and analyzing this data using Azure Monitor. This can be used to identify and troubleshoot issues with the availability of your application.Custom metrics: Collecting custom metrics that are specific to your application and analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to track key performance indicators (KPIs) for your application and to create custom alerts.All these patterns and metrics can be used together to gain a comprehensive understanding of the performance and availability of your application and to quickly identify and resolve issues. Additionally, Azure Observability services are integrated; this way, you can easily correlate different data sources and have a holistic view of your application's performance. While Azure Observability provides a powerful set of services for monitoring, diagnosing, and improving the performance of your applications, there are also some common mistakes/contrasts that should be avoided to get the most out of these services. Here are a few examples of Azure Observability contrasts:Not collecting enough data: Collecting insufficient data makes it difficult to diagnose and troubleshoot issues and can lead to incomplete or inaccurate analysis. Make sure to collect all the relevant data for your application, including logs, traces, and performance metrics, to ensure that you have a comprehensive view of your environment.Not analyzing the data: Collecting data is not enough; you need to analyze it and act. Not analyzing the data can lead to missed opportunities to improve the performance and availability of your applications. Make sure to use Azure Monitor and Azure Log Analytics to analyze the data, identify patterns and issues, and act. Conclusion In summary, Azure observability architecture is a set of services that allows for data collection, data analysis, and troubleshooting. It provides a comprehensive set of services that allows you to monitor, diagnose, and improve the performance and availability of your applications. With Azure Observability, you can gain deep insights into your environment and quickly identify and resolve issues, ensuring that your applications are always available and performing at their best.
In this eighth installment of the series covering my journey into the world of cloud-native observability, I'm continuing to explore an open-source project called Perses. If you missed any of the previous articles, head on back to the introduction for a quick update. After laying out the groundwork for this series in the initial article, I spent some time in the second article sharing who the observability players are. I also discussed the teams that these players are on in this world of cloud-native o11y. For the third article, I looked at the ongoing discussion around monitoring pillars versus phases. In the fourth article, I talked about keeping your options open with open-source standards. In the fifth article in this series, I talked about bringing monolithic applications into the cloud-native o11y world. In my sixth article, I provided you with an introduction to a new open-source dashboard and visualization project and shared how to install the project on your local developer machine. In the previous and seventh article, I explored the API and tooling provided by the Perses project. In this eighth article, I'm diving into understanding what makes up a Perses dashboard. This is a preview of the fourth lab developed for my hands-on workshop dedicated to exploring dashboards and visualization. Being a developer from my early days in IT, it's been very interesting to explore the complexities of cloud-native o11y. Monitoring applications goes way beyond just writing and deploying code, especially in the cloud-native world. One thing remains the same: maintaining your organization's architecture always requires both a vigilant outlook and an understanding of available open standards. This article is part of my ongoing effort to get practical hands-on experience in the cloud-native o11y world. I'm going to walk you through a portion of the Perses dashboard data model, get you started with a basic template for a minimal dashboard, and set the stage for the next lab where you will start designing components for laying out in your first dashboard. This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: Now let's dive into the Perses data model. Note this article is only a short summary, so please see the complete lab found online as lab 4.4 here to work through the data model for dashboards yourself: The following is a short overview of what is in this specific lab of the Perses workshop. Exploring Dashboards Each lab starts with a goal. In this case, it is fairly simple: to understand what the basic data model is for building dashboards with the Perses project. This lab explores the Perses dashboard resource, specifically how the data model consists of certain elements that make up the basic template for displaying a dashboard. For this lab, we work in a JSON formatted resource file with all examples shown, but you can also make use of the YAML format if you so desire. At the highest level of the dashboard specification, you have three elements: Kind Metadata Spec You've seen examples of the kinds of resources you can specify within the Kind element, such as a project, a dashboard, a datasource, or a globaldatasource. You are defining the type of resource this file will hold. The next element, Metadata, is used to define a name for the resource, and you can assign the project that the resource belongs to. Finally, the Spec element stands for "specification," and its structure depends on the Kind of resource you are defining. For the rest of this lab, we are only going to dive into the dashboard specification. The rest of this lab spends time diving into each of these elements that make up a dashboard specification: Display - defining a component name and description Duration - defining a default time to look in the past for data (not for all specifications) Variables - defining a list of variables for use in components Panels - defining a component to gather data for visualizing on a dashboard using a query of metrics data Layouts - defining the look, feel, and location of a panel in the dashboard Each one is presented with a code example to help you understand their usage. Pulling them all together at the end, you are set up with a basic empty dashboard resource file, and using your knowledge of the Perses tooling from previous labs, you apply this resource to the running Perses server. This leaves you ready for the next lab where you will start designing your first dashboard. You're now one step closer to creating your own dashboards and visualizations! More To Come Next up, I take you farther into the Perses project with more workshop materials to share. Stay tuned for more insights into the real, practical experience as my cloud native o11y journey continues.
This article will look at best practices for how organizations can efficiently ingest, normalize, and structure their AWS logs so that security teams can effectively implement the proper detections for their specific AWS environment. We'll also discuss how leaders can enable a Detection-as-Code practice empowering security teams to scale their security engineering operations resiliently alongside their AWS environment as it changes and grows. The Current State of Security Log Monitoring As businesses move more of their operations to the cloud, the need for robust security log monitoring becomes increasingly important. Security log data can provide valuable insights into an organization's IT infrastructure and help identify potential security threats. However, many businesses struggle to utilize log monitoring in the cloud fully and are often bewildered by the complexities and scale of available logs in their cloud environment. With the multitude of AWS-specific tools and services available, the set of logs generated by these chosen services can add up quickly. In a recent Panther survey of security professionals who protect an AWS environment, 18.8% of respondents indicated they log data from more than 40 accounts, and over 54.4% say their environments are "very complex." In addition, 64.8% of these respondents said their companies have "only existed in the cloud." And a plurality (17.9%) said collecting large amounts of log data from multiple sources quickly was their top challenge. This complexity is a shame because security log data is essential in identifying and mitigating cyber threats. By tracking activity in your environment and detecting any suspicious behavior, you can reduce the risk of a data breach or other security incident. Here are some best practices on how to do just that. 1. Efficiently Ingest, Normalize, and Centralize AWS Logs One of the best ways to protect your data and ensure the security of your AWS environment is by efficiently ingesting, normalizing, and centralizing your logs. Doing so lets you comprehensively view all activity in your environment and quickly detect any suspicious behavior. Organizing and centralizing AWS logs can be difficult for security practitioners, but it is necessary to have visibility across your environment. Unfortunately, logs are siloed in AWS, creating a problem of having too many uncorrelated logs, and this lack of correlation means a lack of visibility and context. To gain back this visibility, we suggest you centralize your AWS logs with other relevant security details in one place. Unfortunately, when centralizing your AWS logs with a legacy SIEM solution, you are opening yourself up to being charged ridiculously high prices. As you scale, the price of managing these logs can climb very quickly and become expensive. Therefore, security teams must find a cost-effective platform that will scale well with a growing AWS footprint and perform quickly across large amounts of log data. By efficiently ingesting, normalizing, and centralizing your AWS logs, you can gain a deeper understanding of how your environment is being used and help identify potential security threats. Implementing these measures will help ensure your data's safety and your AWS environment's integrity. 2. Implement the Right Detections for Your Environment Another critical step in protecting your data and ensuring the security of your AWS account is to implement good detections. This means choosing the right detection methods and settings for your specific environment. Security practitioners need an easy way to implement out-of-the-box detection coverage aligned to best practice security frameworks like CIS and MITRE. However, once foundational coverage is in place; organizations also need the flexibility to implement custom or environment-specific detections. To secure your AWS environment, it's crucial to use out-of-the-box detections and policies. First, doing so makes getting started easy. Then, leverage MITRE ATT&CK Mapping visualization to help understand the detections you need. Lastly, implementing detection logic using a general language instead of a convoluted, vendor-specific one. For example, Python is an expressive language that has been widely adopted by engineers of all stripes. Given the adoption and the robust set of libraries available for Python, it is both simpler and more powerful for editing or writing custom detections to fit your particular AWS environment. Implementing the right detections is an essential step in ensuring the security of your data and AWS environment. Choosing the proper methods and settings can reduce the risk of a data breach or other security incident. 3. Implement Detection-as-Code to Help Security Engineering Operations Scale and Adapt Alongside AWS AWS infrastructure and services are flexible and scalable, so detecting threats should be too. To protect your data and ensure the security of your AWS environment, it's important to use code to define your detections rather than manual methods or rule-based systems. Detection-as-code is for writing detections as infrastructure-as-code (IaC), and configuration-as-code (CaC) is for machine-readable definition files and models for framing infrastructure. Detection-as-code is a systemized, adaptable, and all-encompassing way to detect threats using the software. It will improve the resilience of your security operations in the face of the ever-changing nature of AWS. Security practitioners need a solution that can quickly ascertain which detections are running, what version of logic they're using, and how to update them without causing more problems. Implementing detection-as-code can help improve the accuracy and scalability of your detection operations. It also helps ensure that your detection methods are always up to date with the latest changes in your environment. By using detection-as-code, teams can effectively manage their detection versions and understand which logic is used for each. In addition, this process makes it easier for the security team to readily use and adapt existing code for new AWS services rather than starting from scratch each time. By implementing detection-as-code, you can improve the accuracy and scalability of your detection operations because it allows you to test your detection methods using actual data instead of hypothetical scenarios. This way, you can be sure your new detection strategy won't result in a glut of false alarms. Conclusion AWS is a rapidly growing platform, and the future of security log management looks bright for those who follow these best practices. Implementing the right detections, using detection-as-code, and adapting to changes in your environment are all essential steps in ensuring the security of your data and AWS account. In addition, following these best practices can help protect your organization from data breaches and other security incidents.
In today's data-driven era, it's critical to design data platforms to help business to foster innovation and compete in the market. Selecting a robust, future-proof set of tools and architecture requires an act of balance between purely technological concerns and the wider constraints of the project. These constraints include challenges regarding regulations, existing workforce skills, talent acquisition, agreed timelines, and your company’s established processes. A modern data platform is the set of technologies, configuration, and implementation details that allows data to be stored and moved across the company systems to satisfy business needs. The SOFT methodology introduces an approach, based on four pillars, to define future-proof and effective data platforms that are scalable, observable, fast, and trustworthy. Its aim is to enable data architects and decision-makers to evaluate both current implementations and future architectures across a different set of criteria. Scalable: the ability of a solution to handle variability in: load volumes, use cases, development platforms, and cost. Observable: are there options around service monitoring and data assets discovery? Fast: analyzing the time to deliver data, develop pipelines and recover from problems. Trustworthy: the ability to provide secure, auditable, and consistent results. The rest of the blog examines each of the pillars in detail, providing the set of questions to be addressed during an evaluation for each pillar. Scalable Data volumes and use cases are growing every day at an unprecedented pace, therefore we need to find technological solutions that support the current set of requirements and have the ability to scale across different directions. Technological Scaling Design a solution with space for growth. Defining the current need and forecasting future growth can help us understand if and when we'll hit the limits of a certain architecture. The modern cloud enables very deep vertical scaling, by creating bigger servers, but high availability and quick failovers are also important considerations. Technologies with support for horizontal scaling, splitting the load across nodes, usually offer a bigger variety of options for up/downsizing depending on needs, but these may incur consistency tradeoffs. In addition to raw scaling, we also need to consider the automation options and the time to scale. We should be able to up/downsize a data platform both manually if a peak in traffic is forecasted, and automatically when certain monitoring thresholds are exceeded. The time to scale is also crucial: keeping it at a minimum enables both better elasticity and less waste of resources. When considering technology scaling, we need to have a broad evaluation of the tooling: not only primary instance, but also high availability, the possibility of read-only replicas, and the robustness and speed of backup and restore processes. Business Cases Scaling Every business has a constantly growing collection of data, therefore a data platform solution needs to have both a low barrier of entry for new use cases and enough capacity to support their solution. Having great integrations available is crucial for the solution to support more use cases and new technologies. Proper, scalable security management and use-case isolation is needed to comply with regulation around data protection. We also need the ability to have separation of our data islands. For even more data success, providing interfaces to explore already available datasets promotes data re-usage in different use cases and accelerates innovation. Human Scaling No matter how much automation is in place, data platforms need humans to build, test, and monitor new pipelines. When selecting a data solution we need to evaluate the skills and experience of the current team, and the possible cost of growth based on the geographies where the company is operating. Selecting well-adopted, open-source data solutions is associated with bigger talent pools and more enthusiastic developers. Using managed solutions offering pre-defined integrations and extended functionality can help by taking some of the burden away from the team allowing humans to focus on building rather than maintaining. Financial Scaling The last part of scalability is related to money. A technically perfectly valid data solution can't be adopted if it is not financially sustainable. Understanding the dollar-to-scale ratio is very important, as well as mapping future changes in the architecture that could raise costs significantly. We need to calculate the derivative of the cost, aiming at solutions that can scale linearly (or hopefully less) with the amount of data. Questions to ask: What options are there to scale this technology, vertically and horizontally? How easy is it to add new technologies? How can we manage data security? What is the current experience of the team? How big is the talent pool? How complex is the management vs the development? How easy is it to add/extract new data, or integrate with other technologies? What will scaling cost? Does it grow linearly with the data volume? The risk of not evaluating the financial scalability is to build a perfect system that works now, but can't cope with the future success of our business or use case. Observable The old days of checking the batch status at 8 AM are gone: modern data platforms and new streaming data pipelines are "live systems" where monitoring, understanding errors, and applying fixes promptly is a requirement to provide successful outcomes. Monitor Checking the data pipeline end status on a dashboard is not enough, we need to have methods to easily define: metrics and log integration aggregations and relevant KPIs (Key Performance Indicators) alert thresholds automatic notifications Relying on a human watching a screen is not a great use of resources and doesn't scale. Automating platform observability and selecting tools that enable accurate external monitoring allows companies to centralize management efforts. Recreate the Bird's Eye View With new pipelines and new technologies being added all the time, it's hard to keep an inventory of all your data assets and how they integrate with each other. A future-proof data solution needs to provide automatic harvesting, consolidation, and exposure of data assets and their interlinkage. Being able to trace where a certain data point came from is critical. We should be able to establish what transformations were performed, where the data exists, and who or what can interact with it at any point in the pipeline. This information is going to help us to comply with security and privacy requirements, especially data lineage, security, impact assessments, and GDPR queries. Obtaining a queryable global map of the data assets provides additional benefits regarding data-reusability: by exposing the assets already present in a company, we can avoid repeated data efforts, and promote data re-usage across departments for faster innovation. History Replay and Data Versioning With continuously evolving systems, having the ability to replay parts of history provides ways to create baselines and compare results. These can help detect regression or errors in new developments and evaluate the impact of changes in the architecture. Easily spinning off new "prod-like" development environments enables faster dev iteration, safer hypothesis validation, and more accurate testing. Having a "data versioning" capability allows us to compare the results of data manipulation across development stages; adding the "metrics versioning" containing execution statistics enables a better (and possibly automatic) handling of performance regressions. Questions to ask: What's happening now in my data platform? Is everything working ok? Are there any delays? What data assets do I have across my company? How is my data transformed across the company? Can I replay part of the pipelines in the event of processing errors? How is a change performing against a baseline? Fast From micro-batching to streaming, the time to data delivery is trending towards real-time. The old times of reporting on yesterday's data are gone, we need to be able to report in near real-time on what's currently happening in our business, based on fresh data. Time to Develop Delivering data in near-real time is useless if developing and testing pipelines takes weeks of work. The toolchain should be self-service and user-friendly. To achieve this, it is crucial to identify the target developers and their skills and to select a set of technologies that will allow this group to work quickly, effectively, and happily. Once the use case is defined, it is also worth checking for existing solutions that can help by removing part of the problem's complexity. Investing time in automation diminishes the friction and delay in the deployment of the data pipelines. People shouldn't lose time clicking around to move a prototype to production. Automated checks and promotion pipelines should take care of this part of the artifact journey. Time to Deliver Selecting a data architecture that enables data streaming is key to building future-proof pipelines. Having the ability to transition from batch to streaming also allows for improving existing pipelines in an iterative fashion, rather than requiring a big-bang approach. An integral part of the Time To Deliver(y) is also the Time To Execute: the performance of your chosen platform needs to be evaluated against target latency figures. Time to Recover Finally, it's crucially important to define acceptable Time To Recover thresholds when data pipelines run into problems. To achieve this, take the time to understand, test and verify what the selected toolchain has to offer in this space. Especially when dealing with stateful transformations, it is crucial to navigate the options regarding checkpoints, exactly-once delivery, and replay of events. The events replay in particular can be handy to verify new developments, run A/B tests, and quickly recover from problematic situations. Questions to ask: How much delay is there between the source event and the related insight? How good is the technology performance for the task? How fast can we create new pipelines? How fast can we promote a change in production? How fast can I recover from a crash? Trustworthy Data is the most valuable asset for companies, and building trusted data platforms is key to ensure quality input is available to all stakeholders. Security The first aspect of trust is related to security. As briefly mentioned in the Observable section, the toolchain should allow us to define and continuously monitor which actors can access a certain data point and what privileges they have. Moreover, platforms should provide enough logging and monitoring capabilities to detect and report, in real-time, any inappropriate access to data. Providing a way to evaluate the implications of security changes (impact assessment) would guarantee an extra level of checking before making a change. Regulations define what the correct usage of data is, and which attributes need to be masked or removed. To build future-proof data platforms we need the ability to apply different obfuscation processes depending on the roles/privileges of the data receiver. In the article, we covered a lot about automation. For security, whilst a lot of checks can be performed with code, we might still want to retain manual gates that need a human evaluation and approval allowing companies to comply with the needed security regulations. Vendor's Ecosystem Evaluation To build trustworthy future-proof data platforms, we need trust in the vendors or projects providing the tools, and their ability to continue to evolve with new features during the lifetime of the tool. Therefore a wide assessment of the company or open source project is required: consider the tool's adoption, any existing community and related growing patterns, and available support methods. Taken together, these topics can help understand if a current tech solution will still be a good choice in the future. Data locality and cloud options Companies might need to define where in the world the data is stored and manipulated. Being forced to use consistent locations across the entire data pipeline might reduce the list of technologies or vendors available. The choice can further be refined by internal policies regarding the adoption of cloud, managed services, or multi-cloud strategies. Data Correctness From a pure data point of view, "trustable" means that the generated results can be trusted. The results needs to be fresh, correct, repeatable, and consistent: Fresh: the results are a representation of the most recent and accurate data. Correct: the KPIs and transformations should follow the definitions. A data transformation/KPI should be defined once across the company, providing a unique source of truth. Repeatable: the workflow could be run again with the same input and provide the same output. Consistent: performance is resilient to errors and consistent across time, giving stakeholders the confidence to receive the data in a timely manner. Questions to ask: How can I secure my data? Can I mask it, and apply filters to rows/columns? Do I trust the vendor or project providing the tool? Can I use the tool in a specific datacenter/region? Can I precisely locate my data at any stage of my flow? Both from a technical and also geographical point of view? Can I trust the data? Is it correct, are results repeatable, on time, and consistent? How many times do we have a particular KPI defined across the company? Deploy SOFT in Your Own Organization Whether you are looking to define your next data platform, or evaluating the existing ones, the SOFT framework provides a comprehensive set of guidelines to help you assess your options. By using the list of questions as a baseline for the evaluation, you can properly compare different solutions and make a better-informed decision about the perfect fit for your data needs. Take the SOFT framework into usage, and let us know your opinion!
Joana Carvalho
Performance Engineer,
Postman
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep