Cloud + data orchestration: Demolish your data silos. Enable complex analytics. Eliminate I/O bottlenecks. Learn the essentials (and more)!
2024 DZone Community Survey: SMEs wanted! Help shape the future of DZone. Share your insights and enter to win swag!
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
A Hands-On Guide To OpenTelemetry: Intro to Observability
Application Telemetry: Different Objectives for Developers and Product Managers
Recently, I encountered a task where a business was using AWS Elastic Beanstalk but was struggling to understand the system state due to the lack of comprehensive metrics in CloudWatch. By default, CloudWatch only provides a few basic metrics such as CPU and Networks. However, it’s worth noting that Memory and Disk metrics are not included in the default metric collection. Fortunately, each Elastic Beanstalk virtual machine (VM) comes with a CloudWatch agent that can be easily configured to collect additional metrics. For example, if you need information about VM memory consumption, which AWS does not provide out of the box, you can configure the CloudWatch agent to collect this data. This can greatly enhance your visibility into the performance and health of your Elastic Beanstalk environment, allowing you to make informed decisions and optimize your application’s performance. How To Configure Custom Metrics in AWS Elastic Beanstalk To accomplish this, you’ll need to edit your Elastic Beanstalk zip bundle and include a cloudwatch.config file in the .ebextensions folder at the top of your bundle. Please note that the configuration file should be chosen based on your operating system, as described in this article. By doing so, you’ll be able to customize the CloudWatch agent settings and enable the collection of additional metrics, such as memory consumption, to gain deeper insights into your Elastic Beanstalk environment. This will allow you to effectively monitor and optimize the performance of your application on AWS. Linux-Based Config: YAML files: "/opt/aws/amazon-cloudwatch-agent/bin/config.json": mode: "000600" owner: root group: root content: | { "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "append_dimensions": { "InstanceId": "$${aws:InstanceId}" }, "metrics_collected": { "mem": { "measurement": [ "mem_total", "mem_available", "mem_used", "mem_free", "mem_used_percent" ] } } } } container_commands: apply_config_metrics: command: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json Windows-Based Config: YAML files: "C:\\Program Files\\Amazon\\AmazonCloudWatchAgent\\cw-memory-config.json": content: | { "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "append_dimensions": { "InstanceId": "$${aws:InstanceId}" }, "metrics_collected": { "mem": { "measurement": [ "mem_total", "mem_available", "mem_used", "mem_free", "mem_used_percent" ] } } } } container_commands: 01_set_config_and_reinitialize_cw_agent: command: powershell.exe cd 'C:\Program Files\Amazon\AmazonCloudWatchAgent'; powershell.exe -ExecutionPolicy Bypass -File ./amazon-cloudwatch-agent-ctl.ps1 -a append-config -m ec2 -c file:cw-memory-config.json -s; powershell.exe -ExecutionPolicy Bypass -File ./amazon-cloudwatch-agent-ctl.ps1 -a start; exit As you may have noticed, I enabled only a few memory-related metrics such as mem_total, mem_available, mem_used, mem_free, and mem_used_percent. However, you can enable more metrics as needed. The complete list of available metrics can be found here. Once you have updated your application, it would be beneficial to create a CloudWatch dashboard to visualize these metrics. To do so, navigate to the AWS CloudWatch console, select Dashboards, and click on Create dashboard. From there, you can create a widget by clicking the Add widget button and selecting Line to create a line chart that displays the desired metrics. Customizing a dashboard with relevant metrics can provide valuable insights into the performance and health of your Elastic Beanstalk environment, making it easier to monitor and optimize your application on AWS. In the case of the example above, we’ll see 5 new metrics in the section CWAgent. Based on them, we may configure a memory widget and get something like this. Final Thoughts Feel free to explore the wide variety of metrics and AWS widgets available in CloudWatch to further customize your dashboard. If you have any questions or need assistance, feel free to ask me in the comments.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Cloud native and observability are an integral part of developer lives. Understanding their responsibilities within observability at scale helps developers tackle the challenges they are facing on a daily basis. There is more to observability than just collecting and storing data, and developers are essential to surviving these challenges. Observability Foundations Gone are the days of monitoring a known application environment, debugging services within our development tooling, and waiting for new resources to deploy our code to. This has become dynamic, agile, and quickly available with auto-scaling infrastructure in the final production deployment environments. Developers are now striving to observe everything they are creating, from development to production, often owning their code for the entire lifecycle. The tooling from days of old, such as Nagios and HP OpenView, can't keep up with constantly changing cloud environments that contain thousands of microservices. The infrastructure for cloud-native deployments is designed to dynamically scale as needed, making it even more essential for observability platforms to help condense all that data noise to detect trends leading to downtime before they happen. Splintering of Responsibilities in Observability Cloud-native complexity not only changed the developer world but also impacted how organizations are structured. The responsibilities of creating, deploying, and managing cloud-native infrastructure have split into a series of new organizational teams. Developers are being tasked with more than just code creation and are expected to adopt more hybrid roles within some of these new teams. Observability teams have been created to focus on a specific aspect of the cloud-native ecosystem to provide their organization a service within the cloud infrastructure. In Table 1, we can see the splintering of traditional roles in organizations into these teams with specific focuses. Table 1. Who's who in the observability game Team Focus maturity goals DevOps Automation and optimization of the app development lifecycle, including post-launch fixes and updates Early stages: developer productivity Platform engineering Designing and building toolchains and workflows that enable self-service capabilities for developers Early stages: developer maturity and productivity boost CloudOps Provides organizations proper (cloud) resource management, using DevOps principles and IT operations applied to cloud-based architectures to speed up business processes Later stages: cloud resource management, costs, and business agility SRE All-purpose role aiming to manage reliability for any type of environment; a full-time job avoiding downtime and optimizing performance of all apps and supporting infrastructure, regardless of whether it's cloud native Early to late stages: on-call engineers trying to reduce downtime Central observability team Responsible for defining observability standards and practices, delivering key data to engineering teams, and managing tooling and observability data storage Later stages, owning: Define monitoring standards and practices Deliver monitoring data to engineering teams Measure reliability and stability of monitoring solutions Manage tooling and storage of metrics data To understand how these teams work together, imagine a large, mature, cloud native organization that has all the teams featured in Table 1: The DevOps team is the first line for standardizing how code is created, managed, tested, updated, and deployed. They work with toolchains and workflow provided by the platform engineering team. DevOps advises on new tooling and/or workflows, creating continuous improvements to both. A CloudOps team focuses on cloud resource management and getting the most out of the budgets spent on the cloud by the other teams. An SRE team is on call to manage reliability, avoiding downtime for all supporting infrastructure in the organization. They provide feedback for all the teams to improve tools, processes, and platforms. The overarching central observability team sets the observability standards for all teams to adhere to, delivering the right observability data to the right teams and managing tooling and data storage. Why Observability Is Important to Cloud Native Today, cloud native usage has seen such growth that developers are overwhelmed by their vast responsibilities that go beyond just coding. The complexity introduced by cloud-native environments means that observability is becoming essential to solving many of the challenges developers are facing. Challenges Increasing cloud native complexity means that developers are providing more code faster and passing more rigorous testing to ensure that their applications work at cloud native scale. These challenges expanded the need for observability within what was traditionally the developers' coding environment. Not only do they need to provide code and testing infrastructure for their applications, they are also required to instrument that code so that business metrics can be monitored. Over time, developers learned that fully automating metrics was overkill, with much of that data being unnecessary. This led developers to fine tune their instrumentation methods and turn to manual instrumentation, where only the metrics they needed were collected. Another challenge arises when decisions are made to integrate existing application landscapes with new observability practices in an organization. The time developers spend manually instrumenting existing applications so that they provide the needed data to an observability platform is an often overlooked burden. New observability tools designed to help with metrics, logs, and traces are introduced to the development teams — leading to more challenges for developers. Often, these tools are mastered by few, leading to siloed knowledge, which results in organizations paying premium prices for advanced observability tools only to have them used as if one is engaging in observability as a toy. Finally, when exploring the ingested data from our cloud infrastructure, the first thing that becomes obvious is that we don't need to keep everything that is being ingested. We need the ability to have control over our telemetry data and find out what is unused by our observability teams. There are some questions we need to answer about how we can: Identify ingested data not used in dashboards, alerting rules, nor touched in ad hoc queries by our observability teams Control telemetry data with aggregation and rules before we put it into expensive, longer-term storage Use only telemetry data needed to support the monitoring of our application landscape Tackling the flood of cloud data in such a way as to filter out the unused telemetry data, keeping only that which is applied for our observability needs, is crucial to making this data valuable to the organization. Cloud Native at Scale The use of cloud-native infrastructure brings with it a lot of flexibility, but when done at scale, the small complexities can become overwhelming. This is due to the premise of cloud native where we describe how our infrastructure should be set up, how our applications and microservices should be deployed, and finally, how it automatically scales when needed. This approach reduces our control over how our production infrastructure reacts to surges in customer usage of an organization's services. Empowering Developers Empowering developers starts with platform engineering teams that focus on developer experiences. We create developer experiences in our organization that treat observability as a priority, dedicating resources for creating a telemetry strategy from day one. In this culture, we're setting up development teams for success with cloud infrastructure, using observability alongside testing, continuous integration, and continuous deployment. Developers are not only owning the code they deliver but are now encouraged and empowered to create, test, and own the telemetry data from their applications and microservices. This is a brave new world where they are the owners of their work, providing agility and consensus within the various teams working on cloud solutions. Rising to the challenges of observability in a cloud native world is a success metric for any organization, and they can't afford to get it wrong. Observability needs to be front of mind with developers, considered a first-class citizen in their daily workflows, and consistently helping them with challenges they face. Artificial Intelligence and Observability Artificial intelligence (AI) has risen in popularity within not only developer tooling but also in the observability domain. The application of AI in observability falls within one of two use cases: Monitoring machine learning (ML) solutions or large language model (LLM) systems Embedding AI into observability tooling itself as an assistant The first case is when you want to monitor specific AI workloads, such as ML or LLMs. They can be further split into two situations that you might want to monitor, the training platform and the production platform. Training infrastructure and the process involved can be approached just like any other workload: easy-to-achieve monitoring using instrumentation and existing methods, such as observing specific traces through a solution. This is not the complete monitoring process that goes with these solutions, but out-of-the-box observability solutions are quite capable of supporting infrastructure and application monitoring of these workloads. The second case is when AI assistants, such as chatbots, are included in the observability tooling that developers are exposed to. This is often in the form of a code assistant, such as one that helps fine tune a dashboard or query our time series data ad hoc. While these are nice to have, organizations are very mindful of developer usage when inputting queries that include proprietary or sensitive data. It's important to understand that training these tools might include using proprietary data in their training sets, or even the data developers input, to further train the agents for future query assistance. Predicting the future of AI-assisted observability is not going to be easy as organizations consider their data one of their top valued assets and will continue to protect its usage outside of their control to help improve tooling. To that end, one direction that might help adoption is to have agents trained only on in-house data, but that means the training data is smaller than publicly available agents. Cloud-Native Observability: The Developer Survival Pattern While we spend a lot of time on tooling as developers, we all understand that tooling is not always the fix for the complex problems we face. Observability is no different, and while developers are often exposed to the mantra of metrics, logs, and traces for solving their observability challenges, this is not the path to follow without considering the big picture. The amount of data generated in cloud-native environments, especially at scale, makes it impossible to continue collecting all data. This flood of data, the challenges that arise, and the inability to sift through the information to find the root causes of issues becomes detrimental to the success of development teams. It would be more helpful if developers were supported with just the right amount of data, in just the right forms, and at the right time to solve issues. One does not mind observability if the solution to problems are found quickly, situations are remediated faster, and developers are satisfied with the results. If this is done with one log line, two spans from a trace, and three metric labels, then that's all we want to see. To do this, developers need to know when issues arise with their applications or services, preferably before it happens. They start troubleshooting with data that has been determined by their instrumented applications to succinctly point to areas within the offending application. Any tooling allows the developer who's investigating to see dashboards reporting visual information that directs them to the problem and potential moment it started. It is crucial for developers to be able to remediate the problem, maybe by rolling back a code change or deployment, so the application can continue to support customer interactions. Figure 1 illustrates the path taken by cloud native developers when solving observability problems. The last step for any developer is to determine how issues encountered can be prevented going forward. Figure 1. Observability pattern Conclusion Observability is essential for organizations to succeed in a cloud native world. The splintering of responsibilities in observability, along with the challenges that cloud-native environments bring at scale, cannot be ignored. Understanding the challenges that developers face in cloud native organizations is crucial to achieving observability happiness. Empowering developers, providing ways to tackle observability challenges, and understanding how the future of observability might look are the keys to handling observability in modern cloud environments. DZone Refcard resources: Full-Stack Observability Essentials by Joana Carvalho Getting Started With OpenTelemetry by Joana Carvalho Getting Started With Prometheus by Colin Domoney Getting Started With Log Management by John Vester Monitoring and the ELK Stack by John Vester This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance. Defining SLA, SLO, SLI, and SRE What Is an SLA? (Commitment) A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services. What Is an SLO? (Objective) A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability. What Is an SLI? (How Did We Do?) A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.). Who Is an SRE? A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments. In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE). Implementation of SLOs and SLIs Let’s assume we have an application service that is up and running in a production environment. The first step is to determine what an SLO should be and what it should cover. Example of SLOs SLO = Target Above this target, GOOD Below this target, BAD: Needs an action item While setting up a Target, please do not consider it 100% reliable. It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences. For example: SLA = 99.99% uptime EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing. Example of SLIs HTTP reqs = No. of success/total requests Common SLI Metrics Durability Response time Latency Availability Error rate Throughput Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.). Category SLO SLI Availability 99.92% uptime/month X % of the time app is available Latency 92% of reqs with response time under 240 ms X average resp time for user reqs Error rate Less than 0.8% of requests result in errors X % of reqs that fail Challenges SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure. SLO: Not able to measure or too broad to calculate SLI: There are too many metrics and differences in capturing and calculating the measures. It leads to lots of effort for the SREs and gives less beneficial results. Best Practices SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement. SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not. SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services. Conclusion Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact.
In this article, I will discuss: The concept of Deep Work Why it is important in this day and age What are some of the unique challenges that Site Reliability Engineers face that make it hard to do Deep Work in their field? Some strategies that Site Reliability Engineering teams can employ to overcome these unique challenges and create an environment for Deep Work for SREs What Is Deep Work? Let's take a look at what Deep Work is. The concept of Deep Work was introduced by Cal Newport in his book called, "Deep Work: Rules for Focused Success in Distracted World." In his book, Cal Newport defines Deep Work to be the act of focusing without distraction on a cognitively demanding task. The opposite of Deep Work is Shallow Work, which Cal Newport defines as logistical-style tasks that can be performed while distracted, like work coordination and communication tasks that are easy to replicate. Why Is Deep Work Important? Firstly, Deep Work is meaningful and satisfying. Based on a recent Gallup Survey, employee engagement in the United States has hit a record low due to less clarity and satisfaction with their organizations. Deep Work can help solve this problem. Secondly, Deep Work can pave the path to a Flow State. The research found that the Flow State leads to happiness. Finally, Deep Work is rewarding. Doing cognitively-demanding work brings value to teams and organizations which in turn will lead to promotions and financial rewards for the individual doing the Deep Work. As Cal Newport says, "A deep life is a good life." Now, let's look at some of the activities that are cognitively demanding for SREs, the activities that can be considered Shallow activities, and some strategies that SRE teams can employ to promote Deep Work within the SRE teams. What Are Some Cognitively Demanding Tasks for SREs? The following are some of the cognitively demanding tasks that SRE teams can perform to have a greater impact on the organizations: Automation and building services: Developing good automation to eliminate toil, improve the efficiency of managing infrastructure, and reduce costs is a cognitively demanding task. Contributing to the codebases that backend teams develop can also be a good opportunity for SREs and is a cognitively demanding task. Improving observability: Another cognitively demanding task for Site Reliability Engineers is improving the observability of the systems. This can be done through designing and creating usable dashboards, tuning alerts to improve signal-to-noise ratio, instrumenting codebases to emit useful metrics, etc. Debugging and troubleshooting difficult issues impacting production systems: Troubleshooting difficult issues affecting production systems availability under time pressure is another cognitively demanding task. Improving processes: Improving processes such as the change management process, incident management process, etc. to improve the overall efficiency of the team, and improving SLOs can be another cognitively demanding task. Improving documentation: Writing good documentation can be impactful and requires focus to get it done. A few examples of good documentation are usable troubleshooting guides, Standard Operating Procedures, architectural diagrams, etc. Learning new technical skills: Continuous learning is key to becoming better at an SRE job. Learning new technical skills and keeping up with the latest technology trends such as Generative AI, etc. is cognitively demanding as well. What Challenges Do SREs Face To Perform Deep Work? The following are some shallow tasks that SREs need to do to run the business that make it difficult for them to do Deep Work: 1. Deployments and Upgrades These are essential activities for the business but tend to be repetitive in nature. Depending on the level of automation that exists within the team, SREs spend some amount of time on these activities. 2. Answering Questions of Other Engineers Randomization of SRE team members by random questions from other teams can be helpful since SRE teams tend to have a deeper knowledge of production systems and infrastructure. 3. Production Access Requests In many teams, access to production systems is restricted only to the SRE team to maintain the stability of the production environments. Members of teams such as backend engineering and data engineering teams may interrupt SREs to get information from production systems for various purposes such as debugging issues, etc. 4. Randomization Due to On-Call and Production Issues SREs tend to have end-to-end knowledge about the production systems and often may be pulled into various on-call issues even when the SRE is not in the current on-call rotation. This takes time away from working on meaningful projects. 5. Meetings There is a lot of overhead with meetings. With SRE roles, sometimes a lot of people join calls that try to troubleshoot issues, and these calls tend to be very long where a lot of engineers just act as bystanders for extended periods of time. 6. Answer Emails and Replying to Teams/Slack Chats This is a common activity for most of the people working in the knowledge economy, and SREs are not immune to it. Replying to emails and chats constantly randomizes an SRE's time and takes their attention away from important work. What Strategies Can SREs Employ To Facilitate Deep Work? Now let's look at some of the strategies that SRE teams can employ to minimize time spent on Shallow work and spend that time on Deep Work: 1. Invest in Automation SRE teams should prioritize investing time in automation to eliminate toil and reduce operational burden with various activities such as deployments, upgrades, etc. Creating robust Continuous Integration and Continuous Deployment pipelines with built-in automated verifications will reduce time spent on these activities. The goal should be to give required tools for development teams to do self-service with upgrades and deployments. SRE team management should plan projects so that proper resources are allocated for these kinds of projects. 2. Build Just-In-Time Access Systems Just-in-time access systems with proper auditing trail and approval processes can help give proper access to production environments for people outside SRE teams, and thus, help SRE teams not to spend time on providing shadow access to others and focus on Deep Work. 3. Proactively Plan for Projects SRE teams can have proper Project Management in place to prioritize important work such as improving the observability of critical production services. 4. Sharing the On-Call Load With R&D and Backend Engineering Teams Sharing on-call load with backend engineering teams while letting SRE teams focus on improving the tooling, and documentation, and training others on how to effectively handle on-call issues would help with this as well. 5. Follow Efficient On-Call Rotations and Incident Management Processes Following efficient on-call rotations where only the responsible on-call engineers during that week handle most of the on-call issues lets other engineers focus on dedicated projects and makes Deep Work possible for the rest of the team. Having clear and easy-to-follow troubleshooting guides would aid with this purpose. 6. Create Time Blocks to Focus on Important Projects On a personal level, individual SRE team members can block time on the calendar to focus on working on important projects to avoid randomization. 7. Providing Time and Resources for Continuous Learning Giving time to SRE team members to learn and explore new technologies and the freedom to implement the technologies to solve reliability problems is a great way to facilitate learning. Also providing subscriptions to online learning services and books would be a great idea. 8. Allow SREs To Work on Projects of Their Choice Allowing SRE team members to work on projects of their choice would be a great way to encourage them to do Deep Work. For example, writing features used by end users, experimenting with a new piece of technology, and working on a different team short team are some of the ways to implement this idea. Google famously allowed all their employees to spend 20% of their time on the projects of their choice. Implementing such a policy would be a great way to encourage Deep Work. Conclusion By following the strategies discussed in this article, Site Reliability Engineers can aim to perform Deep Work and achieve happiness, satisfaction, and rewarding work while having a greater impact on their organizations.
I'm in the process of adding more components to my OpenTelemetry demo (again!). The new design deploys several warehouse services behind the inventory service so the latter can query the former for data via their respective HTTP interface. I implemented each warehouse on top of a different technology stack. This way, I can show OpenTelemetry traces across several stacks. Anyone should be able to add a warehouse in their favorite tech stack if it returns the correct JSON payload to the inventory. For this, I want to make the configuration of the inventory "easy"; add a new warehouse with a simple environment variable pair, i.e., the endpoint and its optional country. The main issue is that environment variables are not structured. I searched for a while and found a relevant post. Its idea is simple but efficient; here's a sample from the post: Properties files FOO__1__BAR=setting-1 #1 FOO__1__BAZ=setting-2 #1 FOO__2__BAR=setting-3 #1 FOO__2__QUE=setting-4 #1 FIZZ__1=setting-5 #2 FIZZ__2=setting-6 #2 BILL=setting-7 #3 Map-like structure Table-like structure Just a value With this approach, I could configure the inventory like this: YAML services: inventory: image: otel-inventory:1.0 environment: WAREHOUSE__0__ENDPOINT: http://apisix:9080/warehouse/us #1 WAREHOUSE__0__COUNTRY: USA #2 WAREHOUSE__1__ENDPOINT: http://apisix:9080/warehouse/eu #1 WAREHOUSE__2__ENDPOINT: http://warehouse-jp:8080 #1 WAREHOUSE__2__COUNTRY: Japan #2 OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4317 OTEL_RESOURCE_ATTRIBUTES: service.name=inventory OTEL_METRICS_EXPORTER: none OTEL_LOGS_EXPORTER: none Warehouse endpoint Set country You can see the three warehouses configured in the above. Each has an endpoint/optional country pair. My first attempt looked like the following: Rust lazy_static::lazy_static! { //1 static ref REGEXP_WAREHOUSE: Regex = Regex::new(r"^WAREHOUSE__(\d)__.*").unwrap(); } std::env::vars() .filter(|(key, _)| REGEXP_WAREHOUSE.find(key.as_str()).is_some()) //2 .group_by(|(key, _)| key.split("__").nth(1).unwrap().to_string()) //3 .into_iter() //4 .map(|(_, mut group)| { //5 let some_endpoint = group.find(|item| item.0.ends_with("ENDPOINT")); //6 let endpoint = some_endpoint.unwrap().1; let some_country = group //7 .find(|item| item.0.ends_with("COUNTRY")) .map(|(_, country)| country); println! {"Country pair is: {:?}", some_country}; (endpoint, some_country).into() //8 } .collect::<Vec<_>>() For making constants out of code evaluated at runtime Filter out warehouse-related environment variable Group by index Back to an Iter with the help of itertools Consist of just the endpoint or the endpoint and the country Get the endpoint Get the country Into a structure — irrelevant I encountered issues several times when I started the demo. The code somehow didn't find the endpoint at all. I chose this approach because I've been taught that it's more performant to iterate throughout the key-value pairs of a map than iterate through its key only and then get the value in the map. I tried to change to the latter. Rust lazy_static! { static ref REGEXP_WAREHOUSE_ENDPOINT: Regex = Regex::new(r"^WAREHOUSE__(?<index>\d)__ENDPOINT.*").unwrap(); //1 } std::env::vars() .filter(|(key, _)| REGEXP_WAREHOUSE_ENDPOINT.find(key.as_str()).is_some()) //2 .map(|(key, endpoint)| { let some_warehouse_index = REGEXP_WAREHOUSE_ENDPOINT.captures(key.as_str()).unwrap(); //3//4 println!("some_warehouse_index: {:?}", some_warehouse_index); let index = some_warehouse_index.name("index").unwrap().as_str(); let country_key = format!("WAREHOUSE__{}__COUNTRY", index); //5 let some_country = var(country_key); //6 println!("endpoint: {}", endpoint); (endpoint, some_country).into() }) .collect::<Vec<_>>() Change the regex to capture only the endpoint-related variables Filter out warehouse-related environment variable I'm aware that the filter_map() function exists, but I think it's clearer to separate them here Capture the index Create the country environment variable from a known string, and the index Get the country With this code, I didn't encounter any issues. Now that it works, I'm left with two questions: Why doesn't the group()/find() version work in the deployed Docker Compose despite working in the tests? Is anyone interested in making a crate out of it? To Go Further Structured data in environment variables lazy_static crate envconfig crate
Observability providers often group periods of activity into sessions as the primary way to model a user’s experience within a mobile app. Each session represents a contiguous chunk of time during which telemetry about the app is gathered, and it usually coincides with a user actively using the app. Therefore, sessions and their associated telemetry are a good way to represent user experience in discrete blocks. But is this really enough? Is there a better way to understand the intersection of users, app behavior, and the business impact of app performance? To answer those questions, I’d like to share my thoughts on the current state of mobile observability, how we got here, and why we should move beyond sessions and focus on users as the main signal to measure app health in the long term. What Are You Observing? When you add instrumentation to make your mobile app observable, what exactly are you observing? Traditionally, there are two ways of answering this question, and both are missing a piece of the larger picture. Observing “The App” First, it could be answering the question of what object is being observed, in which case the answer would unsurprisingly be “the app.” But what is implied but not stated is that traditionally, you are observing the entire deployment of the app in aggregate, an app that is potentially running on millions of different mobile devices. The data provided by observability tooling that are typically scrutinized are aggregates of telemetry from said devices: the total number of crashes, the P99 of cold app startup time, etc. By looking at aggregates, you are observing the big picture of how your app is running in production, an abstraction that provides a high-level overview of app quality. What you don’t get from aggregates are how individual users experience the app and the sequence of events that lead to specific anomalies that are hard to reproduce in-house. Observing “The Users” A second reading of the question is more nuanced: You are observing the users of your app, specifically what is happening in the app while people are using it. This is where sessions come in, to provide telemetry collected on one device for a period of time, laid out so you can see what is happening in the app in sequence. This is how you can find correlations between events in an ad hoc fashion, which is tremendously helpful for debugging difficult-to-reproduce problems. Mobile observability providers use sessions as a key selling point of their RUM products. Sessions combine mobile telemetry with the runtime context, including event sequencing, and viewed together, they could explain performance anomalies. Seeing the events that preceded an app crash along with the details of said app crash can really speed up the debugging of hard-to-reproduce issues. Providing better telemetry and a more useful context in sessions has traditionally been one way how mobile observability providers differentiate themselves. Why Not Both? Combining insights gleaned from both readings of the question can lead to powerful results. Not only can you drill into outliers and debug the cause of hard-to-reproduce problems, but you can also use the aggregates to tell you how many people are impacted by each problem. In addition, you can examine commonalities among affected users to uncover further clues for finding the root cause. Based on this telemetry, powerful datasets and visualizations can be built that reveal key details of mobile performance problems, as well as quantify their pervasiveness. It can do that not only for the problems you know about but also for the ones that you may not have anticipated. In other words, it can surface the unknown unknowns, which is the hallmark of good observability tooling. To varying degrees, most of the mobile observability platforms out there today can provide this level of insight. Is This It? So far, the status quo sounds great. If you can get all this from the current generation of mobile observability tooling, what more can you ask for? Before I answer this very obviously leading question, I want to go back to the original question: What is being observed? And instead of simply asking that, I want to zoom out even further: Why do you want to observe what you are trying to observe? Why Are You Observing? Asking about the what of mobile observability clarifies the types of questions you want the tooling to answer, but it doesn’t get to the core of why you want those questions answered – that is, what are you going to do when you get those answers, and are they complete enough to give you the means to do what you want? Traditionally, mobile observability tooling is used to monitor crashes, ANRs, and other performance problems so that they can be fixed in future releases. Mobile developers and other users of the tooling not only want to know how frequently these problems occur, but they also want enough information to help them find the root causes. Knowing is only half the battle: If the tooling doesn't provide enough debugging information, it is next to useless. In other words, performance problems are the what while finding the cause and ultimately fixing the issues are the why. The Limitations of Aggregates Traditional backend observability data is usually first looked at in aggregate, and the same is true for mobile: How many times has a particular crash occurred, what is the P99 app startup time, etc? Existing issues are ranked according to their perceived severity, and the order they are worked on – and whether they are worked on at all – is largely based on that. The higher the severity, the higher the priority. And how is the severity of a performance problem determined? This usually comes down to a combination of how frequently a problem occurs, and “how bad” the problem is when it occurs. Aggregates like frequency and regression rates provide the baseline data for this assessment, but those numbers are filtered through the lens of the people doing the prioritizing, through their experience and understanding of the app, in order for the severity to be worked out. Using aggregates alone as the data points to determine severity is difficult, even for knowledgeable people, because it’s missing one key puzzle piece: how users are individually impacted when they encounter a particular performance problem. Knowing that the P99 app startup time is 30% slower won’t tell you the increased level of frustration experienced by those who were impacted by the extra delay. That is because individual users are nowhere to be found when you look at aggregates like P99. Aggregates treat an app as a single system, not as the millions of individual systems that it actually is, each running on a different device with an individual user behind it who is experiencing the app and its performance problems in their own unique way. While you know the increase in the absolute time it took for the app to start, how can you properly, objectively, assess the impact of that regression if you can’t quantify how this has affected how those users are using your app? For some, it may just be waiting a little longer for the loading screen to disappear, but for others, they may be so annoyed that this was the straw that broke the camel’s back, that they won’t use your app again. Determining how and if a performance issue affects future app usage is the key to understanding impact, and aggregates aren’t designed to give you that kind of insight. The Limitations of Sessions In the traditional backend observability space, users are represented in telemetry as a high-cardinality attribute, if they are represented at all. This is because the utility of knowing the specific users making requests is limited for backend performance tracking. There are often other factors that are more directly relevant, and high-cardinality attributes are not generally useful for aggregation. The main use case for tracking users explicitly in backend data is the potential to link them to your mobile data. This linkage provides additional attributes that can then be associated with the request that led to slow backend traces. For example, you can add context that may be too expensive to be tracked directly in the backend, like the specific payload blobs for the request, but that is easily collectible on the client. For mobile observability, tracking users explicitly is of paramount importance. In this space, platforms, and vendors recognize that modeling a user’s experience is essential because knowing the totality and sequencing of the activities around the time a user experiences performance problems is key for debugging. By grouping temporally related events for a user and presenting them in a chronologically sorted order, they have created what has become de rigueur in mobile observability: the user session. Presenting telemetry this way allows mobile developers to spot patterns and provide explanations as to why performance problems occur. This is especially useful for difficult-to-reproduce problems that may not be apparent if you simply looked at aggregates. Sometimes, it’s not obvious that a particular crash happens right after the device loses network connectivity – not unless you look at a user’s telemetry laid out in sequential order. This is the power of user sessions, and why they have become table stakes for mobile observability. But there is still a gap: User sessions are but a slice of time in the journey a user takes with a mobile app. An implicit assumption when looking at a session is that things happening within it will only impact other things that happen within the same session. If you zoom out a bit to consider multiple sequential sessions for the same user, you can start getting more context (e.g., a crash in a previous session on a particular screen always leads to the next app startup being really slow). But the utility of this technique starts to fray as you consider more and more sessions for a user. It gets increasingly harder to find direct causal linkage between events in general if they are farther apart. While looking at session timelines is useful for debugging specific performance problems from the perspective of a representative user, it is difficult to predict any long-term impact those problems might have on the user and how they use your app. Perhaps even more difficult is drawing any conclusions about the broader impacts of performance problems on your app and the company’s key metrics like revenue and DAU. In other words, sessions are useful for debugging performance problems, not for assessing their long-term impact. Putting the “User” Ahead of “Sessions” If sessions alone are not sufficient to assess the long-term impact of performance problems on key company metrics, what is still missing? In short, it requires a fundamental change: to center your observability practices around understanding user behavior in the long run, particularly when their perception of the app’s performance changes. This will involve aggregating data in novel ways that are not often seen in mobile observability. To do this, you must first track the behavior of users throughout their lifetime using the app. Specifically, you need to look at the behavior of users after they encounter a performance problem and compare that to their behavior before they encounter said problem. You can also group users into cohorts – similar users who were impacted vs. similar users who weren’t. By observing the difference in other behaviors, you may begin to see correlations between performance issues and negative user trends. And if you’re lucky, some of those correlational relationships may turn out to be causal, which would allow you to determine the impact more directly through further analysis and experimentation. In other words, rather than simply looking at the impact on the counts of crashes that may or may not matter, you can look at how churn and conversion rates for your app are affected, which definitely matter. How you can do that is the subject of an entirely different post, but suffice it to say, you can’t even begin to do this type of analysis until you start aggregating mobile telemetry in the correct way: through the lens of aggregate behavior of different user cohorts. And before you can do that, you need to start collecting and annotating telemetry in a way that allows that level of aggregation, provided that your tooling supports this. This is to say, the question you should ask yourself is this: Is your mobile observability telemetry conducive to being broken down by user cohorts, linked together with other datasets to give a full-stack view of app performance from the user’s perspective, and analyzed to show the overall engagement of those users in the long run? If the answer is yes, then you have all the ingredients you need to fully leverage mobile observability beyond just looking at sessions for crash debugging.
Hey internet humans! I’ve recently re-entered the world of observability and monitoring realm after a short detour in the Internal Developer Portal space. Since my return, I have felt a strong urge to discuss the general sad state of observability in the market today. I still have a strong memory of myself, knee-deep in Kubernetes configs, drowning in a sea of technical jargon, not clearly knowing if I’ve actually monitored everything in my stack, deploying heavy agents, and fighting with engineering managers and devs just to get their code instrumented only to find out I don’t have half the stuff I thought I did. Sound familiar? Most of us have been there. It's like trying to find your way out of a maze with a blindfold on while someone repeatedly spins you around and gives you the wrong directions to the way out. Not exactly a walk in the park, right? The three pain points that are top-of-mind for me these days are: The state of instrumentation for observability The horrible surprise bills vendors are springing on customers and the insanely confusing pricing models that can’t even be calculated Ownership and storage of data - data residency issues, compliance, and control Instrumentation The monitoring community has got a fantastic new tool at its disposal: eBPF. Ever heard of it? It's a game-changing tech (a cheat code, if you will, to get around that horrible manual instrumentation) that allows us to trace what's going on in our systems without all the usual headaches. No complex setups, no intrusive instrumentation – just clear, detailed insights into our app's performance. With eBPF, we can dive deep into the inner workings of applications and infrastructure, capturing data at the kernel level with minimal overhead. It's like having X-ray vision for our software stack without the pain of having to corral all of the engineers to instrument the code manually. I’ve had first-hand experience in deploying monitoring solutions at scale during my tenure at companies like Datadog, Splunk, and, before microservices were cool, CA Technologies. I’ve seen the patchwork of APM, infrastructure, logs, OpenTelemetry, custom instrumentation, open-source, etc. that is often patched together (usually poorly) to just try and get at the basics. Each one of these comes usually at a high technical maintenance cost and requires SREs, platform engineers, developers, DevOps, etc. to all coordinate (also usually ineffectively) to instrument code, deploy everywhere they’re aware of, and cross their fingers just hoping they’re going to get most of what should be monitored. At this point, there are two things that happen: Not everything is monitored because we have no idea where everything is. We end up with far less than 100% coverage. We start having those cringe-worthy discussions on “should we monitor this thing” due to the sheer cost of monitoring, often costing more than the infrastructure our applications and microservices are running on. Let’s be clear: this isn’t a conversation we should be having. Indeed, OpenTelemetry is fantastic for a number of things: It solves vendor lock-in and has a much larger community working on it, but I must be brutally honest here: it takes A LOT OF WORK. It takes real collaboration between all of the teams to make sure everyone is instrumenting manually and that every single library we use is well supported, assuming we can properly validate that the legacy code we’re trying to instrument has what you think it does in it. From my observations, this generally results in an incomplete patchwork of things giving us a very incomplete picture 95% of the time. Circling back to eBPF technology: With proper deployment and some secret sauce, these are two core concerns we simply don’t have to worry about as long as there’s a simplified pricing model in place. We can get full-on 360-degree visibility in our environments with tracing, metrics, and logs without the hassle and without wondering if we can really afford to see everything. The Elephant in the Room: Cost and the Awful State of Pricing in the Observability Market Today If I’d have a penny for every time I’ve heard the saying, “I need an observability tool to monitor the cost of my observability tool." Traditional monitoring tools often come with a hefty price tag attached, and often one that’s a big fat surprise when we add a metric or a log line… especially when it’s at scale! It's not just about the initial investment – it's the unexpected overage bills that really sting. You see, these tools typically charge based on the volume of data ingested, and it's easy to underestimate just how quickly those costs can add up. We’ve all been there before - monitoring a Kubernetes cluster with hundreds of pods, each generating logs, traces, and metrics. Before we know it, we're facing a mountain of data and a surprise sky-high bill to match. Or perhaps we’ve decided we need a new facet to that metric and we got an unexpected massive charge for metric cardinality. Or maybe a dev decides it’s a great idea to add that additional log line to our high-volume application and our log bill grows exponentially overnight. It's a tough pill to swallow, especially when we're trying to balance the need for comprehensive and complete monitoring with budget constraints. I’ve seen customers receive multiple tens of thousands of dollars (sometimes multiple hundreds of thousands) in “overage” bills because some developer added a few extra log lines or because someone needed some additional cardinality in a metric. Those costs are very real for those very simple mistakes (when often there are no controls in place to keep them from happening). From my personal experience: I wish you the best of luck in trying to negotiate those bills down. You’re stuck now, as these companies have no interest in customers paying less when they get hit with those bills. As a customer-facing architect, I’ve had customers see red, and boy, that sucks. The ethics behind surprise pricing is dubious at best. That's when a modern solution should step in to save the day. By flipping the script on traditional pricing models, offering transparent pricing that's based on usage, not volume, ingest, egress, or some unknown metric that you have no idea how to calculate, we should be able to get specific about the cost of monitoring and set clear expectations knowing we can see everything end-to-end without sacrificing because the cost may be too high. With eBPF and a bit of secret sauce, we'll never have to worry about surprise overage charges again. We can know exactly what we are paying for upfront, giving us peace of mind and control over our monitoring costs. It's not just about cost – it's about value. We don’t just want a monitoring tool; we want a partner in our quest for observability. We want a team and community that is dedicated to helping us get the most out of our monitoring setup, providing guidance and support every step of the way. It must change from the impersonal, transactional approach of legacy vendors. Ownership and Storage of Data The next topic I'd like to touch upon is the importance of data residency, compliance, and security in the realm of observability solutions. In today's business landscape, maintaining control over where and how data is stored and accessed is crucial. Various regulations, such as GDPR (General Data Protection Regulation), require organizations to adhere to strict guidelines regarding data storage and privacy. Traditional cloud-based observability solutions may present challenges in meeting these compliance requirements, as they often store data on third-party servers dispersed across different regions. I’ve seen this happen and I’ve seen customers take extraordinary steps to avoid going to the cloud while employing massive teams of in-house developers just to keep their data within their walls. Opting for an observability solution that allows for on-premises data storage addresses these concerns effectively. By keeping monitoring data within the organization's data center, businesses gain greater control over its security and compliance. This approach minimizes the risk of unauthorized access or data breaches, thereby enhancing data security and simplifying compliance efforts. Additionally, it aligns with data residency requirements and regulations, providing assurance to stakeholders regarding data sovereignty and privacy. Moreover, choosing an observability solution with on-premises data storage can yield significant cost savings in the long term. By leveraging existing infrastructure and eliminating the need for costly cloud storage and data transfer fees, organizations can optimize their operational expenses. Transparent pricing models further enhance cost efficiency by providing clarity and predictability, ensuring that organizations can budget effectively without encountering unexpected expenses. On the other hand, relying on a Software-as-a-Service (SaaS) based observability provider can introduce complexities, security risks, and issues. With SaaS solutions, organizations relinquish control over data storage and management, placing sensitive information in the hands of third-party vendors. This increases the potential for security breaches and data privacy violations, especially when dealing with regulations like GDPR. Additionally, dependence on external service providers can lead to vendor lock-in, making it challenging to migrate data or switch providers in the future. Moreover, fluctuations in pricing and service disruptions can disrupt operations and strain budgets, further complicating the observability landscape for organizations. For organizations seeking to ensure compliance, enhance data security, and optimize costs, an observability solution that facilitates on-premises data storage offers a compelling solution. By maintaining control over data residency and security while achieving cost efficiencies, businesses can focus on their core competencies and revenue-generating activities with confidence.
Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms. Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions. Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior. Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions. Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback. Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component. Mean Time To Recovery: This considers the time to restore full functionality after a failure. Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident. Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road. MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators Feature MTBF MTTF Repairable System Yes No Repair Time Considered in the calculation Not considered in the calculation Failure Focus Time between subsequent failures Time to the first failure Application Planning maintenance, resource allocation Assessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited. Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on. Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns. Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out. Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components. Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: Feature Hardware Reliability Models Software Reliability Models Root Cause of Failures Physical Degradation, Defects, Environmental Factors Code Defects, Design Flaws, External Dependencies Failure Patterns Time-Dependent (Infant Mortality, Wear-Out) Non-Time Dependent (Bugs Remain Until Fixed) Prediction Focus Average Times Between Failures (MTBF, MTTF) Number of Remaining Defects Prevention Strategies Preventive Maintenance Schedules Code Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metrics Fortify: Provides static code analysis for security vulnerabilities and code complexity CppDepend (for C++): Analyzes code dependencies and metrics for C++ codebases PMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes. Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems. Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors. Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable. Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders. Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data. Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data. Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts. Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization. Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable. Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to. Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.
If your system is facing an imminent security threat—or worse, you’ve just suffered a breach—then logs are your go-to. If you’re a security engineer working closely with developers and the DevOps team, you already know that you depend on logs for threat investigation and incident response. Logs offer a detailed account of system activities. Analyzing those logs helps you fortify your digital defenses against emerging risks before they escalate into full-blown incidents. At the same time, your logs are your digital footprints, vital for compliance and auditing. Your logs contain a massive amount of data about your systems (and hence your security), and that leads to some serious questions: How do you handle the complexity of standardizing and analyzing such large volumes of data? How do you get the most out of your log data so that you can strengthen your security? How do you know what to log? How much is too much? Recently, I’ve been trying to use tools and services to get a handle on my logs. In this post, I’ll look at some best practices for using these tools—how they can help with security and identifying threats. And finally, I’ll look at how artificial intelligence may play a role in your log analysis. How To Identify Security Threats Through Logs Logs are essential for the early identification of security threats. Here’s how: Identifying and Mitigating Threats Logs are a gold mine of streaming, real-time analytics, and crucial information that your team can use to its advantage. With dashboards, visualizations, metrics, and alerts set up to monitor your logs you can effectively identify and mitigate threats. In practice, I’ve used both Sumo Logic and the ELK stack (a combination of Elasticsearch, Kibana, Beats, and Logstash). These tools can help your security practice by allowing you to: Establish a baseline of behavior and quickly identify anomalies in service or application behavior. Look for things like unusual access times, spikes in data access, or logins from unexpected areas of the world. Monitor access to your systems for unexpected connections. Watch for frequent and unusual access to critical resources. Watch for unusual outbound traffic that might signal data exfiltration. Watch for specific types of attacks, such as SQL injection or DDoS. For example, I monitor how rate-limiting deals with a burst of requests from the same device or IP using Sumo Logic’s Cloud Infrastructure Security. Watch for changes to highly critical files. Is someone tampering with config files? Create and monitor audit trails of user activity. This forensic information can help you to trace what happened with suspicious—or malicious—activities. Closely monitor authentication/authorization logs for frequent failed attempts. Cross-reference logs to watch for complex, cross-system attacks, such as supply chain attacks or man-in-the-middle (MiTM) attacks. Using a Sumo Logic dashboard of logs, metrics, and traces to track down security threats It’s also best practice to set up alerts to see issues early, giving you the lead time needed to deal with any threat. The best tools are also infrastructure agnostic and can be run on any number of hosting environments. Insights for Future Security Measures Logs help you with more than just looking into the past to figure out what happened. They also help you prepare for the future. Insights from log data can help your team craft its security strategies for the future. Benchmark your logs against your industry to help identify gaps that may cause issues in the future. Hunt through your logs for signs of subtle IOCs (indicators of compromise). Identify rules and behaviors that you can use against your logs to respond in real-time to any new threats. Use predictive modeling to anticipate future attack vectors based on current trends. Detect outliers in your datasets to surface suspicious activities What to Log. . . And How Much to Log So we know we need to use logs to identify threats both present and future. But to be the most effective, what should we log? The short answer is—everything! You want to capture everything you can, all the time. When you’re first getting started, it may be tempting to try to triage logs, guessing as to what is important to keep and what isn’t. But logging all events as they happen and putting them in the right repository for analysis later is often your best bet. In terms of log data, more is almost always better. But of course, this presents challenges. Who’s Going To Pay for All These Logs? When you retain all those logs, it can be very expensive. And it’s stressful to think about how much money it will cost to store all of this data when you just throw it in an S3 bucket for review later. For example, on AWS a daily log data ingest of 100GB/day with the ELK stack could create an annual cost of hundreds of thousands of dollars. This often leads to developers “self-selecting” what they think is — and isn’t — important to log. Your first option is to be smart and proactive in managing your logs. This can work for tools such as the ELK stack, as long as you follow some basic rules: Prioritize logs by classification: Figure out which logs are the most important, classify them as such, and then be more verbose with those logs. Rotate logs: Figure out how long you typically need logs and then rotate them off servers. You probably only need debug logs for a matter of weeks, but access logs for much longer. Log sampling: Only log a sampling of high-volume services. For example, log just a percentage of access requests but log all error messages. Filter logs: Pre-process all logs to remove unnecessary information, condensing their size before storing them. Alert-based logging: Configure alerts based on triggers or events that subsequently turn logging on or make your logging more verbose. Use tier-based storage: Store more recent logs on faster, more expensive storage. Move older logs to cheaper, slow storage. For example, you can archive old logs to Amazon S3. These are great steps, but unfortunately, they can involve a lot of work and a lot of guesswork. You often don’t know what you need from the logs until after the fact. A second option is to use a tool or service that offers flat-rate pricing; for example, Sumo Logic’s $0 ingest. With this type of service, you can stream all of your logs without worrying about overwhelming ingest costs. Instead of a per-GB-ingested type of billing, this plan bills based on the valuable analytics and insights you derive from that data. You can log everything and pay just for what you need to get out of your logs. In other words, you are free to log it all! Looking Forward: The Role of AI in Automating Log Analysis The right tool or service, of course, can help you make sense of all this data. And the best of these tools work pretty well. The obvious new tool to help you make sense of all this data is AI. With data that is formatted predictably, we can apply classification algorithms and other machine-learning techniques to find out exactly what we want to know about our application. AI can: Automate repetitive tasks like data cleaning and pre-processing Perform automated anomaly detection to alert on abnormal behaviors Automatically identify issues and anomalies faster and more consistently by learning from historical log data Identify complex patterns quickly Use large amounts of historical data to more accurately predict future security breaches Reduce alert fatigue by reducing false positives and false negatives Use natural language processing (NLP) to parse and understand logs Quickly integrate and parse logs from multiple, disparate systems for a more holistic view of potential attack vectors AI probably isn’t coming for your job, but it will probably make your job a whole lot easier. Conclusion Log data is one of the most valuable and available means to ensure your applications’ security and operations. It can help guard against both current and future attacks. And for log data to be of the most use, you should log as much information as you can. The last problem you want during a security crisis is to find out you didn’t log the information you need.
The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.
Joana Carvalho
Observability and Monitoring Specialist,
Sage
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone