DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Monitoring and Observability

Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.

icon
Latest Refcards and Trend Reports
Trend Report
Performance and Site Reliability
Performance and Site Reliability
Refcard #368
Getting Started With OpenTelemetry
Getting Started With OpenTelemetry
Refcard #293
Getting Started With Prometheus
Getting Started With Prometheus

DZone's Featured Monitoring and Observability Resources

The Convergence of Testing and Observability

The Convergence of Testing and Observability

By Mirco Hering
This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report One of the core capabilities that has seen increased interest in the DevOps community is observability. Observability improves monitoring in several vital ways, making it easier and faster to understand business flows and allowing for enhanced issue resolution. Furthermore, observability goes beyond an operations capability and can be used for testing and quality assurance. Testing has traditionally faced the challenge of identifying the appropriate testing scope. "How much testing is enough?" and "What should we test?" are questions each testing executive asks, and the answers have been elusive. There are fewer arguments about testing new functionality; while not trivial, you know the functionality you built in new features and hence can derive the proper testing scope from your understanding of the functional scope. But what else should you test? What is a comprehensive general regression testing suite, and what previous functionality will be impacted by the new functionality you have developed and will release? Observability can help us with this as well as the unavoidable defect investigation. But before we get to this, let's take a closer look at observability. What Is Observability? Observability is not monitoring with a different name. Monitoring is usually limited to observing a specific aspect of a resource, like disk space or memory of a compute instance. Monitoring one specific characteristic can be helpful in an operations context, but it usually only detects a subset of what is concerning. All monitoring can show is that the system looks okay, but users can still be experiencing significant outages. Observability aims to make us see the state of the system by making data flows "observable." This means that we can identify when something starts to behave out of order and requires our attention. Observability combines logs, metrics, and traces from infrastructure and applications to gain insights. Ideally, it organizes these around workflows instead of system resources and, as such, creates a functional view of the system in use. Done correctly, it lets you see what functionality is being executed and how frequently, and it enables you to identify performance characteristics of the system and workflow. Figure 1: Observability combines metrics, logs, and traces for insights One benefit of observability is that it shows you the actual system. It is not biased by what the designers, architects, and engineers think should happen in production. It shows the unbiased flow of data. The users, over time (and sometimes from the very first day), find ways to use the system quite differently from what was designed. Observability makes such changes in behavior visible. Observability is incredibly powerful in debugging system issues as it allows us to navigate the system to see where problems occur. Observability requires a dedicated setup and some contextual knowledge similar to traceability. Traceability is the ability to follow a system transaction over time through all the different components of our application and infrastructure architecture, which means you have to have common information like an ID that enables this. OpenTelemetry is an open standard that can be used and provides useful guidance on how to set this up. Observability makes identifying production issues a lot easier. And we can use observability for our benefit in testing, too. Observability of Testing: How to Look Left Two aspects of observability make it useful in the testing context: Its ability to make the actual system usage observable and its usefulness in finding problem areas during debugging. Understanding the actual system behavior is most directly useful during performance testing. Performance testing is the pinnacle of testing since it tries to achieve as close to the realistic peak behavior of a system as possible. Unfortunately, performance testing scenarios are often based on human knowledge of the system instead of objective information. For example, performance testing might be based on the prediction of 10,000 customer interactions per hour during a sales campaign based on the information of the sales manager. Observability information can help define the testing scenarios by using the information to look for the times the system was under the most stress in production and then simulate similar situations in the performance test environment. We can use a system signature to compare behaviors. A system signature in the context of observability is the set of values for logs, metrics, and traces during a specific period. Take, for example, a marketing promotion for new customers. The signature of the system should change during that period to show more new account creations with its associated functionality and the related infrastructure showing up as being more "busy." If the signature does not change during the promotion, we would predict that we also don't see the business metrics move (e.g., user sign-ups). In this example, the business metrics and the signature can be easily matched. Figure 2: A system behaving differently in test, which shows up in the system signature In many other cases, this is not true. Imagine an example where we change the recommendation engine to use our warehouse data going forward. We expect the system signature to show increased data flows between the recommendation engine and our warehouse system. You can see how system signatures and the changes of the system signature can be useful for testing; any differences in signature between production and the testing systems should be explainable by the intended changes of the upcoming release. Otherwise, investigation is required. In the same way, information from the production observability system can be used to define a regression suite that reflects the functionality most frequently used in production. Observability can give you information about the workflows still actively in use and which workflows have stopped being relevant. This information can optimize your regression suite both from a maintenance perspective and, more importantly, from a risk perspective, making sure that core functionality, as experienced by the user, remains in a working state. Implementing observability in your test environments means you can use the power of observability for both production issues and your testing defects. It removes the need for debugging modes to some degree and relies upon the same system capability as production. This way, observability becomes how you work across both dev and ops, which helps break down silos. Observability for Test Insights: Looking Right In the previous section, we looked at using observability by looking left or backward, ensuring we have kept everything intact. Similarly, we can use observability to help us predict the success of the features we deliver. Think about a new feature you are developing. During the test cycles, we see how this new feature changes the workflows, which shows up in our observability solution. We can see the new features being used and other features changing in usage as a result. The signature of our application has changed when we consider the logs, traces, and metrics of our system in test. Once we go live, we predict that the signature of the production system will change in a very similar way. If that happens, we will be happy. But what if the signature of the production system does not change as predicted? Let's take an example: We created a new feature that leverages information from previous bookings to better serve our customers by allocating similar seats and menu options. During testing, we tested the new feature with our test data set, and we see an increase in accessing the bookings database while the customer booking is being collated. Once we go live, we realize that the workflows are not utilizing the customer booking database, and we leverage the information from our observability tooling to investigate. We have found a case where the users are not using our new features or are not using the features in the expected way. In either case, this information allows us to investigate further to see whether more change management is required for the users or whether our feature is just not solving the problem in the way we wanted it to. Another way to use observability is to evaluate the performance of your changes in test and the impact on the system signature — comparing this afterwards with the production system signature can give valuable insights and prevent overall performance degradation. Our testing efforts (and the associated predictions) have now become a valuable tool for the business to evaluate the success of a feature, which elevates testing to become a business tool and a real value investment. Figure 3: Using observability in test by looking left and looking right Conclusion While the popularity of observability is a somewhat recent development, it is exciting to see what benefits it can bring to testing. It will create objectiveness for defining testing efforts and results by evaluating them against the actual system behavior in production. It also provides value to developer, tester, and business communities, which makes it a valuable tool for breaking down barriers. Using the same practices and tools across communities drives a common culture — after all, culture is nothing but repeated behaviors. This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report More
Log Analysis Using grep

Log Analysis Using grep

By Muhammad Raza
I recently began a new role as a software engineer, and in my current position, I spend a lot of time in the terminal. Even though I have been a long-time Linux user, I embarked on my Linux journey after becoming frustrated with setting up a Node.js environment on Windows during my college days. It was during that time that I discovered Ubuntu, and it was then that I fell in love with the simplicity and power of the Linux terminal. Despite starting my Linux journey with Ubuntu, my curiosity led me to try other distributions, such as Manjaro Linux, and ultimately Arch Linux. Without a doubt, I have a deep affection for Arch Linux. However, at my day job, I used macOS, and gradually, I also developed a love for macOS. Now, I have transitioned to macOS as my daily driver. Nevertheless, my love for Linux, especially Arch Linux and the extensive customization it offers, remains unchanged. Anyway, in this post, I will be discussing grep and how I utilize it to analyze logs and uncover insights. Without a doubt, grep has proven to be an exceptionally powerful tool. However, before we delve into grep, let’s first grasp what grep is and how it works. What Is grep and How Does It Work? grep is a powerful command-line utility in Unix-like operating systems used for searching text or regular expressions (patterns) within files. The name “grep” stands for “Global Regular Expression Print.” It’s an essential tool for system administrators, programmers, and anyone working with text files and logs. How It Works When you use grep, you provide it with a search pattern and a list of files to search through. The basic syntax is: grep [options] pattern [file...] Here’s a simple understanding of how it works: Search pattern: You provide a search pattern, which can be a simple string or a complex regular expression. This pattern defines what you’re searching for within the files. Files to search: You can specify one or more files (or even directories) in which grep should search for the pattern. If you don’t specify any files, grep reads from the standard input (which allows you to pipe in data from other commands). Matching lines:grep scans through each line of the specified files (or standard input) and checks if the search pattern matches the content of the line. Output: When a line containing a match is found, grep prints that line to the standard output. If you’re searching within multiple files, grep also prefixes the matching lines with the file name. Options:grep offers various options that allow you to control its behavior. For example, you can make the search case-insensitive, display line numbers alongside matches, invert the match to show lines that don’t match and more. Backstory of Development grep was created by Ken Thompson, one of the early developers of Unix, and its development dates back to the late 1960s. The context of its creation lies in the evolution of the Unix operating system at Bell Labs. Ken Thompson, along with Dennis Ritchie and others, was involved in developing Unix in the late 1960s. As part of this effort, they were building tools and utilities to make the system more practical and user-friendly. One of the tasks was to develop a way to search for patterns within text files efficiently. The concept of regular expressions was already established in the field of formal language theory, and Thompson drew inspiration from this. He created a program that utilized a simple form of regular expressions for searching and printing lines that matched the provided pattern. This program eventually became grep. The initial version of grep used a simple and efficient algorithm to perform the search, which is based on the use of finite automata. This approach allowed for fast pattern matching, making grep a highly useful tool, especially in the early days of Unix when computing resources were limited. Over the years, grep has become an integral part of Unix-like systems, and its functionality and capabilities have been extended. The basic concept of searching for patterns in text using regular expressions, however, remains at the core of grep’s functionality. grep and Log Analysis So you might be wondering how grep can be used for log analysis. Well, grep is a powerful tool that can be used to analyze logs and uncover insights. In this section, I will be discussing how I use grep to analyze logs and find insights. Isolating Errors Debugging often starts with identifying errors in logs. To isolate errors using grep, I use the following techniques: Search for error keywords: Start by searching for common error keywords such as "error", "exception", "fail" or "invalid" . Use case-insensitive searches with the -i flag to ensure you capture variations in case. Multiple pattern search: Use the -e flag to search for multiple patterns simultaneously. For instance, you could search for both "error" and "warning" messages to cover a wider range of potential issues. Contextual search: Use the -C flag to display a certain number of lines of context around each match. This helps you understand the context in which an error occurred. Tracking Down Issues Once you’ve isolated errors, it’s time to dig deeper and trace the source of the issue: Timestamp-based search: If your logs include timestamps, use them to track down the sequence of events leading to an issue. You can use grep along with regular expressions to match specific time ranges. Unique identifiers: If your application generates unique identifiers for events, use these to track the flow of events across log entries. Search for these identifiers using grep. Combining with other tools: Combine grep with other command-line tools like sort, uniq, and awk to aggregate and analyze log entries based on various criteria. Identifying Patterns Log analysis is not just about finding errors; it’s also about identifying patterns that might provide insights into performance or user behavior: Frequency analysis: Use grep to count the occurrence of specific patterns. This can help you identify frequently occurring events or errors. Custom pattern matching: Leverage regular expressions to define custom patterns based on your application’s unique log formats. Anomaly detection: Regular expressions can also help you detect anomalies by defining what “normal” log entries look like and searching for deviations from that pattern. Conclusion In the world of debugging and log analysis, grep is a tool that can make a significant difference. Its powerful pattern-matching capabilities, combined with its versatility in handling regular expressions, allow you to efficiently isolate errors, track down issues, and identify meaningful patterns in your log files. With these techniques in your toolkit, you’ll be better equipped to unravel the mysteries hidden within your logs and ensure the smooth operation of your systems and applications. Happy log hunting! Remember, practice is key. The more you experiment with grep and apply these techniques to your real-world scenarios, the more proficient you’ll become at navigating through log files and gaining insights from them. Examples Isolating Errors Search for lines containing the word “error” in a log file: grep -i "error" application.log Search for lines containing either “error” or “warning” in a log file: grep -i -e "error" -e "warning" application.log Display lines containing the word “error” along with 2 lines of context before and after: grep -C 2 "error" application.log Tracking Down Issues Search for log entries within a specific time range (using regular expressions for timestamp matching): grep "^\[2023-08-31 10:..:..]" application.log Search for entries associated with a specific transaction ID: grep "TransactionID: 12345" application.log Count the occurrences of a specific error: grep -c "Connection refused" application.log Identifying Patterns Count the occurrences of each type of error in a log file: grep -i -o "error" application.log | sort | uniq -c Search for log entries containing IP addresses: grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" application.log Detect unusual patterns using negative lookaheads in regular expressions: grep -E "^(?!.*normal).*error" application.log Lastly, I hope you enjoyed reading this and got a chance to learn something new from this post. If you have any grep tips or how you started your Linux journey feel free to comment below as I would love to hear them. More
Send Your Logs to Loki
Send Your Logs to Loki
By Nicolas Fränkel CORE
Getting Started With Prometheus Workshop: Instrumenting Applications
Getting Started With Prometheus Workshop: Instrumenting Applications
By Eric D. Schabell CORE
Inferencing: The AI-Led Future of Observability?
Inferencing: The AI-Led Future of Observability?
By Samyukktha T
Extracting Maximum Value From Logs
Extracting Maximum Value From Logs

Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system. However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging. With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post. We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention. Let’s dive in with a look at log levels and formats. Logging Design Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats. Log Levels Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest): TRACE: Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change. DEBUG: Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments. INFO: Provides general information about the system's operation to convey important events or milestones in the system's execution. WARNING: Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary. ERROR: Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality. Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively. When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently. Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels. Structured Versus Unstructured Logs Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically. Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems. On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly. However, programmatically extracting specific information from the resulting logs can be very challenging. Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits. However, if all you need is simplicity and readability, then unstructured logs may be sufficient. In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages. For large-scale systems, you should lean towards structured logging when possible, but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning. Logging Challenges With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings. Disparate Destinations Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome. For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic. Varying Formats Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields. Unifying the information you get from a diversity of logs and formats requires the right tools. Inconsistent Log Levels Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels. One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation. You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose. Log Storage Cost Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system. Dealing With These Challenges While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices. Aggregate Your Logs When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform. Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation. Move Toward a Unified Format Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format. The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down. Establish a Logging Standard Across Your Applications For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics. If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard. If a migration is not feasible, treat your legacy applications like you would third-party applications. Enrich Logs From Third-Party Sources Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities. To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics). Manage Log Volume, Frequency, and Retention Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage. Volume: Monitoring generated log volume helps you control resource consumption and performance impacts. Frequency: Determine how often to log, based on the criticality of events and desired level of monitoring. Retention: Define a log retention policy appropriate for compliance requirements, operational needs, and available storage. Rotation: Periodically archive or purge older log files to manage log file sizes effectively. Compression: Compress log files to reduce storage requirements. Log-Based Metrics Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges. Benefits Granular insights: Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues. Comprehensive monitoring: By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience. Historical analysis: Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability. Flexibility and customization: You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs. Challenges Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task. This identification requires a deep understanding of system behavior and close alignment with your business objectives. Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next. Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge. Need for real-time analysis: Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively. Performance impact: Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance. Data noise and irrelevance: Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events Long-Term Log Retention After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area. How Long Should You Keep Logs Around? How long you should keep a log around depends on several factors, including: Log type: Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting. Regulatory requirements: Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years. Company policy: Your company may have policies that dictate how long logs should be kept. Log size: If your logs are large, you may need to rotate them or delete them more frequently. Storage cost: Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage. How Do You Reduce the Level of Detail and Cost of Older Logs? Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around. When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures: Downsampling logs: In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10). Trimming logs: For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code. Compression and archiving: You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements. Conclusion In this article, we’ve looked at how to get the most out of logging in large-scale systems. Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources. Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system. And you can do this while keeping your logging costs at bay.

By Alvin Lee CORE
Logging Incoming Requests in Spring WebFlux
Logging Incoming Requests in Spring WebFlux

In the world of modern software development, meticulous monitoring and robust debugging are paramount. With the rise of reactive programming paradigms, Spring WebFlux has emerged as a powerful framework for building reactive, scalable, and highly performant applications. However, as complexity grows, so does the need for effective logging mechanisms. Enter the realm of logging input requests in Spring WebFlux — a practice that serves as a critical foundation for both diagnosing issues and ensuring application security. Logging, often regarded as the unsung hero of software development, provides developers with invaluable insights into their applications' inner workings. Through comprehensive logs, developers can peer into the execution flow, troubleshoot errors, and track the journey of each request as it traverses through the intricate layers of their Spring WebFlux application. But logging is not a one-size-fits-all solution; it requires thoughtful configuration and strategic implementation to strike the balance between informative insights and performance overhead. In this article, we embark on a journey through the landscape of Spring WebFlux and delve into the art of logging input requests. We'll explore the nuances of intercepting and capturing crucial details of incoming requests, all while maintaining security and privacy standards. By the end, you'll be equipped with the knowledge to empower your Spring WebFlux application with insightful logs, fostering enhanced debugging, streamlined monitoring, and a fortified security posture. So, fasten your seatbelts as we unravel the techniques, best practices, and considerations for logging input requests in Spring WebFlux, and learn how this practice can elevate your application development to new heights. Action Although WebFilters are frequently employed to log web requests, we will choose to utilize AspectJ for this scenario. Assuming that all the endpoints in our project are located within a package named "controller" and that Controller classes end with the term "Controller," we can craft an advice method as depicted below. @Aspect @Component public class RequestLoggingAspect { @Around("execution (* my.cool.project.controller.*..*.*Controller.*(..))") public Object logInOut(ProceedingJoinPoint joinPoint) { Class<?> clazz = joinPoint.getTarget().getClass(); Logger logger = LoggerFactory.getLogger(clazz); Date start = new Date(); Object result = null; Throwable exception = null; try { result = joinPoint.proceed(); if (result instanceof Mono<?> monoOut) { return logMonoResult(joinPoint, clazz, logger, start, monoOut); } else if (result instanceof Flux<?> fluxOut) { return logFluxResult(joinPoint, clazz, logger, start, fluxOut); } else { return result; } } catch (Throwable e) { exception = e; throw e; } finally { if (!(result instanceof Mono<?>) && !(result instanceof Flux<?>)) { doOutputLogging(joinPoint, clazz, logger, start, result, exception); } } } } The RequestLoggingAspect stands out for its adept handling of diverse return types, including Flux, Mono, and non-WebFlux, within a Spring WebFlux framework. Employing the AspectJ @Around annotation, it seamlessly intercepts methods in "Controller" classes, offering tailored logging for each return type. Below is the logMonoResult method, which efficiently logs with contextView to retrieve contextual data from the WebFlux environment. This method adeptly handles Mono return types, capturing various scenarios while maintaining a structured logging approach. It gracefully integrates deferred contextual information and ensures seamless logging of different outcomes. From handling empty results to tracking successes and errors, the logMonoResult method seamlessly facilitates detailed logging within the Spring WebFlux context: private <T, L> Mono<T> logMonoResult(ProceedingJoinPoint joinPoint, Class<L> clazz, Logger logger, Date start, Mono<T> monoOut) { return Mono.deferContextual(contextView -> monoOut .switchIfEmpty(Mono.<T>empty() .doOnSuccess(logOnEmptyConsumer(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[empty]", null)))) .doOnEach(logOnNext(v -> doOutputLogging(joinPoint, clazz, logger, start, v, null))) .doOnEach(logOnError(e -> doOutputLogging(joinPoint, clazz, logger, start, null, e))) .doOnCancel(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[cancelled]", null))) ); } Likewise, the logFluxResult method is presented below. This method orchestrates comprehensive logging while seamlessly incorporating the contextView to obtain contextual information from the WebFlux environment. By accommodating diverse scenarios, such as empty results or cancellations, the logFluxResult method optimizes logging within the Spring WebFlux ecosystem: private <T> Flux<T> logFluxResult(ProceedingJoinPoint joinPoint, Class<?> clazz, Logger logger, Date start, Flux<T> fluxOut) { return Flux.deferContextual(contextView -> fluxOut .switchIfEmpty(Flux.<T>empty() .doOnComplete(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[empty]", null)))) .doOnEach(logOnNext(v -> doOutputLogging(joinPoint, clazz, logger, start, v, null))) .doOnEach(logOnError(e -> doOutputLogging(joinPoint, clazz, logger, start, null, e))) .doOnCancel(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[cancelled]", null))) ); } Let's delve into the details of the logOnNext, logOnError, logOnEmptyConsumer, and logOnEmptyRunnable methods, explaining how they contribute to comprehensive request logging. These methods encapsulate intricate logging procedures and utilize the contextView to maintain contextual information from the WebFlux environment. The combination of MDC (Mapped Diagnostic Context) and signal processing ensures precise logging under various scenarios: logOnNext Method: The logOnNext method is designed to log information when a signal indicates a successful next event. It uses the signal's contextView to extract contextual variables such as transaction ID ( TRX_ID) and path URI ( PATH_URI). Later we will describe how such values can be put to context. These variables are then included in the MDC to enable consistent tracking throughout the logging process. The logging statement is encapsulated within the MDC context, guaranteeing that the correct transaction and path details are associated with the log statement. This approach ensures that successful events are accurately logged within the relevant context. private static <T> Consumer<Signal<T>> logOnNext(Consumer<T> logStatement) { return signal -> { if (!signal.isOnNext()) return; String trxIdVar = signal.getContextView().getOrDefault(TRX_ID, ""); String pathUriVar = signal.getContextView().getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { T t = signal.get(); logStatement.accept(t); } }; } logOnError Method: The logOnError method mirrors the behavior of logOnNext, but it focuses on error events. It extracts the contextual variables from the signal's contextView and places them in the MDC. This ensures that errors are logged in the proper context, making it easier to identify the specific transaction and path associated with the error event. By encapsulating the error log statement within the MDC, this method ensures that error logs are informative and appropriately contextualized. public static <T> Consumer<Signal<T>> logOnError(Consumer<Throwable> errorLogStatement) { return signal -> { if (!signal.isOnError()) return; String trxIdVar = signal.getContextView().getOrDefault(TRX_ID, ""); String pathUriVar = signal.getContextView().getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { errorLogStatement.accept(signal.getThrowable()); } }; } logOnEmptyConsumer and logOnEmptyRunnable Methods: Both of these methods deal with scenarios where the signal is empty, indicating that there's no result to process. The logOnEmptyConsumer method is designed to accept a Consumer and executes it when the signal is empty. It retrieves the contextual variables from the provided contextView and incorporates them into the MDC before executing the log statement. private static <T> Consumer<T> logOnEmptyConsumer(final ContextView contextView, Runnable logStatement) { return signal -> { if (signal != null) return; String trxIdVar = contextView.getOrDefault(TRX_ID, ""); String pathUriVar = contextView.getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { logStatement.run(); } }; } private static Runnable logOnEmptyRunnable(final ContextView contextView, Runnable logStatement) { return () -> { String trxIdVar = contextView.getOrDefault(TRX_ID, ""); String pathUriVar = contextView.getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { logStatement.run(); } }; } In both cases, these methods ensure that the correct context, including transaction and path details, is established through MDC before executing the log statements. This allows for consistent and meaningful logging even in situations where there is no explicit result to process. To introduce the transaction ID and path variables into the WebFlux context, consider the following WebFilter configuration. As a @Bean with highest priority, the slf4jMdcFilter extracts the request's unique ID and path URI, incorporating them into the context. This ensures that subsequent processing stages, including the RequestLoggingAspect, can seamlessly access this enriched context for precise and comprehensive request handling. @Bean @Order(Ordered.HIGHEST_PRECEDENCE) WebFilter slf4jMdcFilter() { return (exchange, chain) -> { String requestId = exchange.getRequest().getId(); return chain.filter(exchange) .contextWrite(Context.of(Constants.TRX_ID, requestId) .put(Constants.PATH_URI, exchange.getRequest().getPath())); }; } Ultimately, for comprehensive logging of diverse request types, the inclusion of a method named doOutputLogging becomes essential. While a detailed implementation of this method is beyond our scope, it serves as a conduit for logging incoming expressions, either via a tailored logger to match your scenario or potentially routed to a database or alternate platform. This method can be customized to align precisely with your distinct necessities and specifications. private <T> void doOutputLogging(final ProceedingJoinPoint joinPoint, final Class<?> clazz, final Logger logger, final Date start, final T result, final Throwable exception) { //log(...); //db.insert(...); } Summary In summary, effective request logging in Spring WebFlux is pivotal for debugging and enhancing application performance. By leveraging AspectJ and WebFilters, developers can simplify the process of logging input and output across diverse endpoints. The showcased RequestLoggingAspect efficiently handles different return types, while the slf4jMdcFilter WebFilter enriches logs with transaction and path data. Although the logMonoResult, logFluxResult, and doOutputLogging methods serve as adaptable templates, they offer customization options to suit specific needs. This empowers developers to tailor logging to their preferences, whether for internal logs or external data storage.

By Dursun Koç CORE
It’s 2 AM. Do You Know What Your Code Is Doing?
It’s 2 AM. Do You Know What Your Code Is Doing?

Once we press the merge button, that code is no longer our responsibility. If it performs sub-optimally or has a bug, it is now the problem of the DevOps team, the SRE, etc. Unfortunately, those teams work with a different toolset. If my code uses up too much RAM, they will increase RAM. When the code runs slower, it will increase CPU. In case the code crashes, they will increase concurrent instances. If none of that helps, they will call you up at 2 AM. A lot of these problems are visible before they become a disastrous middle-of-the-night call. Yes. DevOps should control production, but the information they gather from production is useful for all of us. This is at the core of developer observability, which is a subject I’m quite passionate about. I’m so excited about it I dedicated a chapter to it in my debugging book. Back when I wrote that chapter, I dedicated most of it to active developer observability tools like Lightrun, Rookout, et al. These tools work like production debuggers. They are fantastic in that regard. When I have a bug and know where to look, I can sometimes reach for one of these tools (I used to work at Lightrun, so I always use it). But there are other ways. Tools like Lightrun are active in their observability; we add a snapshot similarly to a breakpoint and get the type of data we expect. I recently started playing with Digma, which takes a radically different approach to developer observability. To understand that, we might need to revisit some concepts of observability first. Observability Isn’t Pillars I’ve been guilty of listing the pillars of observability just as much as the next guy. They’re even in my book (sorry). To be fair, I also discussed what observability really means… Observability means we can ask questions about our system and get answers or at least have a clearly defined path to get those answers. Sounds simple when running locally, but when you have a sophisticated production environment, and someone asks you: is anyone even using that block of code? How do you know? You might have lucked out and had a login that code, and it might still be lucky that the log is in the right level and piped properly so you can check. The problem is that if you added too many logs or too much observability data, you might have created a disease worse than the cure: over-logging or over-observing. Both can bring down your performance and significantly impact the bank account, so ideally, we don’t want too many logs (I discuss over-logging here), and we don’t want too much observability. Existing developer observability tools work actively. To answer the question, if someone is using the code, I can place a counter on the line and wait for results. I can give it a week's timeout and find out in a week. Not a terrible situation but not ideal either. I don’t have that much patience. Tracing and OpenTelemetry It’s a sad state of affairs that most developers don’t use tracing in their day-to-day job. For those of you who don’t know it, it is like a call stack for the cloud. It lets us see the stack across servers and through processes. No, not method calls. More at the entry point level, but this often contains details like the database queries that were made and similarly deep insights. There’s a lot of history with OpenTelemetry, which I don’t want to get into. If you’re an observability geek, you already know it, and if not, then it’s boring. What matters is that OpenTelemetry is taking over the world of tracing. It’s a runtime agent, which means you just add it to the server, and you get tracing information almost seamlessly. It’s magic. It also doesn’t have a standard server which makes it very confusing. That means multiple vendors can use a single agent and display the information it collects to various demographics: A vendor focused on performance can show the timing of various parts in the system. A vendor focused on troubleshooting can detect potential bugs and issues. A vendor focused on security can detect potential risky access. Background Developer Observability I’m going to coin a term here since there isn’t one: Background Developer Observability. What if the data you need was already here and a system already collected it for you in the background? That’s what Digma is doing. In Digma's terms, it's called Continuous Feedback. Essentially, they’re collecting OpenTelemetry data, analyzing it, and displaying it as information that’s useful for developers. If Lightrun is like a debugger, then Digma is like SonarQube based on actual runtime and production information. The cool thing is that you probably already use OpenTelemetry without even knowing it. DevOps probably installed that agent already, and the data is already there! Going back to my question, is anyone using this API? If you use Digma, you can see that right away. OpenTelelbery already collected the information in the background, and the DevOps team already paid the price of collection. We can benefit from that too. Enough Exposition I know, I go on… Let’s get to the meat and potatoes of why this rocks. Notice that this is a demo; when running locally, the benefits are limited. The true value of these tools is in understanding production, still, they can provide a lot of insight even when running locally and even when running tests. Digma has a simple and well-integrated setup wizard for IntelliJ/IDEA. You need to have Docker Desktop running for setup to succeed. Note that you don’t need to run your application using Docker. This is simply for the Digma server process, where they collect the execution details. Once it is installed, we can run our application. In my case, I just ran the JPA unit test from my latest book, and it produced standard traces, which are already pretty cool. We can see them listed below: When we click a trace for one of these, we get the standard trace view, this is nothing new, but it’s really nice to see this information directly in the IDE and readily accessible. I can imagine the immense value this will have for figuring out CI execution issues: But the real value and where Digma becomes a “Developer Observability” tool instead of an Observability tool is with the tool window here: There is a strong connection to the code directly from the observability data and deeper analysis, which doesn’t show in my particular overly simplistic hello world. This Toolwindow highlights problematic traces and errors and helps understand real-world issues. How Does This Help at 2 AM? Disasters happen because we aren’t looking. I’d like to say I open my observability dashboard regularly, but I don’t. Then when there’s a failure I take a while to get my bearings within it. The locality of the applicable data is important. It helps us notice issues when they happen. Detect regressions before they turn to failures and understand the impact of the code we just merged. Prevention starts with awareness, and as developers, we handed our situational awareness to the DevOps team. When the failure actually happens, the locality and accessibility of the data make a big difference. Since we use tools that integrate into the IDE daily, this reduces the meantime to a fix. No, a background developer observability tool might not include the information we need to fix a problem. But if it does, then the information is already there, and we need nothing else. That is fantastic. Final Word With all the discussion about observability and OpenTelemetry, you would think everyone is using them. Unfortunately, the reality is far from that. Yes, there’s some saturation and familiarity in the DevOps crowd. This is not the case for developers. This is a form of environmental blindness. How can our teams, who are so driven by data and facts, proceed with secondhand and often outdated data from OPS? Should I spend time further optimizing this method, or will I waste the effort since few people use it? We can benchmark things locally just fine, but real-world usage and impact are things that we all need to improve.

By Shai Almog CORE
Getting Started With Prometheus Workshop: Service Discovery
Getting Started With Prometheus Workshop: Service Discovery

Are you interested in open-source observability but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus, installing Prometheus, an introduction to the query language, exploring basic queries, using advanced queries, and relabeling metrics in Prometheus as free online labs. In this article, you'll learn all about discovering service targets in Prometheus. Your learning path takes you into the wonderful world of service discovery in Prometheus, where you explore the more realistic and dynamic world of cloud-native services that automatically scale up and down. Note this article is only a short summary, so please see the complete lab found online to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is as follows: This lab provides an understanding of how service discovery is used in Prometheus for locating and scraping targets for metrics collection. You're learning by setting up a service discovery mechanism to dynamically maintain a list of scraping targets. You start in this lab exploring the service discovery architecture Prometheus provides and how it is supporting all manner of automated discovery of dynamically scaling targets in your infrastructure, the basic definitions of what service discovery needs to achieve, knowing what targets should exist, knowing how to pull metrics from those targets, and how to use the associated target metadata. You then dive into the two options for installing the lab demo environments, either using source projects or in open-source containers for the exercises later in this lab. The Demo Environment Whether you install it using source projects or containers, you'll be setting up the following architecture to support your service discovery exercises using the services demo to ensure your local infrastructure contains the following: Production 1 running at http://localhost:11111 Production 2 running at http://localhost:22222 Development running at http://localhost:44444 Note that if you have any port conflicts on your machine, you can map any free port numbers you like, making this exercise very flexible across your available machines. Next, you'll be setting up a file-based discovery integration with Prometheus that allows your applications and pipelines to modify a file for dynamic targeting of the infrastructure you want to scrape. This file (targets.yml) in our exercise will look something like this if you are targeting the above infrastructure: - targets: - "localhost:11111" - "localhost:22222" labels: job: "services" env: "production" - targets: - "localhost:44444" labels: job: "services" env: "development" Configuring your Prometheus instance requires a new file-based discovery section in your workshop-prometheus.yml file: # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # File based discovery. - job_name: "file-sd-workshop" file_sd_configs: - files: - "targets.yml" After saving your configuration and starting your Prometheus instance, you are then shown how to verify that the target infrastructure is now being scraped: Next up, you'll start adding dynamic changes to your target file and see that they are automatically discovered by Prometheus without having to restart your instance. Exploring Dynamic Discovery The rest of the lab walks through multiple exercises where you make dynamic changes and verify that Prometheus is able to automatically scale to the needs of your infrastructure. For example, you'll first change the infrastructure you have deployed by promoting the development environment to become the staging infrastructure for your organization. First, you update the targets file: - targets: - "localhost:11111" - "localhost:22222" labels: job: "services" env: "production" - targets: - "localhost:44444" labels: job: "services" env: "staging" Then you verify that the changes are picked up, this time using a PromQL query and the Prometheus console without having to restart your Prometheus instance: Later in the lab, you are given exercises to fly solo and add a new testing environment so that the end results of your dynamically growing observability infrastructure contain production, staging, testing, and your Prometheus instance: Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming Up Next I'll be taking you through the following lab in this workshop where you'll learn all about instrumenting your applications for collecting Prometheus metrics. Stay tuned for more hands-on material to help you with your cloud-native observability journey.

By Eric D. Schabell CORE
Introduction to the Tower Library
Introduction to the Tower Library

One of the components of my OpenTelemetry demo is a Rust application built with the Axum web framework. In its description, axum mentions: axum doesn't have its own middleware system but instead uses tower::Service. This means axum gets timeouts, tracing, compression, authorization, and more, for free. It also enables you to share middleware with applications written using hyper or tonic. — axum README So far, I was happy to let this cryptic explanation lurk in the corner of my mind, but today is the day I want to understand what it means. Like many others, this post aims to explain to me and others how to do this. The tower crate offers the following information: Tower is a library of modular and reusable components for building robust networking clients and servers. Tower provides a simple core abstraction, the Service trait, which represents an asynchronous function taking a request and returning either a response or an error. This abstraction can be used to model both clients and servers. Generic components, like timeouts, rate limiting, and load balancing, can be modeled as Services that wrap some inner service and apply additional behavior before or after the inner service is called. This allows implementing these components in a protocol-agnostic, composable way. Typically, such services are referred to as middleware. — tower crate Tower is designed around Functional Programming and two main abstractions, Service and Layer. In its simplest expression, a Service is a function that reads an input and produces an output. It consists of two methods: One should call poll_ready() to ensure that the service can process requests call() processes the request and returns the response asynchronously Because calls can fail, the return value is wrapped in a Result. Moreover, since Tower deals with asynchronous calls, the Result is wrapped in a Future. Hence, a Service transforms a Self::Request into a Future<Result>, with Request and Response needing to be defined by the developer. The Layer trait allows composing Services together. Here's a slightly more detailed diagram: A typical Service implementation will wrap an underlying component; the component may be a service itself. Hence, you can chain multiple features by composing various functions. The call() function implementation usually executes these steps in order, all of them being optional: Pre-call Call the wrapped component Post-call For example, a logging service could log the parameters before the call, call the logged component, and log the return value after the call. Another example would be a throttling service, which limits the rate of calls of the wrapped service: it would read the current status before the call and, if above a configured limit, would return immediately without calling the wrapped component. It will call the component and increment the status if the status is valid. The role of a layer would be to take one service and wrap it into the other. With this in mind, it's relatively easy to check the axum-tracing-opentelemetry crate and understand what it does. It offers two services with their respective layers: one is to extract the trace and span IDs from an HTTP request, and another is to send the data to the OTEL collector. Note that Tower comes with several out-of-the-box services, each available via a feature crate: balance: load-balance requests buffer: MPSC buffer discover: service discovery filter: conditional dispatch hedge: retry slow requests limit: limit requests load: load measurement retry: retry failed requests timeout: timeout requests Finally, note that Tower comes in three crates: tower is the public crate, while tower-service and tower-layer are considered less stable. In this post, we have explained what is the Tower library: it's a Functional Programming library that provides function composition. If you come from the Object-Oriented Programming paradigm, it's similar to the Decorator pattern. It builds upon two abstractions, Service is the function and Layer composes functions. It's widespread in the Rust ecosystem, and learning it is a good investment. To go further: Axum Tower documentation Tower crate Axum_tracing_opentelemetry documentation

By Nicolas Fränkel CORE
How to Configure Istio, Prometheus and Grafana for Monitoring
How to Configure Istio, Prometheus and Grafana for Monitoring

Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.

By Md Azmal
Exploring OpenTelemetry Capabilities
Exploring OpenTelemetry Capabilities

It is impossible to know with certainty what is happening inside a remote or distributed system, even if they are running on a local machine. Telemetry is precisely what provides all the information about the internal processes. This data can take the form of logs, metrics, and Distributed Traces. In this article, I will explain all of the forms separately. I will also explain the benefits of OpenTelemetry protocol and how you can configure telemetry flexibly. Telemetry: Why Is It Needed? It provides insights into the current and past states of a system. This can be useful in various situations. For example, it can reveal that system load reached 70% during the night, indicating the need to add a new instance. Metrics can also show when errors occurred and which specific traces and actions within those traces failed. It demonstrates how users interact with the system. Telemetry allows us to see which user actions are popular, which buttons users frequently click, and so on. This information helps developers, for example, to add caching to actions. For businesses, this data is important for understanding how the system is actually being used by people. It highlights areas where system costs can be reduced. For instance, if telemetry shows that out of four instances, only one is actively working at 20% capacity, it may be possible to eliminate two unnecessary instances (with a minimum of two instances remaining). This allows for adjusting the system's capacity and reducing maintenance costs accordingly. Optimizing CI/CD pipelines. These processes can also be logged and analyzed. For example, if a specific step in the build or deployment process is taking a long time, telemetry can help identify the cause and when these issues started occurring. It can also provide insights for resolving such problems. Other purposes. There can be numerous non-standard use cases depending on the system and project requirements. You collect the data, and how it is processed and utilized depends on the specific circumstances. Logging everything may be necessary in some cases, while in others, tracing central services might be sufficient. No large IT system or certain types of businesses can exist without telemetry. Therefore, this process needs to be maintained and implemented in projects where it is not yet present. Types of Data in Telemetry Logs Logs are the simplest type of data in telemetry. There are two types of logs: Automatic logs: These are generated by frameworks or services (such as Azure App Service). With automatic logs, you can log details such as incoming and outgoing requests, as well as the contents of the requests. No additional steps are required to collect these logs, which is convenient for routine tasks. Manual logs: These logs need to be manually triggered. They are not as straightforward as automatic logs, but they are justified for logging important parts of the system. Typically, these logs capture information about resource-intensive processes or those related to specific business tasks. For example, in an education system, it would be crucial not to lose students' test data for a specific period. info: ReportService.ReportServiceProcess[0] Information: {"Id":0,"Name":"65", "TimeStamp":"2022-06-15T11:09:16.2420721Z"} info: ReportService. ReportServiceProcess[0] Information: {"Id":0,"Name":"85","TimeStamp":"2022-06-15T11:09:46.5739821Z"} Typically, there is no need to log all data. Evaluate the system and identify (if you haven't done so already) the most vulnerable and valuable parts of the system. Most likely, those areas will require additional logging. Sometimes, you will need to employ log-driven programming. In my experience, there was a desktop application project on WPW that had issues with multi-threading. The only way to understand what was happening was to log every step of the process. Metrics Metrics are more complex data compared to logs. They can be valuable for both development teams and businesses. Metrics can also be categorized as automatic or manual: Automatic metrics are provided by the system itself. For example, in Windows, you can see metrics such as CPU utilization, request counts, and more. The same principle applies to the Monitoring tab when deploying a virtual machine on AWS or Azure. There, you can find information about the amount of data coming into or going out of the system. Manual metrics can be added by you. For instance, when you need to track the current number of subscriptions to a service. This can be implemented using logs, but the metrics provide a more visual and easily understandable representation, especially for clients. Distributed Trace This data is necessary for working with distributed systems that are not running on a single instance. In such cases, we don't know which instance or service is handling a specific request at any given time. It all depends on the system architecture. Here are some possible scenarios: In the first diagram, the client sends a request to the BFF, which then separately forwards it to three services. In the center, we see a situation where the request goes from the first service to the second, and then to the third. The diagram on the right illustrates a scenario where a service sends requests to a Message Broker, which further distributes them between the second and third services. I'm sure you've come across similar systems, and there are countless examples. These architectures are quite different from monoliths. In systems with a single instance, we have visibility into the call stack from the controller to the database. Therefore, it is relatively easy to track what happened during a specific API call. Most likely, the framework provides this information. However, in distributed systems, we can't see the entire flow. Each service has its own logging system. When sending a request to the BFF, we can see what happens within that context. However, we don't know what happened within services 1, 2, and 3. This is where Distributed Trace comes in. Here is an example of how it works: Let's examine this path in more detail… The User Action goes to the API Gateway, then to Service A, and further to Service B, resulting in a call to the database. When these requests are sent to the system, we receive a trace similar to the one shown. Here, the duration of each process is clearly visible: from User Action to the Database. For example, we can see that the calls were made sequentially. The time between the API Gateway and Service A was spent on setting up the HTTP connection, while the time between Service B and the Database was needed for database setup and data processing. Therefore, we can assess how much time was spent on each operation. This is possible thanks to the Correlation ID mechanism. What is the essence of it? Typically, in monolithic applications, logs and actions are tied to process ID or thread ID during logging. Here, the mechanism is the same, but we manually add it to the requests. Let's look at an example: When the Order Service action starts in the Web Application, it sees the added Correlation ID. This allows the service to understand that it is part of a chain and passes the "marker" to the next services. They, in turn, see themselves as part of a larger process. As a result, each component logs data in a way that allows the system to see everything happening during a multi-stage action. The transmission of the Correlation ID can be done in different ways. For example, in HTTP, this data is often passed as one of the header parameters. In Message Broker services, it is typically written inside the message. However, there are likely SDKs or libraries available in each platform that can help implement this functionality. How OpenTelemetry Works Often, the telemetry format of an old system is not supported in a new one. This leads to many issues when transitioning from one system to another. For example, this was the case with AppInsight and CloudWatch. The data was not grouped properly, and something was not working as expected. OpenTelemetry helps overcome such problems. It is a data transfer protocol in the form of unified libraries from OpenCensus and OpenTracing. The former was developed by Google for collecting metrics and traces, while the latter was created by Uber experts specifically for traces. At some point, the companies realized that they were essentially working on the same task. Therefore, they decided to collaborate and create a universal data representation format. Thanks to the OTLP protocol, logs, metrics, and traces are sent in a unified format. According to the OpenTelemetry repository, prominent IT giants contribute to this project. It is in demand in products that collect and display data, such as Datadog and New Relic. It also plays a significant role in systems that require telemetry, including Facebook, Atlassian, Netflix, and others. Key Components of the OTLP Protocol Cross-language specification: This is a set of interfaces that need to be implemented to send logs, metrics, and traces to a telemetry visualization system. SDK: These are implemented parts in the form of automatic traces, metrics, and logs. Essentially, they are libraries connected to the framework. With them, you can view the necessary information without writing any code. There are many SDKs available for popular programming languages. However, they have different capabilities. Pay attention to the table. Tracing has stable versions everywhere except for the PHP and JS SDKs. On the other hand, metrics and logs are not yet well-implemented in many languages. Some have only alpha versions, some are experimental, and in some cases, the protocol implementation is missing altogether. From my experience, I can say that everything works fine with services on .NET. It provides easy integration and reliable logging. Collector: This is the main component of OpenTelemetry. It is a software package that is distributed as an exe, pkg, or Docker file. The collector consists of four components: Receivers: These are the data sources for the collector. Technically, logs, metrics, and traces are sent to the receivers. They act as access points. Receivers can accept OTLP from Jaeger or Prometheus. Processors: These can be launched for each data type. They filter data, add attributes, and customize the process for specific system or project requirements. Exporters: These are the final destinations for sending telemetry. From here, data can be sent to OTLP, Jaeger, or Prometheus. Extensions: These tools extend the functionality of the collector. One example is the health_check extension, which allows sending a request to an endpoint to check if the collector is working. Extensions provide various insights, such as the number of receivers and exporters in the system and their operation status. In this diagram, we have two types of data: metrics and logs (represented by different colors). Logs go through their processor to Jaeger, while metrics go through another processor, have their own filter, and are sent to two data sources: OTLP and Prometheus. This provides flexible data analysis capabilities, as different software has different ways of displaying telemetry. An interesting point: data can be received from OpenTelemetry and sent back to it. In certain cases, you can send the same data to the same collector. OTLP Deployment There are many ways to build a telemetry collection system. One of the simplest schemes is shown in the illustration below. It involves a single .NET service that sends OpenTelemetry directly to New Relic: If needed, the scheme can be enhanced with an agent. The agent can act as a host service or a background process within the service, collecting data and sending it to New Relic: Moving forward, let's add another application to the scheme (e.g., a Node.js application). It will send data directly to the collector, while the first application will do it through its own agent using OTLP. The collector will then send the data to two systems. For example, metrics will go to New Relic, and logs will go to Datadog: You can also add Prometheus as a data source here. For instance, when someone on the team prefers this tool and wants to use it. However, the data will still be collected in New Relic and Datadog: The telemetry system can be further complicated and adapted to your project. Here's another example: Here, there are multiple collectors, each collecting data in its own way. The agent in the .NET application sends data to both New Relic and the collector. One collector can send information to another because OTLP is sent to a different data source. It can perform any action with the data. As a result, the first collector filters the necessary data and passes it to the next one. The final collector distributes logs, metrics, and traces among New Relic, Datadog, and Azure Monitor. This mechanism allows you to analyze telemetry in a way that is convenient for you. Exploring OpenTelemetry Capabilities Let's dive into the practical aspects of OpenTelemetry and examine its features. For this test, I've created a project based on the following diagram: It all starts with an Angular application that sends HTTP requests to a Python application. The Python application, in turn, sends requests to .NET and Node.js applications, each working according to its own scenario. The .NET application sends requests to Azure Service Bus and handles them in the Report Service, also sending metrics about the processed requests. Additionally, .NET sends requests to MS SQL. The Node.js requests go to Azure Blob Queue and Google. This system emulates some workflow. All applications utilize automatic tracing systems to send traces to the collector. Let's begin by dissecting the docker-compose file. version: "2" services: postal-service: build: context: ../Postal Service dockerfile: Dockerfile ports: - "7120:80" environment: - AZURE_EXPERIMENTAL_ENABLE_ACTIVITY_SOURCE=true depends_on: - mssql report-service: build: context: ../Report dockerfile: Dockerfile -"7133:80" environment: - AZURE_EXPERIMENTAL_ENABLE_ACTIVITY_SOURCE=true depends_on: - mssql billing-service: The file contains the setup for multiple BFF (Backend For Frontend) services. Among the commented-out sections, we have Jaeger, which helps visualize traces. ports: - "5000:5000" #jaeger-all-in-one: # image: jaegertracing/all-in-one: latest # ports: # - "16686:16686" # - "14268" # - "14250" There is also Zipkin, another software for trace visualization. # Zipkin zipkin-all-in-one: image: openzipkin/zipkin:latest ports: "9411:9411" MS SQL and the collector are included as well. The collector specifies a config file and various ports to which data can be sent. # Collector otel-collector: image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.51.0 command: ["--config=/etc/otel-collector-config.yaml" ] volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml ports: - "1888:1888" # pprof extension - "13133:13133" # health_check_extension - "4317:4317" - "4318:4318" - "55678:55679" # zpages extension depends_on: - jaeger-all-in-one - zipkin-all-in-one The config file includes key topics: receivers, exporters, processors, extensions, and the service itself, which acts as the constructor for all of this. otel-collector-config.yaml receivers: otlp: protocols: grpc: http:/ cors: allowed_origins: - http://localhost:4200 max_age: 7200 exporters: prometheus: endpoint: "0.0.0.0:8889" const_labels: label1: value1 logging: There is a single receiver, otlp, which represents the OpenTelemetry Protocol. Other receivers can be added as well (such as Prometheus). The receiver can be configured, and in my example, I set up the allowed_origins. receivers: otrip: protocols: grpc: http: cors: allowed_origins: http://localhost:4200 max_age: 7200 Next are exporters. They allow metrics to be sent to Prometheus. exporters: prometheus: endpoint: "0.0.0.0:8889" const_labels: label1: value1 logging: Then come the extensions. In this case, there is a health_check extension, which serves as an endpoint to check the collector's activity. extensions: health_check: pprof: endpoint: :1888 zpages: endpoint: :55679 Lastly, we have a service with pipelines, traces, and metrics. This section clarifies the data type, its source, processing, and destination. In this example, traces from the receiver are sent for logging to two backends, while metrics are sent to Prometheus. service extensions: [pprof, zpages, health_check] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [zipkin, jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] Now, let's see how it works in practice. The frontend sends requests to the backend, and the backend uses BFF to send requests. We create a trace and observe the results. Among them, we see some requests with a 500 status. To understand what went wrong, we look at the traces through Zipkin. The detailed description of the problematic request shows that the frontend called BFF, which then sent two synchronous requests, one after the other. Through the traces, we can learn where this request was directed, the URL it targeted, and the HTTP method used. All this information is generated based on automatic data. Additionally, manual traces can be added to make the infographic more informative. Additionally, we see that BFF called BILLINGSERVICE. In it, there are middleware processes, requests sent to Azure, and an HTTP POST request that was sent to Azure, resulting in a CREATED status. The system also sets up and sends requests to Google. There is also POSTALSERVICE, where one request failed. Taking a closer look, we see the error description: "ServiceBusSender has already been closed...". Therefore, one must be cautious with ServiceBusSender in the future. Here, we can also observe multiple requests being sent to MS SQL. Finally, we obtain a comprehensive infographic of all the processes in the system. However, I want to warn you that things are not always as transparent. In our case, two traces, as they say, are "out of context." Nothing is clear about them: where they are executed, what happens with them, and there are minimal details. Sometimes, this happens, and you need to be prepared. As an option, you can add manual traces. Let's take a look at how metrics are sent to Prometheus. The illustration shows that the additional request was successfully sent. There was one request, and now there are five. Therefore, metrics are working properly. In the .NET application, requests are sent to Azure Service Bus, and they are processed by the Report Service. However, in Zipkin, there was no Report Service. Nevertheless, the metrics show that it is functioning. So, remember that not everything in OTLP works as expected everywhere. I know libraries that add traces to message brokers by default, and you can see them in the stack. However, this functionality is still considered experimental. Let's not forget about health_check. It shows whether our collector is functioning. {["status":"Server available","upSince": "2022-06-17T15:49:00.320594Z","uptime": "56m4.4995003s"} Now let's send data to Jaeger as well (by adding a new trace resource). After starting it, we need to resend the requests since it does not receive previous data. We receive a list of services like this: We have similar traces to those in Zipkin, including ones with a 500 status. I personally like the System Architecture tab, which displays a system graph. It shows that everything starts with a request to BFF, which is then redirected to BillingService and PostalService. This exemplifies how different tools display data in their unique ways. Lastly, let's discuss the order. In it, we can find the request and the generated trace ID. If you specify this trace ID in the system, you can learn what happened in the request and thoroughly investigate the HTTP call. This way, the frontend learns that it is the first to receive the User Action. In the same way, the frontend understands that it needs to create a trace that will be passed along the chain and send data to the collector. The collector collects and sends the data to Jaeger, Zipkin, and Prometheus. Therefore, the advantages of using the OpenTelemetry Protocol are evident. It is a flexible system for collecting, processing, and sending telemetry. It is particularly convenient in combination with Docker, which I used in creating this demo. However, always remember the limitations of OTLP. When it comes to traces, everything works quite well. However, the feasibility of using this protocol for metrics and logs depends on the readiness of specific system libraries and SDKs.

By Dmitriy Bogdan
6 Effective Strategies for Kubernetes Observability in Hybrid Cloud Environments
6 Effective Strategies for Kubernetes Observability in Hybrid Cloud Environments

2023 has seen rapid growth in native-cloud applications and platforms. Organizations are constantly striving to maximize the potential of their applications, ensure seamless user experiences, and drive business growth. The rise of hybrid cloud environments and the adoption of containerization technologies, such as Kubernetes, have revolutionized the way modern applications are developed, deployed, and scaled. In this digital arena, Kubernetes is the platform of choice for most cloud-native applications and workloads, which is adopted across industries. According to a 2022 report, 96% of companies are already either using or evaluating the implementation of Kubernetes in their cloud system. This popular open-source utility is helpful for container orchestration and discovery, load balancing, and other capabilities. However, with this transformation comes a new set of challenges. As the complexity of applications increases, so does the need for robust observability solutions that enable businesses to gain deep insights into their containerized workloads. Enter Kubernetes observability—a critical aspect of managing and optimizing containerized applications in hybrid cloud environments. In this blog post, we will delve into Kubernetes observability, exploring six effective strategies that can empower businesses to unlock the full potential of their containerized applications in hybrid cloud environments. These strategies, backed by industry expertise and real-world experiences, will equip you with the tools and knowledge to enhance the observability of your Kubernetes deployments, driving business success. Understanding Observability in Kubernetes Let us first start with the basics. Kubernetes is a powerful tool for managing containerized applications. But despite its powerful features, keeping track of what's happening in a hybrid cloud environment can be difficult. This is where observability comes in. Observability is collecting, analyzing, and acting on data in a particular environment. In the context of Kubernetes, observability refers to gaining insights into the behavior, performance, and health of containerized applications running within a Kubernetes cluster. Kubernetes Observability is based on three key pillars: 1. Logs: Logs provide valuable information about the behavior and events within a Kubernetes cluster. They capture important details such as application output, system errors, and operational events. Analyzing logs helps troubleshoot issues, understand application behavior, and identify patterns or anomalies. 2. Metrics: Metrics are quantitative measurements that provide insights into a Kubernetes environment's performance and resource utilization. They include CPU usage, memory consumption, network traffic, and request latency information. Monitoring and analyzing metrics help identify performance bottlenecks, plan capacity, and optimize resource allocation. 3. Traces: Traces enable end-to-end visibility into the flow of requests across microservices within a Kubernetes application. Distributed tracing captures timing data and dependencies between different components, providing a comprehensive understanding of request paths. Traces help identify latency issues, understand system dependencies, and optimize critical paths for improved application performance. Kubernetes observability processes typically involve collecting and analyzing data from various sources to understand the system's internal state and provide actionable intelligence. By implementing the right observability strategies, you can gain a deep understanding of your applications and infrastructure, which will help you to: Detect and troubleshoot problems quickly Improve performance and reliability Optimize resource usage Meet compliance requirements Observability processes are being adopted at a rapid pace by IT teams. By 2026, 70% of organizations will have successfully applied observability to achieve shorter latency for decision-making while increasing distributed, organized, and simplified data management processes. 1. Use Centralized Logging and Log Aggregation For gaining insights into distributed systems, centralized logging is an essential strategy. In Kubernetes environments, where applications span multiple containers and nodes, collecting and analyzing logs from various sources becomes crucial. Centralized logging involves consolidating logs from different components into a single, easily accessible location. The importance of centralized logging lies in its ability to provide a holistic view of your system's behavior and performance. With Kubernetes logging, you can correlate events and identify patterns across your Kubernetes cluster, enabling efficient troubleshooting and root-cause analysis. To implement centralized logging in Kubernetes, you can leverage robust log aggregation tools or cloud-native solutions like Amazon CloudWatch Logs or Google Cloud Logging. These tools provide scalable and efficient ways to collect, store, and analyze logs from your Kubernetes cluster. 2. Leverage Distributed Tracing for End-to-End Visibility In a complex Kubernetes environment with microservices distributed across multiple containers and nodes, understanding the flow of requests and interactions between different components becomes challenging. This is where distributed tracing comes into play, providing end-to-end visibility into the execution path of requests as they traverse through various services. Distributed tracing allows you to trace a request's journey from its entry point to all the microservices it touches, capturing valuable information about each step. By instrumenting your applications with tracing libraries or agents, you can generate trace data that reveals each service's duration, latency, and potential bottlenecks. The benefits of leveraging distributed tracing in Kubernetes are significant. Firstly, it helps you understand the dependencies and relationships between services, enabling better troubleshooting and performance optimization. When a request experiences latency or errors, you can quickly identify the service or component responsible and take corrective actions. Secondly, distributed tracing allows you to measure and monitor the performance of individual services and their interactions. By analyzing trace data, you can identify performance bottlenecks, detect inefficient resource usage, and optimize the overall responsiveness of your system. This information is invaluable with regard to capacity planning and ensuring scalability in your Kubernetes environment. Several popular distributed tracing solutions are available. These tools provide the necessary instrumentation and infrastructure to effectively collect and visualize trace data. By integrating these solutions into your Kubernetes deployments, you can gain comprehensive visibility into the behavior of your microservices and drive continuous improvement. 3. Integrate Kubernetes With APM Solutions To achieve comprehensive observability in Kubernetes, it is essential to integrate your environment with Application Performance Monitoring (APM) solutions. APM solutions provide advanced monitoring capabilities beyond traditional metrics and logs, offering insights into the performance and behavior of individual application components. One of the primary benefits of APM integration is the ability to detect and diagnose performance bottlenecks within your Kubernetes applications. With APM solutions, you can trace requests as they traverse through various services and identify areas of high latency or resource contention. Armed with this information, you can take targeted actions to optimize critical paths and improve overall application performance. Many APM solutions offer dedicated Kubernetes integrations that streamline the monitoring and management of containerized applications. These integrations provide pre-configured dashboards, alerts, and instrumentation libraries that simplify capturing and analyzing APM data within your Kubernetes environment. 4. Use Metrics-Based Monitoring Metrics-based monitoring forms the foundation of observability in Kubernetes. It involves collecting and analyzing key metrics that provide insights into your Kubernetes clusters and applications' health, performance, and resource utilization. When it comes to metrics-based monitoring in Kubernetes, there are several essential components to consider: Node-Level Metrics: Monitoring the resource utilization of individual nodes in your Kubernetes cluster is crucial for capacity planning and infrastructure optimization. Metrics such as CPU usage, memory usage, disk I/O, and network bandwidth help you identify potential resource bottlenecks and ensure optimal allocation. Pod-Level Metrics: Pods are the basic units of deployment in Kubernetes. Monitoring metrics related to pods allows you to assess their resource consumption, health, and overall performance. Key pod-level metrics include CPU and memory usage, network throughput, and request success rates. Container-Level Metrics: Containers within pods encapsulate individual application components. Monitoring container-level metrics helps you understand the resource consumption and behavior of specific application services or processes. Metrics such as CPU usage, memory usage, and file system utilization offer insights into container performance. Application-Specific Metrics: Depending on your application's requirements, you may need to monitor custom metrics specific to your business logic or domain. These metrics could include transaction rates, error rates, cache hit ratios, or other relevant performance indicators. Metric-based monitoring architecture diagram 5. Use Custom Kubernetes Events for Enhanced Observability Custom events communicate between Kubernetes components and between Kubernetes and external systems. They can signal important events, such as deployments, scaling operations, configuration changes, or even application-specific events within your containers. By leveraging custom events, you can achieve several benefits in terms of observability: Proactive Monitoring: Custom events allow you to define and monitor specific conditions that require attention. For example, you can create events to indicate when resources are running low, when pods experience failures, or when specific thresholds are exceeded. By capturing these events, you can proactively detect and address issues before they escalate. Contextual Information: Custom events can include additional contextual information that helps troubleshoot and analyze root causes. You can attach relevant details, such as error messages, timestamps, affected resources, or any other metadata that provides insights into the event's significance. This additional context aids in understanding and resolving issues more effectively. Integration with External Systems: Kubernetes custom events can be consumed by external systems, such as monitoring platforms or incident management tools. Integrating these systems allows you to trigger automated responses or notifications based on specific events. This streamlines incident response processes and ensures the timely resolution of critical issues. To leverage custom Kubernetes events, you can use Kubernetes event hooks, custom controllers, or even develop your event-driven applications using the Kubernetes API. By defining event triggers, capturing relevant information, and reacting to events, you can establish a robust observability framework that complements traditional monitoring approaches. 6. Incorporating Synthetic Monitoring for Proactive Observability Synthetic monitoring simulates user journeys or specific transactions that represent everyday interactions with your application. These synthetic tests can be scheduled to run regularly from various geographic locations, mimicking user behavior and measuring key performance indicators. There are several key benefits to incorporating synthetic monitoring in your Kubernetes environment: Proactive Issue Detection: Synthetic tests allow you to detect issues before real users are affected. By regularly simulating user interactions, you can identify performance degradations, errors, or unresponsive components. This early detection enables you to address issues proactively and maintain high application availability. Performance Benchmarking: Synthetic monitoring provides a baseline for performance benchmarking and SLA compliance. You can measure response times, latency, and availability under normal conditions by running consistent tests from different locations. These benchmarks serve as a reference for detecting anomalies and ensuring optimal performance. Geographic Insights: Synthetic tests can be configured to run from different geographic locations, providing insights into the performance of your application from various regions. This helps identify latency issues or regional disparities that may impact user experience. By optimizing your application's performance based on these insights, you can ensure a consistent user experience globally. You can leverage specialized tools to incorporate synthetic monitoring into your Kubernetes environment. These tools offer capabilities for creating and scheduling synthetic tests, monitoring performance metrics, and generating reports. An approach for gaining Kubernetes observability for traditional and microservice-based applications is by using third-party tools like Datadog, Splunk, Middleware, and Dynatrace. This tool captures metrics and events, providing several out-of-the-box reports, charts, and alerts to save time. Wrapping Up This blog explored six practical strategies for achieving Kubernetes observability in hybrid cloud environments. By utilizing centralized logging and log aggregation, leveraging distributed tracing, integrating Kubernetes with APM solutions, adopting metrics-based monitoring, incorporating custom Kubernetes events, and synthetic monitoring, you can enhance your understanding of the behavior and performance of your Kubernetes deployments. Implementing these strategies will provide comprehensive insights into your distributed systems, enabling efficient troubleshooting, performance optimization, proactive issue detection, and improved user experience. Whether you are operating a small-scale Kubernetes environment or managing a complex hybrid cloud deployment, applying these strategies will contribute to the success and reliability of your applications.

By Savan Kharod
Monitor API Health Check With Prometheus
Monitor API Health Check With Prometheus

APISIX has a health check mechanism that proactively checks the health status of the upstream nodes in your system. Also, APISIX integrates with Prometheus through its plugin that exposes upstream nodes (multiple instances of a backend API service that APISIX manages) health check metrics on the Prometheus metrics endpoint typically, on URL path /apisix/prometheus/metrics. In this article, we'll guide you on how to enable and monitor API health checks using APISIX and Prometheus. Prerequisite(s) This guide assumes the following tools are installed locally: Before you start, it is good to have a basic understanding of APISIX. Familiarity with API gateway, and its key concepts such as routes, upstream, Admin API, plugins, and HTTP protocol will also be beneficial. Docker is used to install the containerized etcd and APISIX. Install cURL to send requests to the services for validation. Start the APISIX Demo Project This project leverages existing the pre-defined Docker Compose configuration file to set up, deploy and run APISIX, etcd, Prometheus, and other services with a single command. First, clone the apisix-prometheus-api-health-check repo on GitHub and open it in your favorite editor, and start the project by simply running docker compose up from the project root folder. When you start the project, Docker downloads any images it needs to run. You can see the full list of services in docker-compose.yaml file. Add Health Check API Endpoints in Upstream To check API health periodically, APISIX needs an HTTP path of the health endpoint of the upstream service. So, you need first to add /health endpoint for your backend service. From there, you inspect the most relevant metrics for that service such as memory usage, database connectivity, response duration, and more. Assume that we have two backend REST API services web1 and web2 running using the demo project and each has its own health check endpoint at URL path /health. At this point, you do not need to make additional configurations. In reality, you can replace them with your backend services. The simplest and standardized way to validate the status of a service is to define a new health check endpoint like /health or /status Setting Up Health Checks in APISIX This process involves checking the operational status of the 'upstream' nodes. APISIX provides two types of health checks: Active checks and Passive Checks respectively. Read more about Health Checks and how to enable them here. Use the Admin API to create an Upstream object. Here is an example of creating an Upstream object with two nodes (Per each backend service we defined) and configuring the health check parameters in the upstream object: curl "http://127.0.0.1:9180/apisix/admin/upstreams/1" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d ' { "nodes": { "web1:80": 1, "web2:80": 1 }, "checks": { "active": { "timeout": 5, "type": "http", "http_path": "/health", "healthy": { "interval": 2, "successes": 1 }, "unhealthy": { "interval": 1, "http_failures": 2 } } } }' This example configures an active health check on the /health endpoint of the node. It considers the node healthy after one successful health check and unhealthy after two failed health checks. Note that sometimes you might need the IP addresses of upstream nodes, not their domains (web1 and web2) if you are running services outside docker network. It is by design that the health check will be started only if the number of nodes (resolved IPs) is bigger than 1. Enable the Prometheus Plugin Create a global rule to enable the prometheus plugin on all routes by adding "prometheus": {} in the plugins option. APISIX gathers internal runtime metrics and exposes them through port 9091 and URI path /apisix/prometheus/metrics by default that Prometheus can scrape. It is also possible to customize the export port and URI path, add extra labels, the frequency of these scrapes, and other parameters by configuring them in the Prometheus configuration /prometheus_conf/prometheus.ymlfile. curl "http://127.0.0.1:9180/apisix/admin/global_rules" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '{ "id": "rule-for-metrics", "plugins": { "prometheus":{} } }' Create a Route Create a Route object to route incoming requests to upstream nodes: curl "http://127.0.0.1:9180/apisix/admin/routes/1" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d ' { "name": "backend-service-route", "methods": ["GET"], "uri": "/", "upstream_id": "1" }' Send Validation Requests to the Route To generate some metrics, you try to send a few requests to the route we created in the previous step: curl -i -X GET "http://localhost:9080/" If you run the above requests a couple of times, you can see from responses that APISX routes some requests to node2 and others to node2. That’s how Gateway load balancing works! HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 Content-Length: 10 Connection: keep-alive Date: Sat, 22 Jul 2023 10:16:38 GMT Server: APISIX/3.3.0 hello web2 ... HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 Content-Length: 10 Connection: keep-alive Date: Sat, 22 Jul 2023 10:16:39 GMT Server: APISIX/3.3.0 hello web1 Collecting Health Check Data With the Prometheus Plugin Once the health checks and route are configured in APISIX, you can employ Prometheus to monitor health checks. APISIX automatically exposes health check metrics data for your APIs if the health check parameter is enabled for upstream nodes. You will see metrics in the response after fetching them from APISIX: curl -i http://127.0.0.1:9091/apisix/prometheus/metrics Example Output: # HELP apisix_http_requests_total The total number of client requests since APISIX started # TYPE apisix_http_requests_total gauge apisix_http_requests_total 119740 # HELP apisix_http_status HTTP status codes per service in APISIX # TYPE apisix_http_status counter apisix_http_status{code="200",route="1",matched_uri="/",matched_host="",service="",consumer="",node="172.27.0.5"} 29 apisix_http_status{code="200",route="1",matched_uri="/",matched_host="",service="",consumer="",node="172.27.0.7"} 12 # HELP apisix_upstream_status Upstream status from health check # TYPE apisix_upstream_status gauge apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="443"} 0 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="80"} 1 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="443"} 0 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="80"} 1 Health check data is represented with metrics label apisix_upstream_status. It has attributes like upstream name, ip and port. A value of 1 represents healthy and 0 means the upstream node is unhealthy. Visualize the Data in the Prometheus Dashboard Navigate to http://localhost:9090/ where the Prometheus instance is running in Docker and type Expression apisix_upstream_status in the search bar. You can also see the output of the health check statuses of upstream nodes on the Prometheus dashboard in the table or graph view: Cleanup Once you are done experimenting with Prometheus and APISIX Gateway health check metrics, you can use the following commands to stop and remove the services created in this guide: docker compose down Next Steps You have now learned how to set up and monitor API health checks with Prometheus and APISIX. APISIX Prometheus plugin is configured to connect Grafana automatically to visualize metrics. Keep exploring the data and customize the Grafana dashboard by adding a panel that shows the number of active health checks.

By Bobur Umurzokov
Monitoring vs. Observability in 2023: An Honest Take
Monitoring vs. Observability in 2023: An Honest Take

If you're running a software system, you need to know what’s happening with it: how it’s performing, whether it’s running as expected, and whether any issues need your attention. And once you spot an issue, you need information so you can troubleshoot. A plethora of tools promises to help with this, from monitoring, APMs, Observability, and everything in between. This has resulted in something of a turf war in the area of observability, where monitoring vendors claims they also do observability, while “observability-first” players disagree and accuse them of observability-washing. So let's take an unbiased look at this and answer a few questions: How are monitoring and observability different, if at all? How effective is each at solving the underlying problem? How does AI impact this space now, and what comes next? What Is Monitoring? A monitoring solution performs three simple actions: Pre-define some "metrics" in advance. Deploy agents to collect these metrics. Display these metrics in dashboards. Note that a metric here is a simple number that captures a quantifiable characteristic of a system. We can then perform mathematical operations on metrics to get different aggregate views. Monitoring has existed for the past 40 years — since the rise of computing systems — and was originally how operations teams kept track of how their infrastructure was behaving. Types of Monitoring Originally, monitoring was most heavily used in infrastructure to keep track of infrastructure behavior - this was infrastructure monitoring. Over time, as applications became more numerous and diverse, we wanted to monitor them as well, leading to the emergence of a category called APM (Application performance monitoring). In a modern distributed system, we have several components we want to monitor — infrastructure, applications, databases, networks, data streams, and so on, and the metrics we want differ depending on the component. For instance: Infrastructure monitoring: uptime, CPU utilization, memory utilization. Application performance monitoring: throughput, error rate, latency. Database monitoring: number of connections, query performance, cache hit ratios. Network monitoring: roundtrip time, TCP retransmits, connection churn. ..and so on These metrics are measures that are generally agreed upon as relevant for that system, and most monitoring tools come pre-built with agents that know which metric to collect and what dashboards to display. As the number of components in distributed systems multiplied, the volume and variety of metrics grew exponentially. To manage this complexity, a separate suite of tools and processes emerged that expanded upon traditional monitoring tools, with time-series databases, SLO systems, and new visualizations. Distinguishing Monitoring Through all this, the core functioning of a monitoring system remains the same, and a monitoring system can be clearly distinguished if: It captures predefined data. The data being collected is a metric (a number). The Goal of Monitoring The goal of a monitoring tool is to alert us when something unexpected is happening in a system. This is akin to an annual medical checkup - we measure a bunch of pre-defined values that will give us an overall picture of our body and let us know if any particular sub-system (organ) is behaving unexpectedly. And just like annual checkups, a monitoring tool may or may not provide any additional information about why something is unexpected. For that, we’ll likely need deeper, more targeted tests and investigations. An experienced physician might still be able to diagnose a condition based on just the overall test, but that is not what the test is designed for. Same with a monitoring solution. What Is Observability? Unlike monitoring, observability is much harder to define. This is because the goal of observability is fuzzier. It is to "help us understand why something is behaving unexpectedly." Logs are the original observability tool that we've been using since the 70s. How we worked until the late 2000s was - traditional monitoring systems would alert us when something went wrong, and logs would help us understand why. However, in the last 15 years, our architectures have gotten significantly more complex. It became near impossible to manually scour logs to figure out what happened. At the same time, our tolerance for downtime decreased dramatically as businesses became more digital, and we could no longer afford to spend hours understanding and fixing issues. We needed more data than we had, so we could troubleshoot issues faster. This led to the rise of the observability industry, whose purpose was to help us understand more easily why our systems were misbehaving. This started with the addition of a new data type called traces, and we said the three pillars of observability were metrics, logs, and traces. Then from there, we kept adding new data types to " improve our observability." The Problem With Observability The fundamental problem with observability is; we don't know what information we might need beforehand. The data we need depends on the issue. The nature of production errors is that they are unexpected and long-tail: if they could've been foreseen, they’d have been fixed already. This is what makes observability fuzzy: there’s no clear scope around what and how much to capture. So observability became "any data that could potentially help us understand what is happening." Today, the best way to describe observability as it is implemented is; "Everything outside of metrics, plus metrics." Monitoring vs. Observability A perfectly observable system would record everything that happens in production with no data gaps. Thankfully, that is impractical and prohibitively expensive, and 99% of the data would be irrelevant anyway, so an average observability platform needs to make complex choices on what and how much telemetry data to capture. Different vendors view this differently, and depending on who you ask, observability seems slightly different. Commonly Cited Descriptions of Observability Are Unhelpful Common articulations of observability, like "observability is being able to observe internal states of a system through its external outputs," are vague and give us neither a clear indication of what it is nor guide us in deciding whether we have sufficient observability for our needs. In addition, most of the commonly cited markers that purport to distinguish observability from monitoring are also vague, if not outright misleading. Let’s look at a few examples: 1. “Monitoring Is Predefined Data; Observability Is Not” In reality, nearly everything we capture in an observability solution today is also predetermined. We define in advance what logs we want to capture, what distributed traces we want to capture (including sampling mechanisms), what context to attach to each distributed trace, and when to capture a stack trace. We're yet to enter the era of tools that selectively capture data based on what is actually happening in production. 2. “Monitoring Is Simple Dashboards; Observability Is More Complex Analysis and Correlation.” This is another promise that’s still unmet in practice. Most observability platforms today also just have dashboards — just that their dashboards show more data than metrics (for example, strings for logs) or can pull up different charts and views based on user instructions. We don't yet have tools that can do any meaningful correlation context by themselves to help us understand problems faster. Being able to connect a log and a trace using a unique ID doesn’t qualify as a complex analysis or correlation, even though the effort required for it may be non-trivial. 3. “Monitoring Is Reactive; Observability Is Proactive.” All observability data we collect is pre-defined, and nearly everything we do in production today (including around observability) is reactive. The proactive part was what we did while testing. In production, if something breaks and/or looks unexpected, we respond and investigate. At best, we use SLO systems which could potentially qualify as proactive. With SLO systems, we predefine an acceptable amount of errors (error budgets) and take action before we surpass them. However, SLO systems are more tightly coupled with monitoring tools, so this is not a particularly relevant distinction between a monitoring and observability solution. 4. “Monitoring Focuses on Individual Components; Observability Reveals Relationships Across Components.” This is a distinction created just to make observability synonymous with distributed tracing. Distributed tracing is just one more data type that shows us the relationships across components. Today, distributed tracing must be used in conjunction with other data to be useful. In summary, we have a poorly defined category with no outer boundaries. Then we made up several vague, not very helpful markers to distinguish that category from monitoring, which existed before. This narrative is designed to tell us that there's always some distance to go before we get to "true observability" —and always one more tool to buy. As a result, we’re continuously expanding the scope of what we need within observability. What Is the Impact of This? Ever Increasing the List of Data Types for Observability All telemetry data is observability because it helps us "observe" the states of our system. Do logs qualify as observability? Yes, because they help us understand what happened in production. Does distributed tracing qualify? Yes. How about error monitoring systems that capture stack traces for exceptions? Yes. How about live debugging systems? Yes. How about continuous profilers? Yes. How about metrics? Also, yes, because they also help us understand the state of our systems. Ever-Increasing Volume of Observability Data How much data to capture is left to the customer to decide, especially outside of monitoring. How much you want to log, how many distributed traces you want to capture, how many events you want to capture and store, at what intervals, for how long — everything is an open question, with limited guidance on how much is "reasonable" and at what point you might be capturing too much. Companies can spend $1M or as much as $65M on observability; it all depends on who builds what business case. Tool Sprawl and Spending Increase All of the above has led to the amount spent on observability rising rapidly. Most companies today use five or more observability tools, and monitoring & observability is typically the second-largest infrastructure spend in a company after cloud infrastructure itself, with a market size of ~$17B. Fear and Loss-Aversion Are Underlying Drivers for Observability Expansion The underlying human driver for the adoption of all these tools is fear -"What if something breaks and I don't have enough data to be able to troubleshoot"? This is every engineering team's worst nightmare. This naturally drives teams to capture more and more telemetry data every year so they feel more secure. Yet MTTR Appears to Be Increasing Globally One would expect that with the wide adoption of observability and the aggressive capturing and storing of various types of observability data, MTTR would have dropped dramatically globally. On the contrary, it appears to be increasing, with 73% of companies taking more than an hour to resolve production issues (vs 47% just two years ago). Despite all the investment, we seem to be making incremental progress at best. Increasing production MTTRs Where We Are Now So far, we continued to collect more and more telemetry data in the hope that processing and storage costs would keep dropping to support that. But with exploding data volumes, we ran into a new problem outside of cost, which is usability. It was getting impossible for a human to directly look at 10s of dashboards and arrive at conclusions quickly enough. So we created different data views and cuts to make it easier for users to test and validate their hypotheses. But these tools have become too complex for an average engineer to use, and we need specially trained "power users" (akin to data scientists) who are well versed in navigating this pool of data to identify an error. This is the approach many observability companies are taking today: capture more data, have more analytics, and train power users who are capable of using these tools. But these specialized engineers do not have enough information about all the parts of the system to be able to generate good-enough hypotheses. Meanwhile, the average engineer continues to rely largely on logs to debug software issues, and we make no meaningful improvement in MTTR. So all of observability seems like a high-effort, high-spend activity that allows us merely to stay in the same place as our architectures rapidly grow in complexity. So what’s next? Monitoring, observability, and inferencing Inferencing: The Next Stage After Observability? To truly understand what the next generation would look like, let us start with the underlying goal of all these tools. It is to keep production systems healthy and running as expected and, if anything goes wrong, to allow us to quickly understand why and resolve the issue. If we start there, we can see that are three distinct levels in how tools can support us: Level 1: "Tell me when something is off in my system" — monitoring. Level 2: "Tell me why something is off (and how to fix it)" — let's call this inferencing. Level 3: "Fix it yourself and tell me what you did" — auto-remediation. Traditional monitoring tools do Level 1 reasonably well and help us detect issues. We have not yet reached Level 2, where a system can automatically tell us why something is breaking. So we introduced a set of tools called observability that sit somewhere between Level 1 and Level 2 to "help understand why something is breaking” by giving us more data. Monitoring, observability, and inferencing Inferencing — Observability Plus AI I'd argue the next step after observability is Inferencing — where a platform can reasonably explain why an error occurred so that we can fix it. This becomes possible now in 2023 with the rapid evolution of AI models over the last few months. Imagine a solution that: Automatically surfaces just the errors that need immediate developer attention. Tells the developer exactly what is causing the issue and where the issue is: this pod, this server, this code path, this line of code, for this type of request. Guides the developer on how to fix it. Uses the developer's actual actions to improve its recommendations continuously. Avoiding the Pitfalls of AIOps In any conversation around AI + observability, it’s important to remember that this has been attempted before with AIOps, with limited success. It will be important for inferencing solutions to avoid the pitfalls of AIOps. To do that, inferencing solutions would have to be architected ground-up for the AI use-case, i.e., data collection, processing, storage, and user interface are all designed ground-up for root-causing issues using AI. What it will probably NOT look like is AI added on top of existing observability tools and existing observability data, simply because that is what we attempted and failed with AIOPs. Conclusion We explored monitoring and observability and how they differ. We looked at how observability is poorly defined today with loose boundaries, which results in uncontrolled data, tool, and spend sprawl. Meanwhile, the latest progress in AI could resolve some of the issues we have with observability today with a new class of Inferencing solutions based on AI. Watch this space for more on this topic!

By Samyukktha T

Top Monitoring and Observability Experts

expert thumbnail

Joana Carvalho

Site Reliability Engineering,
Virtuoso

Joana has been a performance engineer for the last ten years. She analyzed root causes from user interaction to bare metal, performance tuning, and new technology evaluation. Her goal is to create solutions to empower the development teams to own performance investigation, visualization, and reporting so that they can, in a self-sufficient manner, own the quality of their services. She works as a Site Reliability Engineer at Virtuoso, a codeless end-to-end testing platform powered by AI/ML.
expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Evangelism. He's renowned in the development community as a speaker, lecturer, author and baseball expert. His current role allows him to help the world understand the challenges they are facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies, organizations, and is a CNCF Ambassador. Follow on https://www.schabell.org.
expert thumbnail

Chris Ward

Zone Leader,
DZone

Twitter: @ChrisChinch‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Ted Young

Director of Open Source Development,
LightStep

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎

The Latest Monitoring and Observability Topics

article thumbnail
WordPress Deployment: Docker, Nginx, Apache, and SSL
Install and set up WordPress with Docker Compose, Nginx, Apache, and Let's Encrypt SSL on Ubuntu 22.04 LTS. This setup is tested on a Google Cloud Compute Engine VM.
September 25, 2023
by Pappin Vijak
· 1,158 Views · 1 Like
article thumbnail
How TIBCO Is Evolving Integration for the Multi-Cloud Era
Single pane of glass end-to-end observability, a developer portal, model-driven data management, and virtualization boost productivity and performance.
September 22, 2023
by Tom Smith CORE
· 2,647 Views · 2 Likes
article thumbnail
AWS Amplify: A Comprehensive Guide
AWS Amplify is a tool for building, shipping, and hosting apps on AWS. It offers authentication, data storage, API development, and more
September 21, 2023
by Hardik Thakker
· 1,961 Views · 2 Likes
article thumbnail
Exploring Edge Computing: Delving Into Amazon and Facebook Use Cases
Edge computing enhances latency, bandwidth utilization, security, and scalability in data processing for companies like Amazon and Facebook.
September 20, 2023
by Arun Pandey
· 4,227 Views · 2 Likes
article thumbnail
Maximizing Uptime: How to Leverage AWS RDS for High Availability and Disaster Recovery
AWS RDS offers Multi-AZ deployments and Read Replicas to enable high availability and cross-region disaster recovery for databases.
September 20, 2023
by Raghava Dittakavi
· 2,529 Views · 1 Like
article thumbnail
The Systemic Process of Debugging
Explore the academic theory of the debugging process, focusing on issue tracking, team communication, and the balance between unit-to-integration tests.
September 19, 2023
by Shai Almog CORE
· 1,555 Views · 3 Likes
article thumbnail
The Convergence of Testing and Observability
While the popularity of observability is a somewhat recent development, it is exciting to see what benefits it can bring to testing. Find out more in this post.
September 18, 2023
by Mirco Hering
· 2,197 Views · 2 Likes
article thumbnail
How To Repair Failed Installations of Exchange Cumulative and Security Updates
In this article, we will list some common issues that you may encounter when installing CU and SU and the possible solutions to fix them.
September 15, 2023
by Shelly Bhardwaj
· 2,694 Views · 1 Like
article thumbnail
Log Analysis: How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second
This article describes a large-scale data warehousing use case to provide reference for data engineers who are looking for log analytic solutions.
September 15, 2023
by Zaki Lu
· 3,034 Views · 2 Likes
article thumbnail
Choosing the Appropriate AWS Load Balancer: ALB vs. NLB
Learn the key differences between Application Load Balancer (ALB) and Network Load Balancer (NLB) to make the right choice for your application.
September 14, 2023
by Satrajit Basu CORE
· 2,372 Views · 4 Likes
article thumbnail
16 K8s Worst Practices That Are Causing You Pain (Or Will Soon)
The article emphasizes the importance of understanding and implementing best practices to avoid pitfalls and ensure efficient and secure Kubernetes operations.
September 14, 2023
by Augustinas Stirbis
· 3,402 Views · 2 Likes
article thumbnail
Getting Started With Prometheus Workshop: Instrumenting Applications
Interested in open-source observability? Learn about instrumenting your applications for customized business metrics in cloud-native environments.
September 14, 2023
by Eric D. Schabell CORE
· 3,161 Views · 5 Likes
article thumbnail
Fortifying the Cloud: A Look at AWS Shield's Scalable DDoS Protection
AWS Shield protects AWS cloud resources from disruptive DDoS attacks. It provides automated protection with real-time monitoring and mitigation.
September 13, 2023
by Raghava Dittakavi
· 2,267 Views · 2 Likes
article thumbnail
Streamlining Salesforce Data Management: Migrating Attachments to AWS S3
In this blog, we'll delve into the significance of migrating attachments from Salesforce to AWS S3 using tools such as Informatica IICS.
September 13, 2023
by srinivas Venkata
· 2,093 Views · 1 Like
article thumbnail
The Essentials of Amazon S3 Glacier for Affordable and Compliant Long-Term Data Archiving
Amazon S3 Glacier is a secure, scalable cloud storage service for long-term data archiving. It offers affordable storage and retrieval of infrequently accessed data.
September 12, 2023
by Raghava Dittakavi
· 1,988 Views · 2 Likes
article thumbnail
Eliminating Bugs Using the Tong Motion Approach
Delve into a two-pronged strategy that streamlines debugging, enabling developers to swiftly pinpoint and resolve elusive software glitches.
September 12, 2023
by Shai Almog CORE
· 1,261 Views · 2 Likes
article thumbnail
PaaS4GenAI: Connecting Generative AI (WatsonX) On IBM Cloud Platform From Oracle Integration Cloud
Solution of multi-cloud connectivity by leveraging Oracle Integration Cloud with Generative AI WatsonX on IBM Cloud Platform.
September 12, 2023
by Sandip Biswas
· 2,531 Views · 1 Like
article thumbnail
Log Analysis Using grep
This post explores log analysis with the grep command, including syntax, examples, and efficiency tips for effective log file searching and filtering.
September 11, 2023
by Muhammad Raza
· 3,220 Views · 9 Likes
article thumbnail
Protect Your Keys: Lessons from the Azure Key Breach
Learn how to better protect your organization from attacks by looking at how attackers compromised a Microsoft signing key. Secure your keys and review logs.
September 9, 2023
by Dwayne McDaniel
· 3,538 Views · 1 Like
article thumbnail
Choosing a Container Platform
There are many container platforms to choose from. In this post, I help break down the suitability of each platform.
September 7, 2023
by Kit Dergilev
· 4,095 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: