DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

[DZone Research] Observability + Performance: We want to hear your experience and insights. Join us for our annual survey (enter to win $$).

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Performance

Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.

icon
Latest Refcards and Trend Reports
Trend Report
Performance and Site Reliability
Performance and Site Reliability
Refcard #385
Observability Maturity Model
Observability Maturity Model
Refcard #368
Getting Started With OpenTelemetry
Getting Started With OpenTelemetry

DZone's Featured Performance Resources

[DZone Research] Join Us for Our Annual Performance + Observability Survey!

[DZone Research] Join Us for Our Annual Performance + Observability Survey!

By Caitlin Candelmo
Do you have experience with observability, monitoring, and/or application performance? If so, look no further than our annual Observability + Application Performance research survey! Building our most performant and resilient applications not only requires monitoring or alerts — it stretches far deeper to a world of data, telemetry, and observability. In this year's survey, we will further investigate key performance topics including: Observability challenges and techniques Application performance and monitoring Site reliability considerations Our community members have a direct hand in driving the research covered in the report, and this is where we could use your anonymous insights! We're asking for ~7 minutes of your time to share your experience. (And enter for a chance to win 1 of 5 $100 gift cards!) Participate in Our Research Over the coming weeks, we will compile and analyze data from hundreds of DZone members, and observations will be featured in the "Key Research Findings" of our November Trend Report, Observability and Application Performance: Building Resilient Systems. Your responses help shape the narrative of our Trend Reports, so we cannot do this without you. We thank you in advance for your help!–The DZone Publications team More
The Convergence of Testing and Observability

The Convergence of Testing and Observability

By Mirco Hering
This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report One of the core capabilities that has seen increased interest in the DevOps community is observability. Observability improves monitoring in several vital ways, making it easier and faster to understand business flows and allowing for enhanced issue resolution. Furthermore, observability goes beyond an operations capability and can be used for testing and quality assurance. Testing has traditionally faced the challenge of identifying the appropriate testing scope. "How much testing is enough?" and "What should we test?" are questions each testing executive asks, and the answers have been elusive. There are fewer arguments about testing new functionality; while not trivial, you know the functionality you built in new features and hence can derive the proper testing scope from your understanding of the functional scope. But what else should you test? What is a comprehensive general regression testing suite, and what previous functionality will be impacted by the new functionality you have developed and will release? Observability can help us with this as well as the unavoidable defect investigation. But before we get to this, let's take a closer look at observability. What Is Observability? Observability is not monitoring with a different name. Monitoring is usually limited to observing a specific aspect of a resource, like disk space or memory of a compute instance. Monitoring one specific characteristic can be helpful in an operations context, but it usually only detects a subset of what is concerning. All monitoring can show is that the system looks okay, but users can still be experiencing significant outages. Observability aims to make us see the state of the system by making data flows "observable." This means that we can identify when something starts to behave out of order and requires our attention. Observability combines logs, metrics, and traces from infrastructure and applications to gain insights. Ideally, it organizes these around workflows instead of system resources and, as such, creates a functional view of the system in use. Done correctly, it lets you see what functionality is being executed and how frequently, and it enables you to identify performance characteristics of the system and workflow. Figure 1: Observability combines metrics, logs, and traces for insights One benefit of observability is that it shows you the actual system. It is not biased by what the designers, architects, and engineers think should happen in production. It shows the unbiased flow of data. The users, over time (and sometimes from the very first day), find ways to use the system quite differently from what was designed. Observability makes such changes in behavior visible. Observability is incredibly powerful in debugging system issues as it allows us to navigate the system to see where problems occur. Observability requires a dedicated setup and some contextual knowledge similar to traceability. Traceability is the ability to follow a system transaction over time through all the different components of our application and infrastructure architecture, which means you have to have common information like an ID that enables this. OpenTelemetry is an open standard that can be used and provides useful guidance on how to set this up. Observability makes identifying production issues a lot easier. And we can use observability for our benefit in testing, too. Observability of Testing: How to Look Left Two aspects of observability make it useful in the testing context: Its ability to make the actual system usage observable and its usefulness in finding problem areas during debugging. Understanding the actual system behavior is most directly useful during performance testing. Performance testing is the pinnacle of testing since it tries to achieve as close to the realistic peak behavior of a system as possible. Unfortunately, performance testing scenarios are often based on human knowledge of the system instead of objective information. For example, performance testing might be based on the prediction of 10,000 customer interactions per hour during a sales campaign based on the information of the sales manager. Observability information can help define the testing scenarios by using the information to look for the times the system was under the most stress in production and then simulate similar situations in the performance test environment. We can use a system signature to compare behaviors. A system signature in the context of observability is the set of values for logs, metrics, and traces during a specific period. Take, for example, a marketing promotion for new customers. The signature of the system should change during that period to show more new account creations with its associated functionality and the related infrastructure showing up as being more "busy." If the signature does not change during the promotion, we would predict that we also don't see the business metrics move (e.g., user sign-ups). In this example, the business metrics and the signature can be easily matched. Figure 2: A system behaving differently in test, which shows up in the system signature In many other cases, this is not true. Imagine an example where we change the recommendation engine to use our warehouse data going forward. We expect the system signature to show increased data flows between the recommendation engine and our warehouse system. You can see how system signatures and the changes of the system signature can be useful for testing; any differences in signature between production and the testing systems should be explainable by the intended changes of the upcoming release. Otherwise, investigation is required. In the same way, information from the production observability system can be used to define a regression suite that reflects the functionality most frequently used in production. Observability can give you information about the workflows still actively in use and which workflows have stopped being relevant. This information can optimize your regression suite both from a maintenance perspective and, more importantly, from a risk perspective, making sure that core functionality, as experienced by the user, remains in a working state. Implementing observability in your test environments means you can use the power of observability for both production issues and your testing defects. It removes the need for debugging modes to some degree and relies upon the same system capability as production. This way, observability becomes how you work across both dev and ops, which helps break down silos. Observability for Test Insights: Looking Right In the previous section, we looked at using observability by looking left or backward, ensuring we have kept everything intact. Similarly, we can use observability to help us predict the success of the features we deliver. Think about a new feature you are developing. During the test cycles, we see how this new feature changes the workflows, which shows up in our observability solution. We can see the new features being used and other features changing in usage as a result. The signature of our application has changed when we consider the logs, traces, and metrics of our system in test. Once we go live, we predict that the signature of the production system will change in a very similar way. If that happens, we will be happy. But what if the signature of the production system does not change as predicted? Let's take an example: We created a new feature that leverages information from previous bookings to better serve our customers by allocating similar seats and menu options. During testing, we tested the new feature with our test data set, and we see an increase in accessing the bookings database while the customer booking is being collated. Once we go live, we realize that the workflows are not utilizing the customer booking database, and we leverage the information from our observability tooling to investigate. We have found a case where the users are not using our new features or are not using the features in the expected way. In either case, this information allows us to investigate further to see whether more change management is required for the users or whether our feature is just not solving the problem in the way we wanted it to. Another way to use observability is to evaluate the performance of your changes in test and the impact on the system signature — comparing this afterwards with the production system signature can give valuable insights and prevent overall performance degradation. Our testing efforts (and the associated predictions) have now become a valuable tool for the business to evaluate the success of a feature, which elevates testing to become a business tool and a real value investment. Figure 3: Using observability in test by looking left and looking right Conclusion While the popularity of observability is a somewhat recent development, it is exciting to see what benefits it can bring to testing. It will create objectiveness for defining testing efforts and results by evaluating them against the actual system behavior in production. It also provides value to developer, tester, and business communities, which makes it a valuable tool for breaking down barriers. Using the same practices and tools across communities drives a common culture — after all, culture is nothing but repeated behaviors. This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report More
Software Engineering in the Age of Climate Change: A Testing Perspective
Software Engineering in the Age of Climate Change: A Testing Perspective
By Stelios Manioudakis
Janus vs. MediaSoup: The Ultimate Guide To Choosing Your WebRTC Server
Janus vs. MediaSoup: The Ultimate Guide To Choosing Your WebRTC Server
By Nilesh Savani
Choosing the Appropriate AWS Load Balancer: ALB vs. NLB
Choosing the Appropriate AWS Load Balancer: ALB vs. NLB
By Satrajit Basu CORE
Unveiling Vulnerabilities via Generative AI
Unveiling Vulnerabilities via Generative AI

Code scanning for vulnerability detection for exposure of security-sensitive parameters is a crucial practice in MuleSoft API development. Code scanning involves the systematic analysis of MuleSoft source code to identify vulnerabilities. These vulnerabilities could range from hardcoded secure parameters like password or accessKey to the exposure of password or accessKey in plain text format in property files. These vulnerabilities might be exploited by malicious actors to compromise the confidentiality, integrity, or availability of the applications. Lack of Vulnerability Auto-Detection MuleSoft Anypoint Studio or Anypoint platform does not provide a feature to keep governance on above mentioned vulnerabilities. It can be managed by design time governance, where a manual review of the code will be needed. However, there are many tools available that can be used to scan the deployed code or code repository to find out such vulnerabilities. Even you can write some custom code/script in any language to perform the same task. Writing custom code adds another complexity and manageability layer. Using Generative AI To Review the Code for Detecting Vulnerabilities In this article, I am going to present how Generative AI can be leveraged to detect such vulnerabilities. I have used the Open AI foundation model “gpt-3.5-turbo” to demonstrate the code scan feature to find the aforementioned vulnerabilities. However, we can use any foundation model to implement this use case. This can be implemented using Python code or any other code in another language. This Python code can be used in the following ways: Python code can be executed manually to scan the code repository. It can be integrated into the CICD build pipeline, which can scan and report the vulnerabilities and result in build failure if vulnerabilities are present. It can be integrated into any other program, such as the Lambda function, which can run periodically and execute the Python code to scan the code repository and report vulnerabilities. High-Level Architecture Architecture There are many ways to execute the Python code. A more appropriate and practical way is to integrate the Python code into the CICD build pipeline. CICD build pipeline executes the Python code. Python code reads the MuleSoft code XML files and property files. Python code sends the MuleSoft code content and prompts the OpenAI gpt-3.5-turbo model. OpenAI mode returns the hardcoded and unencrypted value. Python code generates the report of vulnerabilities found. Implementation Details MuleSoft API project structure contains two major sections where security-sensitive parameters can be exposed as plain text. src/main/mule folder contains all the XML files, which contain process flow, connection details, and exception handling. MuleSoft API project may have custom Java code also. However, in this article, I have not considered the custom Java code used in the MuleSoft API. src/main/resources folder contains environment property files. These files can be .properties or .yaml files for development, quality, and production. These files contain property key values, for example, user, password, host, port, accessKey, and secretAccesskey in an encrypted format. Based on the MuleSoft project structure, implementation can be achieved in two steps: MuleSoft XML File Scan Actual code is defined as process flow in MuleSoft Anypoint Studio. We can write Python code to use the Open AI foundation model and write a prompt that can scan the MuleSoft XML files containing the code implementation to find hardcoded parameter values. For example: Global.xml/Config.xml file: This file contains all the connector configurations. This is standard recommended by MuleSoft. However, it may vary depending on the standards and guidelines defined in your organization. A generative AI foundation model can use this content to find out hardcoded values. Other XML files: These files may contain some custom code or process flow calling other for API calls, DB calls, or any other system call. This may have connection credentials hard-coded by mistake. A generative AI foundation model can use this content to find out hardcoded values. I have provided the screenshot of a sample MuleSoft API code. This code has three XML files; one is api.xml, which contains the Rest API flow. Process.xml has a JMS-based asynchronous flow. Global.xml has all the connection configurations. api.xml process.xml global.xml For demonstration purposes, I have used a global.xml file. The code snippet has many hardcoded values for demonstration. Hardcoded values are highlighted in red boxes: Python Code The Python code below uses the Open AI foundation model to scan the above XML files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/mule/global.xml" try: with open(file_path, 'r') as file: file_content = file.read() print(file_content) except FileNotFoundError: except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the non-hard-coded value and hard-coded value separately. Example: For instance, if you were to find the hardcoded values, the hard-coded value look like this: name=""value"". if you were to find the non-hardcoded values, the non-hardcoded value look like this: host=""${host}"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: The result from the Generative AI Model Similarly, we can provide api.xml and process.xml to scan the hard-coded values. You can even modify the Python code to read all the XML files iteratively and get the result in sequence for all the files. Scanning the Property Files We can use the Python code to send another prompt to the AI model, which can find the plain text passwords kept in property files. In the following screenshot dev-secure.yaml file has client_secret as the encrypted value, and db.password and jms.password is kept as plain text. config file Python Code The Python code below uses the Open AI foundation model to scan config files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/resources/config/secure/dev-secure.yaml" try: with open(file_path, 'r') as file: file_content = file.read() except FileNotFoundError: print("File not found.") except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the encrypted value and unencrypted value separately. Example: For instance, if you were to find the encrypted values, the encrypted value look like this: ""![asdasdfadsf]"". if you were to find the unencrypted values, the unencrypted value look like this: ""sdhfsd"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: result from Generative AI Impact of Generative AI on the Development Life Cycle We see a significant impact on the development lifecycle. We can think of leveraging Generative AI for different use cases related to the development life cycle. Efficient and Comprehensive Analysis Generative AI models like GPT-3.5 have the ability to comprehend and generate human-like text. When applied to code review, they can analyze code snippets, provide suggestions for improvements, and even identify patterns that might lead to bugs or vulnerabilities. This technology enables a comprehensive examination of code in a relatively short span of time. Automated Issue Identification Generative AI can assist in detecting potential issues such as syntax errors, logical flaws, and security vulnerabilities. By automating these aspects of code review, developers can allocate more time to higher-level design decisions and creative problem-solving. Adherence To Best Practices Through analysis of coding patterns and context, Generative AI can offer insights on adhering to coding standards and best practices. Learning and Improvement Generative AI models can "learn" from vast amounts of code examples and industry practices. This knowledge allows them to provide developers with contextually relevant recommendations. As a result, both the developers and the AI system benefit from a continuous learning cycle, refining their understanding of coding conventions and emerging trends. Conclusion In conclusion, conducting a code review to find security-sensitive parameters exposed as plain text using OpenAI's technology has proven to be a valuable and efficient process. Leveraging OpenAI for code review not only accelerated the review process but also contributed to producing more robust and maintainable code. However, it's important to note that while AI can greatly assist in the review process, human oversight and expertise remain crucial for making informed decisions and fully understanding the context of the code.

By Ajay Singh
Building Real-Time Applications to Process Wikimedia Streams Using Kafka and Hazelcast
Building Real-Time Applications to Process Wikimedia Streams Using Kafka and Hazelcast

In this tutorial, developers, solution architects, and data engineers can learn how to build high-performance, scalable, and fault-tolerant applications that react to real-time data using Kafka and Hazelcast. We will be using Wikimedia as a real-time data source. Wikimedia provides various streams and APIs (Application Programming Interfaces) to access real-time data about edits and changes made to their projects. For example, this source provides a continuous stream of updates on recent changes, such as new edits or additions to Wikipedia articles. Developers and solution architects often use such streams to monitor and analyze the activity on Wikimedia projects in real-time or to build applications that rely on this data, like this tutorial. Kafka is great for event streaming architectures, continuous data integration (ETL), and messaging systems of record (database). Hazelcast is a unified real-time stream data platform that enables instant action on data in motion by combining stream processing and a fast data store for low-latency querying, aggregation, and stateful computation against event streams and traditional data sources. It allows you to build resource-efficient, real-time applications quickly. You can deploy it at any scale from small edge devices to a large cluster of cloud instances. In this tutorial, we will guide you through setting up and integrating Kafka and Hazelcast to enable real-time data ingestion and processing for reliable streaming processing. By the end, you will have a deep understanding of how to leverage the combined capabilities of Hazelcast and Kafka to unlock the potential of streaming processing and instant action for your applications. So, let's get started! Wikimedia Event Streams in Motion First, let’s understand what we are building: Most of us use or read Wikipedia, so let’s use Wikipedia's recent changes as an example. Wikipedia receives changes from multiple users in real time, and these changes contain details about the change such as title, request_id, URI, domain, stream, topic, type, user, topic, title_url, bot, server_name, and parsedcomment. We will read recent changes from Wikimedia Event Streams. Event Streams is a web service that exposes streams of structured event data in real time. It does it over HTTP with chunked transfer encoding in accordance with the Server-Sent Events protocol (SSE). Event Streams can be accessed directly through HTTP, but they are more often used through a client library. An example of this is a “recentchange”. But what if you want to process or enrich changes in real time? For example, what if you want to determine if a recent change is generated by a bot or human? How can you do this in real time? There are actually multiple options, but here we’ll show you how to use Kafka to transport data and how to use Hazelcast for real-time stream processing for simplicity and performance. Here’s a quick diagram of the data pipeline architecture: Prerequisites If you are new to Kafka or you’re just getting started, I recommend you start with Kafka Documentation. If you are new to Hazelcast or you’re just getting started, I recommend you start with Hazelcast Documentation. For Kafka, you need to download Kafka, start the environment, create a topic to store events, write some events to your topic, and finally read these events. Here’s a Kafka Quick Start. For Hazelcast, you can use either the Platform or the Cloud. I will use a local cluster. Step #1: Start Kafka Run the following commands to start all services in the correct order: Markdown # Start the ZooKeeper service $ bin/zookeeper-server-start.sh config/zookeeper.properties Open another terminal session and run: Markdown # Start the Kafka broker service $ bin/kafka-server-start.sh config/server.properties Once all services have successfully launched, you will have a basic Kafka environment running and ready to use. Step #2: Create a Java Application Project The pom.xml should include the following dependencies in order to run Hazelcast and connect to Kafka: XML <dependencies> <dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast</artifactId> <version>5.3.1</version> </dependency> <dependency> <groupId>com.hazelcast.jet</groupId> <artifactId>hazelcast-jet-kafka</artifactId> <version>5.3.1</version> </dependency> </dependencies> Step #3: Create a Wikimedia Publisher Class Basically, the class reads from a URL connection, creates a Kafka Producer, and sends messages to a Kafka topic: Java public static void main(String[] args) throws Exception { String topicName = "events"; URLConnection conn = new URL ("https://stream.wikimedia.org/v2/stream/recentchange").openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), StandardCharsets.UTF_8)); try (KafkaProducer<Long, String> producer = new KafkaProducer<>(kafkaProps())) { for (long eventCount = 0; ; eventCount++) { String event = reader.readLine(); producer.send(new ProducerRecord<>(topicName, eventCount, event)); System.out.format("Published '%s' to Kafka topic '%s'%n", event, topicName); Thread.sleep(20 * (eventCount % 20)); } } } private static Properties kafkaProps() { Properties props = new Properties(); props.setProperty("bootstrap.servers", "127.0.0.1:9092"); props.setProperty("key.serializer", LongSerializer.class.getCanonicalName()); props.setProperty("value.serializer", StringSerializer.class.getCanonicalName()); return props; } Step #4: Create a Main Stream Processing Class This class creates a pipeline that reads from a Kafka source using the same Kafka topic, and then it filters out messages that were created by bots (bot:true), keeping only messages created by humans. It sends the output to a logger: Java public static void main(String[] args) { Pipeline p = Pipeline.create(); p.readFrom(KafkaSources.kafka(kafkaProps(), "events")) .withNativeTimestamps(0) .filter(event-> Objects.toString(event.getValue()).contains("bot\":false")) .writeTo(Sinks.logger()); JobConfig cfg = new JobConfig().setName("kafka-traffic-monitor"); HazelcastInstance hz = Hazelcast.bootstrappedInstance(); hz.getJet().newJob(p, cfg); } private static Properties kafkaProps() { Properties props = new Properties(); props.setProperty("bootstrap.servers", "127.0.0.1:9092"); props.setProperty("key.deserializer", LongDeserializer.class.getCanonicalName()); props.setProperty("value.deserializer", StringDeserializer.class.getCanonicalName()); props.setProperty("auto.offset.reset", "earliest"); return props; } Step #5: Enriching a Stream If you want to enrich real-time messages with batch or static data such as location details, labels, or some features, you can follow the next step: Create a Hazelcast Map and load static data into it. Use the Map to enrich the Message stream using mapUsingIMap. Conclusion In this post, we explained how to build a real-time application to process Wikimedia streams using Kafka and Hazelcast. Hazelcast allows you to quickly build resource-efficient, real-time applications. You can deploy it at any scale, from small-edge devices to a large cluster of cloud instances. A cluster of Hazelcast nodes shares the data storage and computational load, which can dynamically scale up and down. Referring to the Wikimedia example, it means that this solution is reliable, even when there are significantly higher volumes of users making changes to Wikimedia. We look forward to your feedback and comments about this blog post!

By Fawaz Ghali, PhD
Extracting Maximum Value From Logs
Extracting Maximum Value From Logs

Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system. However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging. With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post. We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention. Let’s dive in with a look at log levels and formats. Logging Design Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats. Log Levels Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest): TRACE: Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change. DEBUG: Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments. INFO: Provides general information about the system's operation to convey important events or milestones in the system's execution. WARNING: Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary. ERROR: Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality. Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively. When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently. Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels. Structured Versus Unstructured Logs Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically. Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems. On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly. However, programmatically extracting specific information from the resulting logs can be very challenging. Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits. However, if all you need is simplicity and readability, then unstructured logs may be sufficient. In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages. For large-scale systems, you should lean towards structured logging when possible, but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning. Logging Challenges With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings. Disparate Destinations Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome. For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic. Varying Formats Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields. Unifying the information you get from a diversity of logs and formats requires the right tools. Inconsistent Log Levels Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels. One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation. You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose. Log Storage Cost Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system. Dealing With These Challenges While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices. Aggregate Your Logs When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform. Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation. Move Toward a Unified Format Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format. The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down. Establish a Logging Standard Across Your Applications For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics. If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard. If a migration is not feasible, treat your legacy applications like you would third-party applications. Enrich Logs From Third-Party Sources Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities. To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics). Manage Log Volume, Frequency, and Retention Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage. Volume: Monitoring generated log volume helps you control resource consumption and performance impacts. Frequency: Determine how often to log, based on the criticality of events and desired level of monitoring. Retention: Define a log retention policy appropriate for compliance requirements, operational needs, and available storage. Rotation: Periodically archive or purge older log files to manage log file sizes effectively. Compression: Compress log files to reduce storage requirements. Log-Based Metrics Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges. Benefits Granular insights: Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues. Comprehensive monitoring: By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience. Historical analysis: Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability. Flexibility and customization: You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs. Challenges Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task. This identification requires a deep understanding of system behavior and close alignment with your business objectives. Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next. Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge. Need for real-time analysis: Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively. Performance impact: Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance. Data noise and irrelevance: Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events Long-Term Log Retention After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area. How Long Should You Keep Logs Around? How long you should keep a log around depends on several factors, including: Log type: Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting. Regulatory requirements: Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years. Company policy: Your company may have policies that dictate how long logs should be kept. Log size: If your logs are large, you may need to rotate them or delete them more frequently. Storage cost: Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage. How Do You Reduce the Level of Detail and Cost of Older Logs? Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around. When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures: Downsampling logs: In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10). Trimming logs: For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code. Compression and archiving: You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements. Conclusion In this article, we’ve looked at how to get the most out of logging in large-scale systems. Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources. Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system. And you can do this while keeping your logging costs at bay.

By Alvin Lee CORE
Send Your Logs to Loki
Send Your Logs to Loki

One of my current talks focuses on Observability in general and Distributed Tracing in particular, with an OpenTelemetry implementation. In the demo, I show how you can see the traces of a simple distributed system consisting of the Apache APISIX API Gateway, a Kotlin app with Spring Boot, a Python app with Flask, and a Rust app with Axum. Earlier this year, I spoke and attended the Observability room at FOSDEM. One of the talks demoed the Grafana stack: Mimir for metrics, Tempo for traces, and Loki for logs. I was pleasantly surprised how one could move from one to the other. Thus, I wanted to achieve the same in my demo but via OpenTelemetry to avoid coupling to the Grafana stack. In this blog post, I want to focus on logs and Loki. Loki Basics and Our First Program At its core, Loki is a log storage engine: Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. Loki Loki provides a RESTful API to store and read logs. Let's push a log from a Java app. Loki expects the following payload structure: I'll use Java, but you can achieve the same result with a different stack. The most straightforward code is the following: Java public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException { var template = "'{' \"streams\": ['{' \"stream\": '{' \"app\": \"{0}\" '}', \"values\": [[ \"{1}\", \"{2}\" ]]'}']'}'"; //1 var now = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant(); var nowInEpochNanos = NANOSECONDS.convert(now.getEpochSecond(), SECONDS) + now.getNano(); var payload = MessageFormat.format(template, "demo", String.valueOf(nowInEpochNanos), "Hello from Java App"); //1 var request = HttpRequest.newBuilder() //2 .uri(new URI("http://localhost:3100/loki/api/v1/push")) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(payload)) .build(); HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString()); //3 } This is how we did String interpolation in the old days Create the request Send it The prototype works, as seen in Grafana: However, the code has many limitations: The label is hard-coded. You can and must send a single-label Everything is hard-coded; nothing is configurable, e.g., the URL The code sends one request for every log; it's hugely inefficient as there's no buffering HTTP client is synchronous, thus blocking the thread while waiting for Loki No error handling whatsoever Loki offers both gzip compression and Protobuf; none are supported with my code Finally, it's completely unrelated to how we use logs, e.g.: Java var logger = // Obtain logger logger.info("My message with parameters {}, {}", foo, bar); Regular Logging on Steroids To use the above statement, we need to choose a logging implementation. Because I'm more familiar with it, I'll use SLF4J and Logback. Don't worry; the same approach works for Log4J2. We need to add relevant dependencies: XML <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <!--1--> <version>2.0.7</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId> <!--2--> <version>1.4.8</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.github.loki4j</groupId> <artifactId>loki-logback-appender</artifactId> <!--3--> <version>1.4.0</version> <scope>runtime</scope> </dependency> SLF4J is the interface Logback is the implementation Logback appender dedicated to SLF4J Now, we add a specific Loki appender: XML <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender"> <!--1--> <http> <url>http://localhost:3100/loki/api/v1/push</url> <!--2--> </http> <format> <label> <pattern>app=demo,host=${HOSTNAME},level=%level</pattern> <!--3--> </label> <message> <pattern>l=%level h=${HOSTNAME} c=%logger{20} t=%thread | %msg %ex</pattern> <!--4--> </message> <sortByTime>true</sortByTime> </format> </appender> <root level="DEBUG"> <appender-ref ref="STDOUT" /> </root> The loki appender Loki URL As many labels as wanted Regular Logback pattern Our program has become much more straightforward: Java var who = //... var logger = LoggerFactory.getLogger(Main.class.toString()); logger.info("Hello from {}!", who); Grafana displays the following: Docker Logging I'm running most of my demos on Docker Compose, so I'll mention the Docker logging trick. When a container writes on the standard out, Docker saves it to a local file. The docker logs command can access the file content. However, other options than saving to a local file are available, e.g., syslog, Google Cloud, Splunk, etc. To choose a different option, one sets a logging driver. One can configure the driver at the overall Docker level or per container. Loki offers its own plugin. To install it: Shell docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions At this point, we can use it on our container app: YAML services: app: build: . logging: driver: loki #1 options: loki-url: http://localhost:3100/loki/api/v1/push #2 loki-external-labels: container_name={{.Name},app=demo #3 Loki logging driver URL to push to Additional labels The result is the following. Note the default labels. Conclusion From a bird's eye view, Loki is nothing extraordinary: it's a plain storage engine with a RESTful API on top. Several approaches are available to use the API. Beyond the naive one, we have seen a Java logging framework appender and Docker. Other approaches include scraping the log files, e.g., Promtail, via a Kubernetes sidecar. You could also add an OpenTelemetry Collector between your app and Loki to perform transformations. Options are virtually unlimited. Be careful to choose the one that fits your context the best. To go further: Push log entries to Loki via API Loki Clients

By Nicolas Fränkel CORE
How to Configure Istio, Prometheus and Grafana for Monitoring
How to Configure Istio, Prometheus and Grafana for Monitoring

Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.

By Md Azmal
Machine Learning Libraries For Any Project
Machine Learning Libraries For Any Project

There are many libraries out there that can be used in machine learning projects. Of course, some of them gained considerable reputations through the years. Such libraries are the straight-away picks for anyone starting a new project which utilizes machine learning algorithms. However, choosing the correct set (or stack) may be quite challenging. The Why In this post, I would like to give you a general overview of the machine learning libraries landscape and share some of my thoughts about working with them. If you are starting your journey with a machine learning library, my text can give you some general knowledge of the machine learning libraries and provide a better starting point for learning more. The libraries described here will be divided by the role they can play in your project. The categories are as follows: Model Creation - Libraries that can be used to create machine learning models Working with data - Libraries that can be used both for feature engineering, future extraction, and all other operations that involve working with features Hyperparameters optimization - Libraries and tools that can be used for optimizing model hyperparameters Experiment tracking - Libraries and tools used for experiment tracking Problem-specific libraries - Libraries that can be used for tasks like time series forecasting, computer vision, and working with spatial data Utils - Non-strictly machine learning libraries but nevertheless the ones I found useful in my projects Model Creation PyTorch Developed by people from Facebook and open-sourced in 2017, it is one of the market's most famous machine learning libraries - based on the open-source Torch package. The PyTorch ecosystem can be used for all types of machine learning problems and has a great variety of purpose-built libraries like torchvision or torchaudio. The basic data structure of PyTorch is the Tensor object, which is used to hold multidimensional data utilized by our model. It is similar in its conception to NumPy ndarray. PyTorch can also use computation accelerators, and it supports CUDA-capable NVIDIA GPUs, ROCm, Metal API, and TPU. The most important part of the core PyTorch library is nn modules which contains layers and tools to build complex models layer by layer easily. class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits An example of a simple neural network with 3 linear layers in PyTorch Additionally, PyTorch 2.0 is already released, and it makes PyTorch even better. Moreover, PyTorch is used by a variety of companies like Uber, Tesla, and Facebook, just to name a few. PyTorch Lightning It is a sort of “extension” for PyTorch, which aims to greatly reduce the amount of boilerplate code needed to utilize our models. Lightning is based on the concept of hooks: functions called on specific phases of the model train/eval loop. Such an approach allows us to pass callback functions executed at a specific time, like the end of the training step. Trainers Lighting automates many functions that one has to take care of in PyTorch; for example, loop, hardware call, or zero grads. Below are roughly equivalent code fragments of PyTorch (Left) and PyTorch Lightning (Right). Image Source TensorFlow A library developed by a team from Google Brain was originally released in 2015 under Apache 2.0 License, and version 2.0 was released in 2019. It provides clients in Java, C++, Python, and even JavaScript. Similar to PyTorch, it is widely adopted throughout the market and used by companies like Google (surprise), Airbnb, and Intel. TensorFlow also has quite an extensive ecosystem built around it by Google. It contains tools and libraries such as an optimization toolkit, TensorBoard (more about it in the "Experiment Tracking" section below), or recommenders. The TensorFlow ecosystem also includes a web-based sandbox to play around with your model’s visualization. Again the tf.nn module plays the most vital part providing all the building blocks required to build machine learning models. Tensorflow uses its own Tensor (flow ;p) object for holding data utilized by deep learning models. It also supports all the common computation accelerators like CUDA or RoCm (community), Metal API, and TPU. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. Keras It is a library similar in meaning and inception to PyTorch Lightning but for TensorFlow. It offers a more high-level interface over TensorFlow. Developed by François Chollet and released in 2015, it provides only Python clients. Keras also has its own set of Python libraries and problem-specific libraries like KerasCV or KerasNLP for more specialized use cases. Before version 2.4 Keras supported more backends than just TensorFlow, but after the release, TensorFlow became the only supported backend. As Keras is just an interface for TensorFlow, it shares similar base concepts as its underlying backend. The same holds true for supported computation accelerators. Keras is used by companies like IBM, PayPal, and Netflix. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. PyTorch vs. TensorFlow I wouldn’t be fully honest If I would not introduce some comparison between these two. As you could read, a moment before, both of them are quite similar in offered features and the ecosystem around them. Of course, there are some minor differences and quirks in how both work or the features they provide. In my opinion, these are more or less insignificant. The real difference between them comes from their approach to defining and executing computational graphs of the machine and deep learning and models. PyTorch uses dynamic computational graphs, which means that the graph is defined on-the-fly during execution. This allows for more flexibility and intuitive debugging, as developers can modify the graph at runtime and easily inspect intermediate outputs. On the other hand, this approach may be less efficient than static graphs, particularly for complex models. However, PyTorch 2.0 attempts to address these issues via torch.compile and FX graphs. TensorFlow uses static computational graphs, which are compiled before execution. This allows for more efficient execution, as the graph can be optimized and parallelized for the target hardware. However, it can also make debugging more difficult, as intermediate outputs are not readily accessible. Another noticeable difference is that PyTorch seems to be more low level than Keras while being more high level than pure TensorFlow. Such a setting makes PyTorch more elastic and easier to use for making tailored models with many customizations. As a side note, I would like to add that both libraries are on equal terms in the case of market share. Additionally, despite the fact that TensorFlow uses the call method and PyTorch uses the forward method both libraries support call semantics as a shorthand for calling model(x). Working With Data pandas A library that you must have heard of if you are using Python, it is probably the most famous Python library for working with data of any type. It was originally released in 2008, and version 1.0 in 2012. It provides functions for filtering, aggregating, and transforming data, as well as merging multiple datasets. The cornerstone of this library is a DataFrame object which represents a multidimensional table of any type of data. The library heavily focuses on performance with some parts written in pure C to boost performance. Besides being performance-focused, pandas provide a lot of features related to: Data cleaning and preprocessing Removing duplicates Filling null values or nan values. Time series analysis Resampling Windowing Time shift Additionally, it can perform a variety of input/output operations: Reading from/to .csv or .xlsx files Performing database queries Load data from GCP BigQuery (with the help of pandas-gbq) NumPy It is yet another famous library for working with data - mostly numeric data science in kind. The most famous part of NumPy is an ndarray - a structure representing a multidimensional array of numbers. Besides ndarray, NumPy provides a lot of high-level mathematical functions and mathematical operations used to work with this data. It is also probably the oldest library in this set as the first version was released in 2005. It was implemented by Travis Oliphant, based on an even older library called numeric (released in 1996). NumPy is extremely focused on performance with contributors trying to get more and more of the currently implemented algorithm to reduce the execution time of NumPy functions even more. Of course, as with all libraries described here, numpy is also open source and uses a BSD license. SciPy It is a library focused on supporting scientific computations. It is even older than NumPy (2005), released in 2001. It is built atop NumPy, with ndarray being the basic data structure used throughout SciPy. Among other things, the library adds functions for optimization, linear algebra, signal processing, interpolation, and spares matrix support. In general, it is more high-level than NumPy and thus can provide more complex functions. Hyperparameter Optimizations Ray Tune It is part of the Ray toolset, a bundle of related libraries for building distributed applications focusing on machine learning and Python. The Tune part of the ML library is focused on providing hyper-priming optimization features by providing a variety of search algorithms; for example, grid search, hyperband, or Bayesian Optimization. Ray Tune can work with models created in most programming languages and libraries available on the market. All of the libraries described in the paragraph about model creation are supported by Ray Tune. The key concepts of Ray Tune are: Trainables - Objects pass to Tune runs; they are our model we want to optimize parameters for Search space - Contains all the values of hyperparameters we want to check in the current trial Tuner - An object responsible for executing runs, calling tuner.fit() starts the process of searching for optimal hyperparameters set. It requires passing at least a trainable object and search space Trial- Each trail represents a particular run of a trainable object with an exact set of parameters from a search space. Trials are generated by Ray Tune Tuner. As it represents the output of running the tuner, Trial contains a ton of information, such as: Config used for a particular trial Trial ID Many others Search algorithms - The algorithm used for a particular execution of Tuner.fit; if not provided, Ray Tune will use RadomSearch as the default Schedulers - These are objects that are in charge of managing runs. They can pause, stop and run trials within the run. This can result in increased efficiency and reduced time of the run. If none is selected, the Tune will pick FIFO as default - run will be executed one by one as in a classic queue. Run analyses - The object conniving the results of Tuner.fit execution in the form of ResultGrid object. It contains all the data related to the run like the best result among all trials or data from all trials. BoTorch BoTorch is a library built atop PyTorch and is part of the PyTorch ecosystem. It focuses solely on providing hyperparameter optimization with the use of Bayesian Optimization. As the only library in this part that is designed to work with a specific model library, it may make it problematic for BoTorch to be used with libraries other than PyTorch. Also as the only library it is currently in version beta and under intensive development, so some unexpected problems may occur. The key feature of BoTorch is its integration with PyTorch, which greatly impacts the ease of interaction between the two. Experiment Tracking Neptune.ai It is a web-based tool that serves both as an experiment tracking and model registry. The tool is cloud-based in the classic SaaS model, but if you are determined enough, there is the possibility of using a self-hosted variant. It provides a dashboard where you can view and compare the results of training your model. It can also be used to store the parameters used for particular runs. Additionally, you can easily version the dataset used for particular runs and all the metadata you think may come in handy. Moreover, it enables easy version control for your models. Obviously, the tool is library agnostic and can host models created using any library. To make the integration possible, Neptune exposes a REST-style API with its own client. The client can be downloaded and installed via pip or any other Python dependency tool. The API is decently documented and quite easy to grasp. The tool is paid and has a simple pricing plan divided into three categories. Yet if you need it just for a personal project or you are in the research or academic unit, then you can apply to use the tool for free. Neptune.ai is a new tool, so some features known from other experiment tracing tools may not be present. Despite that fact, Neptune.ai support is keen to react to their user feedback and implement missing functionalities – at least, it was the thing in our case as we use Neptune.ai extensively in The Codos Project. Weights & Biases Also known as WandB or W&B, this is a web-based tool that exposes all the needed functionalities to be used as an experiment tracing tool and model registry. It exposes a more or less similar set of functionalities as neptune.ai. However, Weights & Biases seem to have better visualization and, in general, is a more mature tool than Neptune. Additionally, WandB seems to be more focused on individual projects and researchers with less emphasis on collaboration. It also has a simple pricing plan divided into 3 categories with a free tier for private use. Yet W&B has the same approach as Neptune to researchers and academic units - they can always use Weights & Biases for free. Weights & Biases also expose an REST-like API to ease the integration process. It seems to be a better document and offers more feathers than the one exposed by Neptune.ai. What is curious is that they expose the client library written in Java - if, for some reason, you wrote a machine learning model in Java instead of Python. TensorBoard It is a dedicated visualization toolkit for the TensorFlow ecosystem. It is designed mostly to work as an experiment tracking tool with a focus on metrics visualizations. Despite being a dedicated TensorFlow tool, it can also be used with Keras (not surprisingly) and PyTorch. Additionally, it is the only free tool from all three described in this section. Here you can host and track your experiments. However, TensorBoard is missing the functionalities responsible for model registry which can be quite problematic and force you to use some 3rd party tool to cover this missing feature. Anyway, there is for sure a tool for this in the TensorFlow ecosystem. As it is a directory part of the TensorFlow ecosystem, its integration with Keras or TensorFlow is much smoother than any of the previous two tools. Problem Specific Libraries tsaug One of a few libraries for augmentation of time series, it is an open-source library created and maintained by a single person under the GitHub name nick tailaiw, released in 2019, and is currently in version 0.2.1. It provides a set of 9 augmentations; Crop, Add Noise, or TimeWrap, among them. The library is reasonably well documented for such a project and easy to use from a user perspective. Unfortunately, for unknown reasons (at least to me), the library seems dead and has not been updated for 3 years. There are many open issues but they are not getting any attention. Such a situation is quite sad in my opinion, as there are not so many other libraries that provide augmentation for time series data. However, if you are looking for a time series for data mining or augmentation and want to use a more up-to-date library, Tsfresh may be a good choice. OpenCV It is a library focused on providing functions for working with image processing and computer vision. Developed by Intel, it is now open source based on the Apache 2 license. OpenCV provides a set of functions related to image and video processing, image classification data analysis, and tracking alongside ready-made machine learning models for working with images and video. If you want to read more about OpenCV, my coworker, Kamil Rzechowski, wrote an article that quite extensively describes the topic. GeoPandas It is a library built atop pandas that aims to provide functions for working with spatial and data structures. It allows easy reading and writing data in GeJSON, shapefile formats, or reading data from PostGIS systems. Besides pandas, it has a lot of other dependencies on spatial data libraries like PyGEOS, GeoPy, or Shapely. The library's basic structure is: GeoSeries - A column of geospatial data, such as a series of points, lines, or polygons GeoDataFrame - Tabular structure holding a set of GeoSeries Utils Matplotlib As the name suggests, it is a library for creating various types of plots ;p. Besides basic plots like lines or histograms, matplotlib allows us to create more complex plots: 3d shapes or polar plots. Of course, it also allows us to customize things like color plots or labels. Despite being somewhat old (released in 2003, so 20 years old at the time of writing), it is actively maintained and developed. With around 17k stars on GitHub, it has quite a community around it and is probably the second pick for anyone needing a data visualization tool. It is well-documented and reasonably easy to grasp for a newcomer. It is also worth noting that matplotlib is used as a base for more high-level visualization libraries. Seaborn Speaking of which, we have Seaborn as an example of such a library. Thus the set of functionalities provided by Seaborn is similar to one provided by Matplotlib. However, the API is more high-level and requires less boilerplate code to achieve similar results. As for other minor differences, the color palette provided by Seaborn is softer, and the design of plots is more modern and nice looking. Additionally, Seaborn is easiest to integrate with pandas, which may be a significant advantage. Below you can find code used to create a heatmap in Matplotlib and Seaborn, alongside their output plots. Imports are common. import matplotlib.pyplot as plt import numpy as np import seaborn as sns data = np.random.rand(5, 5) fig, ax = plt.subplots() heatmap = ax.pcolor(data, cmap=plt.cm.Blues) ax.set_xticks(np.arange(data.shape[0])+0.5, minor=False) ax.set_yticks(np.arange(data.shape[1])+0.5, minor=False) ax.set_xticklabels(np.arange(1, data.shape[0]+1), minor=False) ax.set_yticklabels(np.arange(1, data.shape[1]+1), minor=False) plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") cbar = plt.colorbar(heatmap) plt.show() sns.heatmap(data, cmap="Blues", annot=True) # Set plot title and axis labels plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") # Show plot plt.show() Hydra In every project, sooner or later, there is a need to make something a configurable value. Of course, if you are using a tool like Jupyter, then the matter is pretty straightforward. You can just move a desired value to the .env file - et voila, it is configurable. However, if you are building a more standard application, things are not that simple. Here Hydra shows its ugly (but quite useful) head. It is an open-source tool for managing and running configurations of Python-based applications. It is based on OmegaConf library, and quoting from their main page: “The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.” What proved quite useful for me is the ability described in the quote above: hierarchical configuration. In my case, it worked pretty well and allowed clearer separation of config files. coolname Having a unique identifier for your taint runs is always a good idea. If, for various reasons, you do not like UUID or just want your ids to be humanly understandable, coolname is the answer. It generates unique alphabetical word-based identifiers of various lengths from 2 to 4 words. As for the number of combinations, it looks more or less like this: 4 words length identifier has 1010 combinations 3 words length identifier has 108 combinations 2 words length identifier has 105 combinations The number is significantly lower than in the case of UUID, so the probability of collision is also higher. However, comparing the two is not the point of this text. The vocabulary is hand-picked by the creators. Yet they described it as positive and neutral (more about this here), so you will not see an identifier like you-ugly-unwise-human-being. Of course, the library is fully open source. tqdm This library provides a progress bar functionally for your application. Despite the fact that having such information displayed is maybe not the most important thing you need. It is still nice to look at and check the progress made by your application during the execution of an important task. Tqdm also uses complex algorithms to estimate the remaining time of a particular task which may be a game changer and help you organize your time around. Additionally, tqdm states that it has barely noticeable performance overhead - in nanoseconds. What is more, it is totally standalone and needs only Python to run. Thus it will not download half of the internet to your disk. Jupyter Notebook (+JupyterLab) Notebooks are a great way to share results and work on the project. Through the concept of cells, it is easy to separate different fragments and responsibilities of your code. Additionally, the fact that the single notebook file can contain code, images, and complex text outputs (tables) together only adds to its existing advantages. Moreover, notebooks allow run pip install inside the cells and use .env files for configuration. Such an approach moves a lot of software engineering complexity out of the way. Summary These are all the various libraries for machine learning that I wanted to describe for you. I aimed to provide a general overview of all the libraries alongside their possible use cases enriched by a quick note of my own experience with using them. I hope that my goal was achieved and this article will deepen your knowledge of the machine learning libraries landscape.

By Bartłomiej Żyliński CORE
It’s 2 AM. Do You Know What Your Code Is Doing?
It’s 2 AM. Do You Know What Your Code Is Doing?

Once we press the merge button, that code is no longer our responsibility. If it performs sub-optimally or has a bug, it is now the problem of the DevOps team, the SRE, etc. Unfortunately, those teams work with a different toolset. If my code uses up too much RAM, they will increase RAM. When the code runs slower, it will increase CPU. In case the code crashes, they will increase concurrent instances. If none of that helps, they will call you up at 2 AM. A lot of these problems are visible before they become a disastrous middle-of-the-night call. Yes. DevOps should control production, but the information they gather from production is useful for all of us. This is at the core of developer observability, which is a subject I’m quite passionate about. I’m so excited about it I dedicated a chapter to it in my debugging book. Back when I wrote that chapter, I dedicated most of it to active developer observability tools like Lightrun, Rookout, et al. These tools work like production debuggers. They are fantastic in that regard. When I have a bug and know where to look, I can sometimes reach for one of these tools (I used to work at Lightrun, so I always use it). But there are other ways. Tools like Lightrun are active in their observability; we add a snapshot similarly to a breakpoint and get the type of data we expect. I recently started playing with Digma, which takes a radically different approach to developer observability. To understand that, we might need to revisit some concepts of observability first. Observability Isn’t Pillars I’ve been guilty of listing the pillars of observability just as much as the next guy. They’re even in my book (sorry). To be fair, I also discussed what observability really means… Observability means we can ask questions about our system and get answers or at least have a clearly defined path to get those answers. Sounds simple when running locally, but when you have a sophisticated production environment, and someone asks you: is anyone even using that block of code? How do you know? You might have lucked out and had a login that code, and it might still be lucky that the log is in the right level and piped properly so you can check. The problem is that if you added too many logs or too much observability data, you might have created a disease worse than the cure: over-logging or over-observing. Both can bring down your performance and significantly impact the bank account, so ideally, we don’t want too many logs (I discuss over-logging here), and we don’t want too much observability. Existing developer observability tools work actively. To answer the question, if someone is using the code, I can place a counter on the line and wait for results. I can give it a week's timeout and find out in a week. Not a terrible situation but not ideal either. I don’t have that much patience. Tracing and OpenTelemetry It’s a sad state of affairs that most developers don’t use tracing in their day-to-day job. For those of you who don’t know it, it is like a call stack for the cloud. It lets us see the stack across servers and through processes. No, not method calls. More at the entry point level, but this often contains details like the database queries that were made and similarly deep insights. There’s a lot of history with OpenTelemetry, which I don’t want to get into. If you’re an observability geek, you already know it, and if not, then it’s boring. What matters is that OpenTelemetry is taking over the world of tracing. It’s a runtime agent, which means you just add it to the server, and you get tracing information almost seamlessly. It’s magic. It also doesn’t have a standard server which makes it very confusing. That means multiple vendors can use a single agent and display the information it collects to various demographics: A vendor focused on performance can show the timing of various parts in the system. A vendor focused on troubleshooting can detect potential bugs and issues. A vendor focused on security can detect potential risky access. Background Developer Observability I’m going to coin a term here since there isn’t one: Background Developer Observability. What if the data you need was already here and a system already collected it for you in the background? That’s what Digma is doing. In Digma's terms, it's called Continuous Feedback. Essentially, they’re collecting OpenTelemetry data, analyzing it, and displaying it as information that’s useful for developers. If Lightrun is like a debugger, then Digma is like SonarQube based on actual runtime and production information. The cool thing is that you probably already use OpenTelemetry without even knowing it. DevOps probably installed that agent already, and the data is already there! Going back to my question, is anyone using this API? If you use Digma, you can see that right away. OpenTelelbery already collected the information in the background, and the DevOps team already paid the price of collection. We can benefit from that too. Enough Exposition I know, I go on… Let’s get to the meat and potatoes of why this rocks. Notice that this is a demo; when running locally, the benefits are limited. The true value of these tools is in understanding production, still, they can provide a lot of insight even when running locally and even when running tests. Digma has a simple and well-integrated setup wizard for IntelliJ/IDEA. You need to have Docker Desktop running for setup to succeed. Note that you don’t need to run your application using Docker. This is simply for the Digma server process, where they collect the execution details. Once it is installed, we can run our application. In my case, I just ran the JPA unit test from my latest book, and it produced standard traces, which are already pretty cool. We can see them listed below: When we click a trace for one of these, we get the standard trace view, this is nothing new, but it’s really nice to see this information directly in the IDE and readily accessible. I can imagine the immense value this will have for figuring out CI execution issues: But the real value and where Digma becomes a “Developer Observability” tool instead of an Observability tool is with the tool window here: There is a strong connection to the code directly from the observability data and deeper analysis, which doesn’t show in my particular overly simplistic hello world. This Toolwindow highlights problematic traces and errors and helps understand real-world issues. How Does This Help at 2 AM? Disasters happen because we aren’t looking. I’d like to say I open my observability dashboard regularly, but I don’t. Then when there’s a failure I take a while to get my bearings within it. The locality of the applicable data is important. It helps us notice issues when they happen. Detect regressions before they turn to failures and understand the impact of the code we just merged. Prevention starts with awareness, and as developers, we handed our situational awareness to the DevOps team. When the failure actually happens, the locality and accessibility of the data make a big difference. Since we use tools that integrate into the IDE daily, this reduces the meantime to a fix. No, a background developer observability tool might not include the information we need to fix a problem. But if it does, then the information is already there, and we need nothing else. That is fantastic. Final Word With all the discussion about observability and OpenTelemetry, you would think everyone is using them. Unfortunately, the reality is far from that. Yes, there’s some saturation and familiarity in the DevOps crowd. This is not the case for developers. This is a form of environmental blindness. How can our teams, who are so driven by data and facts, proceed with secondhand and often outdated data from OPS? Should I spend time further optimizing this method, or will I waste the effort since few people use it? We can benchmark things locally just fine, but real-world usage and impact are things that we all need to improve.

By Shai Almog CORE
Java Parallel GC Tuning
Java Parallel GC Tuning

Parallel garbage collector (Parallel GC) is one of the oldest Garbage Collection algorithms introduced in JVM to leverage the processing power of modern multi-core systems. Parallel GC aims to reduce the impact of GC pauses by utilizing multiple threads to perform garbage collection in parallel. In this article, we will delve into the realm of Parallel GC tuning specifically. However, if you want to learn more basics of Garbage Collection tuning, you may watch this JAX London conference talk. When To Use Parallel GC You can consider using Parallel GC for your application if you have any one of the requirements: Throughput emphasis: If your application has high transactional throughput requirements and can tolerate long, occasional pauses for garbage collection, Parallel GC can be a suitable choice. It focuses on maximizing throughput by allowing garbage collection to occur concurrently with application execution. Batch processing: Applications that involve batch processing or data analysis tasks can benefit from Parallel GC. These types of applications often perform extensive computations, and Parallel GC helps minimize the impact of garbage collection on overall processing time. Heap size considerations: Parallel GC is well-suited for applications with moderate to large heap sizes. If your application requires a substantial heap to accommodate its memory needs, Parallel GC can efficiently manage memory and reduce the impact of garbage collection pauses. How To Enable Parallel GC To explicitly configure your application to use Parallel GC, you can pass the following argument when launching your Java application: -XX:+UseParallelGC This JVM argument instructs the JVM to use the Parallel GC algorithm for garbage collection. However, please note that if you don’t explicitly specify a garbage collection algorithm, in all the server class JVMs until Java 8, the default garbage collector is set to Parallel GC. Most Used Parallel GC JVM Arguments In the realm of Java Parallel GC tuning, there are a few key JVM arguments that provide control over crucial aspects of the garbage collection process. We have grouped those JVM arguments into three buckets: a. Heap and generation size parameters b. Goal-based tuning parameters c. Miscellaneous parameters Let’s get on to the details: A. Heap and Generation Size Parameters Garbage collection (GC) tuning for the Parallel Collector involves achieving a delicate balance between the size of the entire heap and the sizes of the Young and Old Generations. While a larger heap generally improves throughput, it also leads to longer pauses during GC. Consequently, finding the optimal size for the heap and generations becomes crucial. In this section, we will explore key JVM arguments that enable the adjustment of heap size and generation sizes to achieve an efficient GC configuration. -Xmx: This argument sets the maximum heap size, which establishes the upper limit for memory allocation. By carefully selecting an appropriate value for -Xmx, developers can control the overall heap size to strike a balance between memory availability and GC performance. -XX:NewSize and -XX:MaxNewSize or -XX:NewRatio: These arguments govern the size of the Young Generation, where new objects are allocated. -XX:NewSize sets the initial size, while -XX:MaxNewSize or -XX:NewRatio control the upper limit or the ratio between the young and tenured generations, respectively. Adjusting these values allows for fine-tuning the size and proportion of the Young Generation. Here is a success story of a massive technology company that reduced its young generation size and saw significant improvement in its overall application response time. -XX:YoungGenerationSizeIncrement and -XX:TenuredGenerationSizeIncrement: These arguments define the size increments for the Young and Tenured Generations, respectively. The size increments of the young and tenured generations are crucial factors in memory allocation and garbage collection behavior. Growing and shrinking are done at different rates. By default, a generation grows in increments of 20% and shrinks in increments of 5%. The percentage for growth is controlled by the command-line option -XX:YoungGenerationSizeIncrement=<Y> for the young generation and -XX:TenuredGenerationSizeIncrement=<T> for the tenured generation. -XX:AdaptiveSizeDecrementScaleFactor: This argument determines the scale factor used when decrementing generation sizes during shrinking. The percentage by which a generation shrinks is adjusted by the command-line flag -XX:AdaptiveSizeDecrementScaleFactor=<D>. If the growth increment is X percent, then the decrement for shrinking is X/D percent. B. Goal-Based Tuning Parameters To achieve optimal performance in garbage collection, it is crucial to control GC pause times and optimize the GC throughput, which represents the amount of time dedicated to garbage collection compared to application execution. In this section, we will explore key JVM arguments that facilitate goal-based tuning, enabling developers to fine-tune these aspects of garbage collection. -XX:MaxGCPauseMillis: This argument enables developers to specify the desired maximum pause time for garbage collection in milliseconds. By setting an appropriate value, developers can regulate the duration of GC pauses, ensuring they stay within acceptable limits. -XX:GCTimeRatio: This argument sets the ratio of garbage collection time to application time using the formula 1 / (1 + N), where N is a positive integer value. The purpose of this parameter is to define the desired allocation of time for garbage collection compared to application execution time, for optimizing the GC throughput. For example, let’s consider the scenario where -XX:GCTimeRatio=19. Using the formula, the goal is to allocate 1/20th or 5% of the total time to garbage collection. This means that for every 20 units of time (e.g., milliseconds) of combined garbage collection and application execution, approximately 1 unit of time will be allocated to garbage collection, while the remaining 19 units will be dedicated to application execution. The default value is 99, which sets a goal of 1% of the time for garbage collection. -XX:GCTimePercentage: This argument allows developers to directly specify the desired percentage of time allocated to garbage collection in relation to application execution time (i.e., GC Throughput). For instance, setting ‘-XX:GCTimePercentage=5’ represents a goal of allocating 5% of the total time to garbage collection, with the remaining 95% dedicated to application execution. Note: Developers can choose to use either ‘-XX:GCTimeRatio‘ or ‘-XX:GCTimePercentage‘ as alternatives to each other. Both options provide flexibility in expressing the desired allocation of time for garbage collection. I would prefer using ‘-XX:GCTimePercentage’ over ‘-XX:GCTimeRatio’ because of its ease of understanding. C. Miscellaneous Parameters In addition to the previously discussed JVM arguments, there are a few other parameters that can be useful for tuning the Parallel GC algorithm. Let’s explore them. -XX:ParallelGCThreads: This argument allows developers to specify the number of threads used for garbage collection in the Parallel GC algorithm. By setting an appropriate value based on the available CPU cores, developers can optimize throughput by leveraging the processing power of multi-core systems. It’s important to strike a balance by avoiding too few or too many threads, as both scenarios can lead to suboptimal performance. -XX:-UseAdaptiveSizePolicy: By default, the ‘UseAdaptiveSizePolicy’ option is enabled, which allows for dynamic resizing of the young and old generations based on the application’s behavior and memory demands. However, this dynamic resizing can lead to frequent “Full GC – Ergonomics” garbage collections and increased GC pause times. To mitigate this, we can pass the -XX:-UseAdaptiveSizePolicy argument to disable the resizing and reduce GC pause times. Here is a real-world example and discussion around this JVM argument. Tuning Parallel GC Behavior Studying the performance characteristics of Parallel GC is best achieved by analyzing the GC log. The GC log contains detailed information about garbage collection events, memory usage, and other relevant metrics. There are several tools available that can assist in analyzing the GC log, such as GCeasy, IBM GC and Memory Visualizer, HP Jmeter, and Google Garbage Cat. By using these tools, you can visualize memory allocation patterns, identify potential bottlenecks, and assess the efficiency of garbage collection. This allows for informed decision-making when fine-tuning Parallel GC for optimal performance. Conclusion In conclusion, optimizing the Parallel GC algorithm through fine-tuning JVM arguments and studying its behavior enables developers to achieve efficient garbage collection and improved performance in Java applications. By adjusting parameters such as heap size, generation sizes, and goal-based tuning parameters, developers can optimize the garbage collection process. Continuous monitoring and adjustment based on specific requirements are essential for maintaining optimal performance. With optimized Parallel GC tuning, developers can maximize memory management, minimize GC pauses, and unlock the full potential of their Java applications.

By Ram Lakshmanan CORE
Log Analysis: Elasticsearch vs. Apache Doris
Log Analysis: Elasticsearch vs. Apache Doris

As a major part of a company's data asset, logs bring value to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine, where you can extract information that points to business growth. Logs are the sequential records of events in the computer system. If you think about how logs are generated and used, you will know what an ideal log analysis system should look like: It should have schema-free support. Raw logs are unstructured free texts and basically impossible for aggregation and calculation, so you needed to turn them into structured tables (the process is called "ETL") before putting them into a database or data warehouse for analysis. If there was a schema change, lots of complicated adjustments needed to be put into ETL and the structured tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You can add or delete fields in these logs, and the log storage system will adjust the schema accordingly. It should be low-cost. Logs are huge, and they are generated continuously. A fairly big company produces 10~100 TBs of log data. For business or compliance reasons, it should keep the logs around for half a year or longer. That means storing a log size measured in PB, so the cost is considerable. It should be capable of real-time processing. Logs should be written in real-time, otherwise, engineers won't be able to catch the latest events in troubleshooting and security tracking. Plus, a good log system should provide full-text searching capabilities and respond to interactive queries quickly. The Elasticsearch-Based Log Analysis Solution A popular log processing solution within the data industry is the ELK stack: Elasticsearch, Logstash, and Kibana. The pipeline can be split into five modules: Log collection: Filebeat collects local log files and writes them to a Kafka message queue. Log transmission: Kafka message queue gathers and caches logs. Log transfer: Logstash filters and transfers log data in Kafka. Log storage: Logstash writes logs in JSON format into Elasticsearch for storage. Log query: Users search for logs via Kibana visualization or send a query request via Elasticsearch DSL API. The ELK stack has outstanding real-time processing capabilities, but frictions exist. Inadequate Schema-Free Support The Index Mapping in Elasticsearch defines the table scheme, which includes the field names, data types, and whether to enable index creation. Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds fields to the Mapping according to the input JSON data. This provides some sort of schema-free support, but it's not enough because: Dynamic Mapping often creates too many fields when processing dirty data, which interrupts the whole system. The data type of fields is immutable. To ensure compatibility, users often configure "text" as the data type, but that results in much slower query performance than binary data types such as integer. The index of fields is immutable, too. Users cannot add or delete indexes for a certain field, so they often create indexes for all fields to facilitate data filtering in queries. But too many indexes require extra storage space and slow down data ingestion. Inadequate Analytic Capability Elasticsearch has its unique Domain Specific Language (DSL), which is very different from the tech stack that most data engineers and analysts are familiar with, so there is a steep learning curve. Moreover, Elasticsearch has a relatively closed ecosystem, so there might be strong resistance to integration with BI tools. Most importantly, Elastisearch only supports single-table analysis and is lagging behind the modern OLAP demands for multi-table join, sub-query, and views. High Cost and Low Stability Elasticsearch users have been complaining about the computation and storage costs. The root reason lies in the way Elasticsearch works. Computation cost: In data writing, Elasticsearch also executes compute-intensive operations, including inverted index creation, tokenization, and inverted index ranking. Under these circumstances, data is written into Elasticsearch at a speed of around 2MB/s per core. When CPU resources are tight, data writing requirements often get rejected during peak times, which further leads to higher latency. Storage cost: To speed up retrieval, Elasticsearch stores the forward indexes, inverted indexes, and docvalues of the original data, consuming a lot more storage space. The compression ratio of a single data copy is only 1.5:1, compared to 5:1 in most log solutions. As data and cluster size grow, maintaining stability can be another issue: During data writing peaks: Clusters are prone to overload during data writing peaks. During queries: Since all queries are processed in the memory, big queries can easily lead to JVM OOM. Slow recovery: For a cluster failure, Elasticsearch should reload indexes, which is resource-intensive, so it will take many minutes to recover — that challenges the service availability guarantee. A More Cost-Effective Option Reflecting on the strengths and limitations of the Elasticsearch-based solution, the Apache Doris developers have optimized Apache Doris for log processing. Increase writing throughout: The performance of Elasticsearch is bottlenecked by data parsing and inverted index creation, so we improved Apache Doris in these factors: we quickened data parsing and index creation by SIMD instructions and CPU vector instructions; then we removed those data structures' unnecessary for log analysis scenarios, such as forward indexes, to simplify index creation. Reduce storage costs: We removed forward indexes, which represented 30% of index data. We adopted columnar storage and the ZSTD compression algorithm and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part of the historical logs are rarely accessed, we introduced tiered storage to separate hot and cold data. Logs that are older than a specified time period will be moved to object storage, which is much less expensive. This can reduce storage costs by around 70%. Benchmark tests with ES Rally, the official testing tool for Elasticsearch, showed that Apache Doris was around five times as fast as Elasticsearch in data writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a writing speed of 550 MB/s and a compression ratio of 10:1. The below figure shows what a typical Doris-based log processing system looks like. It is more inclusive and allows for more flexible usage from data ingestion, analysis, and application: Ingestion: Apache Doris supports various ingestion methods for log data. You can push logs to Doris via HTTP Output using Logstash, you can use Flink to pre-process the logs before you write them into Doris, or you can load logs from Flink or object storage to Doris via Routine Load and S3 Load. Analysis: You can put log data in Doris and conduct join queries across logs and other data in the data warehouse. Application: Apache Doris is compatible with MySQL protocol, so you can integrate a wide variety of data analytic tools and clients to Doris, such as Grafana and Tableau. You can also connect applications to Doris via JDBC and ODBC APIs. We are planning to build a Kibana-like system to visualize logs. Moreover, Apache Doris has better scheme-free support and a more user-friendly analytic engine. Native Support for Semi-Structured Data Firstly, we worked on the data types. We optimized the string search and regular expression matching for "text" through vectorization and brought a performance increase of 2~10 times. For JSON strings, Apache Doris will parse and store them in a more compacted and efficient binary format, which can speed up queries by four times. We also added a new data type for complicated data: Array Map. It can structuralize concatenated strings to allow for higher compression rates and faster queries. Secondly, Apache Doris supports schema evolution. This means you can adjust the schema as your business changes. You can add or delete fields and indexes and change the data types for fields. Apache Doris provides Light Schema Change capabilities so that you can add or delete fields within milliseconds: -- Add a column. Result will be returned in milliseconds. ALTER TABLE lineitem ADD COLUMN l_new_column INT; You can also add an index only for your target fields so that you can avoid overheads from unnecessary index creation. After you add an index, by default, the system will generate the index for all incremental data, and you can specify which historical data partitions that need the index. -- Add inverted index. Doris will generate inverted index for all new data afterward. ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED; -- Build index for the specified historical data partitions. BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2); SQL-Based Analytic Engine The SQL-based analytic engine makes sure that data engineers and analysts can smoothly grasp Apache Doris in a short time and bring their experience with SQL to this OLAP engine. Building on the rich features of SQL, users can execute data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and materialized views to serve their own needs. With MySQL compatibility, Apache Doris can be integrated with most GUI and BI tools in the big data ecosystem so that users can realize more complex and diversified data analysis. Performance in Use Case A gaming company has transitioned from the ELK stack to the Apache Doris solution. Their Doris-based log system used 1/6 of the storage space that they previously needed. In a cybersecurity company that built their log analysis system utilizing an inverted index in Apache Doris, they supported a data writing speed of 300,000 rows per second with 1/5 of the server resources that they formerly used. Hands-On Guide Now let's go through the three steps of building a log analysis system with Apache Doris. Before you start, download Apache Doris 2.0 or newer versions from the website and deploy clusters. Step 1: Create Tables This is an example of table creation. Explanations for the configurations: The DATETIMEV2 time field is specified as the Key in order to speed up queries for the latest N log records. Indexes are created for the frequently accessed fields, and fields that require full-text search are specified with Parser parameters. PARTITION BY RANGE means to partition the data by RANGE based on time fields, Dynamic Partition is enabled for auto-management. DISTRIBUTED BY RANDOM BUCKETS AUTO means to distribute the data into buckets randomly, and the system will automatically decide the number of buckets based on the cluster size and data volume. log_policy_1day and log_s3 means to move logs older than one day to S3 storage. CREATE DATABASE log_db; USE log_db; CREATE RESOURCE "log_s3" PROPERTIES ( "type" = "s3", "s3.endpoint" = "your_endpoint_url", "s3.region" = "your_region", "s3.bucket" = "your_bucket", "s3.root.path" = "your_path", "s3.access_key" = "your_ak", "s3.secret_key" = "your_sk" ); CREATE STORAGE POLICY log_policy_1day PROPERTIES( "storage_resource" = "log_s3", "cooldown_ttl" = "86400" ); CREATE TABLE log_table ( `ts` DATETIMEV2, `clientip` VARCHAR(20), `request` TEXT, `status` INT, `size` INT, INDEX idx_size (`size`) USING INVERTED, INDEX idx_status (`status`) USING INVERTED, INDEX idx_clientip (`clientip`) USING INVERTED, INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english") ) ENGINE = OLAP DUPLICATE KEY(`ts`) PARTITION BY RANGE(`ts`) () DISTRIBUTED BY RANDOM BUCKETS AUTO PROPERTIES ( "replication_num" = "1", "storage_policy" = "log_policy_1day", "deprecated_dynamic_schema" = "true", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-3", "dynamic_partition.end" = "7", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "AUTO", "dynamic_partition.replication_num" = "1" ); Step 2: Ingest the Logs Apache Doris supports various ingestion methods. For real-time logs, we recommend the following three methods: Pull logs from Kafka message queue: Routine Load Logstash: write logs into Doris via HTTP API Self-defined writing program: write logs into Doris via HTTP API Ingest from Kafka For JSON logs that are written into Kafka message queues, create Routine Load so Doris will pull data from Kafka. The following is an example. The property.* configurations are optional: -- Prepare the Kafka cluster and topic ("log_topic") -- Create Routine Load, load data from Kafka log_topic to "log_table" CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table COLUMNS(ts, clientip, request, status, size) PROPERTIES ( "max_batch_interval" = "10", "max_batch_rows" = "1000000", "max_batch_size" = "109715200", "strict_mode" = "false", "format" = "json" ) FROM KAFKA ( "kafka_broker_list" = "host:port", "kafka_topic" = "log_topic", "property.group.id" = "your_group_id", "property.security.protocol"="SASL_PLAINTEXT", "property.sasl.mechanism"="GSSAPI", "property.sasl.kerberos.service.name"="kafka", "property.sasl.kerberos.keytab"="/path/to/xxx.keytab", "property.sasl.kerberos.principal"="xxx@yyy.com" ); You can check how the Routine Load runs via the SHOW ROUTINE LOAD command. Ingest via Logstash Configure HTTP Output for Logstash, and then data will be sent to Doris via HTTP Stream Load. 1. Specify the batch size and batch delay in logstash.yml to improve data writing performance. pipeline.batch.size: 100000 pipeline.batch.delay: 10000 2. Add HTTP Output to the log collection configuration file testlog.conf, URL => the Stream Load address in Doris. Since Logstash does not support HTTP redirection, you should use a backend address instead of a FE address. Authorization in the headers is http basic auth. It is computed with echo -n 'username:password' | base64. The load_to_single_tablet in the headers can reduce the number of small files in data ingestion. output { http { follow_redirects => true keepalive => false http_method => "put" url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load" headers => [ "format", "json", "strip_outer_array", "true", "load_to_single_tablet", "true", "Authorization", "Basic cm9vdDo=", "Expect", "100-continue" ] format => "json_batch" } } Ingest via self-defined program This is an example of ingesting data to Doris via HTTP Stream Load. Notes: Use basic auth for HTTP authorization, use echo -n 'username:password' | base64 in computation http header "format:json": the data type is specified as JSON http header "read_json_by_line:true": each line is a JSON record http header "load_to_single_tablet:true": write to one tablet each time For the data writing clients, we recommend a batch size of 100MB~1GB. Future versions will enable Group Commit at the server end and reduce batch size from clients. curl \ --location-trusted \ -u username:password \ -H "format:json" \ -H "read_json_by_line:true" \ -H "load_to_single_tablet:true" \ -T logfile.json \ http://fe_host:fe_http_port/api/log_db/log_table/_stream_load Step 3: Execute Queries Apache Doris supports standard SQL, so you can connect to Doris via MySQL client or JDBC and then execute SQL queries. mysql -h fe_host -P fe_mysql_port -u root -Dlog_db A few common queries in log analysis: Check the latest ten records. SELECT * FROM log_table ORDER BY ts DESC LIMIT 10; Check the latest ten records of Client IP 8.8.8.8. SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10; Retrieve the latest ten records with error or 404 in the "request" field. MATCH_ANY is a SQL syntax keyword for full-text search in Doris. It means to find the records that include any one of the specified keywords. SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10; Retrieve the latest ten records with image and faq in the request field. MATCH_ALL is also a SQL syntax keyword for full-text search in Doris. It means finding the records that include all of the specified keywords. SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10; Conclusion If you are looking for an efficient log analytic solution, Apache Doris is friendly to anyone equipped with SQL knowledge; if you find friction with the ELK stack, try Apache Doris provides better schema-free support, enables faster data writing and queries, and brings much less storage burden. But we won't stop here. We are going to provide more features to facilitate log analysis. We plan to add more complicated data types to the inverted index and support the BKD index to make Apache Doris a fit for geo-data analysis. We also plan to expand capabilities in semi-structured data analysis, such as working on complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithms. And we welcome any user feedback and development advice.

By Joseph Everrett
Snowflake Workload Optimization
Snowflake Workload Optimization

In the era of big data, efficient data management and query performance are critical for organizations that want to get the best operational performance from their data investments. Snowflake, a cloud-based data platform, has gained immense popularity for providing enterprises with an efficient way of handling big data tables and reducing complexity in data environments. Big data tables are characterized by their immense size, constantly increasing data sets, and the challenges that come with managing and analyzing vast volumes of information. With data pouring in at high volume from various sources in diverse formats, ensuring data reliability and quality is increasingly challenging but also critical. Extracting valuable insights from this diverse and dynamic data necessitates scalable infrastructure, powerful analytics tools, and a vigilant focus on security and privacy. Despite the complexities, big data tables offer immense potential for informed decision-making and innovation, making it essential for organizations to understand and address the unique characteristics of these data repositories to harness their full capabilities effectively. To achieve optimal performance, Snowflake leverages several essential concepts that are instrumental in handling and processing big data efficiently. One is data pruning, which plays a vital role by eliminating irrelevant data during query execution, leading to faster response times by reducing the amount of data that is scanned. Simultaneously, Snowflake's micro-partitions, small immutable segments typically 16 MB in size, allow for seamless scalability and efficient distribution across nodes. Micro-partitioning is an important differentiator for Snowflake. This innovative technique combines the advantages of static partitioning while avoiding its limitations, resulting in additional significant benefits. The beauty of Snowflake's architecture lies in its scalable, multi-cluster virtual warehouse technology, which automates the maintenance of micro-partitions. This process ensures efficient and automatic execution of re-clustering in the background, eliminating the need for manual creation, sizing, or resizing of virtual warehouses. The compute service actively monitors the clustering quality of all registered clustered tables and systematically performs clustering on the least clustered micro-partitions until reaching an optimal clustering depth. This seamless process optimizes data storage and retrieval, enhancing overall performance and user experience. How Micro-Partitioning Improves Data Storage and Processing This design enhances data storage and processing efficiency, further improving query performance. Additionally, Snowflake's clustering feature enables users to define clustering keys, arranging data within micro-partitions based on similarities. By colocating data with similar values for clustering keys, Snowflake minimizes data scans during queries, resulting in optimized performance. Together, these key concepts empower Snowflake to deliver unparalleled efficiency and performance in managing big data workloads. Inadequate table layouts can result in long-running queries, increased costs due to higher data scans, and diminished overall performance. It is crucial to tackle this challenge to fully harness the capabilities of Snowflake and maximize its potential. One major challenge in big data table management is the data ingestion team's lack of awareness regarding consumption workloads, leading to various issues that negatively impact system performance and cost-effectiveness. Long-running queries are a significant consequence, causing delays in delivering critical insights, especially in time-sensitive applications where real-time data analysis is vital for decision-making. Moreover, the team's unawareness can lead to increased operational costs as inefficient table layouts consume more computational resources and storage, straining the organization's budget over time. List of frequently accessed tables Optimize Snowflake Performance The first step in optimizing Snowflake's performance is to analyze consumption workloads thoroughly. Acceldata’s Data Observability Cloud (ADOC) platform analyzes such historical workloads and provides table-level insights at the size, access, partitioning, and clustering level. Stats for top frequently accessed tables Understanding the queries executed most frequently and the filtering patterns applied can provide valuable insights. Focus on tables that are large and frequently accessed, as they have the most significant impact on overall performance. Most filtered columns for a table ADOC’s advanced query parsing technology has the ability to detect the columns that are accessed via WHERE or JOIN clauses. Utilize visualizations and analytics tools to identify which columns are accessed and filtered most frequently. Micro-partitioning and clustering view for a column+table ADOC also fetches CLUSTERING_INFORMATION via the Snowflake table system functions and shows the table clustering metadata in a simple and easily interpretable visualization. This information can guide the decision-making process for optimizing the table layout. Snowflake visual table clustering explorer Understand the extent of overlap and depth for filtered columns. This information is crucial for making informed decisions when defining clustering keys. The ultimate goal is to match clustering keys with the most commonly filtered columns. This alignment ensures that relevant data is clustered together, reducing data scans and improving query performance. Snowflake's prowess in managing big data tables is unparalleled, but to fully reap its benefits, optimizing performance through data pruning and clustering is essential. The collaboration between the data ingestion team and the teams using the data is vital to ensure the best possible layout for tables. By understanding consumption workloads and matching clustering keys with filtered columns, organizations can achieve efficient queries, reduce costs, and make the most of Snowflake's capabilities in handling big data efficiently.

By Ashwin Rajeeva

Top Performance Experts

expert thumbnail

Joana Carvalho

Site Reliability Engineering,
Virtuoso

Joana has been a performance engineer for the last ten years. She analyzed root causes from user interaction to bare metal, performance tuning, and new technology evaluation. Her goal is to create solutions to empower the development teams to own performance investigation, visualization, and reporting so that they can, in a self-sufficient manner, own the quality of their services. She works as a Site Reliability Engineer at Virtuoso, a codeless end-to-end testing platform powered by AI/ML.
expert thumbnail

Greg Leffler

Observability Practitioner, Director,
Splunk

Greg Leffler heads the Observability Practitioner team at Splunk and is on a mission to spread the good word of Observability to the world. Greg's career has taken him from the NOC to SRE, from SRE to management, with side stops in security and editorial functions. In addition to Observability, Greg's professional interests include hiring, training, SRE culture and operating effective remote teams. Greg holds a Master's Degree in Industrial/Organizational Psychology from Old Dominion University.
expert thumbnail

Ted Young

Director of Open Source Development,
LightStep

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Evangelism. He's renowned in the development community as a speaker, lecturer, author and baseball expert. His current role allows him to help the world understand the challenges they are facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies, organizations, and is a CNCF Ambassador. Follow on https://www.schabell.org.

The Latest Performance Topics

article thumbnail
Time-Series Forecasting With Recurrent Neural Networks
This article comprehensively guides time-series forecasting using Recurrent Neural Networks (RNNs) as well as the associated different aspects.
October 4, 2023
by Deepanshu Lulla
· 813 Views · 1 Like
article thumbnail
Not a Single Trace
Observability is an orchestra, not a single instrument. By combining multiple data points, we form an accurate production narrative to resolve issues.
October 4, 2023
by Shai Almog CORE
· 952 Views · 2 Likes
article thumbnail
Observability Pillars: Exploring Logs, Metrics and Traces
Explore the vital elements of observability in this insightful blog. Discover how data, metrics, and traces form the three pillars of effective observability.
October 4, 2023
by Chitra Bisht
· 859 Views · 1 Like
article thumbnail
Data Observability: Better Insights Through Reliable Data Practices
With data observability, data integrity is safeguarded, and its full potential is unlocked — leading to innovation, efficiency, and success.
October 4, 2023
by Joana Carvalho CORE
· 1,685 Views · 1 Like
article thumbnail
PromCon EU 2023: Observability Recap in Berlin
Explore a recap of PromCon EU 2023, the community-organized event focused on the technology and implementations around the open-source Prometheus project.
October 3, 2023
by Eric D. Schabell CORE
· 1,148 Views · 1 Like
article thumbnail
Real-Time Analytics
All data, any data, any scale, at any time: Learn why data pipelines need to embrace real-time data streams to harness the value of data as it is created.
October 3, 2023
by Tim Spann CORE
· 1,695 Views · 3 Likes
article thumbnail
Unveiling Real-Time Operating Systems (RTOS): The Heartbeat of Modern Embedded Systems
Learn more about the Real-Time operating system: Heartbeat of Modern Embedded Systems and much more.
October 2, 2023
by Manvendra sharma
· 1,823 Views · 1 Like
article thumbnail
How to Debug an Unresponsive Elasticsearch Cluster
Master debug an unresponsive Elasticsearch cluster with our simple tutorial guide. Try this efficient solution for buggy or unstable Elasticsearch setups.
October 1, 2023
by Derric Gilling CORE
· 3,203 Views · 1 Like
article thumbnail
Choreography Pattern: Optimizing Communication in Distributed Systems
Explore the Choreography Pattern and discover how it can optimize communication in distributed systems. Learn key concepts, best practices, and real-world scenario.
September 30, 2023
by Gaurav Gaur CORE
· 3,193 Views · 2 Likes
article thumbnail
Performance Optimization Strategies in Highly Scalable Systems
Optimizing digital applications involves Prefetching, Memoization, Concurrent Fetching, and Lazy Loading. These techniques enhance efficiency and user experience.
September 28, 2023
by Hemanth Murali
· 3,353 Views · 2 Likes
article thumbnail
A New Era Has Come, and So Must Your Database Observability
We can't prevent the bad code from the deployment. We lack database observability, and we can't troubleshoot automatically. We need a new approach: database guardrails.
September 28, 2023
by Adam Furmanek
· 2,709 Views · 1 Like
article thumbnail
What Is Web App Penetration Testing?
Strengthen your web app's defenses with expert Web App Penetration Testing services. Identify vulnerabilities, protect data, and stay ahead of cyber threats.
September 28, 2023
by Jatin Patel
· 2,310 Views · 1 Like
article thumbnail
Reflections From a DBA
In this article, take a deep dive into clouds and discuss their optimal suitability based on different types of organizational or individual needs.
September 26, 2023
by Aayushi Mangal
· 3,111 Views · 2 Likes
article thumbnail
Logging to Infinity and Beyond: How To Find the Hidden Value of Your Logs
Though it can feel like trouble, sorting through logs is crucial. Logs hold info on performance problems, security issues, and user behavior.
September 26, 2023
by Michael Bogan CORE
· 2,328 Views · 3 Likes
article thumbnail
Modbus Protocol: The Grandfather of IoT Communication
The Modbus protocol is simple and robust, making it a popular choice for industrial control systems.
September 26, 2023
by Weihong Zhang
· 2,342 Views · 1 Like
article thumbnail
Resolving Log Corruption Detected During Database Backup in SQL Server
Solve log corruption problems in SQL Server whenever there is a problem due to viruses, malware, or hardware attacks. The article will show different solutions.
September 25, 2023
by Daniel Calbimonte
· 1,973 Views · 1 Like
article thumbnail
Time Series Analysis: VAR-Model-As-A-Service Using Flask and MinIO
VAR-As-A-Service is an MLOps approach for a statistical model service implementation backed by an AWS S3-compatible object-based model storage like MinIO.
September 25, 2023
by Daniela Kolarova CORE
· 2,444 Views · 2 Likes
article thumbnail
Navigating the Skies
In an in-depth look at primary cloud vendors, analyze crucial factors that set them apart: scalability, security, and cloud services for cloud databases.
September 25, 2023
by Kellyn Gorman CORE
· 3,119 Views · 2 Likes
article thumbnail
Implementing Stronger RBAC and Multitenancy in Kubernetes Using Istio
Learn how to use Istio service mesh on top of K8s auth to implement stronger RBAC and multitenancy for Kubernetes workloads.
September 25, 2023
by Debasree Panda
· 5,232 Views · 5 Likes
article thumbnail
How TIBCO Is Evolving Integration for the Multi-Cloud Era
Single pane of glass end-to-end observability, a developer portal, model-driven data management, and virtualization boost productivity and performance.
September 22, 2023
by Tom Smith CORE
· 3,274 Views · 3 Likes
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: