Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service
[DZone Research] Observability + Performance: We want to hear your experience and insights. Join us for our annual survey (enter to win $$).
Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
[DZone Research] Join Us for Our Annual Performance + Observability Survey!
The Convergence of Testing and Observability
Code scanning for vulnerability detection for exposure of security-sensitive parameters is a crucial practice in MuleSoft API development. Code scanning involves the systematic analysis of MuleSoft source code to identify vulnerabilities. These vulnerabilities could range from hardcoded secure parameters like password or accessKey to the exposure of password or accessKey in plain text format in property files. These vulnerabilities might be exploited by malicious actors to compromise the confidentiality, integrity, or availability of the applications. Lack of Vulnerability Auto-Detection MuleSoft Anypoint Studio or Anypoint platform does not provide a feature to keep governance on above mentioned vulnerabilities. It can be managed by design time governance, where a manual review of the code will be needed. However, there are many tools available that can be used to scan the deployed code or code repository to find out such vulnerabilities. Even you can write some custom code/script in any language to perform the same task. Writing custom code adds another complexity and manageability layer. Using Generative AI To Review the Code for Detecting Vulnerabilities In this article, I am going to present how Generative AI can be leveraged to detect such vulnerabilities. I have used the Open AI foundation model “gpt-3.5-turbo” to demonstrate the code scan feature to find the aforementioned vulnerabilities. However, we can use any foundation model to implement this use case. This can be implemented using Python code or any other code in another language. This Python code can be used in the following ways: Python code can be executed manually to scan the code repository. It can be integrated into the CICD build pipeline, which can scan and report the vulnerabilities and result in build failure if vulnerabilities are present. It can be integrated into any other program, such as the Lambda function, which can run periodically and execute the Python code to scan the code repository and report vulnerabilities. High-Level Architecture Architecture There are many ways to execute the Python code. A more appropriate and practical way is to integrate the Python code into the CICD build pipeline. CICD build pipeline executes the Python code. Python code reads the MuleSoft code XML files and property files. Python code sends the MuleSoft code content and prompts the OpenAI gpt-3.5-turbo model. OpenAI mode returns the hardcoded and unencrypted value. Python code generates the report of vulnerabilities found. Implementation Details MuleSoft API project structure contains two major sections where security-sensitive parameters can be exposed as plain text. src/main/mule folder contains all the XML files, which contain process flow, connection details, and exception handling. MuleSoft API project may have custom Java code also. However, in this article, I have not considered the custom Java code used in the MuleSoft API. src/main/resources folder contains environment property files. These files can be .properties or .yaml files for development, quality, and production. These files contain property key values, for example, user, password, host, port, accessKey, and secretAccesskey in an encrypted format. Based on the MuleSoft project structure, implementation can be achieved in two steps: MuleSoft XML File Scan Actual code is defined as process flow in MuleSoft Anypoint Studio. We can write Python code to use the Open AI foundation model and write a prompt that can scan the MuleSoft XML files containing the code implementation to find hardcoded parameter values. For example: Global.xml/Config.xml file: This file contains all the connector configurations. This is standard recommended by MuleSoft. However, it may vary depending on the standards and guidelines defined in your organization. A generative AI foundation model can use this content to find out hardcoded values. Other XML files: These files may contain some custom code or process flow calling other for API calls, DB calls, or any other system call. This may have connection credentials hard-coded by mistake. A generative AI foundation model can use this content to find out hardcoded values. I have provided the screenshot of a sample MuleSoft API code. This code has three XML files; one is api.xml, which contains the Rest API flow. Process.xml has a JMS-based asynchronous flow. Global.xml has all the connection configurations. api.xml process.xml global.xml For demonstration purposes, I have used a global.xml file. The code snippet has many hardcoded values for demonstration. Hardcoded values are highlighted in red boxes: Python Code The Python code below uses the Open AI foundation model to scan the above XML files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/mule/global.xml" try: with open(file_path, 'r') as file: file_content = file.read() print(file_content) except FileNotFoundError: except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the non-hard-coded value and hard-coded value separately. Example: For instance, if you were to find the hardcoded values, the hard-coded value look like this: name=""value"". if you were to find the non-hardcoded values, the non-hardcoded value look like this: host=""${host}"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: The result from the Generative AI Model Similarly, we can provide api.xml and process.xml to scan the hard-coded values. You can even modify the Python code to read all the XML files iteratively and get the result in sequence for all the files. Scanning the Property Files We can use the Python code to send another prompt to the AI model, which can find the plain text passwords kept in property files. In the following screenshot dev-secure.yaml file has client_secret as the encrypted value, and db.password and jms.password is kept as plain text. config file Python Code The Python code below uses the Open AI foundation model to scan config files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/resources/config/secure/dev-secure.yaml" try: with open(file_path, 'r') as file: file_content = file.read() except FileNotFoundError: print("File not found.") except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the encrypted value and unencrypted value separately. Example: For instance, if you were to find the encrypted values, the encrypted value look like this: ""![asdasdfadsf]"". if you were to find the unencrypted values, the unencrypted value look like this: ""sdhfsd"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: result from Generative AI Impact of Generative AI on the Development Life Cycle We see a significant impact on the development lifecycle. We can think of leveraging Generative AI for different use cases related to the development life cycle. Efficient and Comprehensive Analysis Generative AI models like GPT-3.5 have the ability to comprehend and generate human-like text. When applied to code review, they can analyze code snippets, provide suggestions for improvements, and even identify patterns that might lead to bugs or vulnerabilities. This technology enables a comprehensive examination of code in a relatively short span of time. Automated Issue Identification Generative AI can assist in detecting potential issues such as syntax errors, logical flaws, and security vulnerabilities. By automating these aspects of code review, developers can allocate more time to higher-level design decisions and creative problem-solving. Adherence To Best Practices Through analysis of coding patterns and context, Generative AI can offer insights on adhering to coding standards and best practices. Learning and Improvement Generative AI models can "learn" from vast amounts of code examples and industry practices. This knowledge allows them to provide developers with contextually relevant recommendations. As a result, both the developers and the AI system benefit from a continuous learning cycle, refining their understanding of coding conventions and emerging trends. Conclusion In conclusion, conducting a code review to find security-sensitive parameters exposed as plain text using OpenAI's technology has proven to be a valuable and efficient process. Leveraging OpenAI for code review not only accelerated the review process but also contributed to producing more robust and maintainable code. However, it's important to note that while AI can greatly assist in the review process, human oversight and expertise remain crucial for making informed decisions and fully understanding the context of the code.
In this tutorial, developers, solution architects, and data engineers can learn how to build high-performance, scalable, and fault-tolerant applications that react to real-time data using Kafka and Hazelcast. We will be using Wikimedia as a real-time data source. Wikimedia provides various streams and APIs (Application Programming Interfaces) to access real-time data about edits and changes made to their projects. For example, this source provides a continuous stream of updates on recent changes, such as new edits or additions to Wikipedia articles. Developers and solution architects often use such streams to monitor and analyze the activity on Wikimedia projects in real-time or to build applications that rely on this data, like this tutorial. Kafka is great for event streaming architectures, continuous data integration (ETL), and messaging systems of record (database). Hazelcast is a unified real-time stream data platform that enables instant action on data in motion by combining stream processing and a fast data store for low-latency querying, aggregation, and stateful computation against event streams and traditional data sources. It allows you to build resource-efficient, real-time applications quickly. You can deploy it at any scale from small edge devices to a large cluster of cloud instances. In this tutorial, we will guide you through setting up and integrating Kafka and Hazelcast to enable real-time data ingestion and processing for reliable streaming processing. By the end, you will have a deep understanding of how to leverage the combined capabilities of Hazelcast and Kafka to unlock the potential of streaming processing and instant action for your applications. So, let's get started! Wikimedia Event Streams in Motion First, let’s understand what we are building: Most of us use or read Wikipedia, so let’s use Wikipedia's recent changes as an example. Wikipedia receives changes from multiple users in real time, and these changes contain details about the change such as title, request_id, URI, domain, stream, topic, type, user, topic, title_url, bot, server_name, and parsedcomment. We will read recent changes from Wikimedia Event Streams. Event Streams is a web service that exposes streams of structured event data in real time. It does it over HTTP with chunked transfer encoding in accordance with the Server-Sent Events protocol (SSE). Event Streams can be accessed directly through HTTP, but they are more often used through a client library. An example of this is a “recentchange”. But what if you want to process or enrich changes in real time? For example, what if you want to determine if a recent change is generated by a bot or human? How can you do this in real time? There are actually multiple options, but here we’ll show you how to use Kafka to transport data and how to use Hazelcast for real-time stream processing for simplicity and performance. Here’s a quick diagram of the data pipeline architecture: Prerequisites If you are new to Kafka or you’re just getting started, I recommend you start with Kafka Documentation. If you are new to Hazelcast or you’re just getting started, I recommend you start with Hazelcast Documentation. For Kafka, you need to download Kafka, start the environment, create a topic to store events, write some events to your topic, and finally read these events. Here’s a Kafka Quick Start. For Hazelcast, you can use either the Platform or the Cloud. I will use a local cluster. Step #1: Start Kafka Run the following commands to start all services in the correct order: Markdown # Start the ZooKeeper service $ bin/zookeeper-server-start.sh config/zookeeper.properties Open another terminal session and run: Markdown # Start the Kafka broker service $ bin/kafka-server-start.sh config/server.properties Once all services have successfully launched, you will have a basic Kafka environment running and ready to use. Step #2: Create a Java Application Project The pom.xml should include the following dependencies in order to run Hazelcast and connect to Kafka: XML <dependencies> <dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast</artifactId> <version>5.3.1</version> </dependency> <dependency> <groupId>com.hazelcast.jet</groupId> <artifactId>hazelcast-jet-kafka</artifactId> <version>5.3.1</version> </dependency> </dependencies> Step #3: Create a Wikimedia Publisher Class Basically, the class reads from a URL connection, creates a Kafka Producer, and sends messages to a Kafka topic: Java public static void main(String[] args) throws Exception { String topicName = "events"; URLConnection conn = new URL ("https://stream.wikimedia.org/v2/stream/recentchange").openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), StandardCharsets.UTF_8)); try (KafkaProducer<Long, String> producer = new KafkaProducer<>(kafkaProps())) { for (long eventCount = 0; ; eventCount++) { String event = reader.readLine(); producer.send(new ProducerRecord<>(topicName, eventCount, event)); System.out.format("Published '%s' to Kafka topic '%s'%n", event, topicName); Thread.sleep(20 * (eventCount % 20)); } } } private static Properties kafkaProps() { Properties props = new Properties(); props.setProperty("bootstrap.servers", "127.0.0.1:9092"); props.setProperty("key.serializer", LongSerializer.class.getCanonicalName()); props.setProperty("value.serializer", StringSerializer.class.getCanonicalName()); return props; } Step #4: Create a Main Stream Processing Class This class creates a pipeline that reads from a Kafka source using the same Kafka topic, and then it filters out messages that were created by bots (bot:true), keeping only messages created by humans. It sends the output to a logger: Java public static void main(String[] args) { Pipeline p = Pipeline.create(); p.readFrom(KafkaSources.kafka(kafkaProps(), "events")) .withNativeTimestamps(0) .filter(event-> Objects.toString(event.getValue()).contains("bot\":false")) .writeTo(Sinks.logger()); JobConfig cfg = new JobConfig().setName("kafka-traffic-monitor"); HazelcastInstance hz = Hazelcast.bootstrappedInstance(); hz.getJet().newJob(p, cfg); } private static Properties kafkaProps() { Properties props = new Properties(); props.setProperty("bootstrap.servers", "127.0.0.1:9092"); props.setProperty("key.deserializer", LongDeserializer.class.getCanonicalName()); props.setProperty("value.deserializer", StringDeserializer.class.getCanonicalName()); props.setProperty("auto.offset.reset", "earliest"); return props; } Step #5: Enriching a Stream If you want to enrich real-time messages with batch or static data such as location details, labels, or some features, you can follow the next step: Create a Hazelcast Map and load static data into it. Use the Map to enrich the Message stream using mapUsingIMap. Conclusion In this post, we explained how to build a real-time application to process Wikimedia streams using Kafka and Hazelcast. Hazelcast allows you to quickly build resource-efficient, real-time applications. You can deploy it at any scale, from small-edge devices to a large cluster of cloud instances. A cluster of Hazelcast nodes shares the data storage and computational load, which can dynamically scale up and down. Referring to the Wikimedia example, it means that this solution is reliable, even when there are significantly higher volumes of users making changes to Wikimedia. We look forward to your feedback and comments about this blog post!
Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system. However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging. With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post. We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention. Let’s dive in with a look at log levels and formats. Logging Design Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats. Log Levels Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest): TRACE: Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change. DEBUG: Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments. INFO: Provides general information about the system's operation to convey important events or milestones in the system's execution. WARNING: Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary. ERROR: Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality. Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively. When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently. Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels. Structured Versus Unstructured Logs Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically. Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems. On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly. However, programmatically extracting specific information from the resulting logs can be very challenging. Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits. However, if all you need is simplicity and readability, then unstructured logs may be sufficient. In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages. For large-scale systems, you should lean towards structured logging when possible, but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning. Logging Challenges With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings. Disparate Destinations Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome. For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic. Varying Formats Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields. Unifying the information you get from a diversity of logs and formats requires the right tools. Inconsistent Log Levels Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels. One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation. You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose. Log Storage Cost Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system. Dealing With These Challenges While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices. Aggregate Your Logs When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform. Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation. Move Toward a Unified Format Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format. The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down. Establish a Logging Standard Across Your Applications For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics. If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard. If a migration is not feasible, treat your legacy applications like you would third-party applications. Enrich Logs From Third-Party Sources Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities. To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics). Manage Log Volume, Frequency, and Retention Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage. Volume: Monitoring generated log volume helps you control resource consumption and performance impacts. Frequency: Determine how often to log, based on the criticality of events and desired level of monitoring. Retention: Define a log retention policy appropriate for compliance requirements, operational needs, and available storage. Rotation: Periodically archive or purge older log files to manage log file sizes effectively. Compression: Compress log files to reduce storage requirements. Log-Based Metrics Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges. Benefits Granular insights: Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues. Comprehensive monitoring: By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience. Historical analysis: Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability. Flexibility and customization: You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs. Challenges Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task. This identification requires a deep understanding of system behavior and close alignment with your business objectives. Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next. Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge. Need for real-time analysis: Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively. Performance impact: Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance. Data noise and irrelevance: Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events Long-Term Log Retention After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area. How Long Should You Keep Logs Around? How long you should keep a log around depends on several factors, including: Log type: Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting. Regulatory requirements: Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years. Company policy: Your company may have policies that dictate how long logs should be kept. Log size: If your logs are large, you may need to rotate them or delete them more frequently. Storage cost: Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage. How Do You Reduce the Level of Detail and Cost of Older Logs? Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around. When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures: Downsampling logs: In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10). Trimming logs: For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code. Compression and archiving: You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements. Conclusion In this article, we’ve looked at how to get the most out of logging in large-scale systems. Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources. Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system. And you can do this while keeping your logging costs at bay.
One of my current talks focuses on Observability in general and Distributed Tracing in particular, with an OpenTelemetry implementation. In the demo, I show how you can see the traces of a simple distributed system consisting of the Apache APISIX API Gateway, a Kotlin app with Spring Boot, a Python app with Flask, and a Rust app with Axum. Earlier this year, I spoke and attended the Observability room at FOSDEM. One of the talks demoed the Grafana stack: Mimir for metrics, Tempo for traces, and Loki for logs. I was pleasantly surprised how one could move from one to the other. Thus, I wanted to achieve the same in my demo but via OpenTelemetry to avoid coupling to the Grafana stack. In this blog post, I want to focus on logs and Loki. Loki Basics and Our First Program At its core, Loki is a log storage engine: Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. Loki Loki provides a RESTful API to store and read logs. Let's push a log from a Java app. Loki expects the following payload structure: I'll use Java, but you can achieve the same result with a different stack. The most straightforward code is the following: Java public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException { var template = "'{' \"streams\": ['{' \"stream\": '{' \"app\": \"{0}\" '}', \"values\": [[ \"{1}\", \"{2}\" ]]'}']'}'"; //1 var now = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant(); var nowInEpochNanos = NANOSECONDS.convert(now.getEpochSecond(), SECONDS) + now.getNano(); var payload = MessageFormat.format(template, "demo", String.valueOf(nowInEpochNanos), "Hello from Java App"); //1 var request = HttpRequest.newBuilder() //2 .uri(new URI("http://localhost:3100/loki/api/v1/push")) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(payload)) .build(); HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString()); //3 } This is how we did String interpolation in the old days Create the request Send it The prototype works, as seen in Grafana: However, the code has many limitations: The label is hard-coded. You can and must send a single-label Everything is hard-coded; nothing is configurable, e.g., the URL The code sends one request for every log; it's hugely inefficient as there's no buffering HTTP client is synchronous, thus blocking the thread while waiting for Loki No error handling whatsoever Loki offers both gzip compression and Protobuf; none are supported with my code Finally, it's completely unrelated to how we use logs, e.g.: Java var logger = // Obtain logger logger.info("My message with parameters {}, {}", foo, bar); Regular Logging on Steroids To use the above statement, we need to choose a logging implementation. Because I'm more familiar with it, I'll use SLF4J and Logback. Don't worry; the same approach works for Log4J2. We need to add relevant dependencies: XML <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <!--1--> <version>2.0.7</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId> <!--2--> <version>1.4.8</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.github.loki4j</groupId> <artifactId>loki-logback-appender</artifactId> <!--3--> <version>1.4.0</version> <scope>runtime</scope> </dependency> SLF4J is the interface Logback is the implementation Logback appender dedicated to SLF4J Now, we add a specific Loki appender: XML <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender"> <!--1--> <http> <url>http://localhost:3100/loki/api/v1/push</url> <!--2--> </http> <format> <label> <pattern>app=demo,host=${HOSTNAME},level=%level</pattern> <!--3--> </label> <message> <pattern>l=%level h=${HOSTNAME} c=%logger{20} t=%thread | %msg %ex</pattern> <!--4--> </message> <sortByTime>true</sortByTime> </format> </appender> <root level="DEBUG"> <appender-ref ref="STDOUT" /> </root> The loki appender Loki URL As many labels as wanted Regular Logback pattern Our program has become much more straightforward: Java var who = //... var logger = LoggerFactory.getLogger(Main.class.toString()); logger.info("Hello from {}!", who); Grafana displays the following: Docker Logging I'm running most of my demos on Docker Compose, so I'll mention the Docker logging trick. When a container writes on the standard out, Docker saves it to a local file. The docker logs command can access the file content. However, other options than saving to a local file are available, e.g., syslog, Google Cloud, Splunk, etc. To choose a different option, one sets a logging driver. One can configure the driver at the overall Docker level or per container. Loki offers its own plugin. To install it: Shell docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions At this point, we can use it on our container app: YAML services: app: build: . logging: driver: loki #1 options: loki-url: http://localhost:3100/loki/api/v1/push #2 loki-external-labels: container_name={{.Name},app=demo #3 Loki logging driver URL to push to Additional labels The result is the following. Note the default labels. Conclusion From a bird's eye view, Loki is nothing extraordinary: it's a plain storage engine with a RESTful API on top. Several approaches are available to use the API. Beyond the naive one, we have seen a Java logging framework appender and Docker. Other approaches include scraping the log files, e.g., Promtail, via a Kubernetes sidecar. You could also add an OpenTelemetry Collector between your app and Loki to perform transformations. Options are virtually unlimited. Be careful to choose the one that fits your context the best. To go further: Push log entries to Loki via API Loki Clients
Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.
There are many libraries out there that can be used in machine learning projects. Of course, some of them gained considerable reputations through the years. Such libraries are the straight-away picks for anyone starting a new project which utilizes machine learning algorithms. However, choosing the correct set (or stack) may be quite challenging. The Why In this post, I would like to give you a general overview of the machine learning libraries landscape and share some of my thoughts about working with them. If you are starting your journey with a machine learning library, my text can give you some general knowledge of the machine learning libraries and provide a better starting point for learning more. The libraries described here will be divided by the role they can play in your project. The categories are as follows: Model Creation - Libraries that can be used to create machine learning models Working with data - Libraries that can be used both for feature engineering, future extraction, and all other operations that involve working with features Hyperparameters optimization - Libraries and tools that can be used for optimizing model hyperparameters Experiment tracking - Libraries and tools used for experiment tracking Problem-specific libraries - Libraries that can be used for tasks like time series forecasting, computer vision, and working with spatial data Utils - Non-strictly machine learning libraries but nevertheless the ones I found useful in my projects Model Creation PyTorch Developed by people from Facebook and open-sourced in 2017, it is one of the market's most famous machine learning libraries - based on the open-source Torch package. The PyTorch ecosystem can be used for all types of machine learning problems and has a great variety of purpose-built libraries like torchvision or torchaudio. The basic data structure of PyTorch is the Tensor object, which is used to hold multidimensional data utilized by our model. It is similar in its conception to NumPy ndarray. PyTorch can also use computation accelerators, and it supports CUDA-capable NVIDIA GPUs, ROCm, Metal API, and TPU. The most important part of the core PyTorch library is nn modules which contains layers and tools to build complex models layer by layer easily. class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits An example of a simple neural network with 3 linear layers in PyTorch Additionally, PyTorch 2.0 is already released, and it makes PyTorch even better. Moreover, PyTorch is used by a variety of companies like Uber, Tesla, and Facebook, just to name a few. PyTorch Lightning It is a sort of “extension” for PyTorch, which aims to greatly reduce the amount of boilerplate code needed to utilize our models. Lightning is based on the concept of hooks: functions called on specific phases of the model train/eval loop. Such an approach allows us to pass callback functions executed at a specific time, like the end of the training step. Trainers Lighting automates many functions that one has to take care of in PyTorch; for example, loop, hardware call, or zero grads. Below are roughly equivalent code fragments of PyTorch (Left) and PyTorch Lightning (Right). Image Source TensorFlow A library developed by a team from Google Brain was originally released in 2015 under Apache 2.0 License, and version 2.0 was released in 2019. It provides clients in Java, C++, Python, and even JavaScript. Similar to PyTorch, it is widely adopted throughout the market and used by companies like Google (surprise), Airbnb, and Intel. TensorFlow also has quite an extensive ecosystem built around it by Google. It contains tools and libraries such as an optimization toolkit, TensorBoard (more about it in the "Experiment Tracking" section below), or recommenders. The TensorFlow ecosystem also includes a web-based sandbox to play around with your model’s visualization. Again the tf.nn module plays the most vital part providing all the building blocks required to build machine learning models. Tensorflow uses its own Tensor (flow ;p) object for holding data utilized by deep learning models. It also supports all the common computation accelerators like CUDA or RoCm (community), Metal API, and TPU. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. Keras It is a library similar in meaning and inception to PyTorch Lightning but for TensorFlow. It offers a more high-level interface over TensorFlow. Developed by François Chollet and released in 2015, it provides only Python clients. Keras also has its own set of Python libraries and problem-specific libraries like KerasCV or KerasNLP for more specialized use cases. Before version 2.4 Keras supported more backends than just TensorFlow, but after the release, TensorFlow became the only supported backend. As Keras is just an interface for TensorFlow, it shares similar base concepts as its underlying backend. The same holds true for supported computation accelerators. Keras is used by companies like IBM, PayPal, and Netflix. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. PyTorch vs. TensorFlow I wouldn’t be fully honest If I would not introduce some comparison between these two. As you could read, a moment before, both of them are quite similar in offered features and the ecosystem around them. Of course, there are some minor differences and quirks in how both work or the features they provide. In my opinion, these are more or less insignificant. The real difference between them comes from their approach to defining and executing computational graphs of the machine and deep learning and models. PyTorch uses dynamic computational graphs, which means that the graph is defined on-the-fly during execution. This allows for more flexibility and intuitive debugging, as developers can modify the graph at runtime and easily inspect intermediate outputs. On the other hand, this approach may be less efficient than static graphs, particularly for complex models. However, PyTorch 2.0 attempts to address these issues via torch.compile and FX graphs. TensorFlow uses static computational graphs, which are compiled before execution. This allows for more efficient execution, as the graph can be optimized and parallelized for the target hardware. However, it can also make debugging more difficult, as intermediate outputs are not readily accessible. Another noticeable difference is that PyTorch seems to be more low level than Keras while being more high level than pure TensorFlow. Such a setting makes PyTorch more elastic and easier to use for making tailored models with many customizations. As a side note, I would like to add that both libraries are on equal terms in the case of market share. Additionally, despite the fact that TensorFlow uses the call method and PyTorch uses the forward method both libraries support call semantics as a shorthand for calling model(x). Working With Data pandas A library that you must have heard of if you are using Python, it is probably the most famous Python library for working with data of any type. It was originally released in 2008, and version 1.0 in 2012. It provides functions for filtering, aggregating, and transforming data, as well as merging multiple datasets. The cornerstone of this library is a DataFrame object which represents a multidimensional table of any type of data. The library heavily focuses on performance with some parts written in pure C to boost performance. Besides being performance-focused, pandas provide a lot of features related to: Data cleaning and preprocessing Removing duplicates Filling null values or nan values. Time series analysis Resampling Windowing Time shift Additionally, it can perform a variety of input/output operations: Reading from/to .csv or .xlsx files Performing database queries Load data from GCP BigQuery (with the help of pandas-gbq) NumPy It is yet another famous library for working with data - mostly numeric data science in kind. The most famous part of NumPy is an ndarray - a structure representing a multidimensional array of numbers. Besides ndarray, NumPy provides a lot of high-level mathematical functions and mathematical operations used to work with this data. It is also probably the oldest library in this set as the first version was released in 2005. It was implemented by Travis Oliphant, based on an even older library called numeric (released in 1996). NumPy is extremely focused on performance with contributors trying to get more and more of the currently implemented algorithm to reduce the execution time of NumPy functions even more. Of course, as with all libraries described here, numpy is also open source and uses a BSD license. SciPy It is a library focused on supporting scientific computations. It is even older than NumPy (2005), released in 2001. It is built atop NumPy, with ndarray being the basic data structure used throughout SciPy. Among other things, the library adds functions for optimization, linear algebra, signal processing, interpolation, and spares matrix support. In general, it is more high-level than NumPy and thus can provide more complex functions. Hyperparameter Optimizations Ray Tune It is part of the Ray toolset, a bundle of related libraries for building distributed applications focusing on machine learning and Python. The Tune part of the ML library is focused on providing hyper-priming optimization features by providing a variety of search algorithms; for example, grid search, hyperband, or Bayesian Optimization. Ray Tune can work with models created in most programming languages and libraries available on the market. All of the libraries described in the paragraph about model creation are supported by Ray Tune. The key concepts of Ray Tune are: Trainables - Objects pass to Tune runs; they are our model we want to optimize parameters for Search space - Contains all the values of hyperparameters we want to check in the current trial Tuner - An object responsible for executing runs, calling tuner.fit() starts the process of searching for optimal hyperparameters set. It requires passing at least a trainable object and search space Trial- Each trail represents a particular run of a trainable object with an exact set of parameters from a search space. Trials are generated by Ray Tune Tuner. As it represents the output of running the tuner, Trial contains a ton of information, such as: Config used for a particular trial Trial ID Many others Search algorithms - The algorithm used for a particular execution of Tuner.fit; if not provided, Ray Tune will use RadomSearch as the default Schedulers - These are objects that are in charge of managing runs. They can pause, stop and run trials within the run. This can result in increased efficiency and reduced time of the run. If none is selected, the Tune will pick FIFO as default - run will be executed one by one as in a classic queue. Run analyses - The object conniving the results of Tuner.fit execution in the form of ResultGrid object. It contains all the data related to the run like the best result among all trials or data from all trials. BoTorch BoTorch is a library built atop PyTorch and is part of the PyTorch ecosystem. It focuses solely on providing hyperparameter optimization with the use of Bayesian Optimization. As the only library in this part that is designed to work with a specific model library, it may make it problematic for BoTorch to be used with libraries other than PyTorch. Also as the only library it is currently in version beta and under intensive development, so some unexpected problems may occur. The key feature of BoTorch is its integration with PyTorch, which greatly impacts the ease of interaction between the two. Experiment Tracking Neptune.ai It is a web-based tool that serves both as an experiment tracking and model registry. The tool is cloud-based in the classic SaaS model, but if you are determined enough, there is the possibility of using a self-hosted variant. It provides a dashboard where you can view and compare the results of training your model. It can also be used to store the parameters used for particular runs. Additionally, you can easily version the dataset used for particular runs and all the metadata you think may come in handy. Moreover, it enables easy version control for your models. Obviously, the tool is library agnostic and can host models created using any library. To make the integration possible, Neptune exposes a REST-style API with its own client. The client can be downloaded and installed via pip or any other Python dependency tool. The API is decently documented and quite easy to grasp. The tool is paid and has a simple pricing plan divided into three categories. Yet if you need it just for a personal project or you are in the research or academic unit, then you can apply to use the tool for free. Neptune.ai is a new tool, so some features known from other experiment tracing tools may not be present. Despite that fact, Neptune.ai support is keen to react to their user feedback and implement missing functionalities – at least, it was the thing in our case as we use Neptune.ai extensively in The Codos Project. Weights & Biases Also known as WandB or W&B, this is a web-based tool that exposes all the needed functionalities to be used as an experiment tracing tool and model registry. It exposes a more or less similar set of functionalities as neptune.ai. However, Weights & Biases seem to have better visualization and, in general, is a more mature tool than Neptune. Additionally, WandB seems to be more focused on individual projects and researchers with less emphasis on collaboration. It also has a simple pricing plan divided into 3 categories with a free tier for private use. Yet W&B has the same approach as Neptune to researchers and academic units - they can always use Weights & Biases for free. Weights & Biases also expose an REST-like API to ease the integration process. It seems to be a better document and offers more feathers than the one exposed by Neptune.ai. What is curious is that they expose the client library written in Java - if, for some reason, you wrote a machine learning model in Java instead of Python. TensorBoard It is a dedicated visualization toolkit for the TensorFlow ecosystem. It is designed mostly to work as an experiment tracking tool with a focus on metrics visualizations. Despite being a dedicated TensorFlow tool, it can also be used with Keras (not surprisingly) and PyTorch. Additionally, it is the only free tool from all three described in this section. Here you can host and track your experiments. However, TensorBoard is missing the functionalities responsible for model registry which can be quite problematic and force you to use some 3rd party tool to cover this missing feature. Anyway, there is for sure a tool for this in the TensorFlow ecosystem. As it is a directory part of the TensorFlow ecosystem, its integration with Keras or TensorFlow is much smoother than any of the previous two tools. Problem Specific Libraries tsaug One of a few libraries for augmentation of time series, it is an open-source library created and maintained by a single person under the GitHub name nick tailaiw, released in 2019, and is currently in version 0.2.1. It provides a set of 9 augmentations; Crop, Add Noise, or TimeWrap, among them. The library is reasonably well documented for such a project and easy to use from a user perspective. Unfortunately, for unknown reasons (at least to me), the library seems dead and has not been updated for 3 years. There are many open issues but they are not getting any attention. Such a situation is quite sad in my opinion, as there are not so many other libraries that provide augmentation for time series data. However, if you are looking for a time series for data mining or augmentation and want to use a more up-to-date library, Tsfresh may be a good choice. OpenCV It is a library focused on providing functions for working with image processing and computer vision. Developed by Intel, it is now open source based on the Apache 2 license. OpenCV provides a set of functions related to image and video processing, image classification data analysis, and tracking alongside ready-made machine learning models for working with images and video. If you want to read more about OpenCV, my coworker, Kamil Rzechowski, wrote an article that quite extensively describes the topic. GeoPandas It is a library built atop pandas that aims to provide functions for working with spatial and data structures. It allows easy reading and writing data in GeJSON, shapefile formats, or reading data from PostGIS systems. Besides pandas, it has a lot of other dependencies on spatial data libraries like PyGEOS, GeoPy, or Shapely. The library's basic structure is: GeoSeries - A column of geospatial data, such as a series of points, lines, or polygons GeoDataFrame - Tabular structure holding a set of GeoSeries Utils Matplotlib As the name suggests, it is a library for creating various types of plots ;p. Besides basic plots like lines or histograms, matplotlib allows us to create more complex plots: 3d shapes or polar plots. Of course, it also allows us to customize things like color plots or labels. Despite being somewhat old (released in 2003, so 20 years old at the time of writing), it is actively maintained and developed. With around 17k stars on GitHub, it has quite a community around it and is probably the second pick for anyone needing a data visualization tool. It is well-documented and reasonably easy to grasp for a newcomer. It is also worth noting that matplotlib is used as a base for more high-level visualization libraries. Seaborn Speaking of which, we have Seaborn as an example of such a library. Thus the set of functionalities provided by Seaborn is similar to one provided by Matplotlib. However, the API is more high-level and requires less boilerplate code to achieve similar results. As for other minor differences, the color palette provided by Seaborn is softer, and the design of plots is more modern and nice looking. Additionally, Seaborn is easiest to integrate with pandas, which may be a significant advantage. Below you can find code used to create a heatmap in Matplotlib and Seaborn, alongside their output plots. Imports are common. import matplotlib.pyplot as plt import numpy as np import seaborn as sns data = np.random.rand(5, 5) fig, ax = plt.subplots() heatmap = ax.pcolor(data, cmap=plt.cm.Blues) ax.set_xticks(np.arange(data.shape[0])+0.5, minor=False) ax.set_yticks(np.arange(data.shape[1])+0.5, minor=False) ax.set_xticklabels(np.arange(1, data.shape[0]+1), minor=False) ax.set_yticklabels(np.arange(1, data.shape[1]+1), minor=False) plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") cbar = plt.colorbar(heatmap) plt.show() sns.heatmap(data, cmap="Blues", annot=True) # Set plot title and axis labels plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") # Show plot plt.show() Hydra In every project, sooner or later, there is a need to make something a configurable value. Of course, if you are using a tool like Jupyter, then the matter is pretty straightforward. You can just move a desired value to the .env file - et voila, it is configurable. However, if you are building a more standard application, things are not that simple. Here Hydra shows its ugly (but quite useful) head. It is an open-source tool for managing and running configurations of Python-based applications. It is based on OmegaConf library, and quoting from their main page: “The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.” What proved quite useful for me is the ability described in the quote above: hierarchical configuration. In my case, it worked pretty well and allowed clearer separation of config files. coolname Having a unique identifier for your taint runs is always a good idea. If, for various reasons, you do not like UUID or just want your ids to be humanly understandable, coolname is the answer. It generates unique alphabetical word-based identifiers of various lengths from 2 to 4 words. As for the number of combinations, it looks more or less like this: 4 words length identifier has 1010 combinations 3 words length identifier has 108 combinations 2 words length identifier has 105 combinations The number is significantly lower than in the case of UUID, so the probability of collision is also higher. However, comparing the two is not the point of this text. The vocabulary is hand-picked by the creators. Yet they described it as positive and neutral (more about this here), so you will not see an identifier like you-ugly-unwise-human-being. Of course, the library is fully open source. tqdm This library provides a progress bar functionally for your application. Despite the fact that having such information displayed is maybe not the most important thing you need. It is still nice to look at and check the progress made by your application during the execution of an important task. Tqdm also uses complex algorithms to estimate the remaining time of a particular task which may be a game changer and help you organize your time around. Additionally, tqdm states that it has barely noticeable performance overhead - in nanoseconds. What is more, it is totally standalone and needs only Python to run. Thus it will not download half of the internet to your disk. Jupyter Notebook (+JupyterLab) Notebooks are a great way to share results and work on the project. Through the concept of cells, it is easy to separate different fragments and responsibilities of your code. Additionally, the fact that the single notebook file can contain code, images, and complex text outputs (tables) together only adds to its existing advantages. Moreover, notebooks allow run pip install inside the cells and use .env files for configuration. Such an approach moves a lot of software engineering complexity out of the way. Summary These are all the various libraries for machine learning that I wanted to describe for you. I aimed to provide a general overview of all the libraries alongside their possible use cases enriched by a quick note of my own experience with using them. I hope that my goal was achieved and this article will deepen your knowledge of the machine learning libraries landscape.
Once we press the merge button, that code is no longer our responsibility. If it performs sub-optimally or has a bug, it is now the problem of the DevOps team, the SRE, etc. Unfortunately, those teams work with a different toolset. If my code uses up too much RAM, they will increase RAM. When the code runs slower, it will increase CPU. In case the code crashes, they will increase concurrent instances. If none of that helps, they will call you up at 2 AM. A lot of these problems are visible before they become a disastrous middle-of-the-night call. Yes. DevOps should control production, but the information they gather from production is useful for all of us. This is at the core of developer observability, which is a subject I’m quite passionate about. I’m so excited about it I dedicated a chapter to it in my debugging book. Back when I wrote that chapter, I dedicated most of it to active developer observability tools like Lightrun, Rookout, et al. These tools work like production debuggers. They are fantastic in that regard. When I have a bug and know where to look, I can sometimes reach for one of these tools (I used to work at Lightrun, so I always use it). But there are other ways. Tools like Lightrun are active in their observability; we add a snapshot similarly to a breakpoint and get the type of data we expect. I recently started playing with Digma, which takes a radically different approach to developer observability. To understand that, we might need to revisit some concepts of observability first. Observability Isn’t Pillars I’ve been guilty of listing the pillars of observability just as much as the next guy. They’re even in my book (sorry). To be fair, I also discussed what observability really means… Observability means we can ask questions about our system and get answers or at least have a clearly defined path to get those answers. Sounds simple when running locally, but when you have a sophisticated production environment, and someone asks you: is anyone even using that block of code? How do you know? You might have lucked out and had a login that code, and it might still be lucky that the log is in the right level and piped properly so you can check. The problem is that if you added too many logs or too much observability data, you might have created a disease worse than the cure: over-logging or over-observing. Both can bring down your performance and significantly impact the bank account, so ideally, we don’t want too many logs (I discuss over-logging here), and we don’t want too much observability. Existing developer observability tools work actively. To answer the question, if someone is using the code, I can place a counter on the line and wait for results. I can give it a week's timeout and find out in a week. Not a terrible situation but not ideal either. I don’t have that much patience. Tracing and OpenTelemetry It’s a sad state of affairs that most developers don’t use tracing in their day-to-day job. For those of you who don’t know it, it is like a call stack for the cloud. It lets us see the stack across servers and through processes. No, not method calls. More at the entry point level, but this often contains details like the database queries that were made and similarly deep insights. There’s a lot of history with OpenTelemetry, which I don’t want to get into. If you’re an observability geek, you already know it, and if not, then it’s boring. What matters is that OpenTelemetry is taking over the world of tracing. It’s a runtime agent, which means you just add it to the server, and you get tracing information almost seamlessly. It’s magic. It also doesn’t have a standard server which makes it very confusing. That means multiple vendors can use a single agent and display the information it collects to various demographics: A vendor focused on performance can show the timing of various parts in the system. A vendor focused on troubleshooting can detect potential bugs and issues. A vendor focused on security can detect potential risky access. Background Developer Observability I’m going to coin a term here since there isn’t one: Background Developer Observability. What if the data you need was already here and a system already collected it for you in the background? That’s what Digma is doing. In Digma's terms, it's called Continuous Feedback. Essentially, they’re collecting OpenTelemetry data, analyzing it, and displaying it as information that’s useful for developers. If Lightrun is like a debugger, then Digma is like SonarQube based on actual runtime and production information. The cool thing is that you probably already use OpenTelemetry without even knowing it. DevOps probably installed that agent already, and the data is already there! Going back to my question, is anyone using this API? If you use Digma, you can see that right away. OpenTelelbery already collected the information in the background, and the DevOps team already paid the price of collection. We can benefit from that too. Enough Exposition I know, I go on… Let’s get to the meat and potatoes of why this rocks. Notice that this is a demo; when running locally, the benefits are limited. The true value of these tools is in understanding production, still, they can provide a lot of insight even when running locally and even when running tests. Digma has a simple and well-integrated setup wizard for IntelliJ/IDEA. You need to have Docker Desktop running for setup to succeed. Note that you don’t need to run your application using Docker. This is simply for the Digma server process, where they collect the execution details. Once it is installed, we can run our application. In my case, I just ran the JPA unit test from my latest book, and it produced standard traces, which are already pretty cool. We can see them listed below: When we click a trace for one of these, we get the standard trace view, this is nothing new, but it’s really nice to see this information directly in the IDE and readily accessible. I can imagine the immense value this will have for figuring out CI execution issues: But the real value and where Digma becomes a “Developer Observability” tool instead of an Observability tool is with the tool window here: There is a strong connection to the code directly from the observability data and deeper analysis, which doesn’t show in my particular overly simplistic hello world. This Toolwindow highlights problematic traces and errors and helps understand real-world issues. How Does This Help at 2 AM? Disasters happen because we aren’t looking. I’d like to say I open my observability dashboard regularly, but I don’t. Then when there’s a failure I take a while to get my bearings within it. The locality of the applicable data is important. It helps us notice issues when they happen. Detect regressions before they turn to failures and understand the impact of the code we just merged. Prevention starts with awareness, and as developers, we handed our situational awareness to the DevOps team. When the failure actually happens, the locality and accessibility of the data make a big difference. Since we use tools that integrate into the IDE daily, this reduces the meantime to a fix. No, a background developer observability tool might not include the information we need to fix a problem. But if it does, then the information is already there, and we need nothing else. That is fantastic. Final Word With all the discussion about observability and OpenTelemetry, you would think everyone is using them. Unfortunately, the reality is far from that. Yes, there’s some saturation and familiarity in the DevOps crowd. This is not the case for developers. This is a form of environmental blindness. How can our teams, who are so driven by data and facts, proceed with secondhand and often outdated data from OPS? Should I spend time further optimizing this method, or will I waste the effort since few people use it? We can benchmark things locally just fine, but real-world usage and impact are things that we all need to improve.
Parallel garbage collector (Parallel GC) is one of the oldest Garbage Collection algorithms introduced in JVM to leverage the processing power of modern multi-core systems. Parallel GC aims to reduce the impact of GC pauses by utilizing multiple threads to perform garbage collection in parallel. In this article, we will delve into the realm of Parallel GC tuning specifically. However, if you want to learn more basics of Garbage Collection tuning, you may watch this JAX London conference talk. When To Use Parallel GC You can consider using Parallel GC for your application if you have any one of the requirements: Throughput emphasis: If your application has high transactional throughput requirements and can tolerate long, occasional pauses for garbage collection, Parallel GC can be a suitable choice. It focuses on maximizing throughput by allowing garbage collection to occur concurrently with application execution. Batch processing: Applications that involve batch processing or data analysis tasks can benefit from Parallel GC. These types of applications often perform extensive computations, and Parallel GC helps minimize the impact of garbage collection on overall processing time. Heap size considerations: Parallel GC is well-suited for applications with moderate to large heap sizes. If your application requires a substantial heap to accommodate its memory needs, Parallel GC can efficiently manage memory and reduce the impact of garbage collection pauses. How To Enable Parallel GC To explicitly configure your application to use Parallel GC, you can pass the following argument when launching your Java application: -XX:+UseParallelGC This JVM argument instructs the JVM to use the Parallel GC algorithm for garbage collection. However, please note that if you don’t explicitly specify a garbage collection algorithm, in all the server class JVMs until Java 8, the default garbage collector is set to Parallel GC. Most Used Parallel GC JVM Arguments In the realm of Java Parallel GC tuning, there are a few key JVM arguments that provide control over crucial aspects of the garbage collection process. We have grouped those JVM arguments into three buckets: a. Heap and generation size parameters b. Goal-based tuning parameters c. Miscellaneous parameters Let’s get on to the details: A. Heap and Generation Size Parameters Garbage collection (GC) tuning for the Parallel Collector involves achieving a delicate balance between the size of the entire heap and the sizes of the Young and Old Generations. While a larger heap generally improves throughput, it also leads to longer pauses during GC. Consequently, finding the optimal size for the heap and generations becomes crucial. In this section, we will explore key JVM arguments that enable the adjustment of heap size and generation sizes to achieve an efficient GC configuration. -Xmx: This argument sets the maximum heap size, which establishes the upper limit for memory allocation. By carefully selecting an appropriate value for -Xmx, developers can control the overall heap size to strike a balance between memory availability and GC performance. -XX:NewSize and -XX:MaxNewSize or -XX:NewRatio: These arguments govern the size of the Young Generation, where new objects are allocated. -XX:NewSize sets the initial size, while -XX:MaxNewSize or -XX:NewRatio control the upper limit or the ratio between the young and tenured generations, respectively. Adjusting these values allows for fine-tuning the size and proportion of the Young Generation. Here is a success story of a massive technology company that reduced its young generation size and saw significant improvement in its overall application response time. -XX:YoungGenerationSizeIncrement and -XX:TenuredGenerationSizeIncrement: These arguments define the size increments for the Young and Tenured Generations, respectively. The size increments of the young and tenured generations are crucial factors in memory allocation and garbage collection behavior. Growing and shrinking are done at different rates. By default, a generation grows in increments of 20% and shrinks in increments of 5%. The percentage for growth is controlled by the command-line option -XX:YoungGenerationSizeIncrement=<Y> for the young generation and -XX:TenuredGenerationSizeIncrement=<T> for the tenured generation. -XX:AdaptiveSizeDecrementScaleFactor: This argument determines the scale factor used when decrementing generation sizes during shrinking. The percentage by which a generation shrinks is adjusted by the command-line flag -XX:AdaptiveSizeDecrementScaleFactor=<D>. If the growth increment is X percent, then the decrement for shrinking is X/D percent. B. Goal-Based Tuning Parameters To achieve optimal performance in garbage collection, it is crucial to control GC pause times and optimize the GC throughput, which represents the amount of time dedicated to garbage collection compared to application execution. In this section, we will explore key JVM arguments that facilitate goal-based tuning, enabling developers to fine-tune these aspects of garbage collection. -XX:MaxGCPauseMillis: This argument enables developers to specify the desired maximum pause time for garbage collection in milliseconds. By setting an appropriate value, developers can regulate the duration of GC pauses, ensuring they stay within acceptable limits. -XX:GCTimeRatio: This argument sets the ratio of garbage collection time to application time using the formula 1 / (1 + N), where N is a positive integer value. The purpose of this parameter is to define the desired allocation of time for garbage collection compared to application execution time, for optimizing the GC throughput. For example, let’s consider the scenario where -XX:GCTimeRatio=19. Using the formula, the goal is to allocate 1/20th or 5% of the total time to garbage collection. This means that for every 20 units of time (e.g., milliseconds) of combined garbage collection and application execution, approximately 1 unit of time will be allocated to garbage collection, while the remaining 19 units will be dedicated to application execution. The default value is 99, which sets a goal of 1% of the time for garbage collection. -XX:GCTimePercentage: This argument allows developers to directly specify the desired percentage of time allocated to garbage collection in relation to application execution time (i.e., GC Throughput). For instance, setting ‘-XX:GCTimePercentage=5’ represents a goal of allocating 5% of the total time to garbage collection, with the remaining 95% dedicated to application execution. Note: Developers can choose to use either ‘-XX:GCTimeRatio‘ or ‘-XX:GCTimePercentage‘ as alternatives to each other. Both options provide flexibility in expressing the desired allocation of time for garbage collection. I would prefer using ‘-XX:GCTimePercentage’ over ‘-XX:GCTimeRatio’ because of its ease of understanding. C. Miscellaneous Parameters In addition to the previously discussed JVM arguments, there are a few other parameters that can be useful for tuning the Parallel GC algorithm. Let’s explore them. -XX:ParallelGCThreads: This argument allows developers to specify the number of threads used for garbage collection in the Parallel GC algorithm. By setting an appropriate value based on the available CPU cores, developers can optimize throughput by leveraging the processing power of multi-core systems. It’s important to strike a balance by avoiding too few or too many threads, as both scenarios can lead to suboptimal performance. -XX:-UseAdaptiveSizePolicy: By default, the ‘UseAdaptiveSizePolicy’ option is enabled, which allows for dynamic resizing of the young and old generations based on the application’s behavior and memory demands. However, this dynamic resizing can lead to frequent “Full GC – Ergonomics” garbage collections and increased GC pause times. To mitigate this, we can pass the -XX:-UseAdaptiveSizePolicy argument to disable the resizing and reduce GC pause times. Here is a real-world example and discussion around this JVM argument. Tuning Parallel GC Behavior Studying the performance characteristics of Parallel GC is best achieved by analyzing the GC log. The GC log contains detailed information about garbage collection events, memory usage, and other relevant metrics. There are several tools available that can assist in analyzing the GC log, such as GCeasy, IBM GC and Memory Visualizer, HP Jmeter, and Google Garbage Cat. By using these tools, you can visualize memory allocation patterns, identify potential bottlenecks, and assess the efficiency of garbage collection. This allows for informed decision-making when fine-tuning Parallel GC for optimal performance. Conclusion In conclusion, optimizing the Parallel GC algorithm through fine-tuning JVM arguments and studying its behavior enables developers to achieve efficient garbage collection and improved performance in Java applications. By adjusting parameters such as heap size, generation sizes, and goal-based tuning parameters, developers can optimize the garbage collection process. Continuous monitoring and adjustment based on specific requirements are essential for maintaining optimal performance. With optimized Parallel GC tuning, developers can maximize memory management, minimize GC pauses, and unlock the full potential of their Java applications.
As a major part of a company's data asset, logs bring value to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine, where you can extract information that points to business growth. Logs are the sequential records of events in the computer system. If you think about how logs are generated and used, you will know what an ideal log analysis system should look like: It should have schema-free support. Raw logs are unstructured free texts and basically impossible for aggregation and calculation, so you needed to turn them into structured tables (the process is called "ETL") before putting them into a database or data warehouse for analysis. If there was a schema change, lots of complicated adjustments needed to be put into ETL and the structured tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You can add or delete fields in these logs, and the log storage system will adjust the schema accordingly. It should be low-cost. Logs are huge, and they are generated continuously. A fairly big company produces 10~100 TBs of log data. For business or compliance reasons, it should keep the logs around for half a year or longer. That means storing a log size measured in PB, so the cost is considerable. It should be capable of real-time processing. Logs should be written in real-time, otherwise, engineers won't be able to catch the latest events in troubleshooting and security tracking. Plus, a good log system should provide full-text searching capabilities and respond to interactive queries quickly. The Elasticsearch-Based Log Analysis Solution A popular log processing solution within the data industry is the ELK stack: Elasticsearch, Logstash, and Kibana. The pipeline can be split into five modules: Log collection: Filebeat collects local log files and writes them to a Kafka message queue. Log transmission: Kafka message queue gathers and caches logs. Log transfer: Logstash filters and transfers log data in Kafka. Log storage: Logstash writes logs in JSON format into Elasticsearch for storage. Log query: Users search for logs via Kibana visualization or send a query request via Elasticsearch DSL API. The ELK stack has outstanding real-time processing capabilities, but frictions exist. Inadequate Schema-Free Support The Index Mapping in Elasticsearch defines the table scheme, which includes the field names, data types, and whether to enable index creation. Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds fields to the Mapping according to the input JSON data. This provides some sort of schema-free support, but it's not enough because: Dynamic Mapping often creates too many fields when processing dirty data, which interrupts the whole system. The data type of fields is immutable. To ensure compatibility, users often configure "text" as the data type, but that results in much slower query performance than binary data types such as integer. The index of fields is immutable, too. Users cannot add or delete indexes for a certain field, so they often create indexes for all fields to facilitate data filtering in queries. But too many indexes require extra storage space and slow down data ingestion. Inadequate Analytic Capability Elasticsearch has its unique Domain Specific Language (DSL), which is very different from the tech stack that most data engineers and analysts are familiar with, so there is a steep learning curve. Moreover, Elasticsearch has a relatively closed ecosystem, so there might be strong resistance to integration with BI tools. Most importantly, Elastisearch only supports single-table analysis and is lagging behind the modern OLAP demands for multi-table join, sub-query, and views. High Cost and Low Stability Elasticsearch users have been complaining about the computation and storage costs. The root reason lies in the way Elasticsearch works. Computation cost: In data writing, Elasticsearch also executes compute-intensive operations, including inverted index creation, tokenization, and inverted index ranking. Under these circumstances, data is written into Elasticsearch at a speed of around 2MB/s per core. When CPU resources are tight, data writing requirements often get rejected during peak times, which further leads to higher latency. Storage cost: To speed up retrieval, Elasticsearch stores the forward indexes, inverted indexes, and docvalues of the original data, consuming a lot more storage space. The compression ratio of a single data copy is only 1.5:1, compared to 5:1 in most log solutions. As data and cluster size grow, maintaining stability can be another issue: During data writing peaks: Clusters are prone to overload during data writing peaks. During queries: Since all queries are processed in the memory, big queries can easily lead to JVM OOM. Slow recovery: For a cluster failure, Elasticsearch should reload indexes, which is resource-intensive, so it will take many minutes to recover — that challenges the service availability guarantee. A More Cost-Effective Option Reflecting on the strengths and limitations of the Elasticsearch-based solution, the Apache Doris developers have optimized Apache Doris for log processing. Increase writing throughout: The performance of Elasticsearch is bottlenecked by data parsing and inverted index creation, so we improved Apache Doris in these factors: we quickened data parsing and index creation by SIMD instructions and CPU vector instructions; then we removed those data structures' unnecessary for log analysis scenarios, such as forward indexes, to simplify index creation. Reduce storage costs: We removed forward indexes, which represented 30% of index data. We adopted columnar storage and the ZSTD compression algorithm and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part of the historical logs are rarely accessed, we introduced tiered storage to separate hot and cold data. Logs that are older than a specified time period will be moved to object storage, which is much less expensive. This can reduce storage costs by around 70%. Benchmark tests with ES Rally, the official testing tool for Elasticsearch, showed that Apache Doris was around five times as fast as Elasticsearch in data writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a writing speed of 550 MB/s and a compression ratio of 10:1. The below figure shows what a typical Doris-based log processing system looks like. It is more inclusive and allows for more flexible usage from data ingestion, analysis, and application: Ingestion: Apache Doris supports various ingestion methods for log data. You can push logs to Doris via HTTP Output using Logstash, you can use Flink to pre-process the logs before you write them into Doris, or you can load logs from Flink or object storage to Doris via Routine Load and S3 Load. Analysis: You can put log data in Doris and conduct join queries across logs and other data in the data warehouse. Application: Apache Doris is compatible with MySQL protocol, so you can integrate a wide variety of data analytic tools and clients to Doris, such as Grafana and Tableau. You can also connect applications to Doris via JDBC and ODBC APIs. We are planning to build a Kibana-like system to visualize logs. Moreover, Apache Doris has better scheme-free support and a more user-friendly analytic engine. Native Support for Semi-Structured Data Firstly, we worked on the data types. We optimized the string search and regular expression matching for "text" through vectorization and brought a performance increase of 2~10 times. For JSON strings, Apache Doris will parse and store them in a more compacted and efficient binary format, which can speed up queries by four times. We also added a new data type for complicated data: Array Map. It can structuralize concatenated strings to allow for higher compression rates and faster queries. Secondly, Apache Doris supports schema evolution. This means you can adjust the schema as your business changes. You can add or delete fields and indexes and change the data types for fields. Apache Doris provides Light Schema Change capabilities so that you can add or delete fields within milliseconds: -- Add a column. Result will be returned in milliseconds. ALTER TABLE lineitem ADD COLUMN l_new_column INT; You can also add an index only for your target fields so that you can avoid overheads from unnecessary index creation. After you add an index, by default, the system will generate the index for all incremental data, and you can specify which historical data partitions that need the index. -- Add inverted index. Doris will generate inverted index for all new data afterward. ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED; -- Build index for the specified historical data partitions. BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2); SQL-Based Analytic Engine The SQL-based analytic engine makes sure that data engineers and analysts can smoothly grasp Apache Doris in a short time and bring their experience with SQL to this OLAP engine. Building on the rich features of SQL, users can execute data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and materialized views to serve their own needs. With MySQL compatibility, Apache Doris can be integrated with most GUI and BI tools in the big data ecosystem so that users can realize more complex and diversified data analysis. Performance in Use Case A gaming company has transitioned from the ELK stack to the Apache Doris solution. Their Doris-based log system used 1/6 of the storage space that they previously needed. In a cybersecurity company that built their log analysis system utilizing an inverted index in Apache Doris, they supported a data writing speed of 300,000 rows per second with 1/5 of the server resources that they formerly used. Hands-On Guide Now let's go through the three steps of building a log analysis system with Apache Doris. Before you start, download Apache Doris 2.0 or newer versions from the website and deploy clusters. Step 1: Create Tables This is an example of table creation. Explanations for the configurations: The DATETIMEV2 time field is specified as the Key in order to speed up queries for the latest N log records. Indexes are created for the frequently accessed fields, and fields that require full-text search are specified with Parser parameters. PARTITION BY RANGE means to partition the data by RANGE based on time fields, Dynamic Partition is enabled for auto-management. DISTRIBUTED BY RANDOM BUCKETS AUTO means to distribute the data into buckets randomly, and the system will automatically decide the number of buckets based on the cluster size and data volume. log_policy_1day and log_s3 means to move logs older than one day to S3 storage. CREATE DATABASE log_db; USE log_db; CREATE RESOURCE "log_s3" PROPERTIES ( "type" = "s3", "s3.endpoint" = "your_endpoint_url", "s3.region" = "your_region", "s3.bucket" = "your_bucket", "s3.root.path" = "your_path", "s3.access_key" = "your_ak", "s3.secret_key" = "your_sk" ); CREATE STORAGE POLICY log_policy_1day PROPERTIES( "storage_resource" = "log_s3", "cooldown_ttl" = "86400" ); CREATE TABLE log_table ( `ts` DATETIMEV2, `clientip` VARCHAR(20), `request` TEXT, `status` INT, `size` INT, INDEX idx_size (`size`) USING INVERTED, INDEX idx_status (`status`) USING INVERTED, INDEX idx_clientip (`clientip`) USING INVERTED, INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english") ) ENGINE = OLAP DUPLICATE KEY(`ts`) PARTITION BY RANGE(`ts`) () DISTRIBUTED BY RANDOM BUCKETS AUTO PROPERTIES ( "replication_num" = "1", "storage_policy" = "log_policy_1day", "deprecated_dynamic_schema" = "true", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-3", "dynamic_partition.end" = "7", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "AUTO", "dynamic_partition.replication_num" = "1" ); Step 2: Ingest the Logs Apache Doris supports various ingestion methods. For real-time logs, we recommend the following three methods: Pull logs from Kafka message queue: Routine Load Logstash: write logs into Doris via HTTP API Self-defined writing program: write logs into Doris via HTTP API Ingest from Kafka For JSON logs that are written into Kafka message queues, create Routine Load so Doris will pull data from Kafka. The following is an example. The property.* configurations are optional: -- Prepare the Kafka cluster and topic ("log_topic") -- Create Routine Load, load data from Kafka log_topic to "log_table" CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table COLUMNS(ts, clientip, request, status, size) PROPERTIES ( "max_batch_interval" = "10", "max_batch_rows" = "1000000", "max_batch_size" = "109715200", "strict_mode" = "false", "format" = "json" ) FROM KAFKA ( "kafka_broker_list" = "host:port", "kafka_topic" = "log_topic", "property.group.id" = "your_group_id", "property.security.protocol"="SASL_PLAINTEXT", "property.sasl.mechanism"="GSSAPI", "property.sasl.kerberos.service.name"="kafka", "property.sasl.kerberos.keytab"="/path/to/xxx.keytab", "property.sasl.kerberos.principal"="xxx@yyy.com" ); You can check how the Routine Load runs via the SHOW ROUTINE LOAD command. Ingest via Logstash Configure HTTP Output for Logstash, and then data will be sent to Doris via HTTP Stream Load. 1. Specify the batch size and batch delay in logstash.yml to improve data writing performance. pipeline.batch.size: 100000 pipeline.batch.delay: 10000 2. Add HTTP Output to the log collection configuration file testlog.conf, URL => the Stream Load address in Doris. Since Logstash does not support HTTP redirection, you should use a backend address instead of a FE address. Authorization in the headers is http basic auth. It is computed with echo -n 'username:password' | base64. The load_to_single_tablet in the headers can reduce the number of small files in data ingestion. output { http { follow_redirects => true keepalive => false http_method => "put" url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load" headers => [ "format", "json", "strip_outer_array", "true", "load_to_single_tablet", "true", "Authorization", "Basic cm9vdDo=", "Expect", "100-continue" ] format => "json_batch" } } Ingest via self-defined program This is an example of ingesting data to Doris via HTTP Stream Load. Notes: Use basic auth for HTTP authorization, use echo -n 'username:password' | base64 in computation http header "format:json": the data type is specified as JSON http header "read_json_by_line:true": each line is a JSON record http header "load_to_single_tablet:true": write to one tablet each time For the data writing clients, we recommend a batch size of 100MB~1GB. Future versions will enable Group Commit at the server end and reduce batch size from clients. curl \ --location-trusted \ -u username:password \ -H "format:json" \ -H "read_json_by_line:true" \ -H "load_to_single_tablet:true" \ -T logfile.json \ http://fe_host:fe_http_port/api/log_db/log_table/_stream_load Step 3: Execute Queries Apache Doris supports standard SQL, so you can connect to Doris via MySQL client or JDBC and then execute SQL queries. mysql -h fe_host -P fe_mysql_port -u root -Dlog_db A few common queries in log analysis: Check the latest ten records. SELECT * FROM log_table ORDER BY ts DESC LIMIT 10; Check the latest ten records of Client IP 8.8.8.8. SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10; Retrieve the latest ten records with error or 404 in the "request" field. MATCH_ANY is a SQL syntax keyword for full-text search in Doris. It means to find the records that include any one of the specified keywords. SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10; Retrieve the latest ten records with image and faq in the request field. MATCH_ALL is also a SQL syntax keyword for full-text search in Doris. It means finding the records that include all of the specified keywords. SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10; Conclusion If you are looking for an efficient log analytic solution, Apache Doris is friendly to anyone equipped with SQL knowledge; if you find friction with the ELK stack, try Apache Doris provides better schema-free support, enables faster data writing and queries, and brings much less storage burden. But we won't stop here. We are going to provide more features to facilitate log analysis. We plan to add more complicated data types to the inverted index and support the BKD index to make Apache Doris a fit for geo-data analysis. We also plan to expand capabilities in semi-structured data analysis, such as working on complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithms. And we welcome any user feedback and development advice.
In the era of big data, efficient data management and query performance are critical for organizations that want to get the best operational performance from their data investments. Snowflake, a cloud-based data platform, has gained immense popularity for providing enterprises with an efficient way of handling big data tables and reducing complexity in data environments. Big data tables are characterized by their immense size, constantly increasing data sets, and the challenges that come with managing and analyzing vast volumes of information. With data pouring in at high volume from various sources in diverse formats, ensuring data reliability and quality is increasingly challenging but also critical. Extracting valuable insights from this diverse and dynamic data necessitates scalable infrastructure, powerful analytics tools, and a vigilant focus on security and privacy. Despite the complexities, big data tables offer immense potential for informed decision-making and innovation, making it essential for organizations to understand and address the unique characteristics of these data repositories to harness their full capabilities effectively. To achieve optimal performance, Snowflake leverages several essential concepts that are instrumental in handling and processing big data efficiently. One is data pruning, which plays a vital role by eliminating irrelevant data during query execution, leading to faster response times by reducing the amount of data that is scanned. Simultaneously, Snowflake's micro-partitions, small immutable segments typically 16 MB in size, allow for seamless scalability and efficient distribution across nodes. Micro-partitioning is an important differentiator for Snowflake. This innovative technique combines the advantages of static partitioning while avoiding its limitations, resulting in additional significant benefits. The beauty of Snowflake's architecture lies in its scalable, multi-cluster virtual warehouse technology, which automates the maintenance of micro-partitions. This process ensures efficient and automatic execution of re-clustering in the background, eliminating the need for manual creation, sizing, or resizing of virtual warehouses. The compute service actively monitors the clustering quality of all registered clustered tables and systematically performs clustering on the least clustered micro-partitions until reaching an optimal clustering depth. This seamless process optimizes data storage and retrieval, enhancing overall performance and user experience. How Micro-Partitioning Improves Data Storage and Processing This design enhances data storage and processing efficiency, further improving query performance. Additionally, Snowflake's clustering feature enables users to define clustering keys, arranging data within micro-partitions based on similarities. By colocating data with similar values for clustering keys, Snowflake minimizes data scans during queries, resulting in optimized performance. Together, these key concepts empower Snowflake to deliver unparalleled efficiency and performance in managing big data workloads. Inadequate table layouts can result in long-running queries, increased costs due to higher data scans, and diminished overall performance. It is crucial to tackle this challenge to fully harness the capabilities of Snowflake and maximize its potential. One major challenge in big data table management is the data ingestion team's lack of awareness regarding consumption workloads, leading to various issues that negatively impact system performance and cost-effectiveness. Long-running queries are a significant consequence, causing delays in delivering critical insights, especially in time-sensitive applications where real-time data analysis is vital for decision-making. Moreover, the team's unawareness can lead to increased operational costs as inefficient table layouts consume more computational resources and storage, straining the organization's budget over time. List of frequently accessed tables Optimize Snowflake Performance The first step in optimizing Snowflake's performance is to analyze consumption workloads thoroughly. Acceldata’s Data Observability Cloud (ADOC) platform analyzes such historical workloads and provides table-level insights at the size, access, partitioning, and clustering level. Stats for top frequently accessed tables Understanding the queries executed most frequently and the filtering patterns applied can provide valuable insights. Focus on tables that are large and frequently accessed, as they have the most significant impact on overall performance. Most filtered columns for a table ADOC’s advanced query parsing technology has the ability to detect the columns that are accessed via WHERE or JOIN clauses. Utilize visualizations and analytics tools to identify which columns are accessed and filtered most frequently. Micro-partitioning and clustering view for a column+table ADOC also fetches CLUSTERING_INFORMATION via the Snowflake table system functions and shows the table clustering metadata in a simple and easily interpretable visualization. This information can guide the decision-making process for optimizing the table layout. Snowflake visual table clustering explorer Understand the extent of overlap and depth for filtered columns. This information is crucial for making informed decisions when defining clustering keys. The ultimate goal is to match clustering keys with the most commonly filtered columns. This alignment ensures that relevant data is clustered together, reducing data scans and improving query performance. Snowflake's prowess in managing big data tables is unparalleled, but to fully reap its benefits, optimizing performance through data pruning and clustering is essential. The collaboration between the data ingestion team and the teams using the data is vital to ensure the best possible layout for tables. By understanding consumption workloads and matching clustering keys with filtered columns, organizations can achieve efficient queries, reduce costs, and make the most of Snowflake's capabilities in handling big data efficiently.
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Greg Leffler
Observability Practitioner, Director,
Splunk
Ted Young
Director of Open Source Development,
LightStep
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere