Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
Data Persistence
At the core of every modern application is an endless, diverse stream of data and with it, an inherent demand for scalability, increased speed, higher performance, and strengthened security. Although data management tools and strategies have matured rapidly in recent years, the complexity of architectural and implementation choices has intensified as well, creating unique challenges — and opportunities — for those who are designing data-intensive applications.DZone’s 2021 Data Persistence Trend Report examines the current state of the industry, with a specific focus on effective tools and strategies for data storage and persistence. Featured in this report are observations and analyses of survey results from our research, as well as an interview with industry leader Jenny Tsai-Smith. Readers will also find contributor insights written by DZone community members, who cover topics ranging from microservice polyglot persistence scenarios to data storage solutions and the Materialized Path pattern. Read on to learn more!
Data streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized data mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance. Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays. Gartner Top Strategic Technology Trends for 2023 The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023: Source Gartner It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products. Hence, here you go with the top 5 trends in data streaming for 2023. The Top 5 Data Streaming Trends for 2023 I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe: Cloud-native lakehouses Decentralized data mesh Data sharing in real-time Improved developer and user experience Advanced data governance and policy enforcement The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud. Kafka as Data Fabric for Cloud-Native Lakehouses Many data platform vendors pitch the lakehouse vision today. That's the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case: In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on. Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach: Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs. Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture. Decentralized Data Mesh With Valuable Data Products Focusing on business value by building data products in independent domains with various technologies is key to success in today's agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices. Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries: Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures. Collaboration Within and Across Organizations in Real Time Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models: The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real time. You cannot connect a real-time consumer to data at rest. However, real-time data beats slow data. Hence, sharing data in real time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models. AsyncAPI Spec for Apache Kafka API Schemas AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include: Specification of OpenAPI contracts (= schemas in the data streaming world) Documentation of APIs Code generation for many programming languages Data governance And much more... Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters. We don't know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let's review the status of AsynchAPI in a few quarters or years. But it has the potential. Improved Developer Experience With Low-Code/No-Code Tools for Apache Kafka Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory. These tools support various personas like developers, citizen integrators, and data scientists. At least in theory. The reality is that: Code is king Development is about evolution Open platforms win Low code/no code is great for some scenarios and personas. But it is just one option of many. Let's look at a few alternatives for building Kafka-native applications: These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient. However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools. And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic. Domain-Driven Design With Dumb Pipes and Smart Endpoints The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware. Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints: Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications. Data Governance Across the Data Streaming Pipeline An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks. The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe. Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security: Policy Enforcement With Schemas and API Contracts The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer: Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products. Slides and Video for Data Streaming Use Cases in 2023 Here is the slide deck from my presentation: Fullscreen Mode And here is the free on-demand video recording. Data Streaming Goes Up in the Maturity Curve in 2023 It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like "when to use Kafka?" or "which cloud service to use?"... In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects. The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons. Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data. What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline?
Data engineering is the practice of managing large amounts of data efficiently, from storing and processing to analyzing and visualizing. Therefore, data engineers must be well-versed in data structures and algorithms that can help them manage and manipulate data efficiently. This article will explore some of the most important data structures and algorithms that data engineers should be familiar with, including their uses and advantages. Data Structures Relational Databases Relational databases are one of the most common data structures used by data engineers. A relational database consists of a set of tables with defined relationships between them. These tables are used to store structured data, such as customer information, sales data, and product inventory. Relational databases are typically used in transactional systems like e-commerce platforms or banking applications. They are highly scalable, provide data consistency and reliability, and support complex queries. NoSQL Databases NoSQL databases are a type of non-relational database used to store and manage unstructured or semi-structured data. Unlike relational databases, NoSQL databases do not use tables or relationships. Instead, they store data using documents, graphs, or key-value pairs. NoSQL databases are highly scalable and flexible, making them ideal for handling large volumes of unstructured data, such as social media feeds, sensor data, or log files. They are also highly resilient to failures, provide high performance, and are easy to maintain. Data Warehouses Data warehouses are specialized databases designed for storing and processing large amounts of data from multiple sources. Data warehouses are typically used for data analytics and reporting and can help streamline and optimize data processing workflows. Data warehouses are highly scalable, support complex queries, and perform well. They are also highly reliable and support data consolidation and normalization. Distributed File Systems Distributed file systems such as Hadoop Distributed File System (HDFS) are used to store and manage large volumes of data across multiple machines. In addition, these highly scalable file systems provide fault tolerance and support batch processing. Distributed file systems are used to store and process large volumes of unstructured data, such as log files or sensor data. They are also highly resilient to failures and support parallel processing, making them ideal for big data processing. Message Queues Message queues are used to manage the data flow between different components of a data processing pipeline. They help to decouple different parts of the system, improve scalability and fault tolerance, and support asynchronous communication. Message queues are used to implement distributed systems, such as microservices or event-driven architectures. They are highly scalable, support high throughput, and provide resilience to system failures. Algorithms Sorting Algorithms Sorting algorithms are used to arrange data in a specific order. Sorting is an essential operation in data engineering as it can significantly improve the performance of various operations such as search, merge, and join. Sorting algorithms can be classified into two categories: comparison-based sorting algorithms and non-comparison-based sorting algorithms. Comparison-based sorting algorithms such as bubble sort, insertion sort, quicksort, and mergesort compare elements in the data to determine the order. These algorithms have a time complexity of O(n log n) in the average case and O(n^2) in the worst case. Non-comparison-based sorting algorithms such as counting sort, radix sort, and bucket sort do not compare elements to determine the order. As a result, these algorithms have a time complexity of O(n) in the average case and worst case. Sorting algorithms are used in various data engineerings tasks, such as data preprocessing, data cleaning, and data analysis. Searching Algorithms Searching algorithms are used to find specific elements in a dataset. Searching algorithms are essential in data engineering as they enable efficient retrieval of data from large datasets. Searching algorithms can be classified into two categories: linear search and binary search. Linear search is a simple algorithm that checks each element in a dataset until the target element is found. Linear search has a time complexity of O(n) in the worst case. Binary search is a more efficient algorithm that works on sorted datasets. Binary search divides the dataset in half at each step and compares the middle element to the target element. Binary search has a time complexity of O(log n) in the worst case. Searching algorithms are used in various data engineerings tasks such as data retrieval, data querying, and data analysis. Hashing Algorithms Hashing algorithms are used to map data of arbitrary size to fixed-size values. Hashing algorithms are essential in data engineering as they enable efficient data storage and retrieval. Hashing algorithms can be classified into two categories: cryptographic hashing and non-cryptographic hashing. Cryptographic hashing algorithms such as SHA-256 and MD5 are used for secure data storage and transmission. These algorithms produce a fixed-size hash value that is unique to the input data. Therefore, the hash value cannot be reversed to obtain the original input data. Non-cryptographic hashing algorithms such as MurmurHash and CityHash are used for efficient data storage and retrieval. These algorithms produce a fixed-size hash value that is based on the input data. The hash value can be used to quickly search for the input data in a large dataset. Hashing algorithms are used in various data engineerings tasks such as data storage, data retrieval, and data analysis. Graph Algorithms Graph algorithms are used to analyze data that can be represented as a graph. Graphs are used to represent relationships between data elements such as social networks, web pages, and molecules. Graph algorithms can be classified into two categories: traversal algorithms and pathfinding algorithms. Traversal algorithms such as breadth-first search (BFS) and depth-first search (DFS) are used to visit all the nodes in a graph. Traversal algorithms can be used to find connected components, detect cycles, and perform topological sorting. Pathfinding algorithms such as Dijkstra's algorithm and A* algorithm are used to find the shortest path between two nodes in a graph. For example, pathfinding algorithms can be used to find the shortest path in a road network, find the optimal route for a delivery truck, and find the most efficient path for a robot. Data structures and algorithms are essential tools for data engineers, enabling them to build scalable, efficient, and optimized solutions for managing and processing large datasets.
A data warehouse was defined by Bill Inmon as "a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management's decisions" over 30 years ago. However, the initial data warehouses were unable to store massive heterogeneous data, hence the creation of data lakes. In modern times, data lakehouse emerges as a new paradigm. It is an open data management architecture featured by strong data analytics and governance capabilities, high flexibility, and open storage. If I could only use one word to describe the next-gen data lakehouse, it would be unification: Unified data storage to avoid the trouble and risks brought by redundant storage and cross-system ETL. Unified governance of both data and metadata with support for ACID, Schema Evolution, and Snapshot. Unified data application that supports data access via a single interface for multiple engines and workloads. Let's look into the architecture of a data lakehouse. We will find that it is not only supported by table formats such as Apache Iceberg, Apache Hudi, and Delta Lake, but more importantly, it is powered by a high-performance query engine to extract value from data. Users are looking for a query engine that allows quick and smooth access to the most popular data sources. What they don't want is for their data to be locked in a certain database and rendered unavailable for other engines or to spend extra time and computing costs on data transfer and format conversion. To turn these visions into reality, a data query engine needs to figure out the following questions: How to access more data sources and acquire metadata more easily? How to improve query performance on data coming from various sources? How to enable more flexible resource scheduling and workload management? Apache Doris provides a possible answer to these questions. It is a real-time OLAP database that aspires to build itself into a unified data analysis gateway. This means it needs to be easily connected to various RDBMS, data warehouses, and data lake engines (such as Hive, Iceberg, Hudi, Delta Lake, and Flink Table Store) and allow for quick data writing from and queries on these heterogeneous data sources. The rest of this article is an in-depth explanation of Apache Doris' techniques in the above three aspects: metadata acquisition, query performance optimization, and resource scheduling. Metadata Acquisition and Data Access Apache Doris 1.2.2 supports a wide variety of data lake formats and data access from various external data sources. Besides, via the Table Value Function, users can analyze files in object storage or HDFS directly. To support multiple data sources, Apache Doris puts efforts into metadata acquisition and data access. Metadata Acquisition Metadata consists of information about the databases, tables, partitions, indexes, and files from the data source. Thus, metadata of various data sources come in different formats and patterns, adding to the difficulty of metadata connection. An ideal metadata acquisition service should include the following: A metadata structure that can accommodate heterogeneous metadata. An extensible metadata connection framework that enables quick and low-cost data connection. Reliable and efficient metadata access that supports real-time metadata capture. Custom authentication services to interface with external privilege management systems and thus reduce migration costs. Metadata Structure Older versions of Doris support a two-tiered metadata structure: database and table. As a result, users need to create mappings for external databases and tables one by one, which is heavy work. Thus, Apache Doris 1.2.0 introduced the Multi-Catalog functionality. With this, you can map to external data at the catalog level, which means: You can map to the whole external data source and ingest all metadata from it. You can manage the properties of the specified data source at the catalog level, such as connection, privileges, and data ingestion details, and easily handle multiple data sources. Data in Doris falls into two types of catalogs: Internal Catalog: Existing Doris databases and tables all belong to the Internal Catalog. External Catalog: This is used to interface with external data sources. For example, HMS External Catalog can be connected to a cluster managed by Hive Metastore, and Iceberg External Catalog can be connected to an Iceberg cluster. You can use the SWITCH statement to switch catalogs. You can also conduct federated queries using fully qualified names. For example: SELECT * FROM hive.db1.tbl1 a JOIN iceberg.db2.tbl2 b ON a.k1 = b.k1; Extensible Metadata Connection Framework The introduction of the catalog level also enables users to add new data sources simply by using the CREATE CATALOG statement: CREATE CATALOG hive PROPERTIES ( 'type'='hms', 'hive.metastore.uris' = 'thrift://172.21.0.1:7004', ); In data lake scenarios, Apache Doris currently supports the following metadata services: Hive Metastore-compatible metadata services Alibaba Cloud Data Lake Formation AWS Glue This also paves the way for developers who want to connect to more data sources via External Catalog. All they need is to implement the access interface. Efficient Metadata Access Access to external data sources is often hindered by network conditions and data resources. This requires extra efforts of a data query engine to guarantee reliability, stability, and real-timeliness in metadata access. Doris enables high efficiency in metadata access by Meta Cache, which includes Schema Cache, Partition Cache, and File Cache. This means that Doris can respond to metadata queries on thousands of tables in milliseconds. In addition, Doris supports manual refresh of metadata at the Catalog/Database/Table level. Meanwhile, it enables auto synchronization of metadata in Hive Metastore by monitoring Hive Metastore Event, so any changes can be updated within seconds. Custom Authorization External data sources usually come with their own privilege management services. Many companies use one single tool (such as Apache Ranger) to provide authorization for their multiple data systems. Doris supports a custom authorization plugin, which can be connected to the user's own privilege management system via the Doris Access Controller interface. As a user, you only need to specify the authorization plugin for a newly created catalog, and then you can readily perform authorization, audit, and data encryption on external data in Doris. Data Access Doris supports data access to external storage systems, including HDFS and S3-compatible object storage: Query Performance Optimization After clearing the way for external data access, the next step for a query engine would be to accelerate data queries. In the case of Apache Doris, efforts are made in data reading, execution engine, and optimizer. Data Reading Reading data on remote storage systems is often bottlenecked by access latency, concurrency, and I/O bandwidth, so reducing reading frequency will be a better choice. Native File Format Reader Improving data reading efficiency entails optimizing the reading of Parquet files and ORC files, which are the most commonly seen data files. Doris has refactored its File Reader, which is fine-tuned for each data format. Take the Native Parquet Reader as an example: Reduce format conversion: It can directly convert files to the Doris storage format or to a format of higher performance using dictionary encoding. Smart indexing of finer granularity: It supports Page Index for Parquet files, so it can utilize Page-level smart indexing to filter Pages. Predicate pushdown and late materialization: It first reads columns with filters first and then reads the other columns of the filtered rows. This remarkably reduces file read volume since it avoids reading irrelevant data. Lower read frequency: Building on the high throughput and low concurrency of remote storage, it combines multiple data reads into one in order to improve overall data reading efficiency. File Cache Doris caches files from remote storage in local high-performance disks as a way to reduce overhead and increase performance in data reading. In addition, it has developed two new features that make queries on remote files as quick as those on local files: Block cache: Doris supports the block cache of remote files and can automatically adjust the block size from 4KB to 4MB based on the read request. The block cache method reduces read/write amplification and read latency in cold caches. Consistent hashing for caching: Doris applies consistent hashing to manage cache locations and schedule data scanning. By doing so, it prevents cache failures brought about by the online and offlining of nodes. It can also increase cache hit rate and query service stability. Execution Engine Developers surely don't want to rebuild all the general features for every new data source. Instead, they hope to reuse the vectorized execution engine and all operators in Doris in the data lakehouse scenario. Thus, Doris has refactored the scan nodes: Layer the logic: All data queries in Doris, including those on internal tables, use the same operators, such as Join, Sort, and Agg. The only difference between queries on internal and external data lies in data access. In Doris, anything above the scan nodes follows the same query logic, while below the scan nodes, the implementation classes will take care of access to different data sources. Use a general framework for scan operators: Even for the scan nodes, different data sources have a lot in common, such as task splitting logic, scheduling of sub-tasks and I/O, predicate pushdown, and Runtime Filter. Therefore, Doris uses interfaces to handle them. Then, it implements a unified scheduling logic for all sub-tasks. The scheduler is in charge of all scanning tasks in the node. With global information of the node in hand, the schedular is able to do fine-grained management. Such a general framework makes it easy to connect a new data source to Doris, which will only take a week of work for one developer. Query OptimizerDoris supports a range of statistical information from various data sources, including Hive Metastore, Iceberg Metafile, and Hudi MetaTable. It has also refined its cost model inference based on the characteristics of different data sources to enhance its query planning capability. PerformanceWe tested Doris and Presto/Trino on HDFS in flat table scenarios (ClickBench) and multi-table scenarios (TPC-H). Here are the results: As is shown, with the same computing resources and on the same dataset, Apache Doris takes much less time to respond to SQL queries in both scenarios, delivering a 3~10 times higher performance than Presto/Trino. Workload Management and Elastic Computing Querying external data sources requires no internal storage of Doris. This makes elastic stateless computing nodes possible. Apache Doris 2.0 is going to implement Elastic Compute Node, which is dedicated to supporting query workloads of external data sources. Stateless computing nodes are open for quick scaling so users can easily cope with query workloads during peaks and valleys and strike a balance between performance and cost. In addition, Doris has optimized itself for Kubernetes cluster management and node scheduling. Now Master nodes can automatically manage the onlining and offlining of Elastic Compute Nodes, so users can govern their cluster workloads in cloud-native and hybrid cloud scenarios without difficulty. Use Case Apache Doris has been adopted by a financial institution for risk management. The user's high demands for data timeliness makes their data mart built on Greenplum and CDH, which could only process data from one day ago, no longer a great fit. In 2022, they incorporated Apache Doris in their data production and application pipeline, which allowed them to perform federated queries across Elasticsearch, Greenplum, and Hive. A few highlights from the user's feedback include: Doris allows them to create one Hive Catalog that maps to tens of thousands of external Hive tables and conducts fast queries on them. Doris makes it possible to perform real-time federated queries using Elasticsearch Catalog and achieve a response time of mere milliseconds. Doris enables the decoupling of daily batch processing and statistical analysis, bringing less resource consumption and higher system stability.
Recently, I posted a tutorial on how I monitored my Raspberry-pi based "pihole" server using New Relic and Flex. Like many tutorials, what you read, there is the end result, a narrative of the perfect execution of a well-conceived idea where all steps and variables are foreseen beforehand. This, my friends, is not how a day in IT typically goes. The truth is that I had a lot of false starts and a couple of sleepless nights trying to get everything put together just the way I wanted it. Aspects of both the pihole itself and New Relic didn't work the way I initially expected, and I had to find workarounds. In the end, if it weren't for the help of several colleagues - including Zameer Fouzan, Kav Pather, Haihong Ren, Before you call for your fainting couch and smelling salts, shocked as I know you are to hear me imply that New Relic isn't pure perfection and elegant execution, I want to be clear: While no tool is perfect or does all things for all people, the issues I ran into were completely normal and the ultimate solutions were both simple to understand and easy to execute. What I initially struggled with was trying to make New Relic operate based on my biases of how I thought things ought to work rather than understanding and accepting how they did work. The Problem in a Nutshell But I'm being unnecessarily vague. Let me get to the specifics: On the pihole, you can query the API for data like this:http://pi.hole/admin/api.php?summary And it will give you output like this: JSON { "domains_being_blocked": "177,888", "dns_queries_today": "41,240", "ads_blocked_today": "2,802", "ads_percentage_today": "6.8", "unique_domains": "8,001", "queries_forwarded": "18,912", "queries_cached": "19,266", "clients_ever_seen": "34", "unique_clients": "28", "dns_queries_all_types": "41,240", "reply_UNKNOWN": "258", "reply_NODATA": "1,155", "reply_NXDOMAIN": "11,989", "reply_CNAME": "12,296", "reply_IP": "15,436", "reply_DOMAIN": "48", "reply_RRNAME": "0", "reply_SERVFAIL": "2", "reply_REFUSED": "0", "reply_NOTIMP": "0", "reply_OTHER": "0", "reply_DNSSEC": "0", "reply_NONE": "0", "reply_BLOB": "56", "dns_queries_all_replies": "41,240", "privacy_level": "0", "status": "enabled", "gravity_last_updated": { "file_exists": true, "absolute": 1676309149, "relative": { "days": 4, "hours": 0, "minutes": 27 } } } While it may not be immediately obvious, let me draw your attention to the issue:"domains_being_blocked":"177,888" Being surrounded by quotes, New Relic will treat the number "177,888" as a string (text) rather than as a number. NRQL to the Rescue! My first attempt to fix this leveraged the obvious (and ultimately incomplete) approach of changing the type of input using a function. In this case, numeric() is purpose-built to do just that — take data that are "typed" as a string and treat it as a number. Easy-peasy right? If you've worked in IT for more than 15 minutes, you know the answer is, "of course not." This technique only worked for numbers less than 1,000 The reason for this is that numeric() can't handle formatted numbers — meaning strings with symbols for currency, percentage, or — to my chagrin — commas. At that point, my colleague and fellow DevRel Advocate Zameer Fouzan came to the rescue. He helped me leverage one of the newer capabilities in NRQL — the ability to parse out sub-elements in a table. The feature is named aparse() which stands for "anchor parse" You can find more information about it here, but in brief, it lets you name a field, describe how you want to separate it, and then rename the separated parts. Like this: aparse(unique_domains ,'*,*' ) As (n1,n2) In plain English, this says, "take the data in the unique_domains field, put everything before the comma into one variable (called n1), and everything after the comma into another variable (called n2)." Now I have the two halves of my number, and I can recombine them: numeric(concat(n1,n2)) The result looks like this: Which might have been the end of my problems, except if the numbers go into the millions. A More FLEX-able Approach The penultimate step for resolving this issue was to take it back to the source — in this case, the New Relic Flex integration, to see if I couldn't reformat the numbers before sending them into New Relic. Which is absolutely possible. Within the Flex YAML file, there are a lot of possibilities for parsing, re-arranging, and reformatting the data prior to passing it into the data store. One of the most powerful of these is jq. You can find the New Relic documentation on jq here. But for a deeper dive into the utility itself, you should go to the source. I can't describe how jq works any better than the author: "A jq program is a "filter": it takes an input, and produces an output. There are a lot of builtin filters for extracting a particular field of an object, or converting a number to a string, or various other standard tasks." To be a little more specific, jq will take JSON input, and for every key that matches your search parameters, it will output the value of that key and reformat it in the process if you tell it to. Therefore, I could create the most basic search filter like this:jq > .domains_being_blocked ...and it would output "177,888". But that wouldn't solve my issue. HOWEVER, using additional filters, you can split the output by its comma, join the two parts back together, set the output as a number, and come out the other side with a beautiful (and correct) output set. But I don't want you to think this solution occurred to me all on its own or that I was able to slap it all together with minimal effort. This was all as new to me as it may be to you, and what you're reading below comes from the amazing and generous minds of Kav Pather (who basically invented the Flex integration) and Senior Solutions Architect Haihong Ren. Unwinding the fullness of the jq string below is far beyond the scope of this blog. But Kav and Haihong have helped me to understand enough to summarize it as: Pull out the "status:" key and the entire "gravity_last_updated" section and keep it as-is. For everything else, split the value on the comma put the component parts back together (without the comma) output it as a number rather than a string Finally, output everything (status, gravity_last_updated, and all the other values) as a single data block which Flex will pack up and send to New Relic. The full YAML file looks like this:: YAML integrations: - name: nri-flex config: name: pihole_test apis: - name: pihole_test url: http://pi.hole/admin/api.php?summary&auth=a11049ddbf38fc1b678f4c4b17b87999a35a1d56617a9e2dcc36f1cc176ab7ce jq: > .[]|with_entries( select(.key | test("^gravity_last_updated|status|api"))) as $xx | with_entries( select(.key | test("^gravity_last_updated|status|api")|not)) |to_entries|map({(.key):(.value|split(",")|join("")|tonumber)})|add+$xx headers: accept: application/json remove_keys: - timestamp Which results in data that looks like this: Plain Text "domains_being_blocked": 182113 "dns_queries_today": 41258 "ads_blocked_today": 3152 (and so on) This is a wonderfully effective solution to the entire issue! A Summary, With a Plot Twist To sum everything up: We learned how to convert strings to numbers in NRQL using numeric(). We learned how to split strings in NRQL based on a delimiter (or even text) using aparse(). We learned how to put the split parts back together in NRQL using concat(). We learned how to use jq in Flex to perform fairly complex string and data manipulations. But most importantly, we learned that asking colleagues for help isn't a sign of weakness. It's a sign of maturity and that, as author and speaker Ken Blanchard said, "None of us is as smart as all of us." But... after all of my searching, all of my queries to coworkers, all of my testing and troubleshooting, and after all the associated tears, rage, and frustration - I discovered I didn't need any of it. Remember what I said at the start of this blog? What I initially struggled with was trying to make New Relic operate based on my biases of how I thought things ought to work, rather than understanding and accepting how they did work. In my race to force New Relic to do all the heavy lifting, in my egocentric need to force New Relic to accept any type of data I threw at it, I ignored the simplest and most elegant solution of all: Pihole has the option to output raw, unformatted data. In place of the standard URL:http://pi.hole/admin/api.php?summary If, instead, I had used:http://pi.hole/admin/api.php?summaryRaw ...all of the numbers come out in a way that New Relic can take and use without any additional manipulation. Plain Text { "domains_being_blocked": 182113, "dns_queries_today": 42825, "ads_blocked_today": 1846, "ads_percentage_today": 4.310566, "unique_domains": 9228, "queries_forwarded": 25224, "queries_cached": 15547, "clients_ever_seen": 36, (and so on) This simply goes to show that the solutions to our problems are out there as long as we have the patience and perseverance to find them.
Data Science is a rapidly developing discipline that has the power to completely change how one conducts business and addresses issues. In order to apply the most efficient techniques and tools available, it is crucial for data scientists to stay current with the most recent trends and technology. In this article, you will discover ways to keep up with the most recent data science trends and technologies. You will learn about the latest industry trends and make sure that you are keeping pace with the advancements in the field. By the end of this article, you will have the knowledge and resources to stay current in the world of data science. Latest Trends in Data Science Data Science is rapidly advancing, and the latest trends continue to bring together the worlds of data and technology. Artificial intelligence, machine learning, and deep learning are just some of the new-generation tools taking the industry by storm. With the ability to quickly gain insights from massive amounts of data, these innovative techniques are changing the game, giving organizations valuable new ways to manage their data and get ahead of the competition. 1. Automated Machine Learning Automated machine learning (AutoML) is an emerging field of data science that uses algorithms to automate the process of building and optimizing data models. It uses a combination of feature engineering, model selection, hyperparameter optimization, and model ensembling to obtain the best possible performance from a machine-learning model. Automated machine learning offers the potential for data scientists to streamline their workflows, reduce time to market, and increase model performance. 2. Blockchain Technology Blockchain technology has become a hot topic in data science circles lately. This technology allows data to be stored securely in a distributed and immutable ledger. It can support complex multi-party transactions and also potentially add a layer of data security by ensuring that data can only be accessed by authorized users. Blockchain technology and its application is still in their early stages, but it holds promise for data science applications and could become an important tool for securing large datasets. 3. Immersive Experiences Immersive data science experiences, such as augmented reality and virtual reality, offer a new way for data scientists to interact with their data. By allowing users to navigate datasets in a 3D environment, immersive experiences can open up new ways of understanding complex data and uncovering insights. These experiences can also be used to create interactive data visualizations that better convey the importance of data science. 4. Robotic Process Automation Robotic process automation (RPA) is a form of automation that uses software robots to automate mundane, repetitive tasks. In the data science field, this technology can be used to automate data collection, cleansing, and preparation tasks, helping data scientists save time and focus on more advanced analytics. RPA can also be used to improve the accuracy of data collection, as it will reduce the potential for human error. 5. AI-Powered Virtual Assistants AI-powered virtual assistants are becoming increasingly popular in data science circles. These virtual assistants use natural language processing and machine learning algorithms to understand complex conversations and respond in appropriate ways. They can be used to automate data analysis tasks and help data scientists spend less time on mundane tasks and more time on higher-value activities. 6. Natural Language Processing Natural language processing (NLP) is a sub-domain of artificial intelligence that focuses on enabling computers to understand and generate human speech. This technology is becoming increasingly important in data science as it allows machines to understand natural language queries better, making it easier for data scientists to ask complex questions. NLP can also be used to automatically generate questions based on the data and provide more detailed insight. 7. Graph Analytics Graph analytics is a branch of data science that uses graph theory to analyze interconnected data sets. It can be used to uncover relationships and patterns existing in the data sets, as well as to analyze the structure of networks and make data-driven decisions. Graph analytics is often used with other analytical techniques like machine learning and predictive analytics to get a more comprehensive view of the data sets. 8. Artificial Intelligence Artificial Intelligence (AI) is a broad field of data science that focuses on machines that are programmed to behave in a similar manner to humans. AI and data science systems can perform tasks more efficiently than humans, making them invaluable in various industries and applications. AI is used in tasks such as facial recognition, speech recognition, and driverless cars, as well as in data analysis, natural language processing, and robotics. 9. Image Processing Image processing is a branch of data science that deals with the analysis of digital images and videos. It is used for various purposes, such as detecting objects and recognizing faces and activities. Image processing techniques can be used to analyze large sets of image data and to extract information from digital images. 10. Text Mining Text mining is a field of data science that focuses on the analysis of text data. It is used to uncover patterns in text data and gain insights from it. Text mining techniques can be applied to large volumes of text data, such as from social media, web pages, and news articles. 11. Internet of Things The Internet of Things (IoT) is a term that describes a network of internet-connected devices that can share data with each other. These devices can collect data from different sources and can be integrated into data science projects. IoT technology can be used to monitor large datasets in real time, allowing data scientists to uncover insights quickly. 12. Big Data Big data describes datasets that are too complex and large to be managed by traditional databases. Big data sets pose a challenge to data scientists, as they require new tools and techniques to be able to process and analyze them. Fortunately, new technologies such as Apache Hadoop and Spark have made it easier to manage and analyze large datasets, making the process more efficient and increasing the potential for data scientists to uncover valuable insights. 13. Data Visualization Data visualization is one of the most important tools used in data science, both to explore and analyze data, as well as to communicate results to others. Data visualization tools are becoming increasingly powerful and user-friendly, allowing users to quickly and easily visualize and communicate even the most complex datasets. Popular visualization tools such as Tableau, Qlik, and Power BI are allowing data scientists to quickly and easily create interactive visualizations that can be easily shared and understood by stakeholders and colleagues. Additionally, tools such as Matplotlib, Seaborn, and Bokeh allow data scientists to create and customize more sophisticated visualizations for their own analysis and exploration. 14. Cloud Computing Cloud computing is an increasingly popular tool for data science. It provides data scientists with quick and easy access to the computing power and storage capacity needed to run large data analysis projects and the ability to share results with colleagues and stakeholders quickly and easily. By leveraging cloud computing, data scientists are able to access large datasets and distributions of computing resources that would otherwise be unavailable or too costly to access. 15. Predictive Analytics Predictive analytics is a data science method that uses data-driven algorithms to make predictions about the future. This technology can be used to anticipate customer behavior, detect patterns, and identify trends. By using predictive analytics, data scientists can gain valuable insights into the future, allowing them to make better-informed decisions. 16. Augmented Analytics Augmented analytics is a field of data science that uses machine learning and natural language processing to automate and enhance the data analysis process. Augmented analytics uses advanced analytics techniques, such as natural language queries and automated machine learning, to make data analysis easier and more efficient. It also enables data scientists to gain deeper insights into datasets and make better-informed decisions. These technologies in data science are essential for uncovering previously unknown insights from data and helping drive better decision-making. In addition, integrating these technologies into data analysis can result in more accurate insights and analysis and improve the accuracy of predictions and forecasts. As AI, image processing, text mining, and other technologies continue to expand, the potential applications for data science and analytics become even more exciting. Bottom Line The key to staying ahead in the data science field is to stay informed and knowledgeable of the latest trends and technologies in the field in order to ensure success. From reading industry publications and attending conferences to taking data science courses and using online resources, there are many ways to stay connected to the data science community and the most innovative technologies and trends in the field. With these tools, you can stay ahead of the pack and be the first to know about new opportunities and advancements to hone your skills and maximize your career potential.
Companies are in continuous motion: new requirements, new data streams, and new technologies are popping up every day. When designing new data platforms supporting the needs of your company, failing to perform a complete assessment of the options available can have disastrous effects on a company’s capability to innovate and make sure its data assets are usable and reusable in the long term. Having a standard assessment methodology is an absolute must to avoid personal bias and properly evaluate the various solutions across all the needed axes. The SOFT Methodology provides a comprehensive guide of all the evaluation points to define robust and future-proof data solutions. However, the original blog doesn’t discuss a couple of important factors: why is applying a methodology like SOFT important? And, even more, what risks can we encounter if we’re not doing so? This blog aims to cover both aspects. The Why Data platforms are here to stay: the recent history of technology has told us that data decisions made now have a long-lasting effect. We commonly see a frequent rework of the front end, but radical changes in the back-end data platforms used are rare. Front-end rework can radically change the perception of a product, but when the same is done on a backend the changes are not immediately impacting the end users. Changing the product provider is nowadays quite frictionless, but porting a solution across different backend tech stacks is, despite the eternal promise, very complex and costly, both financially and time-wise. Some options exist to ease the experience, but the code compatibility and performances are never a 100% match. Furthermore, when talking about data solutions, performance consistency is key. Any change in the backend technology is therefore seen as a high-risk scenario, and most of the time refused with the statement “don’t fix what isn’t broken." The fear of change blocks both new tech adoption as well as upgrades of existing solutions. In summary, the world has plenty of examples of companies using backend data platforms chosen ages ago, sometimes with old, unsupported versions. Therefore, any data decision made today needs to be robust and age well in order to support the companies in their future data growth. Having a standard methodology helps understand the playing field, evaluate all the possible directions, and accurately compare the options. The Risks of Being (Data) Stuck Ok, you’re in the long-term game now. Swapping back-end or data pipeline solutions is not easy, therefore selecting the right one is crucial. But what problems will we face if we fail in our selection process? What are the risks of being stuck with a sub-optimal choice? Features When thinking about being stuck, it’s tempting to compare the chosen solution with the new and shiny tooling available at the moment, and their promised future features. New options and functionalities could enhance a company’s productivity, system management, integration, and remove friction at any point of the data journey. Being stuck with a suboptimal solution without a clear innovation path and without any capability to influence its direction puts the company in a potentially weak position regarding innovation. Evaluating the community and the vendors behind a certain technology could help decrease the risk of stagnating tools. It’s very important to evaluate which features and functionality is relevant/needed and define a list of “must haves” to reduce time spent on due diligence. Scaling The SOFT methodology blog post linked above touches on several directions of scaling: human, technological, business case, and financial. Hitting any of these problems could mean that the identified solution: Could not be supported by a lack of talent Could hit technical limits and prevent growth Could expose security/regulatory problems Could be perfectly fine to run on a sandbox, but financially impractical on production-size data volumes Hitting scaling limits, therefore, means that companies adopting a specific technology could be forced to either slow down growth or completely rebuild solutions starting from a different technology choice. Support and Upgrade Path Sometimes the chosen technology advances, but companies are afraid or can’t find the time/budget to upgrade to the new version. The associated risk is that the older the software version, the more complex (and risky) the upgrade path will be. In exceptional circumstances, the upgrade path could not exist, forcing a complete re-implementation of the solution. Support needs a similar discussion: staying on a very old version could mean a premium support fee in the best case or a complete lack of vendor/community help in a vast majority of the scenarios. Community and Talent The risk associated with talent shortage was already covered in the scaling chapter. New development and workload scaling heavily depend on the humans behind the tool. Moreover, not evaluating the community and talent pool behind a certain technology decision could create support problems once the chosen solution becomes mature and the first set of developers/supporters leave the company without proper replacement. The lack of a vibrant community around a data solution could rapidly decrease the talent pool, creating issues for new features, new developments, and existing support. Performance It’s impossible to know what the future will hold in terms of new technologies and integrations. But selecting a closed solution, with limited (or no) capabilities of integration forces companies to run only “at the speed of the chosen technology,” exposing companies to a risk of not being able to unleash new use cases because of technical limitations. Moreover, not paying attention to the speed of development and recovery could expose limits on the innovation and resilience fronts. Black Box When defining new data solutions, an important aspect is an ability to make data assets and related pipelines discoverable and understandable. Dealing with a black box approach means exposing companies to repeated efforts and inconsistent results which decrease the trust in the solution and open the door to misalignments in the results across departments. Overthinking The opposite risk is overthinking: the more time spent evaluating solutions, the more technologies, options, and needs will pile up, making the final decision process even longer. An inventory of the needs, timeframes, and acceptable performance is necessary to reduce the scope, take a decision, and start implementing. Conclusion When designing a data platform, it is very important to address the right questions and avoid the “risk of being stuck." The SOFT Methodology aims at providing all the important questions you should ask yourself in order to avoid pitfalls and create a robust solution. Do you feel all the risks are covered? Have a different opinion? Let me know!
With the Spring 6 and Spring Boot 3 releases, Java 17+ became the baseline framework version. So now is a great time to start using compact Java records as Data Transfer Objects (DTOs) for various database and API calls. Whether you prefer reading or watching, let’s review a few approaches for using Java records as DTOs that apply to Spring Boot 3 with Hibernate 6 as the persistence provider. Sample Database Follow these instructions if you’d like to install the sample database and experiment yourself. Otherwise, feel free to skip this section: 1. Download the Chinook Database dataset (music store) for the PostgreSQL syntax. 2. Start an instance of YugabyteDB, a PostgreSQL-compliant distributed database, in Docker: Shell mkdir ~/yb_docker_data docker network create custom-network docker run -d --name yugabytedb_node1 --net custom-network \ -p 7001:7000 -p 9000:9000 -p 5433:5433 \ -v ~/yb_docker_data/node1:/home/yugabyte/yb_data --restart unless-stopped \ yugabytedb/yugabyte:latest \ bin/yugabyted start \ --base_dir=/home/yugabyte/yb_data --daemon=false 3. Create the chinook database in YugabyteDB: Shell createdb -h 127.0.0.1 -p 5433 -U yugabyte -E UTF8 chinook 4. Load the sample dataset: Shell psql -h 127.0.0.1 -p 5433 -U yugabyte -f Chinook_PostgreSql_utf8.sql -d chinook Next, create a sample Spring Boot 3 application: 1. Generate an application template using Spring Boot 3+ and Java 17+ with Spring Data JPA as a dependency. 2. Add the PostgreSQL driver to the pom.xml file: XML <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.5.4</version> </dependency> 3. Provide YugabyteDB connectivity settings in the application.properties file: Properties files spring.datasource.url = jdbc:postgresql://127.0.0.1:5433/chinook spring.datasource.username = yugabyte spring.datasource.password = yugabyte All set! Now, you’re ready to follow the rest of the guide. Data Model The Chinook Database comes with many relations, but two tables will be more than enough to show how to use Java records as DTOs. The first table is Track, and below is a definition of a corresponding JPA entity class: Java @Entity public class Track { @Id private Integer trackId; @Column(nullable = false) private String name; @ManyToOne(fetch = FetchType.LAZY) @JoinColumn(name = "album_id") private Album album; @Column(nullable = false) private Integer mediaTypeId; private Integer genreId; private String composer; @Column(nullable = false) private Integer milliseconds; private Integer bytes; @Column(nullable = false) private BigDecimal unitPrice; // Getters and setters are omitted } The second table is Album and has the following entity class: Java @Entity public class Album { @Id private Integer albumId; @Column(nullable = false) private String title; @Column(nullable = false) private Integer artistId; // Getters and setters are omitted } In addition to the entity classes, create a Java record named TrackRecord that stores short but descriptive song information: Java public record TrackRecord(String name, String album, String composer) {} Naive Approach Imagine you need to implement a REST endpoint that returns a short song description. The API needs to provide song and album names, as well as the author’s name. The previously created TrackRecord class can fit the required information. So, let’s create a record using the naive approach that gets the data via JPA Entity classes: 1. Add the following JPA Repository: Java public interface TrackRepository extends JpaRepository<Track, Integer> { } 2. Add Spring Boot’s service-level method that creates a TrackRecord instance from the Track entity class. The latter is retrieved via the TrackRepository instance: Java @Transactional(readOnly = true) public TrackRecord getTrackRecord(Integer trackId) { Track track = repository.findById(trackId).get(); TrackRecord trackRecord = new TrackRecord( track.getName(), track.getAlbum().getTitle(), track.getComposer()); return trackRecord; } The solution looks simple and compact, but it’s very inefficient because Hibernate needs to instantiate two entities first: Track and Album (see the track.getAlbum().getTitle()). To do this, it generates two SQL queries that request all the columns of the corresponding database tables: SQL Hibernate: select t1_0.track_id, t1_0.album_id, t1_0.bytes, t1_0.composer, t1_0.genre_id, t1_0.media_type_id, t1_0.milliseconds, t1_0.name, t1_0.unit_price from track t1_0 where t1_0.track_id=? Hibernate: select a1_0.album_id, a1_0.artist_id, a1_0.title from album a1_0 where a1_0.album_id=? Hibernate selects 12 columns across two tables, but TrackRecord needs only three columns! This is a waste of memory, computing, and networking resources, especially if you use distributed databases like YugabyteDB that scatters data across multiple cluster nodes. TupleTransformer The naive approach can be easily remediated if you query only the records the API requires and then transform a query result set to a respective Java record. The Spring Data module of Spring Boot 3 relies on Hibernate 6. That version of Hibernate split the ResultTransformer interface into two interfaces: TupleTransformer and ResultListTransformer. The TupleTransformer class supports Java records, so, the implementation of the public TrackRecord getTrackRecord(Integer trackId) can be optimized this way: Java @Transactional(readOnly = true) public TrackRecord getTrackRecord(Integer trackId) { org.hibernate.query.Query<TrackRecord> query = entityManager.createQuery( """ SELECT t.name, a.title, t.composer FROM Track t JOIN Album a ON t.album.albumId=a.albumId WHERE t.trackId=:id """). setParameter("id", trackId). unwrap(org.hibernate.query.Query.class); TrackRecord trackRecord = query.setTupleTransformer((tuple, aliases) -> { return new TrackRecord( (String) tuple[0], (String) tuple[1], (String) tuple[2]); }).getSingleResult(); return trackRecord; } entityManager.createQuery(...) - Creates a JPA query that requests three columns that are needed for the TrackRecord class. query.setTupleTransformer(...) - The TupleTransformer supports Java records, which means a TrackRecord instance can be created in the transformer’s implementation. This approach is more efficient than the previous one because you no longer need to create entity classes and can easily construct a Java record with the TupleTransformer. Plus, Hibernate generates a single SQL request that returns only the required columns: SQL Hibernate: select t1_0.name, a1_0.title, t1_0.composer from track t1_0 join album a1_0 on t1_0.album_id=a1_0.album_id where t1_0.track_id=? However, there is one very visible downside to this approach: the implementation of the public TrackRecord getTrackRecord(Integer trackId) method became longer and wordier. Java Record Within JPA Query There are several ways to shorten the previous implementation. One is to instantiate a Java record instance within a JPA query. First, expand the implementation of the TrackRepository interface with a custom query that creates a TrackRecord instance from requested database columns: Java public interface TrackRepository extends JpaRepository<Track, Integer> { @Query(""" SELECT new com.my.springboot.app.TrackRecord(t.name, a.title, t.composer) FROM Track t JOIN Album a ON t.album.albumId=a.albumId WHERE t.trackId=:id """) TrackRecord findTrackRecord(@Param("id") Integer trackId); } Next, update the implementation of the TrackRecord getTrackRecord(Integer trackId) method this way: Java @Transactional(readOnly = true) public TrackRecord getTrackRecord(Integer trackId) { return repository.findTrackRecord(trackId); } So, the method implementation became a one-liner that gets a TrackRecord instance straight from the JPA repository: as simple as possible. But that’s not all. There is one more small issue. The JPA query that constructs a Java record requires you to provide a full package name for the TrackRecord class: SQL SELECT new com.my.springboot.app.TrackRecord(t.name, a.title, t.composer)... Let’s find a way to bypass this requirement. Ideally, the Java record needs to be instantiated without the package name: SQL SELECT new TrackRecord(t.name, a.title, t.composer)... Hypersistence Utils Hypersistence Utils library comes with many goodies for Spring and Hibernate. One feature allows you to create a Java record instance within a JPA query without the package name. Let’s enable the library and this Java records-related feature in the Spring Boot application: 1. Add the library’s Maven artifact for Hibernate 6. 2. Create a custom IntegratorProvider that registers TrackRecord class with Hibernate: Java public class ClassImportIntegratorProvider implements IntegratorProvider { @Override public List<Integrator> getIntegrators() { return List.of(new ClassImportIntegrator(List.of(TrackRecord.class))); } } 3. Update the application.properties file by adding this custom IntegratorProvider: Properties files spring.jpa.properties.hibernate.integrator_provider=com.my.springboot.app.ClassImportIntegratorProvider After that you can update the JPA query of the TrackRepository.findTrackRecord(...) method by removing the Java record’s package name from the query string: Java @Query(""" SELECT new TrackRecord(t.name, a.title, t.composer) FROM Track t JOIN Album a ON t.album.albumId=a.albumId WHERE t.trackId=:id """) TrackRecord findTrackRecord(@Param("id") Integer trackId); It’s that simple! Summary The latest versions of Java, Spring, and Hibernate have a number of significant enhancements to simplify and make coding in Java more fun. One such enhancement is built-in support for Java records that can now be easily used as DTOs in Spring Boot applications. Enjoy!
The amount of data that needs to be processed, filtered, connected, and stored is constantly growing. Companies that can process the data quickly have an advantage. Ideally, it should happen in real-time. Event-driven architectures are used for this. Such architectures allow exchanging messages in real-time, storing them in a single database, and using distributed systems for sending and processing messages. In this article, we will talk about how event-driven architecture differs from other popular message-driven architecture. In addition, we will learn the most popular event-driven architecture patterns, and their main advantages and disadvantages. Event-Driven vs. Message-Driven Architecture An event is some action that took place at a certain point in time. It is generated by the service and does not have specific recipients. Any system component can be an event consumer. A message is a fixed packet of data that is sent from one service to another. An event is a type of message that signals a change in the state of the system. In an event-driven architecture, the component that generates the event tells other components where it will be stored. That way, any component can save, process, and respond to an event, and the producer won’t know who the consumer is. In message-driven systems, components that create messages send them to a specific address. After sending a message, the component immediately receives control and does not wait for the message to be processed. In a message-driven architecture, if a similar message needs to be sent to multiple recipients, the sender must send it to each recipient separately. In contrast, in an event-driven architecture, a producer generates an event once and sends it to a processing system/ After that, the event can be consumed by any number of subscribers connecting to that system. Event-Driven Patterns There are different approaches to implementing an event-driven architecture. Often, when designing a program, several approaches are used together. In this section, we will talk about the most popular patterns that allow you to implement an event-driven architecture, their advantages, and their application areas. Global Event Streaming Platform It is vital for today’s companies to respond to events in real-time. Customers expect the business to respond immediately to various events. So, there is a need to develop such software architectures that will meet modern business requirements and will be able to process data as event streams, and not only data that is in a state of rest. A great solution is to use a global event streaming platform. It allows you to process business functions as event streams. In addition, this platform is fault-tolerant and scalable. All events that occur in the system are recorded in the data streaming platform once. External systems read these events and process them in real time. Event streaming platforms consist of a diverse set of components. Their creation requires significant resources and engineering experience. Such a pattern is quite popular and is used in many industries. Central Event Store The central event store ensures the publication and storage of events in a single database. It is a single endpoint for events of various types. This allows applications and services to respond to events in real-time without delays or data loss. Applications can easily subscribe to a variety of events, reducing development costs. The central event store is used for a variety of purposes: Publication of changes for consumer services. Reconstruction of past states and conducting business analytics. Search for events. Saving all program changes as a sequence of events. A single endpoint for notification based on application state changes. Monitoring system status. Using a central event store, you can create new applications and use existing events without having to republish them. Event-First and Event Streaming Applications Event streaming allows you to receive data from various sources, such as applications, databases, various sensors, Internet devices, etc., process, clean, and use them without first saving them. This method of event processing provides fast results and is very important for companies that transfer large amounts of data and require to receive information quickly. Key benefits of event streaming platforms: Improving the customer experience: Customers can instantly learn about changes in the status of their orders, which improves their experience and, accordingly, increases the company’s income. Risk reduction: Systems that use streaming events allow the detection of fraud on the Internet, can stop a suspicious transaction, or block a card. Reliability: Event streaming platforms allow for robust systems that can effectively handle subscriber failures. Feedback in real time: Users can see the results of their operations immediately after their execution, without waiting a minute. Event streaming systems require less infrastructure and data to support, so they are simple and fast to build. Applications that use this architecture pattern are used in various industries, such as financial trading, risk and fraud detection in financial systems, the Internet of Things, retail, etc. CQRS Command Query Responsibility Segregation (CQRS) is the principle of separating data structures for reading and writing information. It is used to increase the performance, security, and scalability of the software. The application of CQRS is quite useful in complex areas where a single model for reading and writing data is too complex. When it is separated, it is greatly simplified. This is especially noticeable when the number of reading and writing operations is significantly different. Different databases can be used for reading and storing data. In this case, they must be synchronized. To do this, the recording model must publish an event every time it updates the database. Advantages of CQRS: Independent scaling of reading and write workloads. Separate optimized data schemes for reading and writing. The use of more flexible and convenient models thanks to the separation of concerns. Ability to avoid complex queries by storing the materialized view in the read database. The CQRS template is useful for the following scenarios: In systems where many users simultaneously access the same data. When data read performance should be configured separately from data write performance. This is especially important when the number of reads greatly exceeds the number of writes. When the read-and-write models are created by different development teams. With frequent changes in business rules. In cases where the logic of the system and the user interface are quite simple, this template is not recommended. Event Sourcing An event sourcing stores the system state as a sequence of events. Whenever the system state changes, a new event is added to the event list. However, this system state can be restored by reprocessing events in the future. All events are stored in the event store, which is an event database. A good example of such an architecture is a version control system. Its event store is a log of all commits, and a working copy of the source tree is the system state. The main advantages of event search: Events can be published each time the system state changes. Ensuring a reliable change audit log. The ability to determine the state of the system at any time. The ability to easily transition from a monolithic application to a microservice architecture through the use of loosely coupled objects that exchange events. However, to reconstruct the state of business objects, you need to send standard requests, which is difficult and inefficient. Therefore, the system must use Command Request Responsibility Distribution (CQRS) to implement requests. This means that applications must process the final agreed-upon data. Automated Data Provisioning This pattern provides fully self-service data provisioning. The user must determine what data and in what format he needs, as well as where it should be stored. For example, in a database, distributed cache, microservice, etc. The selected repository can be used together with the central repository. The system provides the client with infrastructure along with pre-loaded data and manages the flow of events. The processor processes and filters data streams according to user requirements. The use of cloud infrastructure makes such a system faster and more practical. Systems that use automated data provisioning are used in finance, retail, and the Internet, both on-premises and in the cloud. Advantages and Disadvantages of Event-Driven Architecture Even though event-driven architecture is quite popular now, is developing rapidly, and helps to solve many business problems, there are also some disadvantages of using this approach. In this section, we list the main advantages and disadvantages of event-driven architecture. Advantages Autonomy: The loose coupling of components that use this architecture allows event producers and consumers to function independently of each other. This connection allows you to use different programming languages and technologies to develop different components. In addition, producers and consumers can be added and removed from the system without affecting other participants. Fault tolerance: Events are published immediately after they occur. Various services and programs subscribe to these events. If the consumer shuts down, events continue to be published and queued. When the consumer reconnects, it will be able to handle these events. Real-time user interaction for a better user experience. Economy: The consumer receives the message immediately after the producer has published it, eliminating the need for constant polling to verify the event. This reduces CPU consumption and network bandwidth usage. Disadvantages Error handling is difficult. Because event producers and consumers can be many and loosely connected, it is difficult to trace the actions between them and identify the root cause of a malfunction. Inability to predict events occurring in different periods. Since events are asynchronous, the order in which they occur cannot be predicted. Additionally, there may be duplicates, each of which may require a contextual response. This requires additional testing time and deep system analysis to prevent data loss. The weak connection between the systems that generate the events and the systems that receive them can be a point of vulnerability that attackers can exploit. Conclusion Event-driven architectures have been around for a long time and were used to pass messages. In connection with modern business needs, they are actively developing and offering better opportunities for connection and management. Event-driven architectures are indispensable for real-time event processing, as required by many modern systems. To implement such an architecture, there are several different patterns, each of which has its advantages and application features. When designing a system, you need to clearly define its functions and applications to choose the right pattern. Although event-driven architecture is widely used and has many advantages, its features can sometimes have a bad effect on the system, which must be taken into account during its design.
In earlier days, it was easy to deduct and debug a problem in monolithic applications because there was only one service running in the back end and front end. Now, we are moving toward microservices architecture, where applications are divided into multiple independently deployable services. These services have their own goal and logic to serve. In this kind of application architecture, it becomes difficult to observe how one service depends on or affects other services. To make the system observable, some logs, metrics, or traces must be emitted from the code, and this data must be sent to an observability back end. This is where OpenTelemetry and Jaeger come into the picture. In this article, we will see how to monitor application trace data (Traces and Spans) with the help of OpenTelemetry and Jaeger. A trace is used to observe the requests as they propagate through the services in a distributed system. Spans are a basic unit of the trace; they represent a single event within the trace, and a trace can have one or multiple spans. A span consists of log messages, time-related data, and other attributes to provide information about the operation it tracks. We will use the distributed tracing method to observe requests moving across microservices, generating data about the request and making it available for analysis. The produced data will have a record of the flow of requests in our microservices, and it will help us understand our application's performance. OpenTelemetry Telemetry is the collection and transmission of data using agents and protocols from the source in observability. The telemetry data includes logs, metrics, and traces, which help us understand what is happening in our application. OpenTelemetry (also known as OTel) is an open source framework comprising a collection of tools, APIs, and SDKs. OpenTelemetry makes generating, instrumenting, collecting, and exporting telemetry data easy. The data collected from OpenTelemetry is vendor-agnostic and can be exported in many formats. OpenTelemetry is formed after merging two projects OpenCensus and OpenTracing. Instrumenting The process of adding observability code to your application is known as instrumentation. Instrumentation helps make our application observable, meaning the code must produce some metrics, traces, and logs. OpenTelemetry provides two ways to instrument our code: Manual instrumentation Auto instrumentation 1. Manual Instrumentation The user needs to add an OpenTelemetry code to the application. The manual instrumentation provides more options for customization in spans and traces. Languages supported for manual instrumentations are C++, .NET, Go, Java, Python, and so on. 2. Automatic Instrumentation This is the easiest way of instrumentation as it requires no code changes and no need to recompile the application. It uses an intelligent agent that gets attached to an application, reads its activity, and extracts the traces. Automatic instrumentation supports Java, NodeJS, Python, and so on. Difference Between Manual and Automatic Instrumentation Both manual and automatic instrumentation have advantages and disadvantages that you might consider while writing your code. A few of them are listed below: Manual Instrumentation Automatic Instrumentation Code changes are required. Code changes are not required. It supports maximum programming languages. Currently, .Net, Java, NodeJS, and Python are supported. It consumes a lot of time as code changes are required. Easy to implement as we do not need to touch the code. Provide more options for the customization of spans and traces. As you have more control over the telemetry data generated by your application. Fewer options for customization. Possibilities of error are high as manual changes are required. No error possibilities. As we don't have to touch our application code. To make the instrumentation process hassle-free, use automatic instrumentation, as it does not require any modification in the code and reduces the possibility of errors. Automatic instrumentation is done by an agent which reads your application's telemetry data, so no manual changes are required. For the scope of this post, we will see how you can use automatic instrumentation in a Kubernetes-based microservices environment. Jaeger Jaeger is a distributed tracing tool initially built by Uber and released as open-source in 2015. Jaeger is also a Cloud Native Computing Foundation graduate project and was influenced by Dapper and OpenZipkin. It is used for monitoring and troubleshooting microservices-based distributed systems. The Jaeger components which we have used for this blog are: Jaeger Collector Jaeger Query Jaeger UI / Console Storage Backend Jaeger Collector: The Jaeger distributed tracing system includes the Jaeger collector. It is in charge of gathering and keeping the information. After receiving spans, the collector adds them to a processing queue. Collectors need a persistent storage backend, hence Jaeger also provides a pluggable span storage mechanism. Jaeger Query: This is a service used to get traces out of storage. The web-based user interface for the Jaeger distributed tracing system is called Jaeger Query. It provides various features and tools to help you understand the performance and behavior of your distributed application and enables you to search, filter, and visualise the data gathered by Jaeger. Jaeger UI/Console: Jaeger UI lets you view and analyze traces generated by your application. Storage Back End: This is used to store the traces generated by an application for the long term. In this article, we are going to use Elasticsearch to store the traces. What Is the Need for Integrating OpenTelemetry With Jaeger? OpenTelemetry and Jaeger are the tools that help us in setting the observability in microservices-based distributed systems, but they are intended to address different issues. OpenTelemetry provides an instrumentation layer for the application, which helps us generate, collect and export the telemetry data for analysis. In contrast, Jaeger is used to store and visualize telemetry data. OpenTelemetry can only generate and collect the data. It does not have a UI for the visualization. So we need to integrate Jaeger with OpenTelemetry as it has a storage backend and a web UI for the visualization of the telemetry data. With the help of Jaeger UI, we can quickly troubleshoot microservices-based distributed systems. Note: OpenTelemetry can generate logs, metrics, and traces. Jaeger does not support logs and metrics. Now you have an idea about OpenTelemetry and Jaeger. Let's see how we can integrate them with each other to visualize the traces and spans generated by our application. Implementing OpenTelemetry Auto-Instrumentation We will integrate OpenTelemetry with Jaeger, where OpenTelemetry will act as an instrumentation layer for our application, and Jaeger will act as the back-end analysis tool to visualize the trace data. Jaeger will get the telemetry data from the OpenTelemetry agent. It will store the data in the storage backend, from where we will query the stored data and visualize it in the Jaeger UI. Prerequisites for this article are: The target Kubernetes cluster is up and running. You have access to run the kubectl command against the Kubernetes cluster to deploy resources. Cert manager is installed and running. You can install it from the website cert-manager.io if it is not installed. We assume that you have all the prerequisites and now you are ready for the installation. The files we have used for this post are available in this GitHub repo. Installation The installation part contains three steps: Elasticsearch installation Jaeger installation OpenTelemetry installation Elasticsearch By default, Jaeger uses in-memory storage to store spans, which is not a recommended approach for the production environment. There are various tools available to use as a storage back end in Jaeger; you can read about them in the official documentation of Jaeger span storage back end. In this article, we will use Elasticsearch as a storage back end. You can deploy Elasticsearch in your Kubernetes cluster using the Elasticsearch Helm chart. While deploying Elasticsearch, ensure you have enabled the password-based authentication and deploy that Elasticsearch in observability namespaces. Elasticsearch is deployed in our Kubernetes cluster, and you can see the output by running the following command. Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m NAME READY AGE statefulset.apps/elasticsearch 1/1 17m Jaeger Installation We are going to use Jaeger to visualize the trace data. Let's deploy the Jaeger Operator on our cluster. Before proceeding with the installation, we will deploy a ConfigMap in the observability namespace. In this ConfigMap, we will pass the username and password of the Elasticsearch which we have deployed in the previous step. Replace the credentials based on your setup. YAML kubectl -n observability apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: jaeger-configuration labels: app: jaeger app.kubernetes.io/name: jaeger data: span-storage-type: elasticsearch collector: | es: server-urls: http://elasticsearch:9200 username: elastic password: changeme collector: zipkin: http-port: 9411 query: | es: server-urls: http://elasticsearch:9200 username: elastic password: changeme agent: | collector: host-port: "jaeger-collector:14267" EOF If you are going to deploy Jaeger in another namespace and you have changed the Jaeger collector service name, then you need to change the values of the host-port value under the agent collector. Jaeger Operator The Jaeger Operator is a Kubernetes operator for deploying and managing Jaeger, an open source, distributed tracing system. It works by automating the deployment, scaling, and management of Jaeger components on a Kubernetes cluster. The Jaeger Operator uses custom resources and custom controllers to extend the Kubernetes API with Jaeger-specific functionality. It manages the creation, update, and deletion of Jaeger components, such as the Jaeger collector, query, and agent components. When a Jaeger instance is created, the Jaeger Operator deploys the necessary components and sets up the required services and configurations. We are going to deploy the Jaeger Operator in the observability namespace. Use the below-mentioned command to deploy the operator. Shell $ kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.38.0/jaeger-operator.yaml -n observability We are using the latest version of Jaeger, which is 1.38.0 at the time of writing this article. By default, the Jaeger script is provided for cluster-wide mode. Suppose you want to watch only a particular namespace. In that case, you need to change the ClusterRole to Role and ClusterBindingRole to RoleBinding in the operator manifest and set the WATCH_NAMESPACE env variable on the Jaeger Operator deployment. To verify whether Jaeger is deployed successfully or not, run the following command: Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m pod/jaeger-operator-5597f99c79-hd9pw 2/2 Running 0 11m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m service/jaeger-operator-metrics ClusterIP 172.20.220.212 <none> 8443/TCP 11m service/jaeger-operator-webhook-service ClusterIP 172.20.224.23 <none> 443/TCP 11m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/jaeger-operator 1/1 1 1 11m NAME DESIRED CURRENT READY AGE replicaset.apps/jaeger-operator-5597f99c79 1 1 1 11m NAME READY AGE statefulset.apps/elasticsearch 1/1 17m As we can see in the above output, our Jaeger Operator is deployed successfully, and all of its pods are up and running; this means Jaeger Operator is ready to install the Jaeger instances (CRs). The Jaeger instance will contain Jaeger components (Query, Collector, Agent); later, we will use these components to query OpenTelemetry metrics. Jaeger Instance A Jaeger Instance is a deployment of the Jaeger distributed tracing system. It is used to collect and store trace data from microservices or distributed applications, and provide a UI to visualize and analyze the trace data. To deploy the Jaeger instance, use the following command. Shell $ kubectl apply -f https://raw.githubusercontent.com/infracloudio/Opentelemertrywithjaeger/master/jaeger-production-template.yaml To verify the status of the Jaeger instance, run the following command: Shell $ kubectl get all -n observability NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 17m pod/jaeger-agent-27fcp 1/1 Running 0 14s pod/jaeger-agent-6lvp2 1/1 Running 0 15s pod/jaeger-collector-69d7cd5df9-t6nz9 1/1 Running 0 19s pod/jaeger-operator-5597f99c79-hd9pw 2/2 Running 0 11m pod/jaeger-query-6c975459b6-8xlwc 1/1 Running 0 16s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/elasticsearch ClusterIP None <none> 9200/TCP,9300/TCP 17m service/jaeger-collector ClusterIP 172.20.24.132 <none> 14267/TCP,14268/TCP,9411/TCP,14250/TCP 19s service/jaeger-operator-metrics ClusterIP 172.20.220.212 <none> 8443/TCP 11m service/jaeger-operator-webhook-service ClusterIP 172.20.224.23 <none> 443/TCP 11m service/jaeger-query LoadBalancer 172.20.74.114 a567a8de8fd5149409c7edeb54bd39ef-365075103.us-west-2.elb.amazonaws.com 80:32406/TCP 16s service/zipkin ClusterIP 172.20.61.72 <none> 9411/TCP 18s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/jaeger-agent 2 2 2 2 2 <none> 16s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/jaeger-collector 1/1 1 1 21s deployment.apps/jaeger-operator 1/1 1 1 11m deployment.apps/jaeger-query 1/1 1 1 18s NAME DESIRED CURRENT READY AGE replicaset.apps/jaeger-collector-69d7cd5df9 1 1 1 21s replicaset.apps/jaeger-operator-5597f99c79 1 1 1 11m replicaset.apps/jaeger-query-6c975459b6 1 1 1 18s NAME READY AGE statefulset.apps/elasticsearch 1/1 17m As we can see in the above screenshot, our Jaeger instance is up and running. OpenTelemetry To install the OpenTelemetry, we need to install the OpenTelemetry Operator. The OpenTelemetry Operator uses custom resources and custom controllers to extend the Kubernetes API with OpenTelemetry-specific functionality, making it easier to deploy and manage the OpenTelemetry observability stack in a Kubernetes environment. The operator manages two things: Collectors: It offers a vendor-agnostic implementation of how to receive, process, and export telemetry data. Auto-instrumentation of the workload using OpenTelemetry instrumentation libraries. It does not require the end-user to modify the application source code. OpenTelemetry Operator To implement the auto-instrumentation, we need to deploy the OpenTelemetry operator on our Kubernetes cluster. To deploy the k8s operator for OpenTelemetry, follow the K8s operator documentation. You can verify the deployment of the OpenTelemetry operator by running the below-mentioned command: Shell $ kubectl get all -n opentelemetry-operator-system NAME READY STATUS RESTARTS AGE pod/opentelemetry-operator-controller-manager-7f479c786d-zzfd8 2/2 Running 0 30s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/opentelemetry-operator-controller-manager-metrics-service ClusterIP 172.20.70.244 <none> 8443/TCP 32s service/opentelemetry-operator-webhook-service ClusterIP 172.20.150.120 <none> 443/TCP 31s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/opentelemetry-operator-controller-manager 1/1 1 1 31s NAME DESIRED CURRENT READY AGE replicaset.apps/opentelemetry-operator-controller-manager-7f479c786d 1 1 1 31s As we can see in the above output, the opentelemetry-operator-controller-manager deployment is running in the opentelemetry-operator-system namespace. OpenTelemetry Collector The OpenTelemetry facilitates the collection of telemetry data via the OpenTelemetry Collector. Collector offers a vendor-agnostic implementation on how to receive, process, and export the telemetry data. The collector is made up of the following components: Receivers: It manages how to get data into the collector. Processors: It manages the processing of data. Exporters: Responsible for sending the received data. We also need to export the telemetry data to the Jaeger instance. Use the following manifest to deploy the collector. YAML kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel spec: config: | receivers: otlp: protocols: grpc: http: processors: exporters: logging: jaeger: endpoint: "jaeger-collector.observability.svc.cluster.local:14250" tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [] exporters: [logging, jaeger] EOF In the above code, the Jaeger endpoint is the address of the Jaeger service which is running inside the observability namespace. We need to deploy this manifest in the same namespace where our application is deployed, so that it can fetch the traces from the application and export them to Jaeger. To verify the deployment of the collector, run the following command. Shell $ kubectl get deploy otel-collector NAME READY UP-TO-DATE AVAILABLE AGE otel-collector 1/1 1 1 41s OpenTelemetry Auto-Instrumentation Injection The above-deployed operator can inject and configure the auto-instrumentation libraries of OpenTelemetry into an application's codebase as it runs. To enable the auto-instrumentation on our cluster, we need to configure an instrumentation resource with the configuration for the SDK and instrumentation. Use the below-given manifest to create the auto-instrumentation. YAML kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: my-instrumentation spec: exporter: endpoint: http://otel-collector:4317 propagators: - tracecontext - baggage - b3 sampler: type: parentbased_traceidratio argument: "0.25" EOF In the above manifest, we have used three things: exporter, propagator, and sampler. Exporter: Used to send data to OpenTelemetry collector at the specified endpoint. In our scenario, it is "http://otel-collector:4317". Propagators: Carry traces, context, and baggage data between distributed tracing systems. They have three propagation mechanisms: tracecontext: This refers to the W3C Trace Context specification, which defines a standard way to propagate trace context information between services. baggage: This refers to the OpenTelemetry baggage mechanism, which allows for the propagation of arbitrary key-value pairs along with the trace context information. b3: This refers to the B3 header format, which is a popular trace context propagation format used by the Zipkin tracing system. Sampler: Uses a "parent-based trace ID ratio" strategy with a sample rate of 0.25 (25%). This means that when tracing a request, if any of its parent requests has already been sampled (with a probability of 0.25), then this request will also be sampled, otherwise it will not be traced. To verify that our custom resource is created or not, we can use the below-mentioned command. Shell $ kubectl get otelinst NAME AGE ENDPOINT SAMPLER SAMPLER ARG my-instrumentation 6s http://otel-collector:4317 parentbased_traceidratio 0.25 This means our custom resource is created successfully. We are using the OpenTelemetry auto-instrumented method, so we don’t need to write instrumentation code in our application. All we need to do is, add an annotation in the pod of our application for auto-instrumentation. As we are going to demo a Java application, the annotation that we will have to use here is: Shell instrumentation.opentelemetry.io/inject-java: "true" Note: The annotation can be added to a namespace as well so that all pods within that namespace will get instrumentation, or by adding the annotation to individual PodSpec objects, available as part of Deployment, Statefulset, and other resources. Below is an example of how your manifest will look after adding the annotations. In the below example, we are using annotation for a Java application. YAML apiVersion: apps/v1 kind: Deployment metadata: name: demo-sagar spec: replicas: 1 selector: matchLabels: app: demo-sagar template: metadata: labels: app: demo-sagar annotations: instrumentation.opentelemetry.io/inject-java: "true" instrumentation.opentelemetry.io/container-names: "spring" spec: containers: - name: spring image: sagar27/petclinic-demo ports: - containerPort: 8080 We have added instrumentation “inject-java” and “container-name” under annotations. If you have multiple container pods, you can add them in the same “container-names” annotation, separated by a comma. For example, “container-name1,container-name-2,container-name-3” etc. After adding the annotations, deploy your application and access it on the browser. Here in our scenario, we are using port-forward to access the application. Shell $ kubectl port-forward service/demo-sagar 8080:8080 To generate traces, either you can navigate through all the pages of this website or you can use the following Bash script: Shell while true; do curl http://localhost:8080/ curl http://localhost:8080/owners/find curl http://localhost:8080/owners?lastName= curl http://localhost:8080/vets.html curl http://localhost:8080/oups curl http://localhost:8080/oups sleep 0.01 done The above-given script will make a curl request to all the pages of the website, and we will see the traces of the request on the Jaeger UI. We are making curl requests to https://localhost:8080 because we use the port-forwarding technique to access the application. You can make changes in the Bash script according to your scenario. Now let’s access the Jaeger UI, as our service jaeger-query uses service type LoadBalancer, we can access the Jaeger UI on the browser by using the load balancer domain/IP. Paste the load balancer domain/IP on the browser and you will see the Jaeger UI there. We have to select our app from the service list and it will show us the traces it generates. In the above screenshot, we have selected our app name “demo-sagar” under the services option and its traces are visible on Jaeger UI. We can further click on the traces to get more details about it. Summary In this article, we have gone through how you can easily instrument your application using the OpenTelemetry auto-instrumentation method. We also learned how this telemetric data could be exported to the Elasticsearch backend and visualized it using Jaeger. Integrating OpenTelemetry with Jaeger will help you in monitoring and troubleshooting. It also helps perform root cause analysis of any bug/issues in your microservice-based distributed systems, performance/latency optimization, service dependency analysis, and so on. We hope you found this post informative and engaging. References OpenTelemetry Jaeger Tracing
The AngularPortfolioMgr project can import the SEC filings of listed companies. The importer class is the FileClientBean and imports the JSON archive from “Kaggle.” The data is provided by year, symbol, and period. Each JSON data set has keys (called concepts) and values with the USD value. For example, IBM’s full-year revenue in 2020 was $13,456. This makes two kinds of searches possible. A search for company data and a search for keys (concepts) over all entries. The components below “Company Query” select the company value year with operators like “=,” “>=,” and “<=” (values less than 1800 are ignored). The symbol search is implemented with an angular autocomplete component that queries the backend for matching symbols. The quarters are in a select component of the available periods. The components below “Available Sec Query Items” provide the Drag’n Drop component container with the items that can be dragged down into the query container. “Term Start” is a mathematical term that means “bracket open” as a logical operator. The term “end” comes from mathematics and refers to a closed bracket. The query item is a query clause of the key (concept). The components below “Sec Query Items” are the search terms in the query. The query components contain the query parameters for the concept and value with their operators for the query term. The terms are created with the bracket open/close wrapper to prefix collections of queries with “and,” and “or,” or “or not,” and “not or” operators. The query parameters and the term structure are checked with a reactive Angular form that enables the search button if they are valid. Creating the Form and the Company Query The create-query.ts class contains the setup for the query: TypeScript @Component({ selector: "app-create-query", templateUrl: "./create-query.component.html", styleUrls: ["./create-query.component.scss"], }) export class CreateQueryComponent implements OnInit, OnDestroy { private subscriptions: Subscription[] = []; private readonly availableInit: MyItem[] = [ ... ]; protected readonly availableItemParams = { ... } as ItemParams; protected readonly queryItemParams = { ... } as ItemParams; protected availableItems: MyItem[] = []; protected queryItems: MyItem[] = [ ... ]; protected queryForm: FormGroup; protected yearOperators: string[] = []; protected quarterQueryItems: string[] = []; protected symbols: Symbol[] = []; protected FormFields = FormFields; protected formStatus = ''; @Output() symbolFinancials = new EventEmitter<SymbolFinancials[]>(); @Output() financialElements = new EventEmitter<FinancialElementExt[]>(); @Output() showSpinner = new EventEmitter<boolean>(); constructor( private fb: FormBuilder, private symbolService: SymbolService, private configService: ConfigService, private financialDataService: FinancialDataService ) { this.queryForm = fb.group( { [FormFields.YearOperator]: "", [FormFields.Year]: [0, Validators.pattern("^\\d*$")], [FormFields.Symbol]: "", [FormFields.Quarter]: [""], [FormFields.QueryItems]: fb.array([]), } , { validators: [this.validateItemTypes()] } ); this.queryItemParams.formArray = this.queryForm.controls[ FormFields.QueryItems ] as FormArray; //delay(0) fixes "NG0100: Expression has changed after it was checked" exception this.queryForm.statusChanges.pipe(delay(0)).subscribe(result => this.formStatus = result); } ngOnInit(): void { this.symbolFinancials.emit([]); this.financialElements.emit([]); this.availableInit.forEach((myItem) => this.availableItems.push(myItem)); this.subscriptions.push( this.queryForm.controls[FormFields.Symbol].valueChanges .pipe( debounceTime(200), switchMap((myValue) => this.symbolService.getSymbolBySymbol(myValue)) ) .subscribe((myValue) => (this.symbols = myValue)) ); this.subscriptions.push( this.configService.getNumberOperators().subscribe((values) => { this.yearOperators = values; this.queryForm.controls[FormFields.YearOperator].patchValue( values.filter((myValue) => myValue === "=")[0] ); }) ); this.subscriptions.push( this.financialDataService .getQuarters() .subscribe( (values) => (this.quarterQueryItems = values.map((myValue) => myValue.quarter)) ) ); } First, there are the arrays for the RxJs subscriptions and the available and query items for Drag’n Drop. The *ItemParams contain the default parameters for the items. The yearOperators and the quarterQueryItems contain the drop-down values. The “symbols” array is updated with values when the user types in characters (in the symbol) autocomplete. The FormFields are an enum with key strings for the local form group. The @Output() EventEmitter provides the search results and activate or deactivate the spinner. The constructor gets the needed services and the FormBuilder injected and then creates the FormGroup with the FormControls and the FormFields. The QueryItems FormArray supports the nested forms in the components of the queryItems array. The validateItemTypes() validator for the term structure validation is added, and the initial parameter is added. At the end, the form status changes are subscribed with delay(0) to update the formStatus property. The ngOnInit() method initializes the available items for Drag’n Drop. The value changes of the symbol autocomplete are subscribed to request the matching symbols from the backend and update the “symbols” property. The numberOperators and the “quarters” are requested off the backend to update the arrays with the selectable values. They are requested off the backend because that enables the backend to add new operators or new periods without changing the frontend. The template looks like this: HTML <div class="container"> <form [formGroup]="queryForm" novalidate> <div> <div class="search-header"> <h2 i18n="@@createQueryCompanyQuery">Company Query</h2> <button mat-raised-button color="primary" [disabled]="!formStatus || formStatus.toLowerCase() != 'valid'" (click)="search()" i18n="@@search" > Search </button> </div> <div class="symbol-financials-container"> <mat-form-field> <mat-label i18n="@@operator">Operator</mat-label> <mat-select [formControlName]="FormFields.YearOperator" name="YearOperator" > <mat-option *ngFor="let item of yearOperators" [value]="item">{{ item }</mat-option> </mat-select> </mat-form-field> <mat-form-field class="form-field"> <mat-label i18n="@@year">Year</mat-label> <input matInput type="text" formControlName="{{ FormFields.Year }" /> </mat-form-field> </div> <div class="symbol-financials-container"> <mat-form-field class="form-field"> <mat-label i18n="@@createQuerySymbol">Symbol</mat-label> <input matInput type="text" [matAutocomplete]="autoSymbol" formControlName="{{ FormFields.Symbol }" i18n-placeholder="@@phSymbol" placeholder="symbol" /> <mat-autocomplete #autoSymbol="matAutocomplete" autoActiveFirstOption> <mat-option *ngFor="let symbol of symbols" [value]="symbol.symbol"> {{ symbol.symbol } </mat-option> </mat-autocomplete> </mat-form-field> <mat-form-field class="form-field"> <mat-label i18n="@@quarter">Quarter</mat-label> <mat-select [formControlName]="FormFields.Quarter" name="Quarter" multiple > <mat-option *ngFor="let item of quarterQueryItems" [value]="item">{{ item }</mat-option> </mat-select> </mat-form-field> </div> </div> ... </div> First, the form gets connected to the formgroup queryForm of the component. Then the search button gets created and is disabled if the component property formStatus, which is updated by the formgroup, is not “valid.” Next, the two <mat-form-field> are created for the selection of the year operator and the year. The options for the operator are provided by the yearOperators property. The input for the year is of type “text” but the reactive form has a regex validator that accepts only decimals. Then, the symbol autocomplete is created, where the “symbols” property provides the returned options. The #autoSymbol template variable connects the input matAutocomplete property with the options. The quarter select component gets its values from the quarterQueryItems property and supports multiple selection of the checkboxes. Drag’n Drop Structure The template of the cdkDropListGroup looks like this: HTML <div cdkDropListGroup> <div class="query-container"> <h2 i18n="@@createQueryAvailableSecQueryItems"> Available Sec Query Items </h2> <h3 i18n="@@createQueryAddQueryItems"> To add a Query Item. Drag it down. </h3> <div cdkDropList [cdkDropListData]="availableItems" class="query-list" (cdkDropListDropped)="drop($event)"> <app-query *ngFor="let item of availableItems" cdkDrag [queryItemType]="item.queryItemType" [baseFormArray]="availableItemParams.formArray" [formArrayIndex]="availableItemParams.formArrayIndex" [showType]="availableItemParams.showType"></app-query> </div> </div> <div class="query-container"> <h2 i18n="@@createQuerySecQueryItems">Sec Query Items</h2> <h3 i18n="@@createQueryRemoveQueryItems"> To remove a Query Item. Drag it up. </h3> <div cdkDropList [cdkDropListData]="queryItems" class="query-list" (cdkDropListDropped)="drop($event)"> <app-query class="query-item" *ngFor="let item of queryItems; let i = index" cdkDrag [queryItemType]="item.queryItemType" [baseFormArray]="queryItemParams.formArray" [formArrayIndex]="i" (removeItem)="removeItem($event)" [showType]="queryItemParams.showType" ></app-query> </div> </div> </div> The cdkDropListGroup div contains the two cdkDropList divs. The items can be dragged and dropped between the droplists availableItems and queryItems and, on dropping, the method drop($event) is called. The droplist divs contain <app-query> components. The search functions of “term start,” “term end,” and “query item type” are provided by angular components. The baseFormarray is a reference to the parent formgroup array, and formArrayIndex is the index where you insert the new subformgroup. The removeItem event emitter provides the query component index that needs to be removed to the removeItem($event) method. If the component is in the queryItems array, the showType attribute turns on the search elements of the components (querItemdParams default configuration). The drop(...) method manages the item transfer between the cdkDropList divs: TypeScript drop(event: CdkDragDrop<MyItem[]>) { if (event.previousContainer === event.container) { moveItemInArray( event.container.data, event.previousIndex, event.currentIndex ); const myFormArrayItem = this.queryForm[ FormFields.QueryItems ].value.splice(event.previousIndex, 1)[0]; this.queryForm[FormFields.QueryItems].value.splice( event.currentIndex, 0, myFormArrayItem ); } else { transferArrayItem( event.previousContainer.data, event.container.data, event.previousIndex, event.currentIndex ); //console.log(event.container.data === this.todo); while (this.availableItems.length > 0) { this.availableItems.pop(); } this.availableInit.forEach((myItem) => this.availableItems.push(myItem)); } } First, the method checks if the event.container has been moved inside the container. That is handled by the Angular Components function moveItemInArray(...) and the fromgrouparray entries are updated. A transfer between cdkDropList divs is managed by the Angular Components function transferArrayItem(...). The availableItems are always reset to their initial content and show one item of each queryItemType. The adding and removing of subformgroups from the formgroup array is managed in the query component. Query Component The template of the query component contains the <mat-form-fields> for the queryItemType. They are implemented in the same manner as the create-query template. The component looks like this: TypeScript @Component({ selector: "app-query", templateUrl: "./query.component.html", styleUrls: ["./query.component.scss"], }) export class QueryComponent implements OnInit, OnDestroy { protected readonly containsOperator = "*=*"; @Input() public baseFormArray: FormArray; @Input() public formArrayIndex: number; @Input() public queryItemType: ItemType; @Output() public removeItem = new EventEmitter<number>(); private _showType: boolean; protected termQueryItems: string[] = []; protected stringQueryItems: string[] = []; protected numberQueryItems: string[] = []; protected concepts: FeConcept[] = []; protected QueryFormFields = QueryFormFields; protected itemFormGroup: FormGroup; protected ItemType = ItemType; private subscriptions: Subscription[] = []; constructor( private fb: FormBuilder, private configService: ConfigService, private financialDataService: FinancialDataService ) { this.itemFormGroup = fb.group( { [QueryFormFields.QueryOperator]: "", [QueryFormFields.ConceptOperator]: "", [QueryFormFields.Concept]: ["", [Validators.required]], [QueryFormFields.NumberOperator]: "", [QueryFormFields.NumberValue]: [ 0, [ Validators.required, Validators.pattern("^[+-]?(\\d+[\\,\\.])*\\d+$"), ], ], [QueryFormFields.ItemType]: ItemType.Query, } ); } This is the QueryComponent with the baseFormArray of the parent to add the itemFormGroup at the formArrayIndex. The queryItemType switches the query elements on or off. The removeItem event emitter provides the index of the component to remove from the parent component. The termQueryItems, stringQueryItems, and numberQueryItems are the select options of their components. The feConcepts are the autocomplete options for the concept. The constructor gets the FromBuilder and the needed services injected. The itemFormGroup of the component is created with the formbuilder. The QueryFormFields.Concept and the QueryFormFields.NumberValue get their validators. Query Component Init The component initialization looks like this: TypeScript ngOnInit(): void { this.subscriptions.push( this.itemFormGroup.controls[QueryFormFields.Concept].valueChanges .pipe(debounceTime(200)) .subscribe((myValue) => this.financialDataService .getConcepts() .subscribe( (myConceptList) => (this.concepts = myConceptList.filter((myConcept) => FinancialsDataUtils.compareStrings( myConcept.concept, myValue, this.itemFormGroup.controls[QueryFormFields.ConceptOperator] .value ) )) ) ) ); this.itemFormGroup.controls[QueryFormFields.ItemType].patchValue( this.queryItemType ); if ( this.queryItemType === ItemType.TermStart || this.queryItemType === ItemType.TermEnd ) { this.itemFormGroup.controls[QueryFormFields.ConceptOperator].patchValue( this.containsOperator ); ... } //make service caching work if (this.formArrayIndex === 0) { this.getOperators(0); } else { this.getOperators(400); } } private getOperators(delayMillis: number): void { setTimeout(() => { ... this.subscriptions.push( this.configService.getStringOperators().subscribe((values) => { this.stringQueryItems = values; this.itemFormGroup.controls[ QueryFormFields.ConceptOperator ].patchValue( values.filter((myValue) => this.containsOperator === myValue)[0] ); }) ); ... }, delayMillis); } First, the QueryFormFields.Concept form control value changes are subscribed to request (with a debounce) the matching concepts from the backend service. The results are filtered with compareStrings(...) and QueryFormFields.ConceptOperator (default is “contains”). Then, it is checked if the queryItemType is TermStart or TermEnd to set default values in their form controls. Then, the getOperators(...) method is called to get the operator values of the backend service. The backend services cache the values of the operators to load them only once, and use the cache after that. The first array entry requests the values from the backend, and the other entries wait for 400 ms to wait for the responses and use the cache. The getOperators(...) method uses setTimeout(...) for the requested delay. Then, the configService method getStringOperators() is called and the subscription is pushed onto the “subscriptions” array. The results are put in the stringQueryItems property for the select options. The result value that matches the containsOperator constant is patched into the operator value of the formcontrol as the default value. All operator values are requested concurrently. Query Component Type Switch If the component is dropped in a new droplist, the form array entry needs an update. That is done in the showType(…) setter: TypeScript @Input() set showType(showType: boolean) { this._showType = showType; if (!this.showType) { const formIndex = this?.baseFormArray?.controls?.findIndex( (myControl) => myControl === this.itemFormGroup ) || -1; if (formIndex >= 0) { this.baseFormArray.insert(this.formArrayIndex, this.itemFormGroup); } } else { const formIndex = this?.baseFormArray?.controls?.findIndex( (myControl) => myControl === this.itemFormGroup ) || -1; if (formIndex >= 0) { this.baseFormArray.removeAt(formIndex); } } } If the item has been added to the queryItems, the showType(…) setter sets the property and adds the itemFormGroup to the baseFormArray. The setter removes the itemFormGroup from the baseFormArray if the item has been removed from the querItems. Creating Search Request To create a search request, the search() method is used: TypeScript public search(): void { //console.log(this.queryForm.controls[FormFields.QueryItems].value); const symbolFinancialsParams = { yearFilter: { operation: this.queryForm.controls[FormFields.YearOperator].value, value: !this.queryForm.controls[FormFields.Year].value ? 0 : parseInt(this.queryForm.controls[FormFields.Year].value), } as FilterNumber, quarters: !this.queryForm.controls[FormFields.Quarter].value ? [] : this.queryForm.controls[FormFields.Quarter].value, symbol: this.queryForm.controls[FormFields.Symbol].value, financialElementParams: !!this.queryForm.controls[FormFields.QueryItems] ?.value?.length ? this.queryForm.controls[FormFields.QueryItems].value.map( (myFormGroup) => this.createFinancialElementParam(myFormGroup) ) : [], } as SymbolFinancialsQueryParams; //console.log(symbolFinancials); this.showSpinner.emit(true); this.financialDataService .postSymbolFinancialsParam(symbolFinancialsParams) .subscribe((result) => { this.processQueryResult(result, symbolFinancialsParams); this.showSpinner.emit(false); }); } private createFinancialElementParam( formGroup: FormGroup ): FinancialElementParams { //console.log(formGroup); return { conceptFilter: { operation: formGroup[QueryFormFields.ConceptOperator], value: formGroup[QueryFormFields.Concept], }, valueFilter: { operation: formGroup[QueryFormFields.NumberOperator], value: formGroup[QueryFormFields.NumberValue], }, operation: formGroup[QueryFormFields.QueryOperator], termType: formGroup[QueryFormFields.ItemType], } as FinancialElementParams; } The symbolFinancialsParams object is created from the values of the queryForm formgroup or the default value is set. The FormFields.QueryItems FormArray is mapped with the createFinancialElementParam(...) method. The createFinancialElementParam(...) method creates conceptFilter and valueFilter objects with their operations and values for filtering. The termOperation and termType are set in the symbolFinancialsParams object, too. Then, the finanicalDataService.postSymbolFinancialsParam(...) method posts the object to the server and subscribes to the result. During the latency of the request, the spinner of the parent component is shown. Conclusion The Angular Components library support for Drag’n Drop is very good. That makes the implementation much easier. The reactive forms of Angular enable flexible form checking that includes subcomponents with their own FormGroups. The custom validation functions allow the logical structure of the terms to be checked. Due to the features of the Angular framework and the Angular Components Library, the implementation needed surprisingly little code.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Kai Wähner
Technology Evangelist,
Confluent
Gilad David Maayan
CEO,
Agile SEO
Grant Fritchey
Product Advocate,
Red Gate Software