DZone Spotlight

Wednesday, July 31 View All Articles »

Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot

By Abhishek Gupta

CORE

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Real-time streaming architectures are designed to ingest, process, and analyze data as it arrives continuously, enabling near real-time decision making and insights. They need to have low latency, handle high-throughput data volumes, and be fault tolerant in the event of failures. Some of the challenges in this area include: Ingestion – ingesting from a wide variety of data sources, formats, and structures at high throughput, even during bursts of high-volume data streams Processing – ensuring exactly-once processing semantics while handling complexities like stateful computations, out-of-order events, and late data arrivals in a scalable and fault-tolerant manner Real-time analytics – achieving low-latency query responses over fresh data that is continuously being ingested and processed from streaming sources, without compromising data completeness or consistency It's hard for a single technology component to be capable of fulfilling all the requirements. That's why real-time streaming architectures are composed of multiple specialized tools that work together. Introduction to Apache Kafka, Flink, and Pinot Let's dive into an overview of Apache Kafka, Flink, and Pinot — the core technologies that power real-time streaming systems. Apache Kafka Apache Kafka is a distributed streaming platform that acts as a central nervous system for real-time data pipelines. At its core, Kafka is built around a publish-subscribe architecture, where producers send records to topics, and consumers subscribe to these topics to process the records. Key components of Kafka's architecture include: Brokers are servers that store data and serve clients. Topics are categories to which records are sent. Partitions are divisions of topics for parallel processing and load balancing. Consumer groups enable multiple consumers to coordinate and process records efficiently. An ideal choice for real-time data processing and event streaming across various industries, Kafka's key features include: High throughput Low latency Fault tolerance Durability Horizontal scalability Apache Flink Apache Flink is an open-source stream processing framework designed to perform stateful computations over unbounded and bounded data streams. Its architecture revolves around a distributed streaming dataflow engine that ensures efficient and fault-tolerant execution of applications. Key features of Flink include: Support for both stream and batch processing Fault tolerance through state snapshots and recovery Event time processing Advanced windowing capabilities Flink integrates with a wide variety of data sources and sinks — sources are the input data streams that Flink processes, while sinks are the destinations where Flink outputs the processed data. Supported Flink sources include message brokers like Apache Kafka, distributed file systems such as HDFS and S3, databases, and other streaming data systems. Similarly, Flink can output data to a wide range of sinks, including relational databases, NoSQL databases, and data lakes. Apache Pinot Apache Pinot is a real-time distributed online analytical processing (OLAP) data store designed for low-latency analytics on large-scale data streams. Pinot's architecture is built to efficiently handle both batch and streaming data, providing instant query responses. Pinot excels at serving analytical queries over rapidly changing data ingested from streaming sources like Kafka. It supports a variety of data formats, including JSON, Avro, and Parquet, and provides SQL-like query capabilities through its distributed query engine. Pinot's star-tree index supports fast aggregations, efficient filtering, high-dimensional data, and compression. Integrating Apache Kafka, Flink, and Pinot Here is a high-level overview of how Kafka, Flink, and Pinot work together for real-time insights, complex event processing, and low-latency analytical queries on streaming data: Kafka acts as a distributed streaming platform, ingesting data from various sources in real time. It provides a durable, fault-tolerant, and scalable message queue for streaming data. Flink consumes data streams from Kafka topics. It performs real-time stream processing, transformations, and computations on the incoming data. Flink's powerful stream processing capabilities allow for complex operations like windowed aggregations, stateful computations, and event-time-based processing. The processed data from Flink is then loaded into Pinot. Pinot ingests the data streams, builds real-time and offline datasets, and creates indexes for low-latency analytical queries. It supports a SQL-like query interface and can serve high-throughput and low-latency queries on the real-time and historical data. Figure 1. Kafka, Flink, and Pinot as part of a real-time streaming architecture Let's break this down and dive into the individual components. Kafka Ingestion Kafka offers several methods to ingest data, each with its own advantages. Using the Kafka producer client is the most basic approach. It provides a simple and efficient way to publish records to Kafka topics from various data sources. Developers can leverage the producer client by integrating it into their applications in most programming languages (Java, Python, etc.), supported by the Kafka client library. The producer client handles various tasks, including load balancing by distributing messages across partitions. This ensures message durability by awaiting acknowledgments from Kafka brokers and manages retries for failed send attempts. By leveraging configurations like compression, batch size, and linger time, the Kafka producer client can be optimized for high throughput and low latency, making it an efficient and reliable tool for real-time data ingestion into Kafka. Other options include: Kafka Connect is a scalable and reliable data streaming tool with built-in features like offset management, data transformation, and fault tolerance. It can read data into Kafka with source connectors and write data from Kafka to external systems using sink connectors. Debezium is popular for data ingestion into Kafka with source connectors to capture database changes (inserts, updates, deletes). It publishes changes to Kafka topics for real-time database updates. The Kafka ecosystem also has a rich set of third-party tools for data ingestion. Kafka-Flink Integration Flink provides a Kafka connector that allows it to consume and produce data streams to and from Kafka topics. The connector is a part of the Flink distribution and provides fault tolerance along with exactly-once semantics. The connector consists of two components: KafkaSource allows Flink to consume data streams from one or more Kafka topics. KafkaSink allows Flink to produce data streams to one or more Kafka topics. Here's an example of how to create a KafkaSource in Flink's DataStream API: Java KafkaSource<String> source = KafkaSource.<String>builder() .setBootstrapServers(brokers) .setTopics("ad-events-topic") .setGroupId("ad-events-app") .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new SimpleStringSchema()) .build(); DataStream<String> stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); Note that FlinkKafkaConsumer, based on the legacy SourceFunction API, has been marked as deprecated and removed. The newer data-source-based API, including KafkaSource, provides greater control over aspects like watermark generation, bounded streams (batch processing), and the handling of dynamic Kafka topic partitions. Flink-Pinot Integration There are a couple options for integrating Flink with Pinot to write processed data into Pinot tables. Option 1: Flink to Kafka to Pinot This is a two-step process where you first write data from Flink to Kafka using the KafkaSink component of the Flink Kafka connector. Here is an example: Java DataStream<String> stream = <existing stream>; KafkaSink<String> sink = KafkaSink.<String>builder() .setBootstrapServers(brokers) .setRecordSerializer(KafkaRecordSerializationSchema.builder() .setTopic("ad-events-topic") .setValueSerializationSchema(new SimpleStringSchema()) .build() ) .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) .build(); stream.sinkTo(sink); As part of the second step, on the Pinot side, you would configure the real-time ingestion support for Kafka that Pinot supports out of the box, which would ingest the data into the Pinot table(s) in real time. This approach decouples Flink and Pinot, allowing you to scale them independently and potentially leverage other Kafka-based systems or applications in your architecture. Option 2: Flink to Pinot (Direct) The other option is to use the Flink SinkFunction that comes as part of the Pinot distribution. This approach simplifies the integration by having a streaming (or batch) Flink application directly write into a designated Pinot database. This method simplifies the pipeline as it eliminates the need for intermediary steps or additional components. It ensures that the processed data is readily available in Pinot for low-latency query and analytics. Best Practices and Considerations Although there are a lot of factors to consider when using Kafka, Flink, and Pinot for real-time streaming solutions, here are some of the common ones. Exactly-Once Semantics Exactly-once semantics guarantee that each record is processed once (and only once), even in the presence of failures or out-of-order delivery. Achieving this behavior requires coordination across the components involved in the streaming pipeline. Use Kafka's idempotence settings to guarantee messages are delivered only once. This includes enabling the enable.idempotence setting on the producer and using the appropriate isolation level on the consumer. Flink's checkpoints and offset tracking ensure that only processed data is persisted, allowing for consistent recovery from failures. Finally, Pinot's upsert functionality and unique record identifiers eliminate duplicates during ingestion, maintaining data integrity in the analytical datasets. Kafka-Pinot Direct Integration vs. Using Flink The choice between integrating Kafka and Pinot directly or using Flink as an intermediate layer depends on your stream processing needs. If your requirements involve minimal stream processing, simple data transformations, or lower operational complexity, you can directly integrate Kafka with Pinot using its built-in support for consuming data from Kafka topics and ingesting it into real-time tables. Additionally, you can perform simple transformations or filtering within Pinot during ingestion, eliminating the need for a dedicated stream processing engine. However, if your use case demands complex stream processing operations, such as windowed aggregations, stateful computations, event-time-based processing, or ingestion from multiple data sources, it is recommended to use Flink as an intermediate layer. Flink offers powerful streaming APIs and operators for handling complex scenarios, provides reusable processing logic across applications, and can perform complex extract-transform-load (ETL) operations on streaming data before ingesting it into Pinot. Introducing Flink as an intermediate stream processing layer can be beneficial in scenarios with intricate streaming requirements, but it also adds operational complexity. Scalability and Performance Handling massive data volumes and ensuring real-time responsiveness requires careful consideration of scalability and performance across the entire pipeline. Two of the most discussed aspects include: You can leverage the inherent horizontal scalability of all three components. Add more Kafka brokers to handle data ingestion volumes, have multiple Flink application instances to parallelize processing tasks, and scale out Pinot server nodes to distribute query execution. You can utilize Kafka partitioning effectively by partitioning data based on frequently used query filters to improve query performance in Pinot. Partitioning also benefits Flink's parallel processing by distributing data evenly across worker nodes. Common Use Cases You may be using a solution built on top of a real-time streaming architecture without even realizing it! This section covers a few examples. Real-Time Advertising Modern advertising platforms need to do more than just serve ads — they must handle complex processes like ad auctions, bidding, and real-time decision making. A notable example is Uber's UberEats application, where the ad events processing system had to publish results with minimal latency while ensuring no data loss or duplication. To meet these demands, Uber built a system using Kafka, Flink, and Pinot to process ad event streams in real time. The system relied on Flink jobs communicating via Kafka topics, with end-user data being stored in Pinot (and Apache Hive). Accuracy was maintained through a combination of exactly-once semantics provided by Kafka and Flink, upsert capabilities in Pinot, and unique record identifiers for deduplication and idempotency. User-Facing Analytics User-facing analytics have very strict requirements when it comes to latency and throughput. LinkedIn has extensively adopted Pinot for powering various real-time analytics use cases across the company. Pinot serves as the back end for several user-facing product features, including "Who Viewed My Profile." Pinot enables low-latency queries on massive datasets, allowing LinkedIn to provide highly personalized and up-to-date experiences to its members. In addition to user-facing applications, Pinot is also utilized for internal analytics at LinkedIn and powers various internal dashboards and monitoring tools, enabling teams to gain real-time insights into platform performance, user engagement, and other operational metrics. Fraud Detection For fraud detection and risk management scenarios, Kafka can ingest real-time data streams related to transaction data, user activities, and device information. Flink's pipeline can apply techniques like pattern detection, anomaly detection, rule-based fraud detection, and data enrichment. Flink's stateful processing capabilities enable maintaining and updating user- or transaction-level states as data flows through the pipeline. The processed data, including flagged fraudulent activities or risk scores, is then forwarded to Pinot. Risk management teams and fraud analysts can execute ad hoc queries or build interactive dashboards on top of the real-time data in Pinot. This enables identifying high-risk users or transactions, analyzing patterns and trends in fraudulent activities, monitoring real-time fraud metrics and KPIs, and investigating historical data for specific users or transactions flagged as potentially fraudulent. Conclusion Kafka's distributed streaming platform enables high-throughput data ingestion, while Flink's stream processing capabilities allow for complex transformations and stateful computations. Finally, Pinot's real-time OLAP data store facilitates low-latency analytical queries, making the combined solution ideal for use cases requiring real-time decision making and insights. While individual components like Kafka, Flink, and Pinot are very powerful, managing them at scale across cloud and on-premises deployments can be operationally complex. Managed streaming platforms reduce operational overhead and abstract away much of the low-level cluster provisioning, configuration, monitoring, and other operational tasks. They allow resources to be elastically provisioned up or down based on changing workload demands. These platforms also offer integrated tooling for critical functions like monitoring, debugging, and testing streaming applications across all components. To learn more, refer to the official documentation and examples for Apache Kafka, Apache Flink, and Apache Pinot. The communities around these projects also have a wealth of resources, including books, tutorials, and tech talks covering real-world use cases and best practices. Additional resources: Apache Kafka Patterns and Anti-Patterns by Abhishek Gupta, DZone Refcard Apache Kafka Essentials by Sudip Sengupta, DZone Refcard This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report More

Why Do We Need to Keep Our Builds Green?

By Istvan Zoltan Nagy

The Trivial Answer Most engineers know that we must have green builds because a red build indicates some kind of issue. Either a test did not pass, or some kind of tool found a vulnerability, or we managed to push our code when it couldn’t even compile. Either way, it is bad. You might have noticed that this article is far from over, so there must be more to this. You are right! What Does Green Mean Exactly? We have already discussed that red means something wrong, but can we say that green is the opposite? Does it guarantee that everything is working great, meets the requirements, and is ready to deploy? As usual, it depends. When your build turns green, we can say, that: The code compiled (assuming you are using a language with a compiler). The existing (and executed) tests passed. The analyzers found no critical issues that needed to be fixed right away. We were able to push our binaries to an artifact storage or image registry. Depending on our setup, we might be ready to deploy our code at the moment. Why am I still not saying anything definite about the state of the software even when the tests passed? It is because I am simply not sure whether a couple of important things are addressed by our theoretical CI system in this thought-experiment. Let me list a couple of the factors I am worried about. Please find these in the following sections! Test Quality I won’t go deep into details as testing and good quality tests are bigger topics, deserving way more focus than what I could squeeze in here. Still, when talking about test quality, I think we should at least mention the following thoughts as bullet points: Do we have sufficient test coverage? Are our test cases making strict assertions that can discover the issues we want to discover? Are we testing the things we should? Meaning: Are we focusing on the most important requirement first instead of testing the easy parts? Are our tests reliable and in general following the F.I.R.S.T. principles? Are we running our tests with each build of the code they are testing? Are we aware of the test pyramid and following the related recommendations? Augmenting these generic ideas, I would like to mention a few additional thoughts in a bit more detail. What Kinds of Dependencies Are We Using in Our Tests? In the lower layers of the test pyramid, we should prefer using test doubles instead of the real dependencies to help us focus on the test case and be able to generate exceptional scenarios we need to cover in our code. Do We Know What We Should Focus on For Each Layer of The Test Pyramid? The test pyramid is not only about the number of tests we should have on each layer, but it gives us an idea about their intent as well. For example, the unit tests should test only a small unit (i.e., a single class) to prod and poke our code and see how it behaves in a wide variety of situations assuming that everything else is working well. As we go higher, the focus moves onto how our classes behave when they are integrated into a component, still relying on test doubles to eliminate any dependency (and any unknowns) related to the third-party components used by our code. Then in the integration tests, we should focus on the integration of our components with their true dependencies to avoid any issues caused by the imperfections of the test doubles we have been using in our lower-layer tests. In the end, the system tests can use an end-to-end mindset to observe how the whole system behaves from the end user’s point of view. Are Our Code Dependencies Following Similar Practices? Hopefully, the dependency selection process considers the maturity and reliability of the dependencies as well as their functionality. This is very important because we must be able to trust our dependencies that they are doing what they say they do. Thorough testing of the dependency can help us build this trust, while the lack of tests can do the opposite. My personal opinion on this is that I cannot expect my users to test my code when they pick my components as dependencies, because not only they cannot possibly do it well; but I won’t know either when their tests fail because my code contains a bug — a bug that I was supposed to find and fix when I released my component. For the same reason, when I am using a dependency, I think my expectation that I should not test that dependency is reasonable. Having Repeatable Builds It can be a great feeling when our build turns green after a hard day’s work. It can give us pride, a feeling of accomplishment, or even closure depending on the context. Yet it can be an empty promise, a lie that does very little good (other than generating a bunch of happy chemicals for our brain) if we cannot repeat it when we need to. Fortunately, there is a way to avoid these issues if we consider the following factors. Using Reliable Tags It is almost a no-brainer that we need to tag our builds to be able to get back to the exact version we have used to build our software. This is a great start for at least our code, but we should keep in mind that nowadays it is almost impossible to imagine a project where we are starting from an empty directory and doing everything on our own without using any dependencies. When using dependencies, we can make a choice between convenience and doing the right thing. On one hand, the convenient option lets us use the latest dependencies without doing anything: we just need to use the wildcard or special version constant supported by our build tool to let it resolve the latest stable version during the build process. On the other hand, we can pin down our dependencies; maybe we can even vendor them if we want to avoid some nasty surprises and have a decent security posture. If we decide to do the right thing, we will be able to repeat the build process using the exact same dependencies as before, giving us a better chance of producing the exact same artifact if needed. In the other case, we would be hard-pressed to do the same a month or two after the original build. In my opinion, this is seriously undermining the usability of our tags and makes me trust the process less. Using The Same Configuration It is only half of the battle to be able to produce the same artifact in the end when we are rebuilding the code. We must be able to repeat the same steps during the build and use the same application configuration for the deployments in order to have the same code and use the same configuration and input to run our tests. It Shouldn't Start With The Main Branch Although we are doing this work in order to have repeatable builds on the main branch the process should not start there. If we want to be sure that the thing we are about to merge won't break the main build, we should at least try building it using the same tools and tests before we click merge. Luckily the Git branch protection rules are very good at this. To avoid broken builds, we should make sure that: The PRs cannot be merged without both the necessary approvals and a successful build validating everything the main build will validate as well.* The branch is up to date, meaning that it contains all changes from the main branch as well. Good code can still cause failures if the main branch contains incompatible changes. *Note: Of course, this is not trivial to achieve, because how can we test, for example, that the artifact will be successfully published to the registry containing the final, ready-to-deploy artifacts? Or how could we verify, that we will be able to push the Git tag when we release using the other workflow? Still, we should do our best to minimize the number of differences just like we do when we are testing our code. Using this approach, we can discover the slight incompatibilities of otherwise well-working changes before we merge them into the main. Why Do We Need Green Build Then? To be honest, green builds are not what we need. They are only the closest we have to the thing we need: a reliable indicator of working software. We need this indicator because we must be able to go there and develop the next feature or fix a production bug when it is discovered. Without being 100% sure that the main branch contains working software, we cannot do either of those because first, we need to see whether it is still working and fix the build if it is broken. In many cases, the broken builds are not due to our own changes, but external factors. For example, without pinning down all dependencies, we cannot guarantee the same input for the build, so the green builds cannot be considered reliable indicators either. This is not only true for code dependencies, but any dependency we are using for our tests as well. Of course, we cannot avoid every potential cause for failure. For example, we can’t do anything against security issues that are noticed after our initial build. Quite naturally, these can still cause build failures. My point is that we should do our best in the area where we have control over things, like the tests where we can rely on test doubles for the lower layers of the test pyramid. What Can You Do When Facing These Issues? Work on improving build repeatability. You can: Consider pinning down all your dependencies to use the same components in your tests. This can be achieved by: Using fixed versions instead of ranges in Maven or Gradle Make sure the dependencies of your dependencies will remain pinned, too, by checking whether their build files contain any ranges. Using SHA 256 manifest digests for Docker images instead of the tag names Make sure that you are performing the same test cases as before by: Following general testing best practices like the F.I.R.S.T. principles Starting from the same initial state in case of every other dependency (cleaning up database content, user accounts, etc.) Performing the same steps (with similar or equivalent data) Make sure you always tag: Your releases Your application configuration The steps of the build pipeline you have used for the build Apply strict branch protection rules. What Should We Not Try to Fix? We should keep in mind that this exercise is not about zealously working until we can just push a build button repeatedly and expect that the exact same workflow does the exact same thing every time like clockwork. This could be an asymptotic goal, but in my opinion, it shouldn’t. The goal is not to do the same thing and produce the exact same output, because we don’t need that. We have already built the project, published all versioned binary artifacts, and saved all test results the first time around. Rebuilding and overwriting these can be harmful because it can become a way to rewrite history and we can never trust our versioning or artifacts ever again. When a build step produces an artifact that is saved somewhere (may it be a binary, a test report, some code scan findings, etc.) that artifact should be handled as a read-only archive and should never change once saved. Therefore, if someone kicks off a build from a previously successfully built tag, it is allowed (or even expected) to fail when the artifact uploads are attempted. In Conclusion I hope this article helped you realize that focusing on the letter of the law is less important than the spirit of the law. It does not matter if you had a green build if you are not able to demonstrate, that your tagged software remained ready for deployment. At the end of the day, if you have a P1 issue in production nobody will care about the fact that your software was ready to deploy in the past if you cannot show that it is still ready to deploy now, and we can start working on the next increment without additional unexpected problems. What do you think about this? Let me know in the comments! More

Trend Report

Database Systems

In 2024, the focus around databases is on their ability to scale and perform in modern data architectures. It's not just centered on distributed, cloud-centric data environments anymore, but rather on databases built and managed in a way that allows them to be used optimally in advanced applications. This modernization of database architectures allows for developers and organizations to be more flexible with their data. With the advancements in automation and the proliferation of artificial intelligence, the way data capabilities and databases are built, managed, and scaled has evolved at an exponential rate.This Trend Report explores database adoption and advancements, including how to leverage time series databases for analytics, why developers should use PostgreSQL, modern, real-time streaming architectures, database automation techniques for DevOps, how to take an AI-focused pivot within database systems practices, and more. The goal of this Trend Report is to equip developers and IT professionals with tried-and-true practices alongside forward-looking industry insights to allow them to modernize and future-proof their database architectures.

Refcard #371

Data Pipeline Essentials

By Sudip Sengupta

CORE

Refcard #395

Open Source Migration Practices and Patterns

By Nuwan Dias

CORE

Open Source Migration Practices and Patterns

How To Generate Flame Graphs in Java

Flame graphs became the de facto standard for investigating issues in today's applications (not only written in Java). The flame graphs can provide a lot of interesting insights and can give developers valuable hints to improve the execution of their applications. However, still, a lot of developers don't use them, even if generating flame graphs is easier than ever before. In this article, I will demonstrate a couple of examples of how easy it is in Java to get started with the investigation of your applications. If you don't know the details about flame graph visualization, please, visit a great article about flame graphs from Brendan Gregg. Let's Get Started With Jeffrey CLI I'll be using the Jeffrey CLI (Command-Line Interface) tool for generating a flame graph. It's a pretty new and easy-to-use tool for your terminal that accepts JFR (JDK Flight Recorder) recordings and generates flame graphs from events stored inside the binary files. Jeffrey takes the JFR file and generates the graphs and data included into a single HTML file, therefore, you have a very convenient way to share it with your colleagues. For the sake of completeness, I need to mention that there is also Jeffrey App. It's a web-based solution with a running Java backend serving data dynamically and containing some other interesting features over the Jeffrey CLI tool. However, let's focus just on generation flame graphs today. There are some limitations over the server-based solution Jeffrey App coming from the "static nature" of the generated file. Jeffrey App dynamically generates data on the server behind the scenes. The features below are not available in the CLI solution. However, there are some mitigations of these shortcomings: Dynamic searching in the time series graph is not provided. We can use the CLI parameter to split the time series graph at the time of executing the command. Zooming of the time series graph is not propagated to the flame graph. If you never heard about JFR recordings, then let's start with a brief introduction. JFR stands for JDK Flight Recorder and it's a built-in feature for collecting events from your JVM and Java application in OpenJDK builds. It generates optimized binary files with all the collected events + metadata to correctly parse the events from the file. At the time of writing this article, there are two common ways to generate this binary file: Enabling the feature directly in the JVM (User Guide for JDK Flight Recorder) Using Async Profiler that can replace some events with their own (setup of Async-Profiler) The generated binary file from whichever approach has absolutely the same structure and can be processed by Jeffrey CLI. Download and Startup The most straightforward way is to download it directly from GitHub. I'll be using pre-generated recordings from Jeffrey's Test Application. Check out the recordings and copy the jeffrey-cli.jar to the same folder to make easier and shorter commands. The recordings from the GitHub repository above were generated using AsyncProfile which uses its own CPU (jdk.ExecutionSample) and Allocation (jdk.ObjectAllocationInNewTLAB) events and these events will be used in our examples below. jeffrey-cli.jar is just an executable jar file that executes basic commands with some arguments to specify the behavior. According to HELP command, at the time of writing this blog post, we can generate flame graphs, sub-second graphs, and their differential graphs (comparison between the same event types from JFR recordings). Let's focus just on flame graphs in this article. $ java -jar jeffrey-cli.jar --help Usage: [-hV] [COMMAND] Generates a flamegraph according to the selected event-type -h, --help Show this help message and exit. -V, --version Print version information and exit. Commands: flame Generates a Flamegraph (default: jdk.ExecutionSample) flame-diff Generates a Differential Flamegraph (default: jdk.ExecutionSample) events List all event-types containing a stacktrace for building a flamegraph sub-second Generates Sub-Second graph (the first 5 minutes) sub-second-diff Generates Differential Sub-Second graph (the first 5 minutes) Flame Graph With a Default Event Type There are multiple arguments to clarify the generated output, let's focus on the main ones. Execute flame --help command to show the information below. $ java -jar jeffrey-cli.jar flame --help Usage: flame [-htVw] [--with-timeseries] [-e=<eventType>] [--end-time=<endTime>] [-o=<outputFile>] [-s=<searchPattern>] [--start-time=<startTime>] <jfr_file> Generates a Flamegraph (default: jdk.ExecutionSample) <jfr_file> one JFR file for fetching events -e, --event-type=<eventType> Selects events for generating a flamegraph (e.g. jdk.ExecutionSample) --end-time=<endTime> Relative end in milliseconds from the beginning of the JFR file -h, --help Show this help message and exit. -o, --output=<outputFile> Path to the file with the generated flamegraph (default is the current folder with a filename '<jfr-name>.html') -s, --search-pattern=<searchPattern> Only for timeseries (timeseries cannot dynamically searches in the generated file, only the flamegraph can) --start-time=<startTime> Relative start in milliseconds from the beginning of the JFR file -t, --thread Groups stacktraces omitted on the particular thread -V, --version Print version information and exit. -w, --weight Uses event's weight instead of # of samples (currently supported: jdk.ObjectAllocationSample, jdk.ObjectAllocationInNewTLAB, jdk. ObjectAllocationOutsideTLAB, jdk.ThreadPark, jdk.JavaMonitorWait, jdk.JavaMonitorEnter) --with-timeseries Includes Timeseries graph with a Flamegraph (it's `true` by default, set `false` to have only the Flamegraph) The simplest command below generates the CPU flame graph (based on jdk.ExecutionSample) to the same folder with the name of the recording (you can use the output argument to specify the output's filename and path). java -jar jeffrey-cli.jar flame jeffrey-persons-full-direct-serde.jfr Generated: <path>/jeffrey-recordings/jeffrey-persons-full-direct-serde.html Open <path>/jeffrey-recordings/jeffrey-persons-full-direct-serde.html in your favorite browser. Flame Graph for a Specific Event Type Another example below uses a specific event-type with a weight option. Since we know that the recording was generated using async-profiler with an alloc option, then we need to use jdk.ObjectAllocationInNewTLAB as the event-type to get the appropriate result. The weight option is useful in this case because we want to be focused more on the path generating more bytes instead of more samples. Otherwise, we could be misled by a lot of samples with non-significant allocated amounts of memory that would hide interesting spots with fewer samples but huge allocated chunks. $ java -jar jeffrey-cli.jar flame --event-type=jdk.ObjectAllocationInNewTLAB --weight jeffrey-persons-full-direct-serde.jfr Generated: <path>/jeffrey-recordings/jeffrey-persons-full-direct-serde.html Flame Graph Grouped by Threads In some cases, it's useful to generate a graph where samples are grouped by a specific thread generated by the given sample (especially for wall-clock samples, however, it makes sense for other types as well). $ java -jar jeffrey-cli.jar flame --thread jeffrey-persons-full-direct-serde.jfr Generated: <path>/jeffrey-recordings/jeffrey-persons-full-direct-serde.html Searching in Time Series and Flame Graph As mentioned before, it's not possible to use zooming and searching directly from the generated graph because of its static nature. However, at least we can generate the graph with a search-pattern option to split the time series graph into two series: Samples that contain the search-pattern The rest of the samples that are not matched We search for Compile pattern in the samples to point out the compilation overhead over the time of the recording. $ java -jar jeffrey-cli.jar flame --search-pattern=Compile jeffrey-persons-full-direct-serde.jfr Generated: <path>/jeffrey-recordings/jeffrey-persons-full-direct-serde.html Summary Play around, try out other use cases, and let me know about your findings. Thank you for reading my article and please leave your comments below. If you have any ideas to make Jeffrey a better tool to use, or if you want to file a bug, please, visit Jeffrey's GitHub Repository. The next article could be about differential graphs! Stay tuned!

By Petr Bouda

CORE

How to Run Apache Spark on Kubernetes in Less Than 5 Minutes

Tools like Ilum will go a long way in simplifying the process of installing Apache Spark on Kubernetes. This guide will take you step-by-step through how to run Spark well on your Kubernetes cluster. With Ilum, deploying, managing, and scaling Apache Spark clusters is easily and naturally done. Introduction Today, we will showcase how to get up and running with Apache Spark on K8s. There are many ways to do that, but most are complex and require several configurations. We will use Ilum since that will do all the cluster setup for us. In the next blog post, we will compare the usage with the Spark operator. Ilum is a free, modular data lakehouse to easily deploy and manage Apache Spark clusters. It has a simple API to define and manage Spark, it will handle all dependencies. It helps with the creation of your own managed spark. With Ilum, you can deploy Spark clusters in minutes and get started immediately running Spark applications. Ilum allows you to easily scale out and in your Spark clusters, managing multiple Spark clusters from a single UI. With Ilum, getting started is easy if you are relatively new to Apache Spark on Kubernetes. Step-By-Step Guide to Install Apache Spark on Kubernetes Quick Start We assume that you have a Kubernetes cluster up and running, just in case you don't, check out these instructions to set up a Kubernetes cluster on the minikube. Check how to install minikube. 1. Setup a Local Kubernetes Cluster Install minikube: Execute the following command to install minikube along with the recommended resources. This will install minikube with 4 CPUs and 8192 MB memory including the metrics server add-on that is necessary for monitoring. minikube start --cpus 4 --memory 8192 --addons metrics-server Once you have a running Kubernetes cluster, it is just a few commands away to install Ilum: 2. Install Spark on Kubernetes With Ilum Add Ilum Helm Repository: helm repo add ilum https://charts.ilum.cloud Install Ilum in your cluster: helm install ilum ilum/ilum Slow internet speed and large Docker image size can lead to the failure of the Kubernetes pod due to the 2-minute download timeout. That's why we recommend pulling the image manually without getting a timeout.- minikube ssh docker pull ilum/core:6.1.3 This setup should take around two minutes. Ilum will deploy into your Kubernetes cluster, preparing it to handle Spark jobs. Once the Ilum is installed, you can access the UI with port-forward and localhost:9777. Port Forward to Access UI: Use Kubernetes port-forwarding to access the Ilum UI. kubectl port-forward svc/ilum-ui 9777:9777 Use admin/admin as default credentials. You can change them during the deployment process. That’s all: your Kubernetes cluster is now configured to handle Spark jobs. Ilum provides a simple API and UI that makes it easy to submit Spark applications. You can also use the good old Spark submit. Deploy Spark Application on Kubernetes Let’s now start a simple Spark job. We'll use the "SparkPi" example from the Spark documentation. Ilum will create a Spark driver Kubernetes pod: it uses Spark version 3.x Docker image. You can control the number of spark executor pods by scaling them to multiple nodes. That's the simplest way to submit Spark applications to K8s. Running Spark on Kubernetes is really easy and frictionless with Ilum. It will configure your whole cluster and present you with an interface where you can manage and monitor your Spark cluster. We believe spark apps on Kubernetes are the future of big data. With Kubernetes, Spark applications will be able to handle huge volumes of data much more reliably, thus giving exact insights and being able to drive decisions with big data. Advantages of Using Ilum To Run Spark on Kubernetes Ilum is equipped with an intuitive UI and a resilient API to scale and handle Spark clusters, configuring a couple of Spark applications from one interface. Here are a few great features in that regard: Ease of use: Ilum simplifies Spark configuration and management on Kubernetes with an intuitive Spark UI, eliminating complex setup processes. Quick deployment: Set up, deploy, and scale Spark clusters in minutes to speed up the time to execution and testing applications right away. Scalability: Using the Kubernetes API, easily scale Spark clusters up or down to meet your data processing needs, ensuring optimal resource utilization. Modularity: Ilum comes with a modular framework that allows users to choose and combine different components such as Spark History Server, Apache Jupyter, Minio, and much more. Migrating From Apache Hadoop Yarn Now that Apache Hadoop Yarn is in deep stagnation, more and more organizations are looking toward migrating from Yarn to Kubernetes. This is attributed to several reasons, but the most common is that Kubernetes provides a more resilient and flexible platform in matters of managing Big Data workloads. Generally, it is difficult to carry out a platform migration of the data processing platform from Apache Hadoop Yarn to any other. There are many factors to consider when such a switch is made — compatibility of data, speed, and cost of processing. However, it would come smoothly and successfully if the procedure was well planned and executed. Kubernetes is pretty much a natural fit when it comes to Big Data workloads because of its inherent ability to be able to scale horizontally. But, with Hadoop Yarn, you are limited to the number of nodes in your cluster. You can increase and reduce the number of nodes inside a Kubernetes cluster on demand. It also allows features that are not available in Yarn, for instance: self-healing and horizontal scaling. Time To Make the Switch to Kubernetes? As the world of big data continues to evolve, so do the tools and technologies used to manage it. For years, Apache Hadoop YARN has been the de facto standard for resource management in big data environments. But with the rise of containerization and orchestration technologies like Kubernetes, is it time to make the switch? Kubernetes has been gaining popularity as a container orchestration platform, and for good reason. It's flexible, scalable, and relatively easy to use. If you're still using traditional VM-based infrastructure, now might be the time to make the switch to Kubernetes. If you're working with containers, then you should definitely care about Kubernetes. It can help you manage and deploy your containers more effectively, and it's especially useful if you're working with a lot of containers or if you're deploying your containers to a cloud platform. Kubernetes is also a great choice if you're looking for an orchestration tool that's backed by a major tech company. Google has been using Kubernetes for years to manage its own containerized applications, and they've invested a lot of time and resources into making it a great tool. There is no clear winner in the YARN vs. Kubernetes debate. The best solution for your organization will depend on your specific needs and use cases. If you are looking for a more flexible and scalable resource management solution, Kubernetes is worth considering. If you need better support for legacy applications, YARN may be a better option. Whichever platform you choose, Ilum can help you get the most out of it. Our platform is designed to work with both YARN and Kubernetes, and our team of experts can help you choose and implement the right solution for your organization. Managed Spark Cluster A managed Spark cluster is a cloud-based solution that makes it easy to provision and manage Spark clusters. It provides a web-based interface for creating and managing Spark clusters, as well as a set of APIs for automating cluster management tasks. Managed Spark clusters are often used by data scientists and developers who want to quickly provision and manage Spark clusters without having to worry about the underlying infrastructure. Ilum provides the ability to create and manage your own spark cluster, which can be run in any environment, including cloud, on-premises, or a mixture of both. The Pros of Apache Spark on Kubernetes There has been some debate about whether Apache Spark should run on Kubernetes. Some people argue that Kubernetes is too complex and that Spark should continue to run on its own dedicated cluster manager or stay in the cloud. Others argue that Kubernetes is the future of big data processing and that Spark should embrace it. We are in the latter camp. We believe that Kubernetes is the future of big data processing and that Apache Spark should run on Kubernetes. The biggest benefit of using Spark on Kubernetes is that it allows for much easier scaling of Spark applications. This is because Kubernetes is designed to handle deployments of large numbers of concurrent containers. So, if you have a Spark application that needs to process a lot of data, you can simply deploy more containers to the Kubernetes cluster to process the data in parallel. This is much easier than setting up a new Spark cluster on EMR each time you need to scale up your processing. You can run it on any cloud platform (AWS, Google Cloud, Azure, etc.) or on-premises. This means that you can easily move your Spark applications from one environment to another without having to worry about changing your cluster manager. Another enormous benefit is that it allows for more flexible workflows. For example, if you need to process data from multiple sources, you can easily deploy different containers for each source and have them all processed in parallel. This is much easier than trying to manage a complex workflow on a single Spark cluster. Kubernetes has several security features that make it a more attractive option for running Spark applications. For example, Kubernetes supports role-based access control, which allows you to fine-tune who has access to your Spark cluster. So there you have it. These are just some of the reasons why we believe that Apache Spark should run on Kubernetes. If you're not convinced, we encourage you to try it out for yourself. We think you'll be surprised at how well it works. Additional Resources Kubernetes Documentation Ilum Official Documentation Conclusion Ilum simplifies the process of installing and managing Apache Spark on Kubernetes, making it an ideal choice for both beginners and experienced users. By following this guide, you’ll have a functional Spark cluster running on Kubernetes in no time.

By Adam Lichtenstein

Event Sourcing Explained: Building Robust Systems With Immutable Event Logs

An architectural pattern named Event Sourcing is gaining more and more recognition from developers who aim for strong and scalable systems. In this article, we’ll take a closer look at the concept of it: what it entails, its benefits, general flow, and key concepts. Moreover, we will discuss how to implement ES — some details on the technologies that make adoption easy. This article is aimed at software architects, system developers, and project managers who might be contemplating or are already engaged in integrating Event Sourcing into their systems. What Is Event Sourcing? The basic idea of Event Sourcing is to store the history of changes in an application state as a sequence of events. These events can't be changed once stored in an immutable log, which means that the system's current state can be recovered by playing back these events. The idea is different from what happens in typical systems where the state itself is kept and not how it came to be. It is important to also outline the most fundamental principles of ES: State as a sequence of events: Store changes as a series of immutable events rather than the current state Event immutability: Events cannot be changed or deleted upon creation Event replay: Reconstruct the current state or a view by replaying events from the beginning Event Sourcing has a number of advantages that make the system strong and adaptable. One key advantage is that it ensures data integrity and keeps everything well-organized with a clear audit trail, making it easy to trace any changes. This ensures that every change made is unchangeable, hence providing an elaborate history of changes made, which makes it valuable during system failures. The other benefit includes easy horizontal scaling where the load can be distributed to multiple parallel handlers of the events thus complemented by architectural flexibility. Furthermore, an introduction to Event Sourcing would be incomplete without mentioning its perfect fit in systems needing an exhaustive audit trail such as fintech projects. Key Terms in Event Sourcing Command: An action request (imperative verb) performed by a user or external actor (external system). Event: The result of a command (verb in the past tense). Aggregate Root: A key concept from Domain-Driven Design (DDD) representing the main entity in a group of closely related entities (aggregate). This entity handles commands and generates events based on the state of the aggregate (command, aggregateState) => events[]. AggregateState: The state of the aggregate. It doesn't need to contain all data, only the data required for validation and decision-making on which events to generate. Saga: An entity opposite to an aggregate - it listens to events and generates commands. A saga is typically a state machine and often requires persistence for its state. View: An entity that listens to events and performs side effects, such as creating a database table with data optimized for fast frontend access. Query: A data read request to the aggregateState or to the data produced in a view. General Flow The process of updating an event-sourced entity can be represented in pseudocode: (0): Restoring state from snapshot or by replaying events from the beginning. (1): Transform a command into events or an error based on the current state. (2): Persist events in the log. (3): Apply each event sequentially to the state generating a new state. (4): Persist the new state snapshot for quick access. (5): Respond to the caller with a report on the completed work. Additionally, events are applied asynchronously to views and sagas, enabling flexible, decoupled architectures that can evolve over time. In the image below, you can see the typical data flow for event-sourced applications when updating data by sending commands to the aggregate. For reading data, you can send a query to the aggregate for a strongly consistent but potentially slower result, or send a query to the view for an optimized but usually eventually consistent result. Benefits System state recovery allows the restoration of the system state at any given point (perhaps even before an unfortunate failure). Audit and traceability ensure that a detailed history of all changes made is provided for further analysis. Scalability can be achieved through horizontal scaling of event processing. Flexibility: Easy adaptation to new requirements and changes. Views can be rebuilt from scratch, and aggregate states can be reconstructed from scratch. Challenges On the other hand, Event Sourcing brings along with it several challenges that need to be managed with finesse for its effective implementation. Designing correct events and aggregates: Mistakes in the design of aggregates can be costly. The need for Event Storming and the adaptation of DDD modeling arises to find good boundaries. Storage requirements: Events can accumulate over time. Solution: Consider databases — NoSQL stores or specialized solutions for the right event storage. Performance issues: Optimizing event store queries. Ensuring consistency: Since views are updated eventually, implementing eventual consistency in the UI can be challenging, requiring careful management of delays to ensure they do not disrupt the user experience. Testing: Developing tests for systems based on ES. Single-writer pattern implementation in a cluster or multi-datacenter: Implementing ES in a cluster can be challenging because you need to create transactional boundaries to make sure only one node must write events (step (2) in the example) for a single entity at a time. ES in a multi-datacenter is even more challenging, often requiring modeling of additional commands and events for explicit cross-data center synchronization processes. Where To Use Event Sourcing Event Sourcing is a powerful architectural pattern that excels in specific scenarios: Complex Business Domains If there are lots of state changes, coupled with convoluted business logic, Event Sourcing creates a perfect audit trail. This is mostly applicable to cases of compliance, and even more to financial services (including banking and trading systems) as well as healthcare (comprising patient records and treatment histories). Auditing and Compliance The immutable event log of Event Sourcing is highly beneficial in industries that have strict regulatory requirements because every change is documented. This practice finds its primary applications in insurance (for keeping records of claim processing and policy changes) as well as government (for legislative processes and public records management). Debugging and Troubleshooting A key benefit of Event Sourcing is that it allows the recreation of past states by replaying events— this can help developers greatly with error analysis as well as testing. It lets you trace back through the sequence of events that led up to any particular issue, enabling root cause identification or even running simulations based on an event playback. Event-Driven Architectures Event-driven Architectures work hand in hand with Event Sourcing, elevating microservices further through the facilitation of state synchronization and empowering real-time applications. This happens with live updates and collaboration features, thereby ensuring that the entire system remains performant despite growing demands placed upon it due to scalability needs. Scalability and Performance Decoupling read and write models optimizes performance, making Event Sourcing ideal for high-volume systems like e-commerce platforms and social networks, as well as for data analysis, performing complex queries on historical data. Technologies for Implementation The proper implementation of Event Sourcing implies the need for a collection of technologies suited to various aspects of architectural demands. Among the solutions that can be recommended without hesitation for event stream persistence is EventStoreDB. It is well-suited for Event Sourcing and comes with a variety of tools related to different aspects of event management — storage, retrieval, and real-time event handling — that are instrumental in sustaining the performance and consistency of event-driven systems. Two other frameworks play an important role in this regard: Akka and Pekko, besides specialized storage The frameworks assist in managing this sophisticated flow of asynchronous data — including system resilience (ability to remain operational) and scalability (ability to handle increased loads) — which must be introduced with a dynamic environment typical for event-driven architectures. When it comes to organizations with unique needs that generic solutions fail to address, implementing Event Sourcing on a custom basis offers the needed adaptability. But these bespoke systems have to be put together; only a good grasp of ES principles will keep them away from typical traps like performance bottlenecks or data consistency problems. Be ready for a heavy investment in development and testing if you're going to create your own Event Sourcing system — ensure that the system is not only robust but also scalable. Conclusion The implementation of Event Sourcing is a pattern that greatly improves the quality and traceability of applications if implemented correctly; it also increases reliability and scalability. It demands a thorough understanding of its concepts and careful planning but offers substantial long-term benefits. If you are looking at building systems that are ready not only to meet the current needs but also any future changes and challenges, think about embracing Event Sourcing.

By Dmitrii Pakhomov

Practitioner’s Guide to Deep Learning

Our world is undergoing an AI revolution powered by very deep neural networks. With the advent of Apple Intelligence and Gemini, AI has reached the hands of every human being with a mobile phone. Apart from consumer AI, we also have deep learning models being used in several industries like automobile, finance, medical science, manufacturing, etc. This has motivated many engineers to learn deep learning techniques and apply them to solve complex problems in their projects. In order to help these engineers, it becomes imperative to lay down certain guiding principles to prevent common pitfalls when building these black box models. Any deep learning project involves five basic elements: data, model architecture, loss functions, optimizer, and evaluation process. It is critical to design and configure each of these appropriately to ensure proper convergence of models. This article shall cover some of the recommended practices and common problems and their solutions associated with each of these elements. Data All deep-learning models are data-hungry and require several thousands of examples at a minimum to reach their full potential. To begin with, it is important to identify the different sources of data and devise a proper mechanism for selecting and labeling data if required. It helps to build some heuristics for data selection and gives careful consideration to balance the data to prevent unintentional biases. For instance, if we are building an application for face detection, it is important to ensure that there is no racial or gender bias in the data, as well as the data is captured under different environmental conditions to ensure model robustness. Data augmentations for brightness, contrast, lighting conditions, random crop, and random flip also help to ensure proper data coverage. The next step is to carefully split the data into train, validation, and test sets while ensuring that there is no data leakage. The data splits should have similar data distributions but identical, or very closely related samples should not be present in both train and test sets. This is important, as if train samples are present in the test set, then we may see high test performance metrics but still several unexplained critical issues in production. Also, data leakage makes it almost impossible to know if the alternate ideas for model improvement are bringing about any real improvement or not. Thus, a diverse, leak-proof, balanced test dataset representative of the production environment is your best safeguard to deliver a robust deep learning-based model and product. Model Architecture In order to get started with model design, it makes sense to first identify the latency and performance requirements of the task at hand. Then, one can look at open-source benchmarks like this one to identify some suitable papers to work with. Whether we use CNNs or transformers, it helps to have some pre-trained weights to start with, to reduce training time. If no pre-trained weights are available, then suitable model initialization for each model layer is important to ensure that the model converges in a reasonable time. Also, if the dataset available is quite small (a few hundred samples or less), then it doesn’t make sense to train the whole model, rather just the last few task-specific layers should be fine-tuned. Now, whether to use CNN, transformers, or a combination of them is very specific to the problem. For natural language processing, transformers have been established as the best choice. For vision, if the latency budget is very tight, CNNs are still the better choice; otherwise, both CNNs and transformers should be experimented with to get the desired results. Loss Functions The most popular loss function for classification tasks is the Cross Entropy Loss and for regression tasks are the L1 or L2 (MSE) losses. However, there are certain variations of them available for numerical stability during model training. For instance in Pytorch, BCEWithLogitsLoss combines the sigmoid layer and BCELoss into a single class and uses the log-sum-exp trick which makes it more numerically stable than a sigmoid layer followed by BCELoss. Another example is of SmoothL1Loss which can be seen as a combination of L1 and L2 loss and makes the L1 Loss smooth near zero. However, care must be taken when using smooth L1 Loss to set the beta appropriately as its default value of 1.0 may not be suitable for regressing values in sine and cosine domains. The figures below show the loss values for L1, L2 (MSE), and Smooth L1 losses and also the change in smooth L1 Loss value for different beta values. Optimizer Stochastic Gradient Descent with momentum has traditionally been a very popular optimizer among researchers for most problems. However, in practice, Adam is generally easier to use but suffers from generalization problems. Transformer papers have popularized the AdamW optimizer which decouples the weight-decay factor’s choice from the learning rate and significantly improves the generalization ability of Adam optimizer. This has made AdamW the optimal choice for optimizers these days. Also, it isn’t necessary to use the same learning rate for the whole network. Generally, if starting from a pre-trained checkpoint, it is better to freeze or keep a low learning rate for the initial layers and a higher learning rate for the deeper task-specific layers. Evaluation and Generalization Developing a proper framework for evaluating the model is the key to preventing issues in production. This should involve both quantitative and qualitative metrics for not only the full benchmark dataset but also for specific scenarios. This should be done to ensure that performance is acceptable in every scenario and there is no regression. Performance metrics should be carefully chosen to ensure that they appropriately represent the task to be achieved. For example, precision/recall or F1 score may be better than accuracy in many unbalanced problems. At times, we may have several metrics to compare alternate models, then it generally helps to come up with a single weighted metric that can simplify the comparison process. For instance, the nuScenes dataset introduced NDS (nuScenes Detection Score) which is a weighted sum of mAP (mean average precision), mATE (mean average translation error), mASE (mean average scale error), mAOE(mean average orientation error), mAVE(mean average velocity error) and mAAE(mean average attribute error) to simplify comparison of various 3D object detection models. Further, one should also visualize the model outputs whenever possible. This could involve drawing bounding boxes on input images for 2D object detection models or plotting cuboids on LIDAR point clouds for 3D object detection models. This manual verification ensures that model outputs are reasonable and there is no apparent pattern in model errors. Additionally, it helps to pay close attention to training and validation loss curves to check for overfitting or underfitting. Overfitting is a problem wherein validation loss diverges from training loss and starts increasing, representing that the model is not generalizing well. This problem can generally be fixed by adding proper regularization like weight-decay, drop-out layers, adding more data augmentation, or by using early stopping. Underfitting, on the other hand, represents the case where the model doesn’t have enough capacity to even fit the training data. This can be identified by the training loss not going down enough and/or remaining more or less flat over the epochs. This problem can be addressed by adding more layers to the model, reducing data augmentations, or selecting a different model architecture. The figures below show examples of overfitting and underfitting through the loss curves. The Deep Learning Journey Unlike traditional software engineering, deep learning is more experimental and requires careful tuning of hyper-parameters. However, if the fundamentals mentioned above are taken care of, this process can be more manageable. Since the models are black boxes, we have to leverage the loss curves, output visualizations, and performance metrics to understand model behavior and correspondingly take corrective measures. Hopefully, this guide can make your deep learning journey a little less taxing.

By Anurag Paul

Why OOP Is a Bad Fit for Custom Software

Ever notice that custom OOP projects tend towards a flaming pile of spaghetti crap? Have you ever seen anti-patterns like the following: Changing a line of code to fix screen A blows up screen B, which have no relation to each other. Many wrappers: A Service is wrapped by a Provider is wrapped by a Performer is wrapped by a ... It is hard to track down where is the code that performs a certain operation. Playing whack a mole, where each bug fix just yields a new bug. Ever ask yourself why OOP has design patterns? I would argue that OOP assumes upfront design before writing any code. In particular, OOP shines when every important thing is known at the outset. Take a Java List or Map as an example. They have remained virtually the same since the rollout of Java 1.2 when the collections API was added, replacing older classes like Vector and Dictionary. A List or Map is a simple beast - they are just ordered sets of data. A list orders items by index, a map by their keys. Once you have basic operations like add, change, iterate, and delete, what more do you really need? This is why Java has really only added conveniences like Map.computeIfAbsent, ConcurrentHashMap, and so on. Nothing huge, just some nice things that people were already doing anyway with their own convenience functions and/or classes. But custom software paid for by a customer who only knows what they want today is something altogether different. You literally don't know from one month to the next what feature the customer will ask for, or what bug they will report. Remember that OOP design pattern for random structural changes on a dime? Neither do I. Why Imperative Is Better OOP intrinsically means some kind of entanglement: once you pick a design pattern for a set of classes, and write a bunch of code for it, you can't easily change to some other design pattern. You can use different patterns for different sets of code, composing them as needed into a larger system. But each part is kind of locked into a chosen pattern, and it is a significant hassle to change the pattern later. It's like the coding equivalent of vendor lock-in. Unfortunately, a set of code doesn't necessarily shout out "Hey, this is the strategy pattern." You have to examine a set of code to reverse engineer its pattern or ask someone. Have you worked on a team that stated up front what patterns were being used for different parts of the system? I don't recall getting very much of this in my career. Really, in a lot of cases, there simply isn't any real conscious choice of design patterns, just replication of whatever the devs saw before elsewhere, often without any real contextual information of why. This entanglement easily leads to hard to deal with code if someone doesn't fully grok whatever pattern(s) are present. More often than not, using OOP for custom who-knows-what-the-customer-wants-next-week software is setting the system up for failure. Not failure as in it doesn't work, but failure as in it will virtually be guaranteed to become very hard to maintain. Using a simple imperative pattern is much better, which you can do even if the language is primarily OOP language like Java. In the case of Java, just use static methods, where each class corresponds to either a data structure or a series of static methods that operate on data structures. By passing data structures as arguments and returning new data structures, effectively the code is working from the outside, which makes the code simpler to understand and tends towards less entanglement. You could organize packages like this: Top-level packages represent functional areas (e.g., configuration, database access, REST API, validations, etc.) Sub-packages for data structures and functions that operate on them Some sub-packages can represent a design pattern like model, view, and controller For example, it might be organized like this, where app is the top-level dir checked out of the repo: app/db/util: Some utility functions to make DB access easier app/db/dto: Database transfer objects that represent data as stored/retrieved in the DB app/db/dao: Database access objects that store/retrieve dtos app/rest/util: Some utility methods to make REST a bit easier app/rest/view: Objects that represent the data as sent/received over HTTP app/rest/translate: Translate app/db/dto to /from app/rest/view app/rest/model: Make app/db/dao calls to store/retrieve data, uses /app/rest/{view, translate} app/rest/controller: Define endpoints and methods, use app/rest/model to do the work app/html: SSR HTML generation You'll notice I mention MVC above, which is an OOP pattern. However, this pattern can be simplified as a set of directories with one responsibility per directory, which can still be an imperative way of writing code. It can still be operating on the data objects from the outside. Just because we don't want to use OOP doesn't mean we can't apply some of what we've learned from it over the years in an imperative way. The above looks like a monolithic design. It can be a hybrid if you want: Make app/{service} dirs, which in turn contain DB, REST, and HTML as shown above. Each service can be its own application. Services can be grouped into a smaller number of deployments: you don't have to deploy each service in its own container. The Other Most Common Mistake One of the most important things to consider is (DO)RY versus (DONT)RY. The overuse of (DONT)RY is often a very big pain point in OOP. Like Lists and Maps, (DONT)RY works best in a limited area of code, such as reusing some common code across all Map implementations. Essentially, it is just another variation of what I said earlier about knowing the design in advance - (DONT)RY can be quite useful when you know the considerations up front, but just another factor in making spaghetti code when you don't. (DO)RY is far more useful when you have a changes-by-the-week application: the duplication isolates changes. For example, say you have a customer address and a business address. They seem kind of the same thing, with only minor differences: Businesses have 3 lines, for doors and stops and other cupboard-under-the-stairs things individuals don't need. Businesses can have multiple addresses, so they need a type (physical, mailing, billing). It sounds like you could use the same code for both. But over time, random requests are made for random changes, and some changes need to only apply to one or the other address type. (DONT)RY causes these increasing differences to get harder and harder to manage, which is exactly the bad form of entanglement I keep seeing. (DO)RY means copying code when a change needs to be done for both. The improvement stems from the fact that when a particular change must be implemented quite differently due to their differing code bases, there is no tangled mess problem - instead, it is just more effort to do the change twice in different ways, without causing either code base to become any harder to read or modify. In some cases, a data type that has its own logic for persistence and retrieval/display might also be contained inside another data type for another use case. When contained, there is no reason to believe in the face of random changes that it will necessarily always require the same validations, persistence, and display logic as when it is used as a top-level object. As such, all the logic for the contained object should be a copy of the top-level code, so the two use cases can be as different as they need to be. Conclusion Imperative programming combined with (DO)RY encourages making separate silos for each data type - separate queries, separate DB reads/writes, separate REST endpoints, and separate HTML generation. This separation expresses an important truth I alluded to earlier about your data - every top-level data type is completely unrelated to any other data type, it is a thing unto itself. Any correlation or similarity in two separate data types should be viewed as both accidental and temporal: in other words, they just so happen to be similar at the moment - there is no reason to believe their similarity will continue in the face of random unknowable future changes. Separating all your top-level data types with their own imperative code and using (DO)RY - copying code as necessary to maintain the separation - is the key to managing code that has to be dynamic in response to frequent unknowable future changes. The resulting code will be larger as a result of copying logic, but more maintainable. In other words, everything in programming is a trade-off, and the combination of imperative and (DO)RY is the best trade-off that results in more total code, but more maintainable code.

By Greg Hall

How To Secure Your Angular Apps: End-To-End Encryption of API Calls

When it comes to secure web applications, we must keep sensitive data secure during the communication period. Sadly, while HTTPS encrypts data as it moves from point A to point B, the information is still exposed in a browser's network tab and can leak out this way. In this post, I will give you an example of implementing end-to-end encryption of API calls in your secure web app built with Angular. Encryption Workflow Weak protections have traditionally been obfuscation with Base64 encoding or custom schemes. Public key cryptography (PKC) is considered a modern solution to be more secure. It uses a key pair one public key for encryption, and the other private key for decryption. A public key is distributed and a private key is kept on the server. The encryption workflow is as follows: Client-side encryption: Your Angular application encrypts the data with the server’s public key before transmitting it to the API. Secure transmission: Over an HTTPS connection, the network then transmits the encrypted data. Server-side decryption: The server decrypts the data by missioning its private key because it receives the encrypted data, seeing the original information. Server-side encryption (Optional): Before sending the response back to the client, the server can also encrypt the data for additional security. Client-side decryption: Finally, the client decrypts the encrypted response from the server using a public key stored in the web application. Implementation Into Angular App Here is a detailed strategy to implement end-to-end encryption in your Angular financial app. 1. Library Selection and Installation Choose a well-maintained library like Angular CryptoJS or Sodium for JavaScript, but still put more reliance than loafing in trying to implement them. These libraries contain the APIs for encryption and decryption which is provided by multiple algorithms. PowerShell npm install crypto-js 2. Key Management Implement a feature to keep the server's public key safe. There are a couple of common approaches to this: Server-side storage: One relatively simple solution is to store the public key on your backend server, and then retrieve it during application initialization in Angular via an HTTPS request. Key Management Service (Optional): For more advanced set ups, consider a KMS dedicated to secret management, but this requires an extra layer. 3. Create Client-Side Encryption and Decryption Service Create a common crypto service to handle the application data encryption and decryption. TypeScript // src/app/services/appcrypto.service.ts import { Injectable } from '@angular/core'; import * as CryptoJS from 'crypto-js'; @Injectable({ providedIn: 'root' }) export class AppCryptoService { private appSerSecretKey: string = 'server-public-key'; encrypt(data: any): string { return CryptoJS.AES.encrypt(JSON.stringify(data), this.appSerSecretKey).toString(); } decrypt(data: string): any { const bytes = CryptoJS.AES.decrypt(data, this.appSerSecretKey); return JSON.parse(bytes.toString(CryptoJS.enc.Utf8)); } } 4. Application API Call Service Create a common service to handle the web application's common HTTP methods. TypeScript // src/app/services/appcommon.service.ts import { Injectable, Inject } from '@angular/core'; import { Observable } from 'rxjs'; import { map } from 'rxjs/operators'; import { HttpClient, HttpHeaders } from '@angular/common/http'; import { AppCryptoService } from '../services/crypto.service'; @Injectable({ providedIn: 'root' }) export class AppCommonService { constructor(private appCryptoService: AppCryptoService private httpClient: HttpClient) {} postData(url: string, data: any): Observable<any> { const encryptedData = this.appCryptoService.encrypt(data); return this.httpClient.post(url, encryptedData); } } 5. Server-Side Decryption On the server side, you have to decrypt all incoming request payloads and encrypt response payloads. Here's an example using Node. js and Express: JavaScript // server.js const express = require('express'); const bodyParser = require('body-parser'); const crypto = require('crypto-js'); const app = express(); const secretKey = 'app-secret-key'; app.use(bodyParser.json()); // Using middleware to decrypt the incoming request bodies app.use((req, res, next) => { if (req.body && typeof req.body === 'string') { const bytes = crypto.AES.decrypt(req.body, secretKey); req.body = JSON.parse(bytes.toString(crypto.enc.Utf8)); } next(); }); // Test post route call app.post('/api/data', (req, res) => { console.log('Decrypted data:', req.body); // response object const responseObj = { message: 'Successfully received' }; // Encrypt the response body (Optional) const encryptedResBody = crypto.AES.encrypt(JSON.stringify(responseBody), secretKey).toString(); res.send(encryptedResBody); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`Server running on port ${PORT}`); }); 6. Server-Side Encryption (Optional) The response of the server can also be sent back to the client in an encrypted form for security. This does add a layer of security, though with the caveat that it may impact system performance. 7. Client-Side Decryption (Optional) When the response is encrypted from the server, decrypt it on the client side. Conclusion This example keeps it simple by using AES encryption. You may want additional encryption mechanisms, depending on your security needs. Don't forget to manage errors and exceptions properly. This is a somewhat crude implementation of encryption in your Angular web app when making API calls around.

By Kalyan Gottipati

Introduction to Polymorphism With Database Engines in NoSQL Using Jakarta NoSQL

Polymorphism, a fundamental concept in object-oriented programming, allows objects of different types to be treated as instances of a common superclass. This flexibility is essential for creating systems that can be easily extended and maintained. While traditional SQL databases in combination with Jakarta Persistence (JPA) can handle polymorphic data, NoSQL databases offer distinct advantages. Unlike SQL databases, which require strict schema definitions, NoSQL databases take a schema-less approach, inherently supporting dynamic and flexible data structures. This flexibility becomes especially appealing when integrated with Jakarta NoSQL, a tool that provides robust support for defining and managing polymorphic fields through custom converters. In many enterprise applications, there is a common need to manage different types of data objects. For example, an e-commerce platform may handle various payment methods such as credit cards, digital wallets, and bank transfers, each with specific attributes. Similarly, asset management systems in large corporations deal with different types of assets like real estate, machinery, and intellectual property, each with unique properties. Healthcare systems must accommodate various data types, from personal information to medical records and test results. Utilizing NoSQL databases with polymorphic fields can store and manage these diverse data types cohesively. The schema-less nature of NoSQL databases also makes it easier to adapt to changing requirements than relational databases. This tutorial will show how to use Jakarta NoSQL to manage polymorphic fields using custom converters. We will include sample code for integrating REST services using Helidon and Oracle NoSQL. This example will demonstrate the practical application of polymorphism in efficiently managing various data types in a schema-less NoSQL database environment. Using Polymorphism With NoSQL and Jakarta NoSQL This tutorial will explore the NoSQL and schema-less capabilities in the Java world using Oracle NoSQL, Java Helidon, and Rest API. We will create a Machine entity where it will provide an engine field that we will convert to JSON. Thanks to the natural flexibility of Oracle NoSQL and its schema-less design, this approach works seamlessly. The first step is creating the Helidon project, where you can use the Helidon Starter: Helidon Starter. After creating a Microprofile-compliant project, the next step is to include the Oracle NoSQL driver: Oracle NoSQL Driver. In the properties file, since we will run locally, we need two properties: one to define the host connection and the second to define the database name: Properties files jnosql.document.database=machines jnosql.oracle.nosql.host=http://localhost:8080 Also, update the port to use 8181: Properties files server.port=8181 The next step is to configure and run Oracle NoSQL. To make it easier, we will use Docker, but it is important to note that Oracle NoSQL also supports cloud deployment on Oracle infrastructure: Shell docker run -d --name oracle-instance -p 8080:8080 ghcr.io/oracle/nosql:latest-ce By using Docker, we simplify the setup process and ensure that our Oracle NoSQL instance is running in a controlled environment. This setup provides a practical approach for development and testing purposes, highlighting the flexibility of deploying Oracle NoSQL in different environments, including cloud infrastructure. After setting up the configuration and database, the next step involves defining the Entity and creating the Converter implementation. In this example, we will demonstrate the seamless integration of Jakarta NoSQL and Jakarta JSON-B, showing how different Jakarta specifications can work together effectively. The initial step is to define the Machine entity, which incorporates a polymorphic Engine field. Java @Entity @JsonbVisibility(FieldAccessStrategy.class) public class Machine { @Id private String id; @Column @Convert(EngineConverter.class) private Engine engine; @Column private String manufacturer; @Column private int year; // Getters and setters } Next, we define the Engine class, specifying the types and implementations using JsonbTypeInfo to handle polymorphism: Java @JsonbTypeInfo( key = "type", value = { @JsonbSubtype(alias = "gas", type = GasEngine.class), @JsonbSubtype(alias = "electric", type = ElectricEngine.class) } ) @JsonbVisibility(FieldAccessStrategy.class) public abstract class Engine { // Common engine attributes // Getters and setters } The engine converter might change by the provider; it can work as String, Map<String, Object>, BSON, etc. After setting up the configuration and database, the next step is to create the application and database bridge. Eclipse JNoSQL integrates two specifications, Jakarta NoSQL and Jakarta Data, using a single interface to achieve this. Annotations handle the necessary steps. First, we define the MachineRepository interface: Java @Repository public interface MachineRepository extends BasicRepository<Machine, String> { @Query("from Machine where engine.type = :type") List<Machine> findByType(@Param("type") String type); } This repository interface allows us to search elements directly within the JSON data without any issues, and you can extend it further without creating a rigid structure. Next, we define the controller to expose this resource: Java @Path("/machines") @ApplicationScoped public class MachineResource { private static final Logger LOGGER = Logger.getLogger(MachineResource.class.getName()); public static final Order<Machine> ORDER_MANUFACTURER = Order.by(Sort.asc("manufacturer")); private final MachineRepository repository; @Inject public MachineResource(@Database(DatabaseType.DOCUMENT) MachineRepository repository) { this.repository = repository; } @GET public List<Machine> getMachines(@QueryParam("page") @DefaultValue("1") int page, @QueryParam("page_size") @DefaultValue("10") int pageSize) { LOGGER.info("Get machines from page " + page + " with page size " + pageSize); Page<Machine> machines = this.repository.findAll(PageRequest.ofPage(page).size(pageSize), ORDER_MANUFACTURER); return machines.content(); } @GET @Path("gas") public List<Machine> getGasMachines() { return this.repository.findByType("gas"); } @GET @Path("electric") public List<Machine> getElectricMachines() { return this.repository.findByType("electric"); } @GET @Path("{id}") public Machine get(@PathParam("id") String id) { LOGGER.info("Get machine by id " + id); return this.repository.findById(id) .orElseThrow(() -> new WebApplicationException("Machine not found with id: " + id, Response.Status.NOT_FOUND)); } @PUT public void save(Machine machine) { LOGGER.info("Saving a machine " + machine); this.repository.save(machine); } } This controller exposes endpoints for managing machines, including getting machines by type and pagination. Finally, execute the application and insert data using the following curl commands: Shell curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "1", "model": "Thunderbolt V8", "engine": { "type": "gas", "horsepower": 450 }, "manufacturer": "Mustang", "year": 2021, "weight": 1600.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "2", "model": "Eagle Eye EV", "engine": { "type": "electric", "horsepower": 300 }, "manufacturer": "Tesla", "year": 2022, "weight": 1400.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "3", "model": "Road Runner GT", "engine": { "type": "gas", "horsepower": 400 }, "manufacturer": "Chevrolet", "year": 2020, "weight": 1700.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "4", "model": "Solaris X", "engine": { "type": "electric", "horsepower": 350 }, "manufacturer": "Nissan", "year": 2023, "weight": 1350.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "5", "model": "Fusion Hybrid 2024", "engine": { "type": "electric", "horsepower": 320 }, "manufacturer": "Toyota", "year": 2024, "weight": 1450.0 }' This setup allows you to search by type within the JSON, and pagination is easily implemented. For more on pagination techniques, refer to this article. With some data inserted, you can explore and understand how the search works: Get machines with pagination Get all machines Get gas machines Final Thoughts Integrating polymorphism with NoSQL databases using Jakarta NoSQL and Jakarta JSON-B offers flexibility and efficiency in managing diverse data types. By leveraging NoSQL’s schema-less nature, this approach simplifies development and enhances application adaptability. For the complete example and source code, visit soujava/helidon-oracle-json-types.

By Otavio Santana

CORE

Exploring the Evolution of Transformers: From Basic To Advanced Architectures

In their seminal 2017 paper, "Attention Is All You Need," Vaswani et al. introduced the Transformer architecture, revolutionizing not only speech recognition technology but many other fields as well. This blog post explores the evolution of Transformers, tracing their development from the original design to the most advanced models, and highlighting significant advancements made along the way. The Original Transformer The original Transformer model introduced several groundbreaking concepts: Self-attention mechanism: This lets the model determine how important each component is in the input sequence. Positional encoding: Adds information about a token's position within a sequence, enabling the model to capture the order of the sequence. Multi-head attention: This feature allows the model to concurrently focus on different parts of the input sequence, enhancing its ability to understand complex relationships. Encoder-decoder architecture: Separates the processing of input and output sequences, enabling more efficient and effective sequence-to-sequence learning. These elements combine to create a powerful and flexible architecture that outperforms previous sequence-to-sequence (S2S) models, especially in machine translation tasks. Encoder-Decoder Transformers and Beyond The original encoder-decoder structure has since been adapted and modified, leading to several notable advancements: BART (Bidirectional and auto-regressive transformers): Combines bidirectional encoding with autoregressive decoding, achieving notable success in text generation. T5 (Text-to-text transfer transformer): Recasts all NLP tasks as text-to-text problems, facilitating multi-tasking and transfer learning. mT5 (Multilingual T5): Expands T5's capabilities to 101 languages, showcasing its adaptability to multilingual contexts. MASS (Masked sequence to sequence pre-training): Introduces a new pre-training objective for sequence-to-sequence learning, enhancing model performance. UniLM (Unified language model): Integrates bidirectional, unidirectional, and sequence-to-sequence language modeling, offering a unified approach to various NLP tasks. BERT and the Rise of Pre-Training BERT (Bidirectional Encoder Representations from Transformers), launched by Google in 2018, marked a significant milestone in natural language processing. BERT popularized and perfected the concept of pre-training on large text corpora, leading to a paradigm shift in the approach to NLP tasks. Let's take a closer look at BERT's innovations and their impact. Masked Language Modeling (MLM) Process: Randomly masks 15% of tokens in each sequence. The model then attempts to predict these masked tokens based on the surrounding context. Bidirectional context: Unlike previous models that processed text either left-to-right or right-to-left, MLM allows BERT to consider the context from both directions simultaneously. Deeper understanding: This approach forces the model to develop a deeper understanding of the language, including syntax, semantics, and contextual relationships. Variant masking: To prevent the model from over-relying on [MASK] tokens during fine-tuning (since [MASK] does not appear during inference), 80% of the masked tokens are replaced by [MASK], 10% by random words, and 10% remain unchanged. Next Sentence Prediction (NSP) Process: The model receives pairs of sentences and must predict whether the second sentence follows the first in the original text. Implementation: 50% of the time, the second sentence is the actual next sentence, and 50% of the time, it is a random sentence from the corpus. Purpose: This task helps BERT understand relationships between sentences, which is crucial for tasks like question answering and natural language inference. Subword Tokenization Process: Words are divided into subword units, balancing the size of the vocabulary and the ability to handle out-of-vocabulary words. Advantage: This approach allows BERT to handle a wide range of languages and efficiently process morphologically rich languages. GPT: Generative Pre-Trained Transformers OpenAI's Generative Pre-trained Transformer (GPT) series represents a significant advancement in language modeling, focusing on the Transformer decoder architecture for generation tasks. Each iteration of GPT has led to substantial improvements in scale, functionality, and impact on natural language processing. GPT-1 (2018) The first GPT model introduced the concept of pre-training for large-scale unsupervised language understanding: Architecture: Based on a Transformer decoder with 12 layers and 117 million parameters. Pre-training: Utilized a variety of online texts. Task: Predicted the next word, considering all previous words in the text. Innovation: Demonstrated that a single unsupervised model could be fine-tuned for different downstream tasks, achieving high performance without task-specific architectures. Implications: GPT-1 showcased the potential of transfer learning in NLP, where a model pre-trained on a large corpus could be fine-tuned for specific tasks with relatively little labeled data. GPT-2 (2019) GPT-2 significantly increased the model size and exhibited impressive zero-shot learning capabilities: Architecture: The largest version had 1.5 billion parameters, more than 10 times greater than GPT-1. Training data: Used a much larger and more diverse dataset of web pages. Features: Demonstrated the ability to generate coherent and contextually relevant text on a variety of topics and styles. Zero-shot learning: Showed the ability to perform tasks it was not specifically trained for by simply providing instructions in the input prompt. Impact: GPT-2 highlighted the scalability of language models and sparked discussions about the ethical implications of powerful text generation systems. GPT-3 (2020) GPT-3 represented a huge leap in scale and capabilities: Architecture: Consisted of 175 billion parameters, over 100 times larger than GPT-2. Training data: Utilized a vast collection of texts from the internet, books, and Wikipedia. Few-shot learning: Demonstrated remarkable ability to perform new tasks with only a few examples or prompts, without the need for fine-tuning. Versatility: Exhibited proficiency in a wide range of tasks, including translation, question answering, text summarization, and even basic coding. GPT-4 (2023) GPT-4 further pushes the boundaries of what is possible with language models, building on the foundations laid by its predecessors. Architecture: While specific architectural details and the number of parameters have not been publicly disclosed, GPT-4 is believed to be significantly larger and more complex than GPT-3, with enhancements to its underlying architecture to improve efficiency and performance. Training data: GPT-4 was trained on an even more extensive and diverse dataset, including a wide range of internet texts, academic papers, books, and other sources, ensuring a comprehensive understanding of various subjects. Advanced few-shot and zero-shot learning: GPT-4 exhibits an even greater ability to perform new tasks with minimal examples, further reducing the need for task-specific fine-tuning. Enhanced contextual understanding: Improvements in contextual awareness allow GPT-4 to generate more accurate and contextually appropriate responses, making it even more effective in applications like dialogue systems, content generation, and complex problem-solving. Multimodal capabilities: GPT-4 integrates text with other modalities, such as images and possibly audio, enabling more sophisticated and versatile AI applications that can process and generate content across different media types. Ethical considerations and safety: OpenAI has placed a strong emphasis on the ethical deployment of GPT-4, implementing advanced safety mechanisms to mitigate potential misuse and ensure that the technology is used responsibly. Innovations in Attention Mechanisms Researchers have proposed various modifications to the attention mechanism, leading to significant advancements: Sparse attention: Allows for more efficient processing of long sequences by focusing on a subset of relevant elements. Adaptive attention: Dynamically adjusts the attention span based on the input, enhancing the model's ability to handle diverse tasks. Cross-attention variants: Improve how decoders attend to encoder outputs, resulting in more accurate and contextually relevant generations. Conclusion The evolution of Transformer architectures has been remarkable. From their initial introduction to the current state-of-the-art models, Transformers have consistently pushed the boundaries of what's possible in artificial intelligence. The versatility of the encoder-decoder structure, combined with ongoing innovations in attention mechanisms and model architectures, continues to drive progress in NLP and beyond. As research continues, we can expect further innovations that will expand the capabilities and applications of these powerful models across various domains.

By Suri Nuthalapati

Developing Event-Driven, Auto-Compensating Saga Transactions For Microservices

Over the past decade, I have presented many times and written numerous blogs and source code on sagas and event-driven microservices. In those blogs, I’ve discussed the need for sagas in microservices architectures, the preferred and increased use of event-driven patterns and communication for microservices, and the difficulties in implementing sagas, particularly around developing saga participant code for compensating transactions. These are addressed in the product solution I will describe, including an example source code here, and soon, an update of the beta version of the saga workshop showing the same. The features are in the Oracle Database Free Docker container and soon in the Oracle Autonomous Database. Part of what makes the new Oracle Saga Framework so powerful is its combined usage of other features in the Oracle Database including the TxEventQ transactional messaging system and reservation-less locking; therefore, I will describe them and how they contribute to the overall comprehensive solution as well. Quick Background on Sagas The saga pattern is used to provide data integrity between multiple services and to do so for potentially long-running transactions. There are many cursory blogs, as they tend to be, written on sagas and long-running transactions. In short, XA and 2PC require distributed locks (con) which manage ACID properties so that the user can simply execute rollback or commit (pro). In contrast, sagas use local transactions only and do not require distributed locks (pro) but require the user to implement compensation logic, etc. (con). My previous blog showed examples of compensation logic and the need to explicitly maintain journals and handle a number of often subtle, but important complexities. Indeed, most agree that data integrity is perhaps the most difficult and critical of challenges when taking advantage of a microservices architecture. The Transfer Service receives a request to transfer money from one account to another. The transfer method that is called is annotated with @LRA(value = LRA.Type.REQUIRES_NEW, end = false); therefore, the underlying LRA client library makes a call to the Coordinator/Orchestrator service which creates a new LRA/saga and passes the LRA/saga ID back to the Transfer Service. The Transfer Service makes a call to the (Bank)Account (for account 66) service to make the withdraw call. The LRA/saga ID is propagated as a header as part of this call. The withdraw method that is called is annotated with @LRA(value = LRA.Type.MANDATORY, end = false); therefore, the underlying client library makes a call to the Coordinator/Orchestrator service, which recognizes the LRA/saga ID and enlists/joins the Account Service endpoint (address) to the LRA/saga started by the Transfer Service. This endpoint has a number of methods, including the complete and compensate methods that will be called when the saga/LRA is terminated/ended. The withdraw method is executed and control returns to the Transfer Service. This is repeated with a call from the Transfer Service to the Account service (for account 67) to make the deposit call. Depending on the returns from the Account Service calls, the Transfer Service determines if it should close or cancel the saga/LRA. close and cancel are somewhat analogous to commit or rollback. The Transfer Service issues the close or cancel call to the Coordinator (I will get into more details on how this is done implicitly when looking closer at the application). The Coordinator in turn issues complete (in the case of close) or compensate calls on the participants that were joined in Saga/LRA previously. Oracle TxEventQ (Formerly AQ) Messaging System in the Database There are several advantages to event-driven microservices — particularly those that are close to the (potentially critical) data — including scalability and QoS levels, reliability, transactionality, integration and routing, versioning, etc. Of course, there needs to be a messaging system/broker in order to provide event-driven sagas. The TxEventQ messaging system (formerly called AQ) has been part of the Oracle database for decades (long before Kafka existed). It provides some key differentiators not available in other messaging systems; in particular, the ability to do messaging and data operations in the same local transaction — which is required to provide transactional outbox, idempotent producers and consumers, and in particular, the robust saga simply can't do. These are described in the blog "Apache Kafka vs. Oracle Transactional Event Queues as Microservices Event Mesh," but the following table gives an idea of the common scenario that would require extra developer and admin handling or is simply not possible in Kafka and other messaging and database systems. The scenario involves an Order micoservice inserting an order in the database and sending a message to an Inventory microservice. The Inventory microservices receives the message, updates the Inventory table, and sends a message back to the Order service, which receives that message and updates the Order in the database. Notice how the Oracle Database and TxEventQ handle all failure scenarios automatically. Auto-Compensating Data Types via Lock-Free Reservations Saga and Escrow History There is little debate that the saga pattern is currently the best approach to data integrity between microservices and long-running transactions/activities. This comes as little surprise, as it has a long history starting with the original paper that was published in 1987 which also states that a simplified and optimal implementation of the saga pattern is one where the coordinator is implemented in the database(s). The concept of escrow concurrency and compensation-aware transactions was described even earlier in 1985. The new Oracle database feature is named "Lock-free Reservations," as a reservation journal acts as an intermediary to the actual data table for any fields marked with the keyword RESERVABLE. Here is an example of how easy it is to simply label a column/field as reservable: CREATE TABLE bankA ( ucid VARCHAR2(50), account_number NUMBER(20) PRIMARY KEY, account_type VARCHAR2(15) CHECK (account_type IN ('CHECKING', 'SAVING')), balance_amount decimal(10,2) RESERVABLE constraint balance_con check(balance_amount >= 0), created_at TIMESTAMP DEFAULT SYSTIMESTAMP ); An internal reservation journal table is created and managed automatically by the database (with nomenclature SYS_RESERVJRNL_<object_number_of_base_table>) which tracks the actions made on the reservable field by concurrent transactions. Changes requested by each transaction are verified against the journal value (not the actual database table), and thus, promises of the change are made to the transactions based on the reservation/journal. The changes are not flushed/processed on the underlying table until the commit of the transaction(s). The modifications made on these fields must be commutative; that is, relative increment/decrement operations such as quantity = quantity + 1, not absolute assignments such as quantity = 2. This is the case in the vast majority of data hot spots and indeed, even state machines work on this principle. Along with the fine-grained/column-level nature of escrow/reservations, high throughput for hot spots of concurrent transactions is attained. Likewise, transactions do not block for long-running transactions. A customer in a store no longer locks all of a particular type of an item just because one of the items is in their cart, nor can they take items from another person's cart. A good way to understand is to compare and contrast lock-less reservations/escrow with the concurrency mechanisms, drawbacks, and benefits of pessimistic and optimistic locking. Pessimistic Locking Optimistic Locking Escrow Locking What is extremely interesting is the fact that the journaling, etc. conducted by lock-free reservations is also used by the Oracle Saga Framework to provide auto-compensating/compensation-aware data. The Saga framework performs compensating actions during a Saga rollback. Reservation journal entries provide the data that is required to take compensatory actions for Saga transactions. The Saga framework sequentially processes the saga_finalization$ table for a Saga branch and executes the compensatory actions using the reservation journal. In other words, it removes the burden of coding the compensation logic, as described in the Developing Saga Participant Code For Compensating Transactions blog. Quick Feature Comparison in Saga Implementations In my previous blog, I used the versatile Oracle MicroTx product, written by the same team that wrote the famous Tuxedo transaction processing monitor. I've provided this table of comparison features to show what is provided by LRA in general and what unique features currently exist (others are in development) between the two Oracle Saga coordinator implementations. Without LRA LRA MicroTX Sagas/LRA Oracle Database Sagas/LRA Automatic propagation of saga/LRA ID and participant enlistment X X X Automatic coordination of completion protocol (commit/rollback). X X X Automatic timeout and recovery logic X X X REST support X X Messaging support X Automatic recovery state maintained in participants X Automatic Compensating Data via Lock-free Reservations X XA and Try-Cancel-Commit support X Coordinator runs in... and HA, Security, ... is supported by... Kubernetes Oracle Database Languages directly supported Java, JavaScript Java, PL/SQL Application Setup and Code Setup There are a few simple setup steps on the database side that need to be issued just once to initialize the system. The full doc can be found here. It is possible to use a number of different configurations for microservices, all of which are supported by the Oracle Database Saga Framework. For example, there can be schema or another isolation level between microservices, or there can be a strict database-per-service isolation. We will show the latter here and use a Pluggable Database (PDB) per service. A PDB is a devoted database that can be managed as a unit/CDB for HA, etc., making it perfect for microservices per service. Create database links between each database for message propagation, forming an event mesh. The command looks like this: CREATE PUBLIC DATABASE LINK PDB2_LINK CONNECT TO admin IDENTIFIED BY test USING 'cdb1_pdb2'; CREATE PUBLIC DATABASE LINK PDB3_LINK CONNECT TO admin IDENTIFIED BY test USING 'cdb1_pdb3'; CREATE PUBLIC DATABASE LINK PDB4_LINK CONNECT TO admin IDENTIFIED BY test USING 'cdb1_pdb4'; 2. Grant saga-related privileges to the saga coordinator/admin. grant saga_adm_role to admin; grant saga_participant_role to admin; grant saga_connect_role to admin; grant all on sys.saga_message_broker$ to admin; grant all on sys.saga_participant$ to admin; grant all on sys.saga$ to admin; grant all on sys.saga_participant_set$ to admin; 3. add_broker and add_coordinator: exec dbms_saga_adm.add_broker(broker_name => 'TEST', broker_schema => 'admin'); exec dbms_saga_adm.add_coordinator(coordinator_name => 'CloudBankCoordinator', mailbox_schema => 'admin', broker_name => 'TEST', dblink_to_coordinator => 'pdb1_link'); exec dbms_saga_adm.add_participant(participant_name => 'CloudBank', coordinator_name => 'CloudBankCoordinator' , dblink_to_broker => 'pdb1_link' , mailbox_schema => 'admin' , broker_name => 'TEST', dblink_to_participant => 'pdb1_link'); 4. add_participant(s): exec dbms_saga_adm.add_participant(participant_name=> 'BankB' ,dblink_to_broker => 'pdb1_link',mailbox_schema=> 'admin',broker_name=> 'TEST', dblink_to_participant=> 'pdb3_link'); Application Dependencies On the Java application side, we just need to add these two dependencies to the maven pom.xml: <dependency> <groupId>com.oracle.database.saga</groupId> <artifactId>saga-core</artifactId> <version>[23.3.0,)</version> </dependency> <dependency> <groupId>com.oracle.database.saga</groupId> <artifactId>saga-filter</artifactId> <version>[23.3.0,)</version> </dependency> Application Source Code As the Oracle Database Saga Framework implements the MicroProfile LRA (Long Running Actions) specification, much of the code, annotations, etc. that I've presented in previous blogs apply to this one. However, though future support has been discussed, the LRA specification does not support messaging/eventing — only REST (it supports Async REST, but that is of course not messaging/eventing) — and so a few additional annotations have been furnished to provide such support and take advantage of the TxEventQ transactional messaging and auto-compensation functionality already described. Full documentation can be found here, but the key two additions are @Request in the saga/LRA participants and @Response in the initiating service/participant as described in the following. Note that the Oracle Database Saga Framework also provides access to the same saga functionality via direct API calls (e.g., SagaInitiator beginSaga(), Saga sendRequest, commitSaga, rollbackSaga, etc.), and so can be used not only in JAX-RS clients but any Java client, and, of course, in PL/SQL as well. As shown in the previous blog and code repos, JAX-RS can also be used in Spring Boot. The example code snippets below show the classic TravelAgency saga scenario (with Airline, etc., as participants) while the example I've been using in the previous blog and the GitHub repos provided continues the bank transfer scenario. The same principles apply, of course; just using different use cases to illustrate. The Initiator (The Initiating Participant) @LRA for demarcation of the saga/LRA indicating whether the method should start, end, or join an LRA. @Response is an Oracle Saga-specific annotation, indicating this method collects responses from Saga participants (who were enrolled into a Saga using the sendRequest() API and the name of the participant (Airline, in this case). Java @Participant(name = "TravelAgency") /* @Participant declares the participant’s name to the saga framework */ public class TravelAgencyController extends SagaInitiator { /* TravelAgencyController extends the SagaInitiator class */ @LRA(end = false) /* @LRA annotates the method that begins a saga and invites participants */ @POST("booking") @Consumes(MediaType.TEXT_PLAIN) @Produces(MediaType.APPLICATION_JSON) public jakarta.ws.rs.core.Response booking( @HeaderParam(LRA_HTTP_CONTEXT_HEADER) URI lraId, String bookingPayload) { Saga saga = this.getSaga(lraId.toString()); /* The application can access the sagaId via the HTTP header and instantiate the Saga object using it */ try { /* The TravelAgency sends a request to the Airline sending a JSON payload using the Saga.sendRequest() method */ saga.sendRequest ("Airline", bookingPayload); response = Response.status(Response.Status.ACCEPTED).build(); } catch (SagaException e) { response=Response.status(Response.Status.INTERNAL_SERVER_ERROR).build(); } } @Response(sender = "Airline.*") /* @Response annotates the method to receive responses from a specific Saga participant */ public void responseFromAirline(SagaMessageContext info) { if (info.getPayload().equals("success")) { saga.commitSaga (); /* The TravelAgency commits the saga if a successful response is received */ } else { /* Otherwise, the TravelAgency performs a Saga rollback */ saga.rollbackSaga (); } } } The Participant Services @Request is an Oracle Saga-specific annotation to indicate the method that receives incoming requests from Saga initiators. @Complete: The completion callback (called by the coordinator) for the saga/LRA @Compensate: The compensate callback (called by the coordinator) for the saga/LRA The Saga framework provides a SagaMessageContext object as an input to the annotated method which includes convenience methods to get the Saga, SagaId, Sender, Payload, and Connection (to use transactionally and as an auto-compensating data type as part of the saga as described earlier). Java @Participant(name = "Airline") /* @Participant declares the participant’s name to the saga framework */ public class Airline extends SagaParticipant { /* Airline extends the SagaParticipant class */ @Request(sender = "TravelAgency") /* The @Request annotates the method that handles incoming request from a given sender, in this example the TravelAgency */ public String handleTravelAgencyRequest(SagaMessageContext info) { /* Perform all DML with this connection to ensure everything is in a single transaction */ FlightService fs = new FlightService(info.getConnection()); fs.bookFlight(info.getPayload(), info.getSagaId()); return response; /* Local commit is automatically performed by the saga framework. The response is returned to the initiator */ } @Compensate /* @Compensate annotates the method automatically called to roll back a saga */ public void compensate(SagaMessageContext info) { fs.deleteBooking(info.getPayload(), info.getSagaId()); } @Complete /* @Complete annotates the method automatically called to commit a saga */ public void complete(SagaMessageContext info) { fs.sendConfirmation(info.getSagaId()); } } APEX Workflow With Oracle Saga Framework Oracle's new APEX Workflow product has been designed to include and account for sagas. More blogs with details are coming, but to give you an idea, the following shows the same bank transfer saga we've been discussing, but defined in a workflow and with the inclusion of a manual step in the flow for approval of the transfer (a common use case in finance and other workflows). You can read more about the workflow product in the blogs here and here. Other Topics: Observability, Optimizations, and Workshop Event-driven applications are, of course, different from blocking/sync/REST applications, and lend to different patterns and advantages, particularly as far as parallelism. Therefore, settings for pool size, number of publishers and listeners, etc. are part of the saga framework in order to optimize. As the journaling and bookkeeping are stored in the database along with the data and messaging, completion and compensation can be conducted locally there, making it in many cases unnecessary to make the callbacks to the application code that are otherwise necessary. This again greatly simplifies development and also drastically cuts down on the costs and considerations of network calls. Microservices, and especially those containing sagas, require effective observability. Especially those containing sagas need this observability not only in the application but also in the saga coordinator, communication infrastructure, and database. Oracle has an OpenTelemetry-based solution for this that is coordinated across all tiers. A "DevOps meets DataOps" video explains this Unified Observability architecture and how it can be used with entirely open-source products such as Kubernetes (including eBPF), Prometheus, Loki and Promtail, ELK stack, Jaeger and Zipkin, Grafana, etc. Finally, note that the existing beta workshop will soon be updated to include the new GA release of the Saga Framework which will be announced at this same blog space. Conclusion Thank you for reading and of course please feel free to reach out to me with any questions or feedback. I want to give credit to the Oracle TxEventQ and Transaction Processing teams for all the amazing work they've done to conquer some of if not the most difficult areas of data-driven microservices and simplify it for the developers. I'd like to give special credit to Oracle's Dieter Gawlick who started the work in both escrow (lock-free reservations) and compensation-aware datatypes and sagas 40 years ago and who is also the original architect of TxEventQ (formerly AQ) with its ability to do messaging and data manipulation in the same local transaction.

By Paul Parkinson

You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team

By Caitlin Candelmo