Designing Scalable and Secure Cloud-Native Architectures: Technical Strategies and Best Practices
Jakarta WebSocket Essentials: A Guide to Full-Duplex Communication in Java
Data Engineering
Over a decade ago, DZone welcomed the arrival of its first ever data-centric publication. Since then, the trends surrounding the data movement have held many titles — big data, data science, advanced analytics, business intelligence, data analytics, and quite a few more. Despite its varying vernacular, the purpose has remained the same: to build intelligent, data-driven systems. The industry has come a long way from organizing unstructured data and driving cultural acceptance to adopting today's modern data pipelines and embracing business intelligence capabilities.This year's Data Engineering Trend Report draws all former terminology, advancements, and discoveries into the larger picture, illustrating where we stand today along our unique, evolving data journeys. Within these pages, readers will find the keys to successfully build a foundation for fast and vast data intelligence across their organization. Our goal is for the contents of this report to help guide individual contributors and businesses alike as they strive for mastery of their data environments.
Platform Engineering Essentials
Apache Kafka Essentials
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics. Data engineering and software engineering have long been at odds, each with their own unique tools and best practices. A key differentiator has been the need for dedicated orchestration when building data products. In this article, we'll explore the role data orchestrators play and how recent trends in the industry may be bringing these two disciplines closer together than ever before. The State of Data Orchestration One of the primary goals of investing in data capabilities is to unify knowledge and understanding across the business. The value of doing so can be immense, but it involves integrating a growing number of systems with often increasing complexity. Data orchestration serves to provide a principled approach to composing these systems, with complexity coming from: Many distinct sources of data, each with their own semantics and limitationsMany destinations, stakeholders, and use cases for data productsHeterogeneous tools and processes involved with creating the end product There are several components in a typical data stack that help organize these common scenarios. The Components The prevailing industry pattern for data engineering is known as extract, load, and transform, or ELT. Data is (E) extracted from upstream sources, (L) loaded directly into the data warehouse, and only then (T) transformed into various domain-specific representations. Variations exist, such as ETL, which performs transformations before loading into the warehouse. What all approaches have in common are three high-level capabilities: ingestion, transformation, and serving. Orchestration is required to coordinate between these three stages, but also within each one as well. Ingestion Ingestion is the process that moves data from a source system (e.g., database), into a storage system that allows transformation stages to more easily access it. Orchestration at this stage typically involves scheduling tasks to run when new data is expected upstream or actively listening for notifications from those systems when it becomes available. Transformation Common examples of transformations include unpacking and cleaning data from its original structure as well as splitting or joining it into a model more closely aligned with the business domain. SQL and Python are the most common ways to express these transformations, and modern data warehouses provide excellent support for them. The role of orchestration in this stage is to sequence the transformations in order to efficiently produce the models used by stakeholders. Serving Serving can refer to a very broad range of activities. In some cases, where the end user can interact directly with the warehouse, this may only involve data curation and access control. More often, downstream applications need access to the data, which, in turn, requires synchronization with the warehouse's models. Loading and synchronization is where orchestrators play a role in the serving stage. Figure 1. Typical flow of data from sources, through the data warehouse, out to end-user apps Ingestion brings data in, transformation occurs in the warehouse, and data is served to downstream apps. These three stages comprise a useful mental model for analyzing systems, but what's important to the business is the capabilities they enable. Data orchestration helps coordinate the processes needed to take data from source systems, which are likely part of the core business, and turn it into data products. These processes are often heterogeneous and were not necessarily built to work together. This can put a lot of responsibility on the orchestrator, tasking it with making copies, converting formats, and other ad hoc activities to bring these capabilities together. The Tools At their core, most data systems rely on some scheduling capabilities. When only a limited number of services need to be managed on a predictable basis, a common approach is to use a simple scheduler such as cron. Tasks coordinated in this way can be very loosely coupled. In the case of task dependencies, it is straightforward to schedule one to start some time after the other is expected to finish, but the result can be sensitive to unexpected delays and hidden dependencies. As processes grow in complexity, it becomes valuable to make dependencies between them explicit. This is what workflow engines such as Apache Airflow provide. Airflow and similar systems are also often referred to as "orchestrators," but as we'll see, they are not the only approach to orchestration. Workflow engines enable data engineers to specify explicit orderings between tasks. They support running scheduled tasks much like cron and can also watch for external events that should trigger a run. In addition to making pipelines more robust, the bird's-eye view of dependencies they offer can improve visibility and enable more governance controls. Sometimes the notion of a "task" itself can be limiting. Tasks will inherently operate on batches of data, but the world of streaming relies on units of data that flow continuously. Many modern streaming frameworks are built around the dataflow model — Apache Flink being a popular example. This approach forgoes the sequencing of independent tasks in favor of composing fine-grained computations that can operate on chunks of any size. From Orchestration to Composition The common thread between these systems is that they capture dependencies, be it implicit or explicit, batch or streaming. Many systems will require a combination of these techniques, so a consistent model of data orchestration should take them all into account. This is offered by the broader concept of composition that captures much of what data orchestrators do today and also expands the horizons for how these systems can be built in the future. Composable Data Systems The future of data orchestration is moving toward composable data systems. Orchestrators have been carrying the heavy burden of connecting a growing number of systems that were never designed to interact with one another. Organizations have built an incredible amount of "glue" to hold these processes together. By rethinking the assumptions of how data systems fit together, new approaches can greatly simplify their design. Open Standards Open standards for data formats are at the center of the composable data movement. Apache Parquet has become the de facto file format for columnar data, and Apache Arrow is its in-memory counterpart. The standardization around these formats is important because it reduces or even eliminates the costly copy, convert, and transfer steps that plague many data pipelines. Integrating with systems that support these formats natively enables native "data sharing" without all the glue code. For example, an ingestion process might write Parquet files to object storage and then simply share the path to those files. Downstream services can then access those files without needing to make their own internal copies. If a workload needs to share data with a local process or a remote server, it can use Arrow IPC or Arrow Flight with close to zero overhead. Standardization is happening at all levels of the stack. Apache Iceberg and other open table formats are building upon the success of Parquet by defining a layout for organizing files so that they can be interpreted as tables. This adds subtle but important semantics to file access that can turn a collection of files into a principled data lakehouse. Coupled with a catalog, such as the recently incubating Apache Polaris, organizations have the governance controls to build an authoritative source of truth while benefiting from the zero-copy sharing that the underlying formats enable. The power of this combination cannot be overstated. When the business' source of truth is zero-copy compatible with the rest of the ecosystem, much orchestration can be achieved simply by sharing data instead of building cumbersome connector processes. Figure 2. A data system composed of open standards Once data is written to object storage as Parquet, it can be shared without any conversions. The Deconstructed Stack Data systems have always needed to make assumptions about file, memory, and table formats, but in most cases, they've been hidden deep within their implementations. A narrow API for interacting with a data warehouse or data service vendor makes for clean product design, but it does not maximize the choices available to end users. Consider Figure 1 and Figure 2, which depict data systems aiming to support similar business capabilities. In a closed system, the data warehouse maintains its own table structure and query engine internally. This is a one-size-fits-all approach that makes it easy to get started but can be difficult to scale to new business requirements. Lock-in can be hard to avoid, especially when it comes to capabilities like governance and other services that access the data. Cloud providers offer seamless and efficient integrations within their ecosystems because their internal data format is consistent, but this may close the door on adopting better offerings outside that environment. Exporting to an external provider instead requires maintaining connectors purpose-built for the warehouse's proprietary APIs, and it can lead to data sprawl across systems. An open, deconstructed system standardizes its lowest-level details. This allows businesses to pick and choose the best vendor for a service while having the seamless experience that was previously only possible in a closed ecosystem. In practice, the chief concern of an open data system is to first copy, convert, and land source data into an open table format. Once that is done, much orchestration can be achieved by sharing references to data that has only been written once to the organization's source of truth. It is this move toward data sharing at all levels that is leading organizations to rethink the way that data is orchestrated and build the data products of the future. Conclusion Orchestration is the backbone of modern data systems. In many businesses, it is the core technology tasked with untangling their complex and interconnected processes, but new trends in open standards are offering a fresh take on how these dependencies can be coordinated. Instead of pushing greater complexity into the orchestration layer, systems are being built from the ground up to share data collaboratively. Cloud providers have been adding compatibility with these standards, which is helping pave the way for the best-of-breed solutions of tomorrow. By embracing composability, organizations can position themselves to simplify governance and benefit from the greatest advances happening in our industry. This is an excerpt from DZone's 2024 Trend Report,Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics.Read the Free Report
Continuing from my previous article, I want to discuss how Jenkins remains relevant in 2024 and will continue to be crucial in the microservices and container world over the next decade by adopting principles from other open-source projects. Why Jenkins Still Has a Place in 2024 Let us start with some stats. Despite the rise of modern, cloud-native CI/CD tools like GitHub Actions and CircleCI, Jenkins remains a heavyweight in the continuous integration and delivery space. Holding an estimated 44-46% of the global CI/CD market in 2023, Jenkins continues to be widely adopted, with more than 11 million developers and over 200,000 active installations across various industries (CD Foundation, CloudBees). This widespread usage reflects Jenkins' strong position in enterprise environments, where its robust plugin ecosystem and extensive customization options continue to deliver value. One of Jenkins' major strengths is its extensibility. With over 1,800 plugins, Jenkins can integrate deeply with legacy systems, internal workflows, and various third-party tools, making it an essential part of many large-scale and complex projects (CloudBees). In industries where infrastructure and application delivery rely on specialized or customized workflows — such as finance, healthcare, and manufacturing — Jenkins' ability to adapt to unique requirements remains unmatched. This flexibility is a key reason why Jenkins is still preferred in enterprises that have heavily invested in their CI/CD pipelines. Moreover, Jenkins continues to see substantial growth in its usage. Between 2021 and 2023, Jenkins Pipeline usage increased by 79%, while overall job workloads grew by 45% (CD Foundation, CloudBees). These numbers indicate that, even in the face of newer competition, Jenkins is being used more frequently to automate complex software delivery processes. Another factor contributing to Jenkins' staying power is its open-source nature and community support. With thousands of active contributors and corporate backing from major players like AWS, IBM, and CloudBees, Jenkins benefits from a large knowledge base and ongoing development (CD Foundation, CloudBees). This ensures that Jenkins remains relevant and adaptable to emerging trends, even if its architecture is not as cloud-native as some of its newer competitors. While Jenkins may not be the go-to for modern Kubernetes or GitOps-focused workflows, it plays a critical role in on-premise and hybrid environments where companies require greater control, customization, and integration flexibility. Its deep entrenchment in enterprise systems and ongoing improvements ensure that Jenkins still has a crucial place in the CI/CD ecosystem in 2024 and beyond. Jenkins to a Prometheus-like architecture and strengthening the plugin system is insightful and could address several of Jenkins' current challenges, especially regarding scalability, plugin complexity, and reliability. Let’s break it down and explore how this shift could help Jenkins remain competitive in the future. 1. Prometheus-Like Architecture: Moving to Statelessness Currently, Jenkins operates in a stateful architecture, which complicates horizontal scaling and cloud-native operations. Adopting a Prometheus-like architecture — which is stateless and relies on external storage — could drastically improve Jenkins’ scalability. Prometheus, a popular monitoring tool, uses an exporter model, gathering data without retaining state internally. Jenkins could benefit from this model by offloading state management to external databases or services. Benefits Improved horizontal scaling: By decoupling Jenkins' state from its core operations, it would become easier to spin up and scale multiple Jenkins instances without worrying about syncing the state across nodes.Resilience: In a stateless system, individual failures (e.g., job failures or plugin crashes) wouldn’t affect the entire server. This makes the system more resilient to downtime.Kubernetes-native: A Prometheus-like architecture would align better with Kubernetes, where stateless microservices are the norm, allowing Jenkins to thrive in modern cloud-native environments. 2. Strengthen Plugins to Work With Web Servers One of Jenkins' key strengths is its extensive plugin ecosystem, but as you rightly point out, plugin management can be a nightmare. Strengthening plugins to work more autonomously — perhaps interfacing directly with web servers or microservices — could alleviate many of Jenkins' pain points. Proposal for Plugin Architecture Jenkins Proposed Architecture Segregated plugin operations: Plugins should be run in isolated, self-contained environments or sandboxes, much like how Docker containers operate. This would ensure that plugin failures do not bring down the main server, enhancing reliability.Web server-driven: Plugins could act as services that register with a central web server (Jenkins core). If a plugin is needed for a specific build step, the server could query the plugin service, much like how microservices communicate through APIs. This would allow each plugin to manage its own state and dependencies independently of Jenkins’ core.Clearer API contracts: By having plugins act like web-based microservices, Jenkins could enforce clear API contracts that streamline interactions between the core and plugins, reducing the risk of misconfigurations. 3. Segregation of Duties: Isolate Plugin Failures One of the most frequent complaints about Jenkins is that plugin failures can cause major disruptions. Your proposal to segregate duties so that plugins do not affect the overall server is crucial. How This Could Work Plugin sandboxing: Each plugin could be run in an isolated process or container. If a plugin fails, the failure is contained to that specific process, and Jenkins’ core server remains unaffected. This is akin to how web browsers handle tabs: if one tab crashes, the browser stays up.Service mesh: Jenkins could adopt a service mesh architecture similar to tools like Istio. Each plugin operates as a service, with a central Jenkins server acting as the orchestrator. This would allow the core to handle orchestration and state management without being tied down by plugin-related failures. 4. CI/CD as a Cloud-Native Service With Jenkins' existing adoption in enterprise settings, transitioning to a cloud-native CI/CD service with this new architecture would allow it to better compete with tools like GitHub Actions and Argo CD, both of which are tailored for cloud-native environments. Additional Considerations Native Support for Containers and Microservices With plugins functioning independently, Jenkins could be fully optimized for containerized environments. For example, each build step could be executed as a separate container, making it easier to manage resources and scale as needed. Enhanced Monitoring and Observability Borrowing from Prometheus’ architecture, Jenkins could adopt a more observability-focused model, with built-in metrics and monitoring services. This would give DevOps teams more insight into the health of their pipelines, helping to reduce the complexity of managing large installations. In Conclusion Transitioning Jenkins towards a Prometheus-like, stateless architecture and restructuring plugins to operate independently would provide significant improvements in scalability, resilience, and deployment flexibility. This approach would enable seamless scaling by decoupling state from individual Jenkins nodes, reducing dependency on centralized databases and simplifying failure recovery. Additionally, independent plugins would facilitate faster development cycles, enhance fault isolation, and improve overall system performance by allowing each plugin to function autonomously without impacting others.
In recent years, deep learning models have steadily improved performance in Natural Language Processing (NLP) and Computer Vision benchmarks. While part of these gains comes from improvements in architecture and learning algorithms, a significant driver has been the increase in dataset sizes and model parameters. The following figure shows the top-1 ImageNet classification accuracy as a function of GFLOPS which can be used as a metric for model complexity. Scaling up data and model complexity seems to be the dominant trend and multi-billion or even trillion-parameter models are not uncommon. While these large models have impressive performance the sheer scale of these models makes it impossible to be used on edge devices or for latency-critical applications. This is where model compression comes in. The goal of model compression is to reduce the model’s parameter count and/or latency while minimizing the drop in performance. There are several approaches but they can be grouped into three main categories: PruningQuantizationKnowledge Distillation (KD) There are other approaches as well, like low-rank tensor decomposition, but we won't cover them in this article. Let’s go over these three primary techniques in detail. Pruning With pruning, we can make a model smaller by removing less important weights (neuron connections) or layers from a neural network. A simple strategy can be to remove a neuron connection if the magnitude of its weight falls below a threshold. This is called weight pruning and it ensures that we remove connections that are redundant or can be removed without affecting the final result much. Similarly, we can remove a neuron itself based on some metric of its importance, e.g., the L2 norm of outgoing weights. This is called neuron pruning and is generally more efficient than weight pruning. Another advantage of node pruning over weight pruning is that the latter leads to a sparse network which can be hard to optimize on hardware like GPU. Though it would reduce the memory footprint and FLOPS, it might not translate to reduced latency. The idea of pruning can be extended to CNNs as well where the relative importance of a filter/kernel can be determined based on its L1/L2 norm and only the important filters can be retained. In practice, pruning is an iterative process where we alternate between pruning and fine-tuning the model. Using this approach we can achieve a minimal drop in performance while cutting down network parameters by over 50% as shown in the image below: Quantization The main idea behind quantization-based model compression is to reduce the precision of model weights to reduce memory and latency. Generally, deep learning models during or after training have their weights stored as 32-bit floating point (FP32). With quantization, these are generally converted to 16- (FP16) or 8-bit (INT8) precision for runtime deployment. Quantization can be split into two categories: Post-Training Quantization (PTQ) This involves quantization of weights and activations post training and is achieved through a process called calibration. The goal of this process is to figure out a mapping from original to target precision while minimizing information loss. To achieve this we use a set of samples from our dataset and run inference on the model, tracking the dynamic range for different activations in the model to determine the mapping function. Quantization Aware Training (QAT) The main problem with training with lower precision weights and activations is that the gradients are not properly defined so we can't do backpropagation. To solve this problem using QAT, the model simulates target precision during forward pass but uses the original precision for the backward pass to compute gradients. While PTQ is easy to implement and doesn't involve retraining the model, it can lead to degradation in performance. QAT, on the other hand, generally has better accuracy than PTQ but is not as straightforward to implement and increases training code complexity. From a mathematical standpoint, quantization and calibration for a given weight/activation involves determining two values: the scale and zero point. Say we wanted to convert from FP32 to INT8: Python # max_int for INT8 would be 255 and min_int 0 # max_float, min_float are deteremined in the calibration process scale = (max_float - min_float) / (max_int - min_int) # to allow for both positive and negative values to be quantized zero_point = round((0 - min_float) / scale) int8_value = round(fp32_value / scale) + zero_point Knowledge Distillation Knowledge distillation (KD) as the name suggests tries to distill or transfer the knowledge of the original model in this case called the teacher model to a smaller model called the student model. There are several ways to achieve this but the most common ones try to match the output or the intermediate feature representation of the teacher model with the student model. Interestingly, the student model trained on a combination of ground truth labels + soft labels from the teacher model's output performs better than trained on ground truth labels alone, and at times can even match the performance of the teacher model. One hypothesis for this behavior is that since the soft labels contain more information than ground truth labels (hard labels; e.g., zero-hot), it helps the student model generalize better. Knowledge distillation is one of the more flexible model compression techniques because the resulting model can have a different architecture than the original model and has potential for larger memory and latency reduction compared to pruning or quantization. However, it is also the most complex to train since it involves training the teacher model, followed by designing and training the student model. Conclusion In practice, it's common to combine multiple compression techniques together — e.g., KD followed by PTQ or Pruning — to achieve desired result in terms of compression and accuracy.
What Is Apache Guacamole? Apache Guacamole is an open-source framework created by the Apache Foundation that provides an HTML5 application that acts as a remote desktop gateway to enable access to remote desktops via the RDP, SSH, and VNC protocols without the use of any other third-party software. The Guacamole solution includes many individual components, such as libguac, guacamole-common, and guacamole-ext. While these projects are beyond the scope of this article, we'll hone in on guacamole-common-js within the Guacamole ecosystem. What Is guacamole-common-js? The Guacamole project provides a JavaScript API for interfacing with components designed to meet Guacamole specifications. The guacamole-common-js API offers a JavaScript implementation of a Guacamole Client and tunneling mechanisms to transfer protocol data from JavaScript to the server side of the application. The server side typically runs a machine with guacd or the Guacamole Daemon. The guacamole-common-js library provides mouse and keyboard abstraction objects to translate JavaScript mouse and keyboard events into data that Guacamole can easily digest. Using guacamole-common-js to Create a Custom Guacamole Client Prerequisites: Install guacamole-common-js via any package manager of your choice. In this example, we will use npm. Shell npm i guacamole-common-js Step 1: Create a Guacamole Tunnel Creating a Guacamole Tunnel allows you to stream data effortlessly between the server and your client. JavaScript const Guacamole = require('guacamole-common-js') let tunnel = new Guacamole.Tunnel("path/to/your/tunnel"); You can pass additional parameters to your server via the tunnel URL using query parameters. For example, your tunnel URL can look like path/to/your/tunnel?param1=value1¶m2=value2. Step 2: Use the Tunnel Object to Create a Guacamole Client You can create a Guacamole Client object by passing the Tunnel object you just created to the Guacamole.Client constructor. JavaScript let guacClient = new Guacamole.Client(tunnel) Step 3: Call the Connect Function to Establish the Connection So, we have the Guacamole Tunnel instance and the Guacamole Client instance. These are all we need to establish a connection to our remote machine. JavaScript guacClient.connect() Just one thing to remember: The Guacamole.Tunnel object passed into the Guacamole.Client constructor must not already be connected. This is because, internally, the guacClient.connect() method will call the tunnel.connect() method, and if the tunnel is already connected, this operation will fail. Now, the astute amongst you will find that you still don't see the contents of your remote machine on your client. That's because we're still missing one crucial step. Step 4: Get the Guacamole Display and Attach It to the DOM Once you have established the connection by calling guacClient.connect(), you can view the remote machine's display by attaching the Guacamole display (an HTMLDivElement) to the DOM. Let's see how we can do that. Imagine you have an HTML page where you wish to show the display of your remote machine. HTML <html> <body id="guacCanvas"> </body> </html> Next, let's get the HTMLDivElement, which needs to be displayed to the user from the guacClient. JavaScript // Get the display element from the guacClient let displayElement = guacClient.getDisplay().getElement(); // Get the element from the DOM and attach the displayElement to it let guacCanvas = document.getElementById('guacCanvas'); // Attach the displayElement to the canvas guacCanvas.appendChild(displayElement); Et voila! You now see the contents of your remote machine on your DOM. But wait, something isn't right. Your keyboard input does nothing, and neither does your mouse. How do we address that? Step 5: Configure Keyboard and Mouse Events To configure keyboard and mouse events, you need to set up the input handlers provided by guacamole-common-js. Let's first look at how we can configure mouse events. JavaScript let mouse = new Guacamole.Mouse(guacElement) // Primarily you need to handle 3 events. onmousedown, onmouseup, onmousemove. // The high level idea is to send the current state of the mouse to guacamole // whenever the mouse moves, the mouse gets clicked or unclicked. const sendMouseState = (mouseState) => { guacClient.sendMouseState(mouseState); } // Essentially sending mouse state for all the individual events mouse.onmousedown = mouse.onmouseup = mouse.onmousemove = sendMouseState; Keyboard configuration is even simpler because there are just two events that need to be configured. JavaScript let keyboard = new Guacamole.Keyboard(guacElement); // you need to pass in the HTMLElement here where you want the keyboard events to // be passed into Guacamole. For example, if you pass in the document object instead // of guacElement, you'll send all the events in the entire DOM to Guacamole, which may // or may not be something you want. If you don't have any other UI elements in your // DOM, you should be fine sending document, but if you have other UI elements in addition // to your guacCanvas, you'd be better off passing just the guacCanvas. // You need to configure 2 events. onkeyup and onkeydown keyboard.onkeydown = (keysym: number) => { guacClient.sendKeyEvent(1, keysym); // Send keydown event to the remote server }; keyboard.onkeyup = (keysym: number) => { guacClient.sendKeyEvent(0, keysym); // Send keyup event to the remote server }; Step 6: Configure Touch Events (Optional) Optionally, you can also configure your client for touch inputs, but it will be translated to Guacamole.Mouse events. JavaScript let touch = new Guacamole.Touch(guacCanvas); // You need to configure 3 events here ontouchstart, ontouchend, ontouchmove touch.onmousedown = touch.onmouseup = touch.onmousemove = (touchState) => guacClient.sendMouseState(touchState); const handleTouchEvent = (event) => { event.preventDefault(); let touchState = touch.getMouseState(event); guacClient.sendMouseState(touchState); } touch.ontouchstart = touch.ontouchend = touch.ontouchmove = handleTouchEvent; As you can see, we're translating touch events into Guacamole mouse events, and this step is entirely optional. You need to configure touch events only if you intend to use your custom client on a touchscreen device. Step 7: Disconnect From Your Remote Machine Finally, we've reached the last step, which is disconnecting from your remote machine, and it is as simple as calling a method on your client. JavaScript guacClient.disconnect(); Conclusion To summarize, Apache Guacamole is a powerful and versatile open-source framework that offers a seamless way to access remote desktops through the RDP, SSH, or VNC protocols. The guacamole-common-js library allows developers to create custom Guacamole clients that can interface with other Guacamole components like guacamole-common, guaclib, and guacamole-ext. By following the steps outlined in this article, you can set up a basic custom guacamole client that can connect to your remote servers and handle keyboard, mouse, and touch events.
Java Virtual Machine (JVM) tuning is the process of adjusting the default parameters to match our application needs. This includes simple adjustments like the size of the heap, through choosing the right garbage collector to use optimized versions of getters. Understanding the Java Virtual Machine (JVM) What Is JVM? The Java Virtual Machine (JVM) is a key component in the Java ecosystem that enables Java applications to be platform-independent. It interprets Java bytecode and executes it as machine code on various operating systems, making it possible to "write once, run anywhere." Optimizing Garbage Collection Garbage Collection The Java application creates many objects to handle incoming requests. After these requests are serviced, the objects become 'garbage' and must be cleaned up. Garbage Collection (GC) is essential for freeing up memory but can slow response times and increase CPU usage. Therefore, tuning GC is important for optimizing performance. ZGC Algorithm The main feature of Z Garbage Collector (ZGC) is its focus on minimizing GC pause times. It is designed to ensure that pause times, typically in a few milliseconds, even with large heap sizes. Choose ZGC in multi-thread or memory-intensive applications that require low-latency performance, with large heaps or high-throughput, low-latency use cases. It minimizes pause times, scales with large memory sizes, and improves predictability. JVM Argument To achieve high throughput and low latency in multi-thread or memory-intensive services, you have to tune the JVM argument. -Xss256k The -Xss option in Java is used to set the thread stack size for each thread in the Java Virtual Machine (JVM). The -Xss256k option specifically sets the thread stack size to 256 kilobytes (KB). This value is useful when tuning Java applications, particularly in multi-threaded scenarios.Pros: In highly concurrent applications, such as servers handling many simultaneous requests (e.g., web servers, message brokers, etc.), reducing the thread stack size can help prevent memory exhaustion by allowing more threads to be created without hitting memory limits. -Xms<Size>g The -Xms JVM option in Java specifies the initial heap size for the JVM when an application starts.Pros: A large-scale web application, a data processing system, or an in-memory database might benefit from an initial large heap allocation to accommodate its memory needs without constantly resizing the heap during startup. -Xmx<MaxSize>g The -Xmx JVM option in Java sets the maximum heap size for the JVM. Max size could be 90% percent of RAM.Pros: Avoids OutOfMemoryError Optimizes GC performance Optimizes memory-intensive applications (in-memory databases, caching, etc.) Avoids frequent heap resizing Improves performance for multi-threaded applications -XX:+UseZGC The -XX:+UseZGC option in Java enables ZGC in the JVM. ZGC is a low-latency garbage collector designed to minimize pause times during garbage collection, even for applications running with very large heaps (up to terabytes of memory).Pros: Low-latency GCScalability with large heapsConcurrent and incremental GCSuitable for containerized and cloud-native environmentsLow pause time, even during Full GCBetter for multi-core systems -XX:+ZGenerational The -XX:+ZGenerational option is used in Java to enable a generational mode for the ZGC. By default, ZGC operates as a non-generational garbage collector, meaning it treats the entire heap as a single, unified region when performing garbage collection.Pros: Improves performance for applications with many short-lived objectsReduces GC pause times by collecting young generation separatelyImproves heap managementReduces the cost of full GCs Using -XX:+ZGenerational enables generational garbage collection in ZGC, which improves performance for applications with a mix of short-lived and long-lived objects by segregating these into different regions of the heap. This can lead to better memory management, reduced pause times, and improved scalability, particularly for large-scale applications that deal with large amounts of data. -XX:SoftMaxHeapSize=<Size>g The -XX:SoftMaxHeapSize JVM option is used to set a soft limit on the maximum heap size for a Java application. When you use -XX:SoftMaxHeapSize=4g, you're telling the JVM to aim for a heap size that does not exceed 4 gigabytes (GB) under normal conditions, but the JVM is allowed to exceed this limit if necessary.Pros: Memory management with flexibilityHandling memory surges without crashingEffective for containerized or cloud environmentsPerformance tuning and resource optimizationPreventing overuse of memory -XX:+UseStringDeduplication The -XX:+UseStringDeduplication option in Java enables string deduplication as part of the garbage collection process. String deduplication is a technique that allows the JVM to identify and remove duplicate string literals or identical string objects in memory, effectively reducing memory usage by storing only one copy of the string value, even if it appears multiple times in the application.Pros: Reduces memory usage for duplicate stringsOptimizes memory in large applications with many stringsApplicable to interned stringsHelps with large textual data handlingAutomatic deduplication with minimal configuration Using the -XX:+UseStringDeduplication flag can be a great way to optimize memory usage in Java applications, especially those that deal with large numbers of repeated string values. By enabling deduplication, you allow the JVM to eliminate redundant copies of strings in the heap, which can lead to significant memory savings and improved performance in memory-intensive applications. -XX:+ClassUnloadingWithConcurrentMark The -XX:+ClassUnloadingWithConcurrentMark option in Java is used to enable the concurrent unloading of classes during the concurrent marking phase of garbage collection. This option is particularly useful when running applications that dynamically load and unload classes, such as application servers (e.g., Tomcat, Jetty), frameworks, or systems that rely on hot-swapping or dynamic class loading.Pros: Reduces GC pause timesImproves memory management in long-lived applicationsBetter scalability for servers and containersHot-swapping and frameworks -XX:+UseNUMA The -XX:+UseNUMA JVM option is used to optimize the JVM for systems with a Non-Uniform Memory Access (NUMA) architecture. NUMA is a memory design used in multi-processor systems where each processor has its own local memory, but it can also access the memory of other processors, albeit with higher latency. The -XX:+UseNUMA option enables the JVM to optimize memory allocation and garbage collection on NUMA-based systems to improve performance.Pros: Improves memory access latency and performanceBetter GC efficiencyOptimizes memory allocationBetter scalability on multi-socket systems What is NUMA? In a NUMA system, processors are connected to local memory, and each processor can access its own local memory more quickly than the memory that is attached to other processors. In contrast to Uniform Memory Access (UMA) systems, where all processors have the same access time to all memory, NUMA systems have asymmetric memory access due to the varying latency to local versus remote memory. NUMA architectures are commonly used in large-scale servers with multiple processors (or sockets), where performance can be improved by ensuring that memory is accessed locally as much as possible. What Does -XX:+UseNUMA Do? When you enable the -XX:+UseNUMA option, the JVM is configured to optimize memory access by considering the NUMA topology of the system. Specifically, the JVM will: Allocate memory from the local NUMA node associated with the processor executing a given task (whenever possible).Keep thread-local memory close to the processor where the thread is running, reducing memory access latency.Improve garbage collection performance by optimizing how the JVM manages heap and other memory resources across multiple NUMA nodes. -XX:ConcGCThreads=<size> The -XX:ConcGCThreads option in Java allows you to control the number of concurrent GC threads that the JVM uses during the concurrent phases of garbage collection.Pros: Controls the degree of parallelism in GCMinimizes garbage collection pause timesOptimizes performance based on hardware resourcesImproves throughput for multi-threaded applications What Does -XX:ConcGCThreads Do? When the JVM performs garbage collection, certain collectors (e.g., G1 GC or ZGC) can execute phases of garbage collection concurrently, meaning they run in parallel with application threads to minimize pause times and improve throughput. The -XX:ConcGCThreads option allows you to specify how many threads the JVM should use during these concurrent GC phases. -XX:+ZUncommit The -XX:+ZUncommit JVM option is used to control the behavior of memory management in the ZGC, specifically related to how the JVM releases (or uncommits) memory from the operating system after it has been allocated for the heap.Pros: Reduces memory footprintDynamic memory reclamation in low-memory environmentsAvoids memory fragmentationOptimizes GC overhead -XX:+AlwaysPreTouch The -XX:+AlwaysPreTouch JVM option is used to pre-touch the memory pages that the JVM will use for its heap, meaning that the JVM will touch each page of memory (i.e., access it) as soon as the heap is allocated rather than lazily touching pages when they are actually needed.Pros: Reduces latency during application runtimePrevents OS page faults during initial executionPreloads virtual memory in large heap applicationsImproves memory allocation efficiency in multi-core systemsAvoids memory swapping during startup -XX:MaxGCPauseMillis=<size> The -XX:MaxGCPauseMillis option in Java is used to set a target for the maximum acceptable pause time during GC. When you specify -XX:MaxGCPauseMillis=100, you are instructing the JVM's garbage collector to aim for a maximum GC pause time of 100 milliseconds.Pros: Minimize application latencyControl and balance throughput vs. pause timeOptimized for interactive or high-throughput systemsGarbage collection tuning in G1 GCImproved user experience in web and server applicationsUse with ZGC and other low-latency garbage collectors -XX:+UseLargePages The -XX:+UseLargePages JVM option is used to enable the JVM to use large memory pages (also known as huge pages or Superpages) for the Java heap and other parts of the memory, such as the metaspace and JIT (Just-In-Time) compilation caches.Pros: Improves memory access performanceReduces operating system overheadBetter performance for memory-intensive applicationsLower memory fragmentationReduces memory paging activity What Does -XX:+UseLargePages Do? Operating systems typically manage memory in pages, which are the basic unit of memory allocation and management. The size of a memory page is usually 4 KB by default on many systems, but some systems support larger page sizes—commonly 2 MB (on x86-64 Linux and Windows systems) or even 1 GB for certain processors and configurations. -XX:+UseTransparentHugePages The -XX:+UseTransparentHugePages JVM option enables the use of Transparent Huge Pages (THP) for memory management in the Java Virtual Machine (JVM). Transparent Huge Pages are a Linux kernel feature designed to automatically manage large memory pages, improving performance for memory-intensive applications. Bonus: How to Use JVM Arguments in a Dockerfile for Java and Spring Boot Services. Dockerfile ENTRYPOINT [ "java", "-Xss256k", "-Xms1g", "-Xmx4g", "-XX:+UseZGC", "-XX:+UseStringDeduplication", "-XX:+ZGenerational", "-XX:SoftMaxHeapSize=4g", "-XX:+ClassUnloadingWithConcurrentMark", "-XX:+UseNUMA", "-XX:ConcGCThreads=4", "-XX:+ZUncommit", "-XX:+AlwaysPreTouch", "-XX:MaxGCPauseMillis=100", "-XX:+UseLargePages", "-XX:+UseTransparentHugePages", "org.springframework.boot.loader.launch.JarLauncher" ] Conclusion Performance tuning in the JVM is essential for optimizing multi-threaded and memory-intensive applications, especially when aiming for high throughput and low latency. The process involves fine-tuning garbage collection, optimizing memory management, and adjusting concurrency settings.
If you’re a Node.js developer, then you’re familiar with npm and Yarn. You might even have a strong opinion about using one over the other. For years, developers have been struggling with the bloat — in disk storage and build time — when working with Node.js package managers, especially npm. Source: Reddit Then, along came pnpm, a package manager that handles package storage differently, saving users space and reducing build time. Here’s how pnpm describes the difference: “When you install a package, we keep it in a global store on your machine, then we create a hard link from it instead of copying. For each version of a module, there is only ever one copy kept on disk. When using npm or yarn for example, if you have 100 packages using lodash, you will have 100 copies of lodash on disk. pnpm allows you to save gigabytes of disk space!” It’s no surprise that pnpm is gaining traction, with more and more developers making it their package manager of choice. Along with that growing adoption rate, many developers who run their apps on Heroku (like I do) wanted to see pnpm supported. Fortunately, pnpm is available via Corepack, which is distributed with Node.js. So, as of May 2024, pnpm is now available in Heroku! In this post, we’ll cover what it takes to get started with pnpm on Heroku. And, we’ll also show off some of the storage and build-time benefits you get from using it. A Quick Primer on pnpm pnpm was created to solve the longstanding Node.js package manager issue of redundant storage and inefficiencies in dependency handling. npm and Yarn copy dependencies into each project’s node_modules. In contrast, pnpm keeps all the packages for all projects in a single global store, then creates hard links to these packages rather than copying them. What does this mean? Let’s assume we have a Node.js project that uses lodash. Naturally, the project will have a node_modules folder, along with a subfolder called lodash, filled with files. To be exact, lodash@4.17.21 has 639 files and another subfolder called fp, with another 415 files. That’s over a thousand files for lodash alone! I created six Node.js projects: two with pnpm, two with npm, and two with Yarn. Each of them uses lodash. Let’s take a look at information for just one of the files in the lodash dependency folder. Shell ~/six-projects$ ls -i npm-foo/node_modules/lodash/lodash.js 14754214 -rw-rw-r-- 544098 npm-foo/node_modules/lodash/lodash.js ~/six-projects$ ls -i npm-bar/node_modules/lodash/lodash.js 14757384 -rw-rw-r-- 544098 npm-bar/node_modules/lodash/lodash.js ~/six-projects$ ls -i yarn-foo/node_modules/lodash/lodash.js 14760047 -rw-r--r-- 544098 yarn-foo/node_modules/lodash/lodash.js ~/six-projects$ ls -i yarn-bar/node_modules/lodash/lodash.js 14762739 -rw-r--r-- 544098 yarn-bar/node_modules/lodash/lodash.js ~/six-projects$ ls -i pnpm-foo/node_modules/lodash/lodash.js 15922696 -rw-rw-r-- 544098 pnpm-foo/node_modules/lodash/lodash.js ~/six-projects$ ls -i pnpm-bar/node_modules/lodash/lodash.js 15922696 -rw-rw-r-- 544098 pnpm-bar/node_modules/lodash/lodash.js The lodash.js file is a little over half a megabyte in size. We’re not seeing soft links, so at first glance, it really looks like each project has its own copy of this file. However, that’s not actually the case. I used ls with the -i flag to display the inode of lodash.js file. You can see in the pnpm-foo and pnpm-bar projects, both files have the same inode (15922696). They’re pointing to the same file! That’s not the case for npm or Yarn. So, if you have a dozen projects that use npm or Yarn, and those projects use lodash, then you’ll have a dozen different copies of lodash, along with copies from other dependencies in those projects which themselves use lodash. In pnpm, every project and dependency that requires this specific version of lodash points to the same, single, global copy. The code for lodash@4.17.21 is just under 5 MB in size. Would you rather have 100 redundant copies of it on your machine, or just one global copy? At the end of the day, dependency installation with pnpm is significantly faster, requiring less disk space and fewer resources. For developers working across multiple projects or managing dependencies on cloud platforms, pnpm offers a leaner, faster way to manage packages. This makes pnpm ideal for a streamlined deployment environment like Heroku. Are you ready to start using it? Let’s walk through how. Getting Started With pnpm Here’s the version of Node.js we’re working with on our machine: Shell $ node --version v20.18.0 Enable and Use pnpm As we mentioned above, Corepack comes with Node.js, so we simply need to use corepack to enable and use pnpm. We create a folder for our project. Then, we run these commands: Shell ~/project-pnpm$ corepack enable pnpm ~/project-pnpm$ corepack use pnpm@latest Installing pnpm@9.12.2 in the project... Already up to date Done in 494ms This generates a package.json file that looks like this: JSON { "packageManager": "pnpm@9.12.2+sha512.22721b3a11f81661ae1ec68ce1a7b879425a1ca5b991c975b074ac220b187ce56c708fe5db69f4c962c989452eee76c82877f4ee80f474cebd61ee13461b6228" } This also generates a pnpm-lock.yaml file. Next, we add dependencies to our project. For demonstration purposes, we’re copying the list of dependencies and devDependencies found in this benchmarking package.json file on GitHub. Now, our package.json file looks like this: JSON { "version": "0.0.1", "dependencies": { "animate.less": "^2.2.0", "autoprefixer": "^10.4.17", "babel-core": "^6.26.3", "babel-eslint": "^10.1.0", ... "webpack-split-by-path": "^2.0.0", "whatwg-fetch": "^3.6.20" }, "devDependencies": { "nan-as": "^1.6.1" }, "packageManager": "pnpm@9.12.2+sha512.22721b3a11f81661ae1ec68ce1a7b879425a1ca5b991c975b074ac220b187ce56c708fe5db69f4c962c989452eee76c82877f4ee80f474cebd61ee13461b6228" } Then, we install the packages. Shell ~/project-pnpm$ pnpm install Comparing Common Commands The usage for pnpm is fairly similar to npm or yarn, and so it should be intuitive. Below is a table that compares the different usages for common commands (taken from this post). npmyarnpnpm npm yarn pnpm npm init yarn init corepack use pnpm@latest npm install yarn pnpm install npm install [pkg] yarn add [pkg] pnpm add [pkg] npm uninstall [pkg] yarn remove [pkg] pnpm remove [pkg] npm update yarn upgrade pnpm update npm list yarn list pnpm list npm run [scriptName] yarn [scriptName] pnpm [scriptName] npx [command] yarn dlx [command] pnpm dlx [command] npm exec [commandName] yarn exec [commandName] pnpm exec [commandName] npm init [initializer] yarn create [initializer] pnpm create [initializer] Heroku Build Speed Comparison Now that we’ve shown how to get a project up and running with pnpm (it’s pretty simple, right?), we wanted to compare the build times for different package managers when running on Heroku. We set up three projects with identical dependencies — using npm, Yarn, and pnpm. First, we log in to the Heroku CLI (heroku login). Then, we create an app for a project. We’ll show the steps for the npm project. Shell ~/project-npm$ heroku apps:create --stack heroku-24 npm-timing Creating ⬢ npm-timing... done, stack is heroku-24 https://npm-timing-5d4e30a1c656.herokuapp.com/ | https://git.heroku.com/npm-timing.git We found a buildpack that adds timestamps to the build steps in the Heroku log, so that we can calculate the actual build times for our projects. We want to add that buildpack to our project, and have it run before the standard buildpack for Node.js. We do that with the following two commands: Shell ~/project-npm$ heroku buildpacks:add \ --index=1 \ https://github.com/edmorley/heroku-buildpack-timestamps.git \ --app pnpm-timing ~/project-npm$ heroku buildpacks:add \ --index=2 heroku/nodejs \ --app npm-timing Buildpack added. Next release on npm-timing will use: 1. https://github.com/edmorley/heroku-buildpack-timestamps.git 2. heroku/nodejs Run git push heroku main to create a new release using these buildpacks. That’s it! Then, we push up the code for our npm-managed project. Shell ~/project-npm$ git push heroku main ... remote: Updated 4 paths from 5af8e67 remote: Compressing source files... done. remote: Building source: remote: remote: -----> Building on the Heroku-24 stack remote: -----> Using buildpacks: remote: 1. https://github.com/edmorley/heroku-buildpack-timestamps.git remote: 2. heroku/nodejs remote: -----> Timestamp app detected remote: -----> Node.js app detected ... remote: 2024-10-22 22:31:29 -----> Installing dependencies remote: 2024-10-22 22:31:29 Installing node modules remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 added 1435 packages, and audited 1436 packages in 11s remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 184 packages are looking for funding remote: 2024-10-22 22:31:41 run `npm fund` for details remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 96 vulnerabilities (1 low, 38 moderate, 21 high, 36 critical) remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 To address issues that do not require attention, run: remote: 2024-10-22 22:31:41 npm audit fix remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 To address all issues possible (including breaking changes), run: remote: 2024-10-22 22:31:41 npm audit fix --force remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 Some issues need review, and may require choosing remote: 2024-10-22 22:31:41 a different dependency. remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 Run `npm audit` for details. remote: 2024-10-22 22:31:41 npm notice remote: 2024-10-22 22:31:41 npm notice New minor version of npm available! 10.8.2 -> 10.9.0 remote: 2024-10-22 22:31:41 npm notice Changelog: https://github.com/npm/cli/releases/tag/v10.9.0 remote: 2024-10-22 22:31:41 npm notice To update run: npm install -g npm@10.9.0 remote: 2024-10-22 22:31:41 npm notice remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 -----> Build remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 -----> Caching build remote: 2024-10-22 22:31:41 - npm cache remote: 2024-10-22 22:31:41 remote: 2024-10-22 22:31:41 -----> Pruning devDependencies remote: 2024-10-22 22:31:44 remote: 2024-10-22 22:31:44 up to date, audited 1435 packages in 4s remote: 2024-10-22 22:31:44 remote: 2024-10-22 22:31:44 184 packages are looking for funding remote: 2024-10-22 22:31:44 run `npm fund` for details remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 96 vulnerabilities (1 low, 38 moderate, 21 high, 36 critical) remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 To address issues that do not require attention, run: remote: 2024-10-22 22:31:45 npm audit fix remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 To address all issues possible (including breaking changes), run: remote: 2024-10-22 22:31:45 npm audit fix --force remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 Some issues need review, and may require choosing remote: 2024-10-22 22:31:45 a different dependency. remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 Run `npm audit` for details. remote: 2024-10-22 22:31:45 npm notice remote: 2024-10-22 22:31:45 npm notice New minor version of npm available! 10.8.2 -> 10.9.0 remote: 2024-10-22 22:31:45 npm notice Changelog: https://github.com/npm/cli/releases/tag/v10.9.0 remote: 2024-10-22 22:31:45 npm notice To update run: npm install -g npm@10.9.0 remote: 2024-10-22 22:31:45 npm notice remote: 2024-10-22 22:31:45 remote: 2024-10-22 22:31:45 -----> Build succeeded! ... We looked at the timing for the following steps, up until the Build succeeded message near the end: Installing dependenciesBuildPruning devDependenciesCaching build In total, with npm, this build took 16 seconds. We ran the same setup for the pnpm-managed project, also using the timings buildpack. Shell ~/project-pnpm$ heroku apps:create --stack heroku-24 pnpm-timing ~/project-pnpm$ heroku buildpacks:add \ --index=1 \ https://github.com/edmorley/heroku-buildpack-timestamps.git \ --app pnpm-timing ~/project-pnpm$ heroku buildpacks:add \ --index=2 heroku/nodejs \ --app pnpm-timing ~/project-pnpm$ git push heroku main … remote: 2024-10-22 22:38:34 -----> Installing dependencies remote: 2024-10-22 22:38:34 Running 'pnpm install' with pnpm-lock.yaml … remote: 2024-10-22 22:38:49 remote: 2024-10-22 22:38:49 dependencies: remote: 2024-10-22 22:38:49 + animate.less 2.2.0 remote: 2024-10-22 22:38:49 + autoprefixer 10.4.20 remote: 2024-10-22 22:38:49 + babel-core 6.26.3 … remote: 2024-10-22 22:38:51 -----> Build succeeded! For the same build with pnpm, it took only 7 seconds. The time savings, we found, isn’t just for that initial installation. Subsequent builds, which use the dependency cache, are also faster with pnpm. npmyarnpnpm First build 16 seconds 28 seconds 7 seconds Second build (using cache) 15 seconds 12 seconds 7 seconds Conclusion When I first started Node.js development, I used npm. Several years ago, I switched to Yarn, and that’s what I had been using. . . until recently. Now, I’ve made the switch to pnpm. On my local machine, I’m able to free up substantial disk space. Builds are faster too. And now, with Heroku support for pnpm, this closes the loop so that I can use it exclusively from local development all the way to deployment in the cloud. Happy coding!
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics. Businesses today rely significantly on data to drive customer engagement, make well-informed decisions, and optimize operations in the fast-paced digital world. For this reason, real-time data and analytics are becoming increasingly more necessary as the volume of data continues to grow. Real-time data enables businesses to respond instantly to changing market conditions, providing a competitive edge in various industries. Because of their robust infrastructure, scalability, and flexibility, cloud data platforms have become the best option for managing and analyzing real-time data streams. This article explores the key aspects of real-time data streaming and analytics on cloud platforms, including architectures, integration strategies, benefits, challenges, and future trends. Cloud Data Platforms and Real-Time Data Streaming Cloud data platforms and real-time data streaming have changed the way organizations manage and process data. Real-time streaming processes data as it is generated from different sources, unlike batch processing, where data is stored and processed at scheduled intervals. Cloud data platforms provide the necessary scalable infrastructure and services to ingest, store, and process these real-time data streams. Some of the key features that make cloud platforms efficient in handling the complexities of real-time data streaming include: Scalability. Cloud platforms can automatically scale resources to handle fluctuating data volumes. This allows applications to perform consistently, even at peak loads.Low latency. Real-time analytics systems are designed to minimize latency, providing near-real-time insights and enabling businesses to react quickly to new data.Fault tolerance. Cloud platforms provide fault-tolerant systems to ensure continuous data processing without any disturbance, whether caused by hardware malfunctioning or network errors.Integration. These platforms are integrated with cloud services for storage, AI/ML tooling, and various data sources to create comprehensive data ecosystems.Security. Advanced security features, including encryption, access controls, and compliance certifications, ensure that real-time data remains secure and meets regulatory requirements.Monitoring and management tools. Cloud-based platforms offer dashboards, notifications, and additional monitoring instruments that enable enterprises to observe data flow and processing efficiency in real time. This table highlights key tools from AWS, Azure, and Google Cloud, focusing on their primary features and the importance of each in real-time data processing and cloud infrastructure management: Table 1 Cloud servicekey featuresimportance AWS Auto Scaling Automatic scaling of resources Predictive scalingFully managed Cost-efficient resource management Better fault tolerance and availability Amazon CloudWatch Monitoring and loggingCustomizable alerts and dashboards Provides insights into system performanceHelps with troubleshooting and optimization Google Pub/Sub Stream processing and data integrationSeamless integration with other GCP services Low latency and high availabilityAutomatic capacity management Azure Data Factory Data workflow orchestrationSupport for various data sourcesCustomizable data flows Automates data pipelinesIntegrates with diverse data sources Azure Key Vault Identity managementSecrets and key management Centralized security managementProtecting and managing sensitive data Cloud providers offer various features for real-time data streaming. When selecting a platform, consider factors like scalability, availability, and compatibility with data processing tools. Select a platform that fits your organization’s setup, security requirements, and data transfer needs. To support your cloud platform and real-time data streaming, here are some key open-source technologies and frameworks: Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming applications.Apache Flink is a stream processing framework that supports complex event processing and stateful computations.Apache Spark Streaming is an extension of Apache Spark for handling real-time data.Kafka Connect is a framework that helps connect Kafka with different data sources and storage options. Connectors can be set up to transfer data between Kafka and outside systems. Real-Time Data Architectures on Cloud Data Platforms The implementation of real-time data analytics requires choosing the proper architecture that fits the special needs of an organization. Common Architectures Different data architectures offer various ways to manage real-time data. Here’s a comparison of the most popular real-time data architectures: Table 2. Data architecture patterns and use cases architecturedescriptionideal use casesLambdaHybrid approach that combines batch and real-time processing; uses a batch layer to process historical data and a real-time layer for real-time data, merging the results for comprehensive analyticsApplications that need historical and real-time dataKappaSimplifies the Lambda architecture, focuses purely on real-time data processing, and removes the need for batch processingInstances where only real-time data is requiredEvent drivenProcesses data based on events triggered by specific actions or conditions, enabling real-time response to changes in dataSituations when instant notifications on data changes are neededMicroservicesModular approach wherein the individual microservices handle specific tasks within the real-time data pipeline, lending scalability and flexibilityComplex systems that need to be modular and scalable These architectures offer adaptable solutions for different real-time data issues, whether the requirement is combining past data, concentrating on current data streams, responding to certain events, or handling complicated systems with modular services. Figure 1. Common data architectures for real-time streaming Integration of Real-Time Data in Cloud Platforms Integrating real-time data with cloud platforms is changing how companies handle and understand their data. It offers quick insights and enhances decision making by using up-to-date information. For the integration process to be successful, you must select the right infrastructure, protocols, and data processing tools for your use case. Key integration strategies include: Integration with on-premises systems. Many organizations combine cloud platforms with on-premises systems to operate in hybrid environments. To ensure data consistency and availability, it is necessary to have efficient real-time data transfer and synchronization between these systems.Integration with third-party APIs and software. The integration of real-time analytics solutions with third-party APIs — such as social media streams, financial data providers, or customer relationship management systems — can improve the quality of insights generated.Data transformation and enrichment. Before analysis, real-time data often needs to be transformed and enriched. Cloud platforms offer tools to make sure the data is in the right format and context for analysis.Ingestion and processing pipelines. Set up automated pipelines that manage data flow from the source to the target, improving real-time data handling without latency. These pipelines can be adjusted and tracked on the cloud platform, providing flexibility and control. Integration of real-time data in cloud platforms involves data ingestion from different data sources and processing in real time by using stream processing frameworks like Apache Flink or Spark Streaming. Data integration can also be used on cloud platforms that support scalable and reliable stream processing. Finally, results are archived in cloud-based data lakes or warehouses, enabling users to visualize and analyze streaming data in real time. Figure 2. Integration of real-time data streams Here are the steps to set up real-time data pipelines on cloud platforms: Select the cloud platform that fits your organization's needs best.Determine the best data ingestion tool for your goals and requirements. One of the most popular data ingestion tools is Apache Kafkadue to its scalability and fault tolerance. If you’re planning to use a managed Kafka service, setup might be minimal. For self-managed Kafka, follow these steps: Identify the data sources to connect, like IoT devices, web logs, app events, social media feeds, or external APIs.Create virtual machines or instances on your cloud provider to host Kafka brokers. Install Kafka and adjust the configuration files as per your requirement.Create Kafka topics for different data streams and set up the partitions to distribute the topics across Kafka brokers. Here is the sample command to create topics using command line interface (CLI). The below command creates a topic stream_data with 2 partitions and a replication factor of 2: Shell bash kafka-topics.sh --create --topic stream_data --bootstrap-server your-broker:9092 --partitions 2 --replication-factor 2 Configure Kafka producers to push real-time data to Kafka topics from various data sources: Utilize the Kafka Producer API to develop producer logic.Adjust batch settings for better performance (e.g., linger.ms, batch.size).Set a retry policy to manage temporary failures. Shell Sample Kafka Producer configuration properties bootstrap.servers=your-kafka-broker:9092 key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=org.apache.kafka.common.serialization.StringSerializer batch.size=15350 linger.ms=5 retries=2 acks=all batch.size sets the max size (bytes) of batch records, linger.ms controls the wait time, and the acks=all setting ensures that data is confirmed only after it has been replicated. Consume messages from Kafka topics by setting up Kafka consumers that subscribed to a topic and process the streaming messages. Once data is added to Kafka, you can use stream processing tools like Apache Flink, Apache Spark, or Kafka Streams to transform, aggregate, and enrich data in real time. These tools operate simultaneously and send the results to other systems. For data storage and retention, create a real-time data pipeline connecting your stream processing engine to analytics services like BigQuery, Redshift, or other cloud storage services. After you collect and save data, use tools such as Grafana, Tableau, or Power BI for analytics and visualization in near real time to enable data-driven decision making. Effective monitoring, scaling, and security are essential for a reliable real-time data pipeline. Use Kafka's metrics and monitoring tools or Prometheus with Grafana for visual displays.Set up autoscaling for Kafka or message brokers to handle sudden increases in load. Leverage Kafka's built-in features or integrate with cloud services to manage access. Enable TLS for data encryption in transit and use encrypted storage for data at rest. Combining Cloud Data Platforms With Real-Time Data Streaming: Benefits and Challenges The real-time data and analytics provided by cloud platforms provide several advantages, including: Improved decision making. Having instant access to data provides real-time insights, helping organizations to make proactive and informed decisions that can affect their business outcomes.Improved customer experience. Through personalized interactions, organizations can engage with customers in real time to improve customer satisfaction and loyalty.Operational efficiency. Automation and real-time monitoring help find and fix issues faster, reducing manual work and streamlining operations.Flexibility and scalability. Cloud platforms allow organizations to adjust their resources according to demand, so they only pay for the services they use while keeping their operations running smoothly.Cost effectiveness. Pay-as-you-go models help organizations use their resources more efficiently by lowering spending on infrastructure and hardware. Despite the advantages, there are many challenges in implementing real-time data and analytics on cloud platforms, including: Data latency and consistency. Applications need to find a balance between how fast they process data and how accurate and consistent that data is, which can be challenging in complex settings.Scalability concerns. Even though cloud platforms offer scalability, handling large-scale real-time processing in practice can be quite challenging in terms of planning and optimization.Integration complexity. Integration of real-time data streaming presses with legacy systems, on-prem infrastructure, or previously implemented solutions can be difficult, especially in hybrid environments; it may need a lot of customization.Data security and privacy. Data security must be maintained throughout the entire process, from collection to storage and analysis. It is important to ensure that real-time data complies with regulations like GDPR and to keep security strong across different systems.Cost management. Cloud platforms are cost effective; however, managing costs can become challenging when processing large volumes of data in real time. It’s important to regularly monitor and manage expenses. Future Trends in Real-Time Data and Analytics in Cloud Platforms The future of real-time data and analytics in cloud platforms is promising, with several trends set to shape the landscape. A few of these trends are outlined below: Innovations in AI and machine learning will have a significant impact on cloud data platforms and real-time data streaming. By integrating AI/ML models into data pipelines, decision-making processes can be automated, predictive insights can be obtained, and data-driven applications can be improved.More real-time data processing is needed closer to the source of data generation as a result of the growth of edge computing and IoT devices. In order to lower latency and minimize bandwidth usage, edge computing allows data to be processed on devices located at the network's edge.Serverless computing is streamlining the deployment and management of real-time data pipelines, reducing the operational burden on businesses. Because of its scalability and affordability, serverless computing models — where the cloud provider manages the infrastructure — are becoming increasingly more common for processing data in real time. In order to support the growing complexity of real-time data environments, these emerging technology trends will offer more flexible and decentralized approaches to data management. Conclusion Real-time data and analytics are changing how systems are built, and cloud data platforms offer the scalability tools and infrastructure needed to efficiently manage real-time data streams. Businesses that use real-time data and analytics on their cloud platforms will be better positioned to thrive in an increasingly data-driven world as technology continues to advance. Emerging trends like serverless architectures, AI integration, and edge computing will further enhance the value of real-time data analytics. These improvements will lead to new ideas in data processing and system performance, influencing the future of real-time data management. This is an excerpt from DZone's 2024 Trend Report,Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics. This article explores the essential strategies for leveraging real-time data streaming to drive actionable insights while future proofing systems through AI automation and vector databases. It delves into the evolving architectures and tools that empower businesses to stay agile and competitive in a data-driven world. Real-Time Data Streaming: The Evolution and Key Considerations Real-time data streaming has evolved from traditional batch processing, where data was processed in intervals that introduced delays, to continuously handle data as it is generated, enabling instant responses to critical events. By integrating AI, automation, and vector databases, businesses can further enhance their capabilities, using real-time insights to predict outcomes, optimize operations, and efficiently manage large-scale, complex datasets. Necessity of Real-Time Streaming There is a need to act on data as soon as it is generated, particularly in scenarios like fraud detection, log analytics, or customer behavior tracking. Real-time streaming enables organizations to capture, process, and analyze data instantaneously, allowing them to react swiftly to dynamic events, optimize decision making, and enhance customer experiences in real time. Sources of Real-Time Data Real-time data originates from various systems and devices that continuously generate data, often in vast quantities and in formats that can be challenging to process. Sources of real-time data often include: IoT devices and sensorsServer logsApp activityOnline advertisingDatabase change eventsWebsite clickstreamsSocial media platformsTransactional databases Effectively managing and analyzing these data streams requires a robust infrastructure capable of handling unstructured and semi-structured data; this allows businesses to extract valuable insights and make real-time decisions. Critical Challenges in Modern Data Pipelines Modern data pipelines face several challenges, including maintaining data quality, ensuring accurate transformations, and minimizing pipeline downtime: Poor data quality can lead to flawed insights.Data transformations are complex and require precise scripting.Frequent downtime disrupts operations, making fault-tolerant systems essential. Additionally, data governance is crucial to ensure data consistency and reliability across processes. Scalability is another key issue as pipelines must handle fluctuating data volumes, and proper monitoring and alerting are vital for avoiding unexpected failures and ensuring smooth operation. Advanced Real-Time Data Streaming Architectures and Applications Scenarios This section demonstrates the capabilities of modern data systems to process and analyze data in motion, providing organizations with the tools to respond to dynamic events in milliseconds. Steps to Build a Real-Time Data Pipeline To create an effective real-time data pipeline, it's essential to follow a series of structured steps that ensure smooth data flow, processing, and scalability. Table 1, shared below, outlines the key steps involved in building a robust real-time data pipeline: Table 1. Steps to build a real-time data pipeline stepactivities performed1. Data ingestionSet up a system to capture data streams from various sources in real time2. Data processingCleanse, validate, and transform the data to ensure it is ready for analysis3. Stream processingConfigure consumers to pull, process, and analyze data continuously4. StorageStore the processed data in a suitable format for downstream use5. Monitoring and scalingImplement tools to monitor pipeline performance and ensure it can scale with increasing data demands Leading Open-Source Streaming Tools To build robust real-time data pipelines, several leading open-source tools are available for data ingestion, storage, processing, and analytics, each playing a critical role in efficiently managing and processing large-scale data streams. Open-source tools for data ingestion: Apache NiFi, with its latest 2.0.0-M3 version, offers enhanced scalability and real-time processing capabilities. Apache Airflow is used for orchestrating complex workflows.Apache StreamSets provides continuous data flow monitoring and processing. Airbyte simplifies data extraction and loading, making it a strong choice for managing diverse data ingestion needs. Open-source tools for data storage: Apache Kafka is widely used for building real-time pipelines and streaming applications due to its high scalability, fault tolerance, and speed. Apache Pulsar, a distributed messaging system, offers strong scalability and durability, making it ideal for handling large-scale messaging. NATS.io is a high-performance messaging system, commonly used in IoT and cloud-native applications, that is designed for microservices architectures and offers lightweight, fast communication for real-time data needs. Apache HBase, a distributed database built on top of HDFS, provides strong consistency and high throughput, making it ideal for storing large amounts of real-time data in a NoSQL environment. Open-source tools for data processing: Apache Spark stands out with its in-memory cluster computing, providing fast processing for both batch and streaming applications. Apache Flink is designed for high-performance distributed stream processing and supports batch jobs. Apache Storm is known for its ability to process more than a million records per second, making it extremely fast and scalable. Apache Apex offers unified stream and batch processing.Apache Beam provides a flexible model that works with multiple execution engines like Spark and Flink. Apache Samza, developed by LinkedIn, integrates well with Kafka and handles stream processing with a focus on scalability and fault tolerance. Heron, developed by Twitter, is a real-time analytics platform that is highly compatible with Storm but offers better performance and resource isolation, making it suitable for high-speed stream processing at scale. Open-source tools for data analytics: Apache Kafka allows high-throughput, low-latency processing of real-time data streams. Apache Flink offers powerful stream processing, ideal for applications requiring distributed, stateful computations. Apache Spark Streaming integrated with the broader Spark ecosystem handles real-time and batch data within the same platform. Apache Druid and Pinot serve as real-time analytical databases, offering OLAP capabilities that allow querying of large datasets in real time, making them particularly useful for dashboards and business intelligence applications. Implementation Use Cases Real-world implementations of real-time data pipelines showcase the diverse ways in which these architectures power critical applications across various industries, enhancing performance, decision making, and operational efficiency. Financial Market Data Streaming for High-Frequency Trading Systems In high-frequency trading systems, where milliseconds can make the difference between profit and loss, Apache Kafka or Apache Pulsar are used for high-throughput data ingestion. Apache Flink or Apache Storm handle low-latency processing to ensure trading decisions are made instantly. These pipelines must support extreme scalability and fault tolerance as any system downtime or processing delay can lead to missed trading opportunities or financial loss. IoT and Real-Time Sensor Data Processing Real-time data pipelines ingest data from IoT sensors, which capture information such as temperature, pressure, or motion, and then process the data with minimal latency. Apache Kafka is used to handle the ingestion of sensor data, while Apache Flink or Apache Spark Streaming enable real-time analytics and event detection. Figure 1 shared below shows the steps of stream processing for IoT from data sources to dashboarding: Figure 1. Stream processing for IoT Fraud Detection From Transaction Data Streaming Transaction data is ingested in real time using tools like Apache Kafka, which handles high volumes of streaming data from multiple sources, such as bank transactions or payment gateways. Stream processing frameworks like Apache Flink or Apache Spark Streaming are used to apply machine learning models or rule-based systems that detect anomalies in transaction patterns, such as unusual spending behavior or geographic discrepancies. How AI Automation Is Driving Intelligent Pipelines and Vector Databases Intelligent workflows leverage real-time data processing and vector databases to enhance decision making, optimize operations, and improve the efficiency of large-scale data environments. Data Pipeline Automation Data pipeline automation enables the efficient handling of large-scale data ingestion, transformation, and analysis tasks without manual intervention. Apache Airflow ensures that tasks are triggered in an automated way at the right time and in the correct sequence. Apache NiFi facilitates automated data flow management, enabling real-time data ingestion, transformation, and routing. Apache Kafka ensures that data is processed continuously and efficiently. Pipeline Orchestration Frameworks Pipeline orchestration frameworks are essential for automating and managing data workflows in a structured and efficient manner. Apache Airflow offers features like dependency management and monitoring. Luigi focuses on building complex pipelines of batch jobs. Dagster and Prefect provide dynamic pipeline management and enhanced error handling. Adaptive Pipelines Adaptive pipelines are designed to dynamically adjust to changing data environments, such as fluctuations in data volume, structure, or sources. Apache Airflow or Prefect allow for real-time responsiveness by automating task dependencies and scheduling based on current pipeline conditions. These pipelines can leverage frameworks like Apache Kafka for scalable data streaming and Apache Spark for adaptive data processing, ensuring efficient resource usage. Streaming Pipelines A streaming pipeline for populating a vector database for real-time retrieval-augmented generation (RAG) can be built entirely using tools like Apache Kafka and Apache Flink. The processed streaming data is then converted into embeddings and stored in a vector database, enabling efficient semantic search. This real-time architecture ensures that large language models (LLMs) have access to up-to-date, contextually relevant information, improving the accuracy and reliability of RAG-based applications such as chatbots or recommendation engines. Data Streaming as Data Fabric for Generative AI Real-time data streaming enables real-time ingestion, processing, and retrieval of vast amounts of data that LLMs require for generating accurate and up-to-date responses. While Kafka helps in streaming, Flink processes these streams in real time, ensuring that data is enriched and contextually relevant before being fed into vector databases. The Road Ahead: Future Proofing Data Pipelines The integration of real-time data streaming, AI automation, and vector databases offers transformative potential for businesses. For AI automation, integrating real-time data streams with frameworks like TensorFlow or PyTorch enable real-time decision making and continuous model updates. For real-time contextual data retrieval, leveraging databases like Faiss or Milvus help in fast semantic searches, which are crucial for applications like RAG. Conclusion Key takeaways include the critical role of tools like Apache Kafka and Apache Flink for scalable, low-latency data streaming, along with TensorFlow or PyTorch for real-time AI automation, and FAISS or Milvus for fast semantic search in applications like RAG. Ensuring data quality, automating workflows with tools like Apache Airflow, and implementing robust monitoring and fault-tolerance mechanisms will help businesses stay agile in a data-driven world and optimize their decision-making capabilities. Additional resources: AI Automation Essentials by Tuhin Chattopadhyay, DZone RefcardApache Kafka Essentials by Sudip Sengupta, DZone RefcardGetting Started With Large Language Models by Tuhin Chattopadhyay, DZone RefcardGetting Started With Vector Databases by Miguel Garcia, DZone Refcard This is an excerpt from DZone's 2024 Trend Report,Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics. As businesses collect more data than ever before, the ability to manage, integrate, and access this data efficiently has become crucial. Two major approaches dominate this space: extract, transform, and load (ETL) and extract, load, and transform (ELT). Both serve the same core purpose of moving data from various sources into a central repository for analysis, but they do so in different ways. Understanding the distinctions, similarities, and appropriate use cases is key to perfecting your data integration and accessibility practice. Understanding ETL and ELT The core of efficient data management lies in understanding the tools at your disposal. The ETL and ELT processes are two prominent methods that streamline the data journey from its raw state to actionable insights. Although ETL and ELT have their distinctions, they also share common ground in their objectives and functionalities. Data integration lies at the heart of both approaches, requiring teams to unify data from multiple sources for analysis. Automation is another crucial aspect, with modern tools enabling efficient, scheduled workflows, and minimizing manual oversight. Data quality management is central to ETL and ELT, ensuring clean, reliable data, though transformations occur at different stages. These commonalities emphasize the importance of scalability and automation for developers, helping them build adaptable data pipelines. Recognizing these shared features allows flexibility in choosing between ETL and ELT, depending on project needs, to ensure robust, efficient data workflows. Key Differences Between and Considerations for Choosing ETL or ELT ETL is traditionally suited for on-premises systems and structured data, while ELT is optimized for cloud-based architectures and complex data. Choosing between ETL and ELT depends on storage, data complexity, and specific business needs, making the decision crucial for developers and engineers. Table 1. Infrastructure considerations for ETL vs. ELT AspectETLELTInfrastructure locationOn-premise systemsCloud-based systemsData storage environmentTraditional data warehousesModern cloud data warehousesCost modelSubstantial upfront investment in hardware and softwareLower upfront cost with the pay-as-you-go modelScalabilityFixed capacity: scale by adding more servicesElastic scaling: automatic resource allocationData type compatibilitySuited for structured, relational databases with defined schemasSuited for unstructured or semi-structured dataData volumeSmall- to -medium-scale datasetsLarge-scale dataset across distributed systemsProcessing powerLimited by on-prem hardwareVirtually unlimited from cloud servicesData transformation processData transformation before loadingData loaded first, transformations occur after in the cloud The order of operations is the fundamental distinction between ETL and ELT processes: In ETL, the data is extracted from the source, then transformed according to predefined rules and schemas, and finally loaded into the target storage location. This ensures that only structured and validated data enters the warehouse.In contrast, ELT focuses on data lakes for raw data storage, modern data warehouses that accommodate both raw and transformed data, NoSQL databases for unstructured data analysis, and analytics platforms for real-time insights. Processing time is determined by the sequence of operations: With its up-front transformations, ETL might experience longer processing times before data is ready for analysis. Using an ETL process, a company can transform data to standardized formats, validate customer identities, and filter out incomplete transactions. It can take several hours to prepare the data before an analytics team can start their work. If a sudden change in customer behavior occurs (e.g., during a sale), the delay in processing might hinder the timely decision. By loading data first and transforming it later, ELT can offer faster initial loading times, although the overall processing time might depend on the complexity of transformations. For example, a company can load raw transaction and customer behavior data directly into a cloud-based data lake without upfront transformations. While the initial loading is fast, they need robust error handling to ensure that the subsequent transformations yield accurate and meaningful insights. When it comes to data storage: ETL typically relies on staging areas or intermediate data stores to store the transformed data before it's loaded into the final destination. Using an ETL process, an organization can first stage data from various sources in an intermediate data warehouse, and then they can perform transformations. ELT, on the other hand, often loads raw data directly into a data lake or cloud data stores, capitalizing on their vast storage capabilities. Transformations then happen within this environment. For example, a company loads raw data directly into a cloud-based data lake, which allows researchers to begin analyzing the data immediately. The data complexity and your flexibility needs also determine which process will work best for your use case: ETL is well suited for structured data that adheres to predefined schemas, making it ideal for traditional relational databases. Due to its predefined transformation rules, ETL might offer limited flexibility once the pipeline is set up.ELT shines when dealing with large volumes of unstructured or semi-structured data, which are common in modern data landscapes, and leverages the flexibility of cloud environments. By applying transformations after loading, ELT provides greater flexibility for iterative and exploratory data analysis, allowing for schema changes and evolving business requirements. Data analysis requirements are important considerations when deciding between ETL and ELT: ETL is favored in scenarios requiring strict data governance and quality control, such as transactional processing where timely and accurate data is essential.ELT is more suited to exploratory data analysis and iterative processes as transformations can be applied after the data has been loaded, offering greater flexibility. The timing of error handling differs in each case: In ETL, error handling is typically incorporated during the transformation phase, ensuring data quality before loading. For example, the data transformation phase checks for errors like invalid account numbers or missing transaction details. Any records with errors are either corrected or rejected before the clean data is loaded into the final database for analysis.In ELT, when an organization loads raw transaction data directly into a cloud data lake, error handling and validation occur during the transformation phase after the data is already stored. Therefore, ELT might require more robust error handling and data validation processes after the data is loaded into the target system. When to Use ETL vs. ELT: Use Cases Developers and engineers must choose between ETL and ELT based on their project needs. Table 2. Use cases for ETL vs. ELT Extract, Transform, LoadExtract, Load, TransformLegacy systems: Existing on-prem infrastructure set up for ETL; structured data, batch processingReal-time processing: Need real-time or near-real-time processingSmaller datasets: Low volume, low complexity; batch processing meets needsComplex data types: Unstructured or semi-structured data; flexible, scalable processing after loadingData governance: Strict regulatory compliance in industries (e.g., finance, healthcare); data quality is paramount and requires validation before loadingBig data and cloud environments: cloud-native infrastructure; big data platforms, distributed processing (e.g., Apache Hadoop or Spark) ETL Example: Financial Reporting System for a Bank In a traditional financial institution, accurate, structured data is critical for regulatory reporting and compliance. Imagine a bank that processes daily transactions from multiple branches: Extract. Data from various sources — such as transactional databases, loan processing systems, and customer accounts — is pulled into the pipeline. These are often structured databases like SQL.Transform. The data is cleaned, validated, and transformed. For example, foreign transactions may need currency conversion, while all dates are standardized to the same format (e.g., DD/MM/YYYY). This step also removes duplicates and ensures that only verified, structured data moves forward.Load. After the transformation, the data is loaded into the bank's centralized data warehouse, a structured, on-premises system designed for financial reporting. This ensures that only clean, validated data is stored and ready for reporting. Figure 1. ETL process for financial reporting in a bank The bank's focus is on data governance and quality control, making ETL ideal for this scenario where accuracy is non-negotiable. ELT Example: Real-Time Analysis for a Social Media Platform A social media company dealing with massive amounts of unstructured data (e.g., user posts, comments, reactions) would leverage an ELT process, particularly within a cloud-based architecture. The company uses ELT to quickly load raw data into a data lake for flexible, real-time analysis and machine learning tasks. Extract. The platform extracts raw data from various sources, including weblogs, user activity, and interaction metrics (likes, shares, etc.). This data is often semi-structured (JSON, XML) or unstructured (text, images).Load. Instead of transforming the data before storage, the platform loads raw data into a cloud-based data lake. This allows the company to store vast amounts of unprocessed data quickly and efficiently.Transform. Once the data is loaded, transformations are applied for different use cases. For example, data scientists might transform subsets of this data to train machine learning models, or analysts might apply business rules to prepare reports on user engagement. These transformations happen dynamically, often using the cloud's scalable computing resources In this ELT scenario, the platform benefits from the flexibility and scalability of the cloud, allowing for real-time analysis of massive datasets without the upfront need to transform everything. This makes ELT perfect for handling big data, especially when the structure and use of data can evolve. To further illustrate the practical applications of ETL and ELT, consider the following diagram: Figure 2. ELT process for real-time analysis on a social media platform Conclusion Both ETL and ELT play vital roles in data integration and accessibility, but the right approach depends on your infrastructure, data volume, and business requirements. While ETL is better suited for traditional on-premises systems and well-structured data, ELT excels in handling large, complex data in cloud-based systems. Mastering these approaches can unlock the true potential of your data, enabling your business to derive insights faster, smarter, and more effectively. As data ecosystems evolve, ELT will likely dominate in large-scale, cloud-based environments where real-time analysis is key. ETL, however, will remain vital in sectors that prioritize data governance and accuracy, like finance and healthcare. Hybrid solutions may emerge, combining the strengths of both methods. To get started, here are some next steps: Assess your infrastructure. Determine whether ETL or ELT better suits your data needs.Try new tools. Explore different platforms to streamline your pipelines.Stay flexible. Adapt your strategy as your data requirements grow. By staying agile and informed, you can ensure your data integration practices remain future ready. This is an excerpt from DZone's 2024 Trend Report,Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics. Remarkable advances in deep learning, combined with the exponential increase in computing power and the explosion of available data, have catalyzed the emergence of generative artificial intelligence (GenAI). Consequently, huge milestones have propelled this technology to greater potential, such as the introduction of the Transformer architecture in 2017 and the launch of GPT-2 in 2019. The arrival of GPT-3 in 2020 then demonstrated astounding capabilities in text generation, translation, and question answering, marking a decisive turning point in the field of AI. In 2024, organizations are devoting more resources to their AI strategy, seeking not only to optimize their decision-making processes, but also to generate new products and services while saving precious time to create more value. In this article, we plan to assess strategic practices for building a foundation of data intelligence systems. The emphasis will center around transparency, governance, and the ethical and responsible exploitation of cutting-edge technologies, particularly GenAI. An Introduction to Identifying and Extracting Data for AI Systems Identifying and extracting data are fundamental steps for training AI systems. As data is the primary resource for these systems, it makes it a priority to identify the best sources and use effective extraction methods and tools. Here are some common sources: Legacy systems contain valuable historical data that can be difficult to extract. These systems are often critical to day-to-day operations. They require specific approaches to extract data without disrupting their functioning.Data warehouses (DWHs) facilitate the search and analysis of structured data. They are designed to store large quantities of historical data and are optimized for complex queries and in-depth analysis.Data lakes store raw structured and unstructured data. Their flexibility means they can store a wide variety of data, providing fertile ground for exploration and the discovery of new insights.Data lakehouses cleverly combine the structure of DWHs with the flexibility of data lakes. They offer a hybrid approach that allows them to benefit from the advantages of both worlds, providing performance and flexibility. Other important sources include NoSQL databases, IoT devices, social media, and APIs, which broaden the spectrum of resources available to AI systems. Importance of Data Quality Data quality is indispensable for training accurate AI models. Poor data quality can distort the learning process and lead to biased or unreliable results. Data validation is, therefore, a crucial step, ensuring that input data meets quality standards such as completeness, consistency, and accuracy. Similarly, data versioning enables engineers to understand the impact of data changes on the performance of AI models. This practice facilitates the reproducibility of experiments and helps to identify sources of improvement or degradation in model performance. Finally, data tracking ensures visibility of the flow of data through the various processing stages. This traceability lets us understand where data comes from, how it is transformed, and how it is used, thereby contributing to transparency and regulatory compliance. Advanced Data Transformation Techniques Advanced data transformation techniques prepare raw data for AI models. These techniques include: Feature scaling and normalization. These methods ensure that all input variables have a similar amplitude. They are crucial for many machine learning algorithms that are sensitive to the scale of the data.Handling missing data. Using imputation techniques to estimate missing values, this step is fundamental to maintaining the integrity and representativeness of datasets.Detection and processing of outliers. This technique is used to identify and manage data that deviate significantly from the other observations, thus preventing these outliers from biasing the models.Dimensionality reduction. This method helps reduce the number of features used by the AI model, which can improve performance and reduce overlearning.Data augmentation. This technique artificially increases the size of the dataset by creating modified versions of existing data, which is particularly useful when training data is limited. These techniques are proving important because of their ability to enhance data quality, manage missing values effectively, and improve predictive accuracy in AI models. Imputation methods, such as those found in libraries like Fancyimpute and MissForest, can fill in missing data with statistically derived values. This is particularly useful in areas where outcomes are often predicted on the basis of historical and incomplete data. Key Considerations for Building AI-Driven Data Environments Data management practices are evolving under the influence of AI and the increasing integration of open-source technologies within companies. GenAI is now playing a central role in the way companies are reconsidering their data and applications, profoundly transforming traditional approaches. Let's take a look at the most critical considerations for building AI-driven data systems. Leveraging Open-Source Databases for AI-Driven Data Engineering The use of open-source databases for AI-driven data engineering has become a common practice in modern data ecosystems. In particular, vector databases are increasingly used in large language model (LLM) optimization. The synergy between vector databases and LLMs makes it possible to create powerful and efficient AI systems. In Table 1, we explore common open-source databases for AI-driven data engineering so that you can better leverage your own data when building intelligent systems: Table 1. Open-source databases for AI-driven data engineering categorycapabilitytechnologyRelational and NoSQLRobust functionality for transactional workloadsPostgreSQL and MySQLLarge-scale unstructured data managementMongoDB, CassandraReal-time performance and cachingRedisSupport for big data projects on Hadoop; large-scale storage and analysis capabilitiesApache HBase, Apache HiveVector databases and LLMsRapid search and processing of vectorsMilvus, PineconeSupport for search optimizationFaiss, Annoy, VespaEmerging technologiesHomomorphic databasesSEAL, TFHEDifferential privacy solutionsOpenDP, differential privacySensitive data protection via isolated execution environmentsIntel SGX, ARM TrustZone Emerging Technologies New database technologies, such as distributed, unified, and multi-model databases, offer developers greater flexibility in managing complex datasets. Data-intensive AI applications need these capabilities to bring greater flexibility in data management. Additionally, privacy-oriented databases enable computations on encrypted data. This enhances security and compliance with regulations such as GDPR. These advances enable developers to build more scalable and secure AI solutions. Industries handling sensitive data need these capabilities to ensure flexibility, security, and regulatory compliance. As shown in Table 1, homomorphic encryption and differential privacy solutions will prove impactful for advanced applications, particularly in industries that deal with sensitive data. For example, homomorphic encryption lets developers operate computations on encrypted data without ever decrypting it. Ethical Considerations Ethical considerations related to training models on large datasets raise important questions about bias, fairness, and transparency of algorithms and applications that use them. Therefore, in order to create AI systems that are more transparent, explainable AI is becoming a major requirement for businesses because the complexity of LLM models often makes it difficult, sometimes even impossible, to understand the decisions or recommendations produced by these systems. For developers, the consequence is that they not only have to work on performance, but also ensure that their models can be interpreted and validated by non-technical stakeholders, which requires extra time and effort when designing models. For example, developers need to install built-in transparency mechanisms, such as attention maps or interpretable results, so that decisions can be traced back to the specific data. Building a Scalable AI Infrastructure Building a scalable AI infrastructure is based on three main components: Storage. Flexible solutions, such as data lakes or data lakehouses, enable massive volumes of data to be managed efficiently. These solutions offer the scalability needed to adapt to the exponential growth in data generated and consumed by AI systems.Computing. GPU or TPU clusters provide new processing power required by deep neural networks and LLMs. These specialized computing units speed up the training and inference of AI models.Orchestration. Orchestration tools (e.g., Apache Airflow, Dagster, Kubernetes, Luigi, Prefect) optimize the management of large-scale AI tasks. They automate workflows, manage dependencies between tasks, and optimize resource use. Figure 1. Scalable AI architecture layers Hybrid Cloud Solutions Hybrid cloud solutions offer flexibility, resilience, and redundancy by combining public cloud resources with on-premises infrastructure. They enable the public cloud to be used for one-off requirements such as massive data processing or complex model training. At the same time, they combine the ability to maintain sensitive data on local servers. This approach offers a good balance between performance, security, and costs because hybrid cloud solutions enable organizations to make the most of both environments. Ensuring Future-Proof AI Systems To ensure the future proofing of AI systems, it is essential to: Design flexible and modular systems. This makes it easy to adapt systems to new technologies and changing business needs.Adopt data-centric approaches. Organizations must ensure that their AI systems remain relevant and effective. To achieve that, they have to place data at the heart of strategy.Integrate AI into a long-term vision. AI should not be seen as an isolated project since technology for technology's sake is of little interest. Instead, it should be seen as an integral component of a company's digital strategy.Focus on process automation. Automation optimizes operational efficiency and frees up resources for innovation.Consider data governance. Solid governance is essential to guarantee the quality, security, and compliance of the data used by AI systems.Prioritize ethics and transparency. These aspects are crucial for maintaining user confidence and complying with emerging regulations. Collaboration Between Data Teams and AI/ML Engineers Collaboration between data engineers, AI/ML engineers, and data scientists is critical to the success of AI projects. Data engineers manage the infrastructure and pipelines that allow data scientists and AI/ML engineers to focus on developing and refining models, while AI/ML engineers operationalize these models to deliver business value. To promote effective collaboration, organizations need to implement several key strategies: Clearly define the roles and responsibilities of each team; everyone must understand their part in the project.Use shared tools and platforms to facilitate seamless interaction and data sharing among team members.Encourage regular communication and knowledge sharing through frequent meetings and the use of shared documentation platforms. These practices help create a cohesive work environment where information flows freely, leading to more efficient and successful AI projects. For example, in a recommendation engine used by an e-commerce platform, data engineers collect and process large volumes of customer data. This includes historical browsing data and purchasing behavior. AI/ML engineers then develop algorithms that predict product preferences, and developers integrate the algorithms into the website or application. When an update to the recommendation model is ready, MLOps pipelines then automate testing and deployment. Conclusion Beyond tool implementation, strategic considerations must be accounted for in the same way as purely technical ones: Projects based on AI technologies must be built on a foundation of high-quality, well-managed data. The quality of AI systems depends in particular on the diversity and richness of their data sources, whether these are existing systems or data lakes.Ensuring AI models are interpretable and ethically compliant is essential to nurture trust and compliance with regulatory frameworks.The success of all AI initiatives is also directly dependent on the level of collaboration between data engineers, AI/ML specialists, and DevOps teams.AI applications, generative models, and hardware infrastructures are evolving rapidly to meet market demands, which require companies to adopt scalable infrastructures that can support these advancements. As organizations move forward, they need to focus on data engineering automation, cross-functional collaboration, and alignment with ethical and regulatory standards in order to maximize the value of their AI investments. This is an excerpt from DZone's 2024 Trend Report,Data Engineering: Enriching Data Pipelines, Expanding AI, and Expediting Analytics.Read the Free Report
The Art of Separation in Data Science: Balancing Ego and Objectivity
November 20, 2024 by
Feature Owner: The Key to Improving Team Agility and Employee Development
November 19, 2024 by
Everyone Uses Postgres...But Why?
November 20, 2024 by CORE
The Art of Separation in Data Science: Balancing Ego and Objectivity
November 20, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Japan’s Evolving Open-Source Culture: A Rapid Change in Traditional Companies Like Hitachi
November 20, 2024 by CORE
The Ultimate Guide to Pods and Services in Kubernetes
November 20, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Python Asyncio Tutorial: A Complete Guide
November 20, 2024 by CORE
JavaScript for Beginners: Assigning Dynamic Classes With ngClass
November 20, 2024 by
Japan’s Evolving Open-Source Culture: A Rapid Change in Traditional Companies Like Hitachi
November 20, 2024 by CORE
November 20, 2024 by CORE
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Japan’s Evolving Open-Source Culture: A Rapid Change in Traditional Companies Like Hitachi
November 20, 2024 by CORE
JavaScript for Beginners: Assigning Dynamic Classes With ngClass
November 20, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by