Big Data Resources

DZone's Featured Big Data Resources

Incremental Jobs and Data Quality Are On a Collision Course

By Jack Vanlightly

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn — one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets. Why? On the one hand, many data sets are inherently small, corresponding to things like people, products, marketing campaigns, sales funnel, win/loss rates, etc. On the other hand, there are inherently large data sets (such as clickstreams, logistics events, IoT, sensor data, etc) that are increasingly being processed incrementally. Why the Trend Towards Incremental Processing? Incremental processing has a number of advantages: It can be cheaper than recomputing the entire derived dataset again (especially if the source data is very big).Smaller precomputed datasets can be queried more often without huge costs.It can lower the time to insight. Rather than a batch job running on a schedule that balances cost vs timeliness, an incremental job keeps the derived dataset up-to-date so that it’s only minutes or low-hours behind the real world.More and more software systems act on the output of analytics jobs. When the output was a report, once a day was enough. When the output feeds into other systems that take actions based on the data, these arbitrary delays caused by periodic batch jobs make less sense. Going incremental, while cheaper in many cases, doesn’t mean we’ll use less compute though. The Jevons paradox is an economic concept that occurs where technological advancements leading to increased efficiency in the use of a resource lead to a paradoxical increase in the overall consumption of that resource rather than a decrease. Greater resource efficiency leads people to believe that we won’t use as much of the resource, but the reality is that this often causes more consumption of the resource due to greater demand. Using this intuition of the Jevons Paradox, we can expect this trend of incremental computation to lead to more computing resources being used in analytics rather than less. We can now: Run dashboards with lower refresh rates.Generate reports sooner.Utilize analytical data in more user-facing applications.Utilize analytical data to drive actions in other software systems. As we make analytics more cost-efficient in lower latency workloads, the demand for those workloads will undoubtedly increase (by finding new use cases that weren’t economically viable before). The rise of GenAI is another driver of demand (though definitely not making analytics cheaper!). Many data systems and data platforms already support incremental computation: Real-time OLAP: ClickHouse/Apache Pinot/Apache Druid all provide incremental precomputed tables.Cloud DWH/lake house Snowflake materialized views.Databricks DLT.DBT incremental jobs.Apache Spark jobs.Incremental capabilities of the open table formatsIncremental ingestion jobs.Stream processing Apache Flink.Spark Structured StreamingMaterialize (a streaming database that maintains materialized views over streams). While the technology for incremental computation is already largely here, many organizations aren’t actually ready for a switch to incremental from periodic batch. The Collision Course Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data. – Julien Le Dem The collision: Bad things happen when uncontrolled changes collide with incremental jobs that feed their output back into other software systems or pollute other derived data sets. Reacting to changes is a losing strategy – Jack Vanlightly Many, if not most, organizations are not equipped to realize this future where analytics data drives actions in other software systems and is exposed to users in user-facing applications. A world of incremental jobs raises the stakes on reliability, correctness, uptime (freshness), and general trustworthiness of data pipelines. The problem is that data pipelines are not reliable enough nor cost-effective enough (in terms of human resource costs) to meet this incremental computation trend. We need to rethink the traditional data warehouse architecture where raw data is ingested from across an organization and landed in a set of staging tables to be cleaned up serially and made ready for analysis. As we well know, that leads to constant break-fix work as data sources regularly change, breaking the data pipelines that turn the raw data into valuable insights. That may have been tolerable when analytics was about strategic decision support (like BI), where the difference of a few hours or a day might not be a disaster. But in an age where analytics is becoming relevant in operational systems and powering more and more real-time or low-minute workloads, it is clearly not a robust or effective approach. The ingest-raw-data->stage->clean->transform approach has a huge amount of inertia and a lot of tooling, but it is becoming less and less suitable as time passes. For analytics to be effective in a world of lower latency incremental processing and more operational use cases, it has to change. So, What Should We Do Instead? The barrier to improving data pipeline reliability and enabling more business-critical workloads mostly relates to how we organize teams and the data architectures we design. The technical aspects of the problem are well-known, and long-established engineering principles exist to tackle them. The thing we’re missing right now is that the very foundations that analytics is built on are not stable. The onus is on the data team to react quickly to changes in upstream applications and databases. This is clearly not going to work for analytics built on incremental jobs where expectations of timeliness are more easily compromised. Even for batch workloads, the constant break-fix work is a drain on resources and also leads to end users questioning the trustworthiness of reports and dashboards. The current approach of reacting to changes in raw data has come about largely because of Conway’s Law: how the different reporting structures have isolated data teams from the operational estate of applications and services. Without incentives for software and data teams to cooperate, data teams have, for years and years, been breaking one of the cardinal rules for how software systems should communicate. Namely, they reach out to grab the private internal state of applications and services. In the world of software engineering, this is an anti-pattern of epic proportions! It’s All About "Coupling" I could make a software architect choke on his or her coffee if I told them my service was directly reading the database of another service owned by a different team. Why is this such an anti-pattern? Why should it result in spilled coffee and dumbfounded shock? It’s all about coupling. This is a fundamental property of software systems that all software engineering organizations take heed of. When services depend on the private internal workings of other services, even small changes in one service's internal state can propagate unpredictably, leading to failures in distant systems and services. This is the principle of coupling, and we want low coupling. Low coupling means that we can change individual parts of a system without those changes propagating far and wide. The more coupling you have in a system, the more coordination and work are required to keep all parts of the system working. This is the situation data teams still find themselves in today. For this reason, software services expose public interfaces (such as a REST API, gRPC, GraphQL, a schematized queue, or a Kafka topic), carefully modeled, stable, and with careful evolution to avoid breaking changes. A system with many breaking changes has high coupling. In a high coupling world, every time I change my service, I force all dependent services to update as well. No, we either have to perform costly coordination between teams to update services (at the same time) or we get a nasty surprise in production. That is why in software engineering, we use contracts, and we have versioning schemes such as SemVer to govern contract changes. In fact, we have multiple ways of evolving public interfaces without propagating those changes further than they need to. It’s why services depend on contracts and not private internal state. Not only do teams build software that communicates via stable APIs, but the software teams collaborate to provide those APIs that the various teams require. This need for APIs and collaboration has only become larger over time. The average enterprise application or service used to be a bit of an island: it had its ten database tables and didn't really need much more. Increasingly, these applications are drawing on much richer sets of data and forming much more complex webs of dependencies. Given this web of dependencies between applications and services, (1) the number of consumers of each API has risen, and (2) the chance of some API change breaking a downstream service has also risen massively. Stable, versioned APIs between collaborating teams are the key. Data Products (Seriously) This is where data products come in. Like or loathe the term, it’s important. Rather than a data pipeline sucking out the private state of an application, it should consume a data product. Data products are very similar to the REST APIs on the software side. They aren’t totally the same, but they share many of the same concerns: Schemas. The shape of the data, both in terms of structure (the fields and their types) and the legal values (not null, credit card numbers with 16 numbers, etc).Careful evolution of schemas to prevent changes from propagating (we want low coupling). Avoiding breaking changes as much as humanly possible.Uptime, which for data products becomes “data freshness.” Is the data arriving on time? Is it late? Perhaps an SLO or even an SLA determines the data freshness goals. Concretely, data products are consumed as governed data-sharing primitives, such as Kafka topics for streaming data and Iceberg/Hudi tables for tabular data. While the public interface may be a topic or a table, the logic/infra that produces the topic or table may be varied. We really don’t want to just emit events that are mirrors of the private schema of the source database tables (due to the high coupling it causes). Just as REST APIs are not mirrors of the underlying database, the data product also requires some level of abstraction and internal transformation. Gunnar Morling wrote an excellent post on this topic, focused on CDC and how to avoid breaking encapsulation. These data products should be capable of real-time or close to real-time because downstream consumers may also be real-time or incremental. As incremental computation spreads, it becomes a web of incremental vertices with edges between them: a graph of incremental computation that is spread across the operational and analytical estates. While the vertices and edges are different from the web of software services, the underlying principles for building reliable and robust systems are the same — low coupling architectures based on stable, evolvable contracts. Because data flows across boundaries, data products should be based on open standards, just as software service contracts are built on HTTP and gRPC. They should come with tooling for schema evolution, access controls, encryption/data masking, data validation rules, etc. More than that, they should come with an expectation of stability and reliability — which comes about from mature engineering discipline and prioritizing these much-needed properties. These data products are owned by the data producers rather than the data consumers (who have no power to govern application databases). It’s not possible for a data team to own the data product whose source is another team’s application or database and expect it to be both sustainable and reliable. Again, I could make a software architect choke on their coffee, suggesting that my software team should build and maintain a REST API (we desperately need) that serves the data of another team’s database. Consumers don’t manage the APIs of source data; it’s the job of the data owner, aka the data producer. This is a hard truth for data analytics but one that is unquestioned in software engineering. The Challenge Ahead What I am describing is Shift Left applied to data analytics. The idea of shifting left is acknowledging that data analytics can’t be a silo where we dump raw data, clean it up, and transform it into something useful. It’s the way it has been done for so long with multi-hop architectures it’s really hard to consider something else. But look at how software engineers build a web of software services that consume each other's data (in real-time) – software teams are doing things very differently. The most challenging aspect of Shift Left is that it changes roles and responsibilities that are now ingrained in the enterprise. This is just how things have been done for a long time. That’s why I think Shift Left will be a gradual trend as it has to overcome this huge inertia. The role of data analytics systems has gone from reporting alone to now including or feeding running-the-business applications. Delaying the delivery of a report for a few hours was tolerable, but in operational systems, hours of downtime can mean huge amounts of lost revenue, so the importance of building reliable (low-coupling) systems has increased. What is holding back analytics right now is that it isn’t reliable enough, it isn’t fast enough, and it has the constant drain of reacting to change (with no control over the timing or shape of those changes). Organizations that shift responsibility for data to the left will build data analytics pipelines that source their data from reliable, stable sources. Rather than sucking in raw data from across the enterprise and dealing with change as it happens, we should build incremental analytics workloads that are robust in the face of changing applications and databases. Ultimately, it’s about: Solving a people problem (getting data and software teams to work together).Applying sound engineering practices to create robust, low-coupling data architectures that can be fit for purpose for more business-critical workloads. The trend of incremental computation is great, but it only raises the stakes. More

Setting Up Local Kafka Container for Spring Boot Application

By Amol Gote

CORE

In today's microservices and event-driven architecture, Apache Kafka is the de facto for streaming applications. However, setting up Kafka for local development in conjunction with your Spring Boot application can be tricky, especially when configuring it to run locally. Spring Boot application provides support for Kafka integration through the spring-kafka maven package. To work with spring-kafka, we need to connect to the Kafka instance. Typically, during development, we would just run a local Kafka instance and build against it. But with Docker Desktop and containers, things are much easier to set up than running a local Kafka instance. This article guides us through the steps for setting up the local Kafka container with the Spring Boot application. Prerequisites We need to set up Docker Desktop. To do so, we can refer to this article. Spring Boot application with spring-kafka package configured. Running Kafka Container For running the Kafka container, we will first use the following docker-compose file: YAML version: '3' services: zookeeper: image: confluentinc/cp-zookeeper:latest environment: ZOOKEEPER_CLIENT_PORT: 2181 ports: - "2181:2181" kafka: image: confluentinc/cp-kafka:latest environment: KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 ports: - "9092:9092" depends_on: - zookeeper This docker-compose file contains the configurations to pull the Kafka container and its dependency, the Zookeeper container. Zookeeper manages Kafka broker nodes in the cluster; we can find further details about this in this article. For registering the containers with Docker Desktop, we will use the following command: PowerShell docker-compose up -d This will pull the required images and launch the containers, and once the containers are launched, you can see the containers in Docker Desktop like below: Now that the Kafka container is up, we can then create the required topics using Docker Desktop console using the following command: PowerShell kafka-topics --create --topic user-notification --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092 Now that the container is up and the required prerequisites have been performed, we can launch the Spring Boot application. For the Spring Boot application, configure the Kafka bootstrap address as below: Properties files kafka.bootstrapAddress=localhost:9092 When we launch the Spring Boot application, we should see the logs for connection with Kafka, depending on the type of Spring Boot application, whether it is a producer or consumer. Following the steps outlined in the article, we set up a local development environment using a Kafka container and a Spring Boot application. More

The Evolution of Adaptive Frameworks

By Manasi Sharma

Bridging the Gap: Unlocking the Power of HDFS-Based Data Lakes With Streaming Databases

By Gautam Goswami

CORE

Data Governance Challenges in the Age of Generative AI

By nishchai jayanna manjula

Understanding WebRTC Security Architecture and IoT

In the IoT world, security is one of the biggest challenges. When you’re connecting multiple devices together over a network, various doors are left ajar to security threats, along with an increase in the number of doors themselves that open to the involved data transmissions. However, data transmissions are an integral part of IoT, because they allow devices to share various types of data among themselves and transmit it to other devices, including notifications and media files. This ability is essential for IoT ecosystems, in which devices need to communicate efficiently to perform complex tasks. However, access to the data channel must be both restricted and encrypted to maintain security. WebRTC is one approach to establishing secure data channels over an IoT network. WebRTC establishes direct peer-to-peer connections, allowing data to flow directly between devices instead of through a separate server. The basic security consists of three mandatory WebRTC encryption protocols: secure real-time protocol (SRTP), secure encryption key exchange, and secure signaling. These protocols encrypt the data sent through WebRTC, protect the encryption keys, and secure the web server connection. Here, we’ll explain further how WebRTC security works to protect your IoT network. A Look at SRTP for WebRTC Security One of the primary concerns in IoT security is the potential for data interception. WebRTC mitigates this risk with secure real-time protocol (SRTP), which encrypts media streams and data packets during transfer. These protocols are widely used in systems such as video surveillance, smart home devices, healthcare IoT, industrial IoT, and connected vehicles, making them essential for securing real-time data transfer across various IoT applications. SRTP builds on the basic real-time protocol (RTP) by adding encryption and authentication layers. Each data packet is encrypted using a unique key shared exclusively between communicating devices. This ensures that even if a packet is intercepted, its content cannot be accessed without the decryption key. WebRTC achieves secure key exchange through DTLS-SRTP, which integrates datagram transport layer security with SRTP to establish secure connections. In addition to encryption, SRTP includes mechanisms for data integrity verification. Every packet has an authentication tag, a digital signature that confirms it has not been tampered with during transmission. If a packet’s tag fails verification, it is discarded, protecting communication from interference. Encryption Key Exchange While SRTP encrypts the data itself, WebRTC employs secure encryption key exchange mechanisms to protect the keys that control access to data streams. These keys, often referred to as session keys, are unique, temporary codes used to encrypt and decrypt the data exchanged between devices. Without these keys, intercepted data cannot be read or modified. Key exchange begins with a DTLS “handshake,” a process that verifies the identities of communicating devices and securely transfers encryption keys. This step ensures that only authenticated devices can participate in the communication. Essentially, Datagram Transport Layer Security (DTLS) plays a critical role in WebRTC by confirming the credentials of both the sender and receiver (similar to verifying IDs) to ensure all participants in the media stream are who they claim to be. A crucial part of this process involves the exchange and validation of certificate fingerprints. WebRTC provides a mechanism to generate fingerprints of certificates, which act as unique identifiers for each device in the connection. Secure Signaling In WebRTC, signaling — the process that helps establish a peer-to-peer connection — is a crucial security component. Signaling mechanisms are used to set up, modify, and terminate WebRTC connections. Although WebRTC doesn’t define a specific signaling protocol, developers typically rely on secure channels (like HTTPS or WebSockets) to manage signaling messages. To understand the differences between SRTP, secure encryption key exchange, and secure signaling, think of them as three roles in building a secure house: SRTP: SRTP is like the lock on the doors and windows in WebRTC security. It ensures that once people (here, data) are inside the house, they are safe and cannot be accessed by unauthorized individuals. It encrypts media streams (audio, video, or data packets) and ensures they remain private and untampered during transmission.Encryption key exchange: This is like the locksmith who provides and secures the keys to the locks. DTLS verifies the identities of the participants (like showing ID to ensure you’re the homeowner) and securely delivers the session keys that control access to the encrypted data.Secure signaling: Secure signaling is like the blueprint and construction crew that set up the house and its security features. Signaling manages the negotiation of how the connection will function — determining the structure (e.g., codecs, ICE candidates, and connection parameters) while ensuring the plans (signaling messages) are not intercepted or altered during setup. So, while SRTP and DTLS focus on protecting the data itself and the keys that enable encryption, secure signaling ensures that the initial connection setup process remains private and free from interference. By securing the signaling messages, WebRTC prevents attackers from tampering with the connection parameters or hijacking the session during its setup phase. Additional WebRTC Security Considerations While SRTP, encryption key exchange, and secure signaling are foundational to WebRTC security, several other safeguards ensure that WebRTC operates within a robust security framework. Browser trust and security updates: Since WebRTC is a browser-based technology, security depends heavily on the browser’s integrity and update cycle. Trusted browsers like Chrome and Firefox automatically receive security patches, reducing the likelihood of vulnerabilities. However, downloading from a trusted source is critical; a compromised browser could weaken WebRTC’s security.User permissions and access control: WebRTC requires explicit user permission to access local resources like cameras and microphones. This permission-based access prevents unauthorized apps from using a device’s hardware and informs users when an application is accessing these resources.TURN servers and data routing: When direct peer-to-peer connections are not possible, WebRTC falls back on TURN servers, which relay data but cannot access its content due to encryption. This fallback option ensures secure communication even in network-restricted environments. Final Thoughts While WebRTC provides robust security features, its effectiveness depends heavily on how it is implemented in applications. The protocols discussed earlier — SRTP for encrypting data streams, DTLS for secure key exchange, and secure signaling for safeguarding the connection setup — form a strong foundation. However, if developers cut corners or mismanage these elements, the data channel can still be left vulnerable to attack. For example, using insecure signaling mechanisms, such as unencrypted HTTP instead of HTTPS or WebSockets, undermines the secure signaling process and exposes the connection setup to interception. Similarly, failing to implement proper DTLS key exchange protocols or neglecting to update SRTP configurations with the latest security standards can compromise the integrity of the encrypted data streams. By adhering to WebRTC security best practices — ensuring secure signaling channels, maintaining updated encryption standards, and leveraging the inherent strengths of SRTP and DTLS — IoT developers can create applications that are both highly functional and secure. These measures are critical to protecting sensitive data and ensuring the reliability of IoT ecosystems in a world where security threats continue to evolve.

By Carsten Rhod Gregersen

The Future of Data Lies in Transformer Models vs. Big Data Transformations

Last year, we witnessed the explosive rise of large models, generating global enthusiasm and making AI seem like a solution to all problems. This year, as the hype subsides, large models have entered a deeper phase, aiming to reshape the foundational logic of various industries. In the realm of big data processing, the collision between large models and traditional ETL (Extract, Transform, Load) processes has sparked new debates. Large models feature “Transformers,” while ETL relies on “Transform” processes — similar names representing vastly different paradigms. Some voices boldly predict: "ETL will be completely replaced in the future, as large models can handle all data!" Does this signal the end of the decades-old ETL framework underpinning data processing? Or is it merely a misunderstood prediction? Behind this conflict lies a deeper contemplation of technology's future. Will Big Data Processing (ETL) Disappear? With the rapid development of large models, some have begun to speculate whether traditional big data processing methods, including ETL, are still necessary. Large models, capable of autonomously learning rules and discovering patterns from vast datasets, are undeniably impressive. However, my answer is clear: ETL will not disappear. Large models still fail to address several core data challenges: 1. Efficiency Issues Despite their outstanding performance in specific tasks, large models incur enormous computational costs. Training a large-scale Transformer model may take weeks and consume vast energy and financial resources. By contrast, ETL, which relies on predefined rules and logic, is efficient, resource-light, and excels at processing structured data. For everyday enterprise data tasks, many operations remain rule-driven, such as: Data Cleaning: Removing anomalies using clear rules or regular expressions.Format Conversion: Standardizing formats to facilitate data transmission and integration across systems.Aggregation and Statistics: Categorizing, aggregating, and calculating data daily, weekly, or monthly. These tasks can be swiftly handled by ETL tools without requiring the complex inference capabilities of large models. 2. Ambiguity in Natural Language Large models have excelled in natural language processing (NLP) but have also exposed inherent challenges — ambiguity and vagueness in human language. For example: A single input query may yield varied interpretations depending on the context, with no guaranteed accuracy.Differences in data quality may lead models to generate results misaligned with real-world requirements. By contrast, ETL is deterministic, processing data based on pre-defined rules to produce predictable, standardized outputs. In high-demand sectors like finance and healthcare, ETL's reliability and precision remain critical advantages. 3. Strong Adaptability to Structured Data Large models are adept at extracting insights from unstructured data (e.g., text, images, videos) but often struggle with structured data tasks. For instance: Traditional ETL efficiently processes relational databases, handling complex operations like JOINs and GROUP BYs.Large models require data to be converted into specific formats before processing, introducing redundancy and delays. In scenarios dominated by structured data (e.g., tables, JSON), ETL remains the optimal choice. 4. Explainability and Compliance Large models are often referred to as “black boxes.” Even when data processing is complete, their internal workings and decision-making mechanisms remain opaque: Unexplainable Results: In regulated industries like finance and healthcare, predictions from large models may be unusable due to their lack of transparency.Compliance Challenges: Many industries require full auditing of data flows and processing logic. Large models, with their complex data pipelines and decision mechanisms, pose significant auditing challenges. ETL, in contrast, provides highly transparent processes, with every data handling step documented and auditable, ensuring compliance with corporate and industry standards. 5. Data Quality and Input Standardization Large models are highly sensitive to data quality. Noise, anomalies, or non-standardized inputs can severely affect their performance: Data Noise: Large models cannot automatically identify erroneous data, potentially using it as "learning material" and producing biased predictions.Lack of Standardization: Feeding raw, uncleaned data into large models can result in inconsistencies and missing values, requiring preprocessing tools like ETL. ETL ensures data is cleaned, deduplicated, and standardized before being fed into large models, maintaining high data quality. Despite the excellence of large models in many areas, their complexity, reliance on high-quality data, hardware demands, and practical limitations ensure they cannot entirely replace ETL. As a deterministic, efficient, and transparent tool, ETL will continue to coexist with large models, providing dual safeguards for data processing. CPU vs. GPU: A Parallel to ETL vs. Large Models While ETL cannot be replaced, the rise of large models in data processing is an inevitable trend. For decades, computing systems were CPU-centric, with other components considered peripherals. GPUs were primarily used for gaming, but today, data processing relies on the synergy of CPUs and GPUs (or NPUs). This paradigm shift reflects broader changes mirrored in the stock trends of Intel and NVIDIA. From Single-Center to Multi-Center Computing Historically, data processing architectures evolved from "CPU-centric" to "CPU+GPU (and even NPU) collaboration." This transition, driven by changes in computing performance requirements, has deeply influenced the choice of data processing tools. During the CPU-centric era, early ETL processes heavily relied on CPU logic for operations like data cleaning, formatting, and aggregation. These tasks were well-suited to CPUs’ sequential processing capabilities. However, the rise of complex data formats (audio, video, text) and exponential storage growth revealed the limitations of CPU power. GPUs, with their unparalleled parallel processing capabilities, have since taken center stage in data-intensive tasks like training large Transformer models. From Traditional ETL to Large Models Traditional ETL processes, optimized for "CPU-centric" computing, excel at handling rule-based, structured data tasks. Examples include: Data validation and cleaning.Format standardization.Aggregation and reporting. Large models, in contrast, require GPU power for high-dimensional matrix computations and large-scale parameter optimization: Preprocessing: Real-time normalization and data segmentation.Model training: Compute-heavy tasks involving floating-point operations.Inference services: Optimized batch processing for low latency and high throughput. This reflects a shift from logical computation to neural inference, broadening data processing to include reasoning and knowledge extraction. Toward a New Generation of ETL Architecture for Large Models The rise of large models highlights inefficiencies in traditional data processing, necessitating a more advanced, unified architecture. Pain Points in Current Data Processing Complex, Fragmented Processes: Data cleaning, annotation, and preprocessing remain highly manual and siloed.Low Reusability: Teams often recreate data pipelines, leading to inefficiencies.Inconsistent Quality: The lack of standardized tools results in varying data quality.High Costs: Separate development and maintenance for each team inflate costs. Solutions: AI-Enhanced ETL Tools Future ETL tools will embed AI capabilities, merging traditional strengths with modern intelligence: Embedding Generation: Built-in support for text, image, and audio vectorization.LLM Knowledge Extraction: Automated structuring of unstructured data.Dynamic Cleaning Rules: Context-aware optimization of data cleaning strategies.Unstructured Data Handling: Support for keyframe extraction, OCR, and speech-to-text.Automated Augmentation: Intelligent data generation and enhancement. The Ultimate Trend: Transformers + Transform With the continuous advancement of technology, large models and traditional ETL processes are gradually converging. The next generation of ETL architectures is expected to blend the intelligence of large models with the efficiency of ETL, creating a comprehensive framework capable of processing diverse data types. Hardware: Integration of Data Processing Units The foundation of data processing is shifting from CPU-centric systems to a collaborative approach involving CPUs and GPUs: CPU for foundational tasks: CPUs excel at basic operations like preliminary data cleaning, integration, and rule-based processing, such as extracting, transforming, and loading structured data.GPU for advanced analytics: With powerful parallel computing capabilities, GPUs handle large model training and inference tasks on pre-processed data. This trend is reflected not only in technical innovation but also in industry dynamics: Intel is advancing AI accelerators for CPU-AI collaboration, while NVIDIA is expanding GPU applications into traditional ETL scenarios. The synergy between CPUs and GPUs promises higher efficiency and intelligent support for next-generation data processing. Software: Integration of Data Processing Architectures As ETL and large model functionalities become increasingly intertwined, data processing is evolving into a multifunctional, collaborative platform where ETL serves as a data preparation tool for large models. Large models require high-quality input data during training, and ETL provides the preliminary processing to create ideal conditions: Noise removal and cleaning: Eliminates noisy data to improve dataset quality.Formatting and standardization: Converts diverse data formats into a unified structure suitable for large models.Data augmentation: Expands data scale and diversity through preprocessing and rule-based enhancements. Emergence of AI-Enhanced ETL Architectures The future of ETL tools lies in embedding AI capabilities to achieve smarter data processing: 1. Embedding Capabilities Integrating modules for generating embeddings to support vector-based data processing.Producing high-dimensional representations for text, images, and audio; using pre-trained models for semantic embeddings in downstream tasks.Performing embedding calculations directly within ETL workflows, reducing dependency on external inference services. 2. LLM Knowledge Extraction Leveraging large language models (LLMs) to efficiently process unstructured data, extracting structured information like entities and events.Completing and inferring complex fields, such as filling in missing values or predicting future trends.Enabling multi-language data translation and semantic alignment during data integration. 3. Unstructured Data Recognition and Keyframe Extraction Supporting video, image, and audio data natively, enabling automatic keyframe extraction for annotation or training datasets.Extracting features from images (e.g., object detection, OCR) and performing audio-to-text conversion, sentiment analysis, and more. 4. Dynamic Cleaning Rules Dynamically adjusting cleaning and augmentation strategies based on data context to enhance efficiency and relevance.Detecting anomalies in real-time and generating adaptive cleaning rules.Optimizing cleaning strategies for specific domains (e.g., finance, healthcare). 5. Automated Data Augmentation and Generation Dynamically augmenting datasets through AI models (e.g., synonym replacement, data back-translation, adversarial sample generation).Expanding datasets for low-sample scenarios and enabling cross-language or cross-domain data generation. AI-enhanced ETL represents a transformative leap from traditional ETL, offering embedding generation, LLM-based knowledge extraction, unstructured data processing, and dynamic rule generation to significantly improve efficiency, flexibility, and intelligence in data processing. Case Study: Apache SeaTunnel – A New Generation AI-Enhanced ETL Architecture As an example, the open-source Apache SeaTunnel project is breaking traditional ETL limitations by supporting innovative data formats and advanced processing capabilities, showcasing the future of data processing: Native support for unstructured data: The SeaTunnel engine supports text, video, and audio processing for diverse model training needs.Vectorized data support: Enables seamless compatibility with deep learning and large-model inference tasks.Embedding large model features: SeaTunnel v2.3.8 supports embedding generation and LLM transformations, bridging traditional ETL with AI inference workflows.“Any-to-Any” transformation: Transforms data from any source (e.g., databases, binlogs, PDFs, SaaS, videos) to any target format, delivering unmatched versatility. Tools like SeaTunnel illustrate how modern data processing has evolved into an AI+Big Data full-stack collaboration system, becoming central to enterprise AI and data strategies. Conclusion Large model transformers and big data transforms are not competitors but allies. The future of data processing lies in the deep integration of ETL and large models, as illustrated below: Collaborative data processing units: Leveraging CPU-GPU synergy for both structured and unstructured data processing.Dynamic data processing architecture: Embedding AI capabilities into ETL for embedding generation, LLM knowledge extraction, and intelligent decision-making.Next-gen tools: Open-source solutions like Apache SeaTunnel highlight this trend, enabling "Any-to-Any" data transformation and redefining ETL boundaries. The convergence of large models and ETL will propel data processing into a new era of intelligence, standardization, and openness. By addressing enterprise demands, this evolution will drive business innovation and intelligent decision-making, becoming a core engine for the future of data-driven enterprises.

By William Guo

Creating Scalable, Compliant Cloud Data Pipelines in SaaS through AI Integration

Data management is undergoing a rapid transformation and is emerging as a critical factor in distinguishing success within the Software as a Service (SaaS) industry. With the rise of AI, SaaS leaders are increasingly turning to AI-driven solutions to optimize data pipelines, improve operational efficiency, and maintain a competitive edge. However, effectively integrating AI into data systems goes beyond simply adopting the latest technologies. It requires a comprehensive strategy that tackles technical challenges, manages complex real-time data flows, and ensures compliance with regulatory standards. This article will explore the journey of building a successful AI-powered data pipeline for a SaaS product. We will cover everything from initial conception to full-scale adoption, highlighting the key challenges, best practices, and real-world use cases that can guide SaaS leaders through this critical process. 1. The Beginning: Conceptualizing the Data Pipeline Identifying Core Needs The first step in adopting AI-powered data pipelines is understanding the core data needs of your SaaS product. This involves identifying the types of data the product will handle, the specific workflows involved, and the problems the product aims to solve. Whether offering predictive analytics, personalized recommendations, or automating operational tasks, each use case will influence the design of the data pipeline and the AI tools required for optimal performance. Data Locality and Compliance Navigating the complexities of data locality and regulatory compliance is one of the initial hurdles for SaaS companies implementing AI-driven data pipelines. Laws such as the GDPR in Europe impose strict guidelines on how companies handle, store, and transfer data. SaaS leaders must ensure that both the storage and processing locations of data comply with regulatory standards to avoid legal and operational risks. Data Classification and Security Managing data privacy and security involves classifying data based on sensitivity (e.g., personally identifiable information or PII vs. non-PII) and applying appropriate access controls and encryption. Here are some essential practices for compliance: Key Elements of a Robust Data Protection Strategy By addressing these challenges, SaaS companies can build AI-driven data pipelines that are secure, compliant, and resilient. 2. The Build: Integrating AI into Data Pipelines Leveraging Cloud for Scalable and Cost-Effective AI-Powered Data Pipelines To build scalable, efficient, and cost-effective AI-powered data pipelines, many SaaS companies turn to the cloud. Cloud platforms offer a wide range of tools and services that enable businesses to integrate AI into their data pipelines without the complexity of managing on-premises infrastructure. By leveraging cloud infrastructure, companies gain flexibility, scalability, and the ability to innovate rapidly, all while minimizing operational overhead and avoiding vendor lock-in. Key Technologies in Cloud-Powered AI Pipelines An AI-powered data pipeline in the cloud typically follows a series of core stages, each supported by a set of cloud services: End-to-End Cloud Data Pipeline 1. Data Ingestion The first step in the pipeline is collecting raw data from various sources. Cloud services allow businesses to easily ingest data in real time from internal systems, customer interactions, IoT devices, and third-party APIs. These services can handle both structured and unstructured data, ensuring that no valuable data is left behind. 2. Data Storage Once data is ingested, it needs to be stored in an optimized manner for processing and analysis. Cloud platforms provide flexible storage options, such as: Data Lakes: For storing large volumes of raw, unstructured data that can later be analyzed or processed.Data Warehouses: For storing structured data, performing complex queries, and reporting.Scalable Databases: For storing key-value or document data that needs fast and efficient access. 3. Data Processing After data is stored, it needs to be processed. The cloud offers both batch and real-time data processing capabilities: Batch Processing: For historical data analysis, generating reports, and performing large-scale computations.Stream Processing: For real-time data processing, enabling quick decision-making and time-sensitive applications, such as customer support or marketing automation. 4. Data Consumption The final stage of the data pipeline is delivering processed data to end users or business applications. Cloud platforms offer various ways to consume the data, including: Business Intelligence Tools: For creating dashboards, reports, and visualizations that help business users make informed decisions.Self-Service Analytics: Enabling teams to explore and analyze data independently.AI-Powered Services: Delivering real-time insights, recommendations, and predictions to users or applications. Ensuring a Seamless Data Flow A well-designed cloud-based data pipeline ensures smooth data flow from ingestion through to storage, processing, and final consumption. By leveraging cloud infrastructure, SaaS companies can scale their data pipelines as needed, ensuring they can handle increasing volumes of data while delivering real-time AI-driven insights and improving customer experiences. Cloud platforms provide a unified environment for all aspects of the data pipeline — ingestion, storage, processing, machine learning, and consumption — allowing SaaS companies to focus on innovation rather than managing complex infrastructure. This flexibility, combined with the scalability and cost-efficiency of the cloud, makes it easier than ever to implement AI-driven solutions that can evolve alongside a business’s growth and needs. 3. Overcoming Challenges: Real-Time Data and AI Accuracy Real-Time Data Access For many SaaS applications, real-time data processing is crucial. AI-powered features need to respond to new inputs as they’re generated, providing immediate value to users. For instance, in customer support, AI must instantly interpret user queries and generate accurate, context-aware responses based on the latest data. Building a real-time data pipeline requires robust infrastructure, such as Apache Kafka or AWS Kinesis, to stream data as it’s created, ensuring that the SaaS product remains responsive and agile. Data Quality and Context The effectiveness of AI models depends on the quality and context of the data they process. Poor data quality can result in inaccurate predictions, a phenomenon often referred to as "hallucinations" in machine learning models. To mitigate this: Implement data validation systems to ensure data accuracy and relevance.Train AI models on context-aware data to improve prediction accuracy and generate actionable insights. 4. Scaling for Long-Term Success Building for Growth As SaaS products scale, so does the volume of data, which places additional demands on the data pipeline. To ensure that the pipeline can handle future growth, SaaS leaders should design their AI systems with scalability in mind. Cloud platforms like AWS, Google Cloud, and Azure offer scalable infrastructure to manage large datasets without the overhead of maintaining on-premise servers. Automation and Efficiency AI can also be leveraged to automate various aspects of the data pipeline, such as data cleansing, enrichment, and predictive analytics. Automation improves efficiency and reduces manual intervention, enabling teams to focus on higher-level tasks. Permissions & Security As the product scales, managing data permissions becomes more complex. Role-based access control (RBAC) and attribute-based access control (ABAC) systems ensure that only authorized users can access specific data sets. Additionally, implementing strong encryption protocols for both data at rest and in transit is essential to protect sensitive customer information. 5. Best Practices for SaaS Product Leaders Start Small, Scale Gradually While the idea of designing a fully integrated AI pipeline from the start can be appealing, it’s often more effective to begin with a focused, incremental approach. Start by solving specific use cases and iterating based on real-world feedback. This reduces risks and allows for continuous refinement before expanding to more complex tasks. Foster a Growth Mindset AI adoption in SaaS requires ongoing learning, adaptation, and experimentation. Teams should embrace a culture of curiosity and flexibility, continuously refining existing processes and exploring new AI models to stay competitive. Future-Proof Your Pipeline To ensure long-term success, invest in building a flexible, scalable pipeline that can adapt to changing needs and ongoing regulatory requirements. This includes staying updated on technological advancements, improving data security, and regularly revisiting your compliance strategies. 6. Conclusion Integrating AI into SaaS data pipelines is no longer optional — it’s a critical component of staying competitive in a data-driven world. From ensuring regulatory compliance to building scalable architectures, SaaS leaders must design AI systems that can handle real-time data flows, maintain high levels of accuracy, and scale as the product grows. By leveraging open-source tools, embracing automation, and building flexible pipelines that meet both operational and regulatory needs, SaaS companies can unlock the full potential of their data. This will drive smarter decision-making, improve customer experiences, and ultimately fuel sustainable growth. With the right strategy and mindset, SaaS leaders can turn AI-powered data pipelines into a significant competitive advantage, delivering greater value to customers while positioning themselves for future success.

By Venkata Gummadi

Delta Live Tables in Databricks: A Guide to Smarter, Faster Data Pipelines

Data pipelines are the main arteries of any organization that functions in the data economy. However, building and maintaining them can be a complex, time-consuming process and can be quite frustrating for data engineers. Maintaining data quality, maintaining data processes, and processing data in real-time are programmable challenges that can complicate projects and, thus, the quality of information. Delta Live Tables (DLT) from Databricks wants to do this differently. Through data validation automation, pipeline management simplification, and real-time processing handling by DLT, data engineers are able to design more efficient pipelines with fewer issues. This article will introduce you to DLT, showing how it can make data pipeline management easier and more efficient. What Are Delta Live Tables? DLT is a capability in Databricks that can be used to create data pipelines. It allows data engineers to build pipelines with a few lines of code in SQL or Python, which means that users with different programming experiences will be able to use it. DLT helps automate most of the routine processes associated with data pipelines. It does data validation checks coping with dependencies. Thus, it reduces time consumption and the probability of errors. In simple terms, DLT assists in establishing efficient and high-quality value chains that are less likely to break down frequently and require the attention of a manager. In the Databricks context, DLT is used in conjunction with other services such as data storage and Delta Lake. However, while Delta Lake is all about data storage and structuring, DLT is about making data movement and transformation much simpler, particularly in real time. This combination enables users to work on data from the input stage to the output stage without much difficulty. Key Benefits of Delta Live Tables Enhanced Data Quality and Validation One of the standout features of DLT is its ability to automatically check and enforce data quality. DLT performs data validation at each step, ensuring only clean, reliable data moves through the pipeline. It can even detect schema changes and handle errors without requiring constant oversight. This built-in quality control reduces the risk of bad data impacting your analytics or machine learning models. Simplified Pipeline Management Managing data pipelines can often be complex, with dependencies and tasks that need careful coordination. DLT simplifies this by automatically handling dependencies within the pipeline, making the setup easier and less prone to errors. Data engineers can use a straightforward, declarative syntax to define pipelines in SQL or Python, which makes them more accessible to teams with varied skill sets. This approach allows for faster setup and easier maintenance, reducing the overall workload. Real-Time Data Processing DLT supports both batch and real-time (streaming) data, giving data engineers flexibility depending on their needs. With DLT’s real-time processing capabilities, users can gain immediate insights, which is especially valuable for applications like fraud detection, customer personalization, or any scenario requiring instant data updates. This ability to handle data instantly makes Delta Live Tables a strong choice for companies looking to move from batch to real-time analytics. Use Cases and Examples DLT offers solutions for a range of real-world data challenges across different industries. Here are a few practical ways DLT can be applied: Banking Fraud Detection Banks and financial institutions require cheap, fast, and accurate means of identifying instances of fraud. In the case of DLT, the banks can process the transaction data in real time and identify suspicious patterns at the same time. This allows for the prevention of fraud more often; thus, customers’ safety and minimization of losses are ensured. Customer Personalization in Retail In retail, firms seek to offer specific experiences to consumers according to their buying behaviors. DLT enables retail organizations to analyze customer behavioral data in real time and provide the right recommendations and offers to the respective customer. Such instant personalization can help to increase the level of involvement and sales. Healthcare Data Processing Healthcare providers manage massive volumes of patient data, where data accessibility is vital and should not be delayed. DLT provides for the processing of patient records and lab data, amongst others, in a real-time manner. Since this can help in making a faster diagnosis, enhance patient care, and ease the flow of data in healthcare facilities. Example Configuration To illustrate how DLT works, here’s a simple example configuration in SQL. This code snippet demonstrates setting up a basic DLT pipeline to validate and clean incoming data: SQL CREATE OR REFRESH LIVE TABLE customer_data AS SELECT customer_id, name, age, purchase_history FROM streaming_data_source WHERE age IS NOT NULL AND purchase_history IS NOT NULL; In this example, we create a table called customer_data that pulls from a live data source, filtering out records with missing values. This is just a basic use case, but it highlights how DLT can help automate data cleaning and validation, ensuring only quality data flows through the pipeline. These use cases and examples show the versatility of DLT, making it useful for any organization that needs real-time, reliable data insights. Future Implications of Delta Live Tables As data demands grow, DLT could transform how organizations manage and use data. In the future, DLT may integrate more closely with machine learning workflows, enabling faster data preparation for complex models. This could streamline processes for AI-driven projects. DLT’s impact on real-time analytics will also expand. With businesses increasingly dependent on immediate data, DLT could play a key role in sectors like IoT, where constant, live data streams drive automation. This would make industries like manufacturing and logistics more efficient and responsive. Lastly, DLT could make data workflows accessible to a broader range of users. By simplifying pipeline creation, DLT may allow data analysts and business teams to manage their own data workflows. This shift could foster a more data-driven culture, where more teams can leverage insights without relying on engineering support. Challenges and Considerations While DLT offers many benefits, there are some challenges to consider. There may be an initial learning curve for new users, especially those unfamiliar with Databricks or declarative pipeline design. Adapting to DLT’s setup may require some training or practice. Cost is another factor. Real-time processing and continuous monitoring in DLT can increase operational expenses, especially for organizations managing large data volumes. Teams should evaluate their budget and choose processing options wisely to control costs. Data governance and security are also important considerations. Since DLT handles data in real-time, organizations dealing with such data will be subject to data protection laws like GDPR or HIPAA. Committing to strong security measures will be a priority to ensure data security and procedural compliance. Final Words Delta Live Tables (DLT) simplifies data pipeline management, enhancing data quality, real-time processing, and overall workflow efficiency. By automating complex tasks and supporting scalable, reliable data operations, DLT helps organizations make faster, data-driven decisions with confidence. As data demands increase, tools like DLT are essential for building flexible, future-ready data systems. For those looking to explore more, understanding how DLT integrates with other Databricks features could be a valuable next step. Frequently Asked Questions Here are some questions you may have about Delta Live Tables answered: What types of tables does Delta Live Tables support? It supports streaming tables, materialized views, and views, each suited for different processing needs like continuous data ingestion or pre-computed results. How does Delta Live Tables enhance data quality? DLT allows users to set data quality rules called “expectations” to filter or flag problematic data, helping ensure accuracy throughout the pipeline. Can Delta Live Tables process both batch and streaming data? Yes, DLT handles both types, allowing for either scheduled batch updates or continuous real-time processing based on needs. What is Delta Live Tables' relationship with Delta Lake? DLT builds on Delta Lake’s storage, adding features like pipeline automation and data validation to streamline workflows. How can I monitor Delta Live Tables pipelines? DLT includes monitoring tools to track pipeline health, detect errors, and review processing times in real time.

By Kiran Polimetla

Event-Driven AI: Building a Research Assistant With Kafka and Flink

The rise of agentic AI has fueled excitement around agents that autonomously perform tasks, make recommendations, and execute complex workflows blending AI with traditional computing. But creating such agents in real-world, product-driven environments presents challenges that go beyond the AI itself. Without careful architecture, dependencies between components can create bottlenecks, limit scalability, and complicate maintenance as systems evolve. The solution lies in decoupling workflows, where agents, infrastructure, and other components interact fluidly without rigid dependencies. This kind of flexible, scalable integration requires a shared “language” for data exchange — a robust event-driven architecture (EDA) powered by streams of events. By organizing applications around events, agents can operate in a responsive, decoupled system where each part does its job independently. Teams can make technology choices freely, manage scaling needs separately, and maintain clear boundaries between components, allowing for true agility. To put these principles to the test, I developed PodPrep AI, an AI-powered research assistant that helps me prepare for podcast interviews on Software Engineering Daily and Software Huddle. In this post, I’ll dive into the design and architecture of PodPrep AI, showing how EDA and real-time data streams power an effective agentic system. Note: If you would like to just look at the code, jump over to my GitHub repo here. Why an Event-Driven Architecture for AI? In real-world AI applications, a tightly coupled, monolithic design doesn’t hold up. While proofs of concept or demos often use a single, unified system for simplicity, this approach quickly becomes impractical in production, especially in distributed environments. Tightly coupled systems create bottlenecks, limit scalability, and slow down iteration — all critical challenges to avoid as AI solutions grow. Consider a typical AI agent. It may need to pull data from multiple sources, handle prompt engineering and RAG workflows, and interact directly with various tools to execute deterministic workflows. The orchestration required is complex, with dependencies on multiple systems. And if the agent needs to communicate with other agents, the complexity only increases. Without a flexible architecture, these dependencies make scaling and modification nearly impossible. Example AI agent dependency graph In production, different teams usually handle different parts of the stack: MLOps and data engineering manage the RAG pipeline, data science selects models, and application developers build the interface and backend. A tightly coupled setup forces these teams into dependencies that slow down delivery and make scaling difficult. Ideally, the application layers shouldn’t need to understand the AI’s internals; they should simply consume results when needed. Furthermore, AI applications can’t operate in isolation. For true value, AI insights need to flow seamlessly across customer data platforms (CDPs), CRMs, analytics, and more. Customer interactions should trigger updates in real-time, feeding directly into other tools for action and analysis. Without a unified approach, integrating insights across platforms becomes a patchwork that’s hard to manage and impossible to scale. EDA-powered AI addresses these challenges by creating a “central nervous system” for data. With EDA, applications broadcast events instead of relying on chained commands. This decouples components, allowing data to flow asynchronously wherever needed, enabling each team to work independently. EDA promotes seamless data integration, scalable growth, and resilience — making it a powerful foundation for modern AI-driven systems. Designing a Scalable AI-Powered Research Agent Over the past two years, I’ve hosted hundreds of podcasts across Software Engineering Daily, Software Huddle, and Partially Redacted. To prepare for each podcast, I carry out an in-depth research process to prepare a podcast brief that contains my thoughts, background on the guest and topic, and a series of potential questions. To build this brief, I typically research the guest and the company they work for, listen to other podcasts they may have appeared on, read blog posts they wrote, and read up on the main topic we’ll be discussing. I try to weave in connections to other podcasts I’ve hosted or my own experience related to the topic or similar topics. This entire process takes considerable time and effort. Large podcast operations have dedicated researchers and assistants who do this work for the host. I’m not running that kind of operation over here. I have to do all this myself. To address this, I wanted to build an agent that could do this work for me. At a high level, the agent would look something like the image below. High-level flow diagram for the research agent I wanted to build I provide basic source materials like the guest name, company, topics I want to focus on, some reference URLs like blog posts and existing podcasts, and then some AI magic happens, and my research is complete. This simple idea led me to create PodPrep AI, my AI-powered research assistant that only costs me tokens. The rest of this article discusses the design of PodPrep AI, starting with the user interface. Building the Agent User Interface I designed the agent’s interface as a web application where I can easily input source material for the research process. This includes the guest’s name, their company, the interview topic, any additional context, and links to relevant blogs, websites, and previous podcast interviews. Example of creating a podcast research bundle I could have given the agent less direction and as part of the agentic workflow have it go find the source materials, but for version 1.0, I decided to supply the source URLs. The web application is a standard three-tier app built with Next.js and MongoDB for the application database. It doesn’t know anything about AI. It simply allows the user to enter new research bundles and these appear in a processing state until the agentic process has completed the workflow and populated a research brief in the application database. List of processing and processed research requests Once the AI magic is complete, I can access a briefing document for the entry as shown below. Example of a complete research bundle Creating the Agentic Workflow For version 1.0, I wanted to be able to perform three primary actions to build the research brief: For any website URL, blog post, or podcast, retrieve the text or summary, chunk the text into reasonable sizes, generate embeddings, and store the vector representation.For all text extracted from the research source URLs, pull out the most interesting questions, and store those.Generate a podcast research brief combining the most relevant context based on the embeddings, best questions asked previously, and any other information that was part of the bundle input. The image below shows the architecture from the web application to the agentic workflow. Agentic workflow for PodPrep AI Action #1 from above is supported by the Process URLs & Create Embeddings Agent HTTP sink endpoint. Action #2 is carried out using Flink and the built-in AI model support in Confluent Cloud. Finally, Action #3 is executed by the Generate Research Brief Agent, also an HTTP sink endpoint, that’s called once the first two actions have completed. In the following sections, I discuss each of these actions in detail. The Process URLs and Create Embeddings Agent This agent is responsible for pulling text from the research source URLs and the vector embedding pipeline. Below is the high-level flow of what is happening behind the scenes to process the research materials. Flow diagram for the Process URLs and Create Embeddings agent Once a research bundle is created by the user and saved to MongoDB, a MongoDB source connector produces messages to a Kafka topic called research-requests. This is what starts the agentic workflow. Each post request to the HTTP endpoint contains the URLs from the research request and the primary key in the MongoDB research bundles collection. The agent loops through each URL and if it’s not an Apple podcast, it retrieves the full-page HTML. Since I don’t know the structure of the page, I can’t rely on HTML parsing libraries to find the relevant text. Instead, I send the page text to gpt-4o-minimodel with a temperature of zero using the prompt below to get what I need. JavaScript `Here is the content of a webpage: ${text} Instructions: - If there is a blog post within this content, extract and return the main text of the blog post. - If there is no blog post, summarize the most important information on the page.` For podcasts, I need to do a bit more work. Reverse Engineering Apple Podcast URLs To pull data from podcast episodes, we first need to convert the audio into text using the Whisper model. But before we can do that, we have to locate the actual MP3 file for each podcast episode, download it, and split it into chunks of 25MB or less (the max size for Whisper). The challenge is that Apple doesn’t provide a direct MP3 link for its podcast episodes. However, the MP3 file is available in the podcast’s original RSS feed, and we can programmatically find this feed using the Apple podcast ID. For example, in the URL below, the numeric part after /id is the podcast’s unique Apple ID: https://podcasts.apple.com/us/podcast/deep-dive-into-inference-optimization-for-llms-with/id1699385780?i=1000675820505 Using Apple’s API, we can look up the podcast ID and retrieve a JSON response containing the URL for the RSS feed: https://itunes.apple.com/lookup?id=1699385780&entity=podcast Once we have the RSS feed XML, we search it for the specific episode. Since we only have the episode URL from Apple (and not the actual title), we use the title slug from the URL to locate the episode within the feed and retrieve its MP3 URL. JavaScript async function getMp3DownloadUrl(url) { let podcastId = extractPodcastId(url); let titleToMatch = extractAndFormatTitle(url); if (podcastId) { let feedLookupUrl = `https://itunes.apple.com/lookup?id=${podcastId}&entity=podcast`; const itunesResponse = await axios.get(feedLookupUrl); const itunesData = itunesResponse.data; // Check if results were returned if (itunesData.resultCount === 0 || !itunesData.results[0].feedUrl) { console.error("No feed URL found for this podcast ID."); return; } // Extract the feed URL const feedUrl = itunesData.results[0].feedUrl; // Fetch the document from the feed URL const feedResponse = await axios.get(feedUrl); const rssContent = feedResponse.data; // Parse the RSS feed XML const rssData = await parseStringPromise(rssContent); const episodes = rssData.rss.channel[0].item; // Access all items (episodes) in the feed // Find the matching episode by title, have to transform title to match the URL-based title const matchingEpisode = episodes.find(episode => { return getSlug(episode.title[0]).includes(titleToMatch); } ); if (!matchingEpisode) { console.log(`No episode found with title containing "${titleToMatch}"`); return false; } // Extract the MP3 URL from the enclosure tag return matchingEpisode.enclosure[0].$.url; } return false; } Now with the text from blog posts, websites, and MP3’s available, the agent uses LangChain’s recursive character text splitter to split the text into chunks and generate the embeddings from these chunks. The chunks are published to the text-embeddings topic and sinked to MongoDB. Note: I chose to use MongoDB as both my application database and vector database. However, because of the EDA approach I’ve taken, these can easily be separate systems, and it’s just a matter of swapping the sink connector from the Text Embeddings topic. Besides creating and publishing the embeddings, the agent also publishes the text from the sources to a topic called full-text-from-sources. Publishing to this topic kick starts Action #2. Extracting Questions With Flink and OpenAI Apache Flink is an open-source stream processing framework built for handling large volumes of data in real time, ideal for high-throughput, low-latency applications. By pairing Flink with Confluent, we can bring LLMs like OpenAI’s GPT directly into streaming workflows. This integration enables real-time RAG workflows, ensuring that the question extraction process works with the freshest available data. Having the original source text in the stream also lets us introduce new workflows later that use the same data, enhancing the research brief generation process or sending it to downstream services like a data warehouse. This flexible setup allows us to layer additional AI and non-AI features over time without needing to overhaul the core pipeline. In PodPrep AI, I use Flink to extract questions from text pulled from source URLs. Setting up Flink to call an LLM involves configuring a connection through Confluent’s CLI. Below is an example command for setting up an OpenAI connection, though multiple options are available. Shell confluent flink connection create openai-connection \ --cloud aws \ --region us-east-1 \ --type openai \ --endpoint https://api.openai.com/v1/chat/completions \ --api-key <REPLACE_WITH_OPEN_AI_KEY> Once the connection is established, I can create a model in either the Cloud Console or Flink SQL shell. For question extraction, I set up the model accordingly. SQL -- Creates model for pulling questions from research source material CREATE MODEL `question_generation` INPUT (text STRING) OUTPUT (response STRING) WITH ( 'openai.connection'='openai-connection', 'provider'='openai', 'task'='text_generation', 'openai.model_version' = 'gpt-3.5-turbo', 'openai.system_prompt' = 'Extract the most interesting questions asked from the text. Paraphrase the questions and seperate each one by a blank line. Do not number the questions.' ); With the model ready, I use Flink’s built-in ml_predict function to generate questions from the source material, writing the output to a stream called mined-questions, which syncs with MongoDB for later use. SQL -- Generates questions based on text pulled from research source material INSERT INTO `mined-questions` SELECT `key`, `bundleId`, `url`, q.response AS questions FROM `full-text-from-sources`, LATERAL TABLE ( ml_predict('question_generation', content) ) AS q; Flink also helps track when all research materials have been processed, triggering the research brief generation. This is done by writing to a completed-requests stream once the URLs in mined-questions match those in the full-text sources stream. SQL -- Writes the bundleId to the complete topic once all questions have been created INSERT INTO `completed-requests` SELECT '' AS id, pmq.bundleId FROM ( SELECT bundleId, COUNT(url) AS url_count_mined FROM `mined-questions` GROUP BY bundleId ) AS pmq JOIN ( SELECT bundleId, COUNT(url) AS url_count_full FROM `full-text-from-sources` GROUP BY bundleId ) AS pft ON pmq.bundleId = pft.bundleId WHERE pmq.url_count_mined = pft.url_count_full; As messages are written to completed-requests, the unique ID for the research bundle is sent to the Generate Research Brief Agent. The Generate Research Brief Agent This agent takes all the most relevant research materials available and uses an LLM to create a research brief. Below is the high-level flow of events that take place to create a research brief. Flow diagram for the Generate Research Brief agent Let’s break down a few of these steps. To construct the prompt for the LLM, I combine the mined questions, topic, guest name, company name, a system prompt for guidance, and the context stored in the vector database that is most semantically similar to the podcast topic. Because the research bundle has limited contextual information, it’s challenging to extract the most relevant context directly from the vector store. To address this, I have the LLM generate a search query to locate the best-matching content, as shown in the “Create Search Query” node in the diagram. JavaScript async function getSearchString(researchBundle) { const userPrompt = ` Guest: ${researchBundle.guestName} Company: ${researchBundle.company} Topic: ${researchBundle.topic} Context: ${researchBundle.context} Create a natural language search query given the data available. `; const systemPrompt = `You are an expert in research for an engineering podcast. Using the guest name, company, topic, and context, create the best possible query to search a vector database for relevant data mined from blog posts and existing podcasts.`; const messages = [ new SystemMessage(systemPrompt), new HumanMessage(userPrompt), ]; const response = await model.invoke(messages); return response.content; } Using the query generated by the LLM, I create an embedding and search MongoDB through a vector index, filtering by the bundleId to limit the search to materials relevant to the specific podcast. With the best context information identified, I build a prompt and generate the research brief, saving the result to MongoDB for the web application to display. Things to Note on the Implementation I wrote both the front-end application for PodPrep AI and the agents in Javascript, but in a real-world scenario, the agent would likely be in a different language like Python. Additionally, for simplicity, both the Process URLs & Create Embeddings Agent and Generate Research Brief Agent are within the same project running on the same web server. In a real production system, these could be serverless functions, running independently. Final Thoughts Building PodPrep AI highlights how an event-driven architecture enables real-world AI applications to scale and adapt smoothly. With Flink and Confluent, I created a system that processes data in real time, powering an AI-driven workflow without rigid dependencies. This decoupled approach allows components to operate independently, yet stay connected through event streams — essential for complex, distributed applications where different teams manage various parts of the stack. In today’s AI-driven environment, accessing fresh, real-time data across systems is essential. EDA serves as a “central nervous system” for data, enabling seamless integration and flexibility as the system scales.

By Sean Falconer

Data Lake vs. Data Warehouse vs. Data Lakehouse

Let us look into the strengths and weaknesses of leading data storage solutions. Data is central to modern business and society. Depending on what sort of leaky analogy you prefer, data can be the new oil, gold, or even electricity. Of course, even the biggest data sets are worthless and might even be a liability if they aren’t organized properly. Data collected from every corner of modern society has transformed the way people live and do business. Everything from the subtleties of internal corporate transactions to the habits of everyday life can be measured, stored, and analyzed. You’d be hard-pressed to find a modern business that does not rely on data-driven insights. The ability to collect, analyze, and utilize data has revolutionized the way businesses operate and interact with their customers in various industries, such as healthcare, finance, and retail. Other industries are natively intertwined with data, like those stemming from mobile devices, internet-of-things, and modern machine learning and AI. Estimates vary, but the amount of new data produced, recorded, and stored is in the ballpark of 200 exabytes per day on average, with an annual total growing from 33 zettabytes in 2018 to a projected 169 zettabytes in 2025. In case you don’t know your metrics, these numbers are astronomical! 100 zettabytes is 1014 gigabytes, or 10 to 100 times more than the estimated number of stars in the Local Group of galaxies, which includes our Milky Way. A Brief History of Data Storage The value of data has been apparent for as long as people have been writing things down. One of the earliest examples of writing, a clay tablet, is likely a record of transactions. Stonehenge and other megaliths can be interpreted as resilient data stores for analyzing solar position to predict seasonal conditions. As the magnitude and role of data in society have changed, so have the tools for dealing with it. While a +3500-year data retention capability for data stored on clay tablets is impressive, the access latency and forward compatibility of clay tablets fall a little short. Similarly, data platforms based on the business needs of the past don’t always meet the needs of today. Data volume and velocity, governance, structure, and regulatory requirements have all evolved and continue to. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications. While data warehouses are still in use, they are limited in use cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground up. Read on for a detailed comparison of the pros and cons of data warehouses, data lakes, and data lakehouses. Data Warehouses A data warehouse is a repository and platform for storing, querying, and manipulating data. Warehouses are particularly suited for structured data used for decision support and business intelligence. Modern data warehouses have become more efficient, flexible, and scalable (particularly in the context of massively parallel processing and distributed computation), but they still bear the mark of their early development in the previous century. The data warehouse concept dates back to data marts in the 1970s. After a long incubation period, the idea began to bear fruit commercially at IBM in the late 1980s and early 1990s. Data warehousing improved on the inefficiency of data marts, siloed data stores maintained by individual departments. Data warehouses have come a long way since then, offered by tech giants (Google BigQuery, Amazon Redshift, Microsoft Synapse), well-known specialists like Yellowbrick, Teradata, and Snowflake, and modern projects like DuckDB. You can read a comparison of Snowflake, Databricks, and DuckDB here. In line with their original use case, data warehouses are by and large more closed than modern data platforms, like Databricks or Snowflake. They use proprietary data formats, more rigidly enforced schema, and have strict governance. Compared to a data lake or lakehouse, a data warehouse tends to have a shorter learning curve and is usually easier for end users with less specialized technical knowledge. With that said, these systems tend to be less flexible and lack operational transparency. Mobile devices, cloud computing, and the Internet of Things have significantly accelerated growth in data volume and velocity in recent years. The growing role of big data and associated technologies, like Hadoop and Spark, have nudged the industry away from its legacy origins and toward cloud data warehousing. Modern data warehouses benefit from decades of continuous improvement for their most suitable use cases in structured analytics and decision support, and they are more flexible and capable than their predecessors. The evolution of data warehouses was motivated by a mix of a need to match the capabilities of newer platforms, like data lakes, and evolving data needs, such as governance and security. Data Lakes For data science and deep machine learning, data lakes offer a more flexible alternative to traditional data warehouse platforms. Where data warehouses organize primarily structured data in rigid schema with often proprietary data formats, data lakes have been developed for flexible native support of ever-growing volumes of unstructured and semi-structured data. Like data warehouses, commercial data lake offerings are integrated into the major clouds: Google’s BigLake, Microsoft’s Azure Data Lake, and Amazon’s S3 Data Lakes. Unlike traditional data warehouses, data lakes are typically built on open-source tools and frameworks like Delta Lake and Apache Hudi. A few big tech companies have the in-house expertise to customize their own data lakes. Examples include a sophisticated hybrid approach with incremental ETL developed at Uber and a complex hybrid infrastructure for data lake and warehouse needs built by Netflix. A data warehouse requires ETL (extract, transform, load) on data going into storage, ensuring it is structured for fast querying and use in analytics and business intelligence. In a data lake, raw data can be stored and accessed directly. ETL is then applied as a separate step either on-demand (lowering performance) or on its way into a data warehouse in a hybrid, multi-platform architecture (increasing complexity). Data lakes trade the query performance and shorter learning curves of data warehouses for increased flexibility and support for semistructured and unstructured data. The use of open data formats in data lakes means more flexibility for users and less vendor lock-in. This flexibility also enables data lakes to be more efficient than data warehouses. For example, a data lake can take advantage of a cloud provider’s spot instances to save cost versus on-demand clusters. This can translate into substantial cost savings, as we have found that spot instances can be 90% cheaper than on-demand clusters. Data lakes can also separate storage and on-demand compute costs, whereas warehouses tend to bundle everything together. On the other hand, a data lake that is poorly optimized can actually be more expensive. Retaining in-house expertise to maintain and optimize a data lake is also a cost that requires consideration. Ultimately the limits of cost and efficiency depend on the particulars of each use case. Overall, data warehouses date to an era of more rigid and structured data needs, but are still useful for structured data, relational queries, and business analytics. Data lakes are flexible enough to support today’s deep learning and data science, but fall short in infrastructure, governance, and relational analytics. So, why not combine the two to get the best of both worlds? Some companies have attempted to create their own hybrid architectures that couple a data lake with one or more warehouses, but this entails its own challenges. This approach can be perfectly viable for an organization that wants to modernize slowly without switching over to a fully modern data platform. But what kind of characteristics would a true hybrid solution, built from the ground up based on today’s needs, have? Data Lakehouses A data lake leaves something to be desired in terms of data governance and infrastructure. Without proper infrastructure and governance, a well-intentioned data lake quickly turns into a data “swamp,” and “swamps start to smell after a while,” according to the father of the lakehouse, Bill Inmon, in Building the Data Lakehouse. The data lakehouse concept shares the goals of hybrid architectures but is designed from the ground up to meet modern needs. Lakehouses maintain the flexibility of a data lake but incorporate an infrastructure layer that adds metadata and indexing, improves governance and ETL, and improves support for traditional queries and relational analytics. Databricks leads the field in lakehouse development and advocacy and introduced the concept in this 2020 whitepaper. Unlike a data warehouse, a data lakehouse supports semistructured and unstructured data and uses open-source data formats to eliminate the possibility of vendor lock-in. The lakehouse also works more naturally with Python and R and provides support for continuous ETL and schema for keeping structured data fresh, enabling business intelligence dashboards and relational queries with languages like SQL. Lastly, the open data formats used by lakehouses reduce the risk of vendor lock-in. In contrast to the data lake, a lakehouse supports ACID (atomicity, consistency, isolation, and durability) transactions, so the side effects from multiple users accessing and analyzing data simultaneously are avoided. In addition, a lakehouse has improved data versioning, making it possible to follow data lineages from ingestion through transformation and analysis, making for better reliability and trustworthiness. The infrastructure and governance capabilities of a lakehouse make it easier to stay in compliance with regulations like the Health Insurance Portability and Accountability Act (HIPAA), the California Consumer Privacy Act (CCPA), the EU’s General Data Protection Regulation (GDPR), etc. Comparison Table Data Warehouse Data Lake Data Lakehouse Structured data Semi-structured dataUnstructured dataOpen source foundation? Proprietary* Open source Open source Learning curveLanguages supported Query languages like SQL Programing languages like Python, R, and Scala SQL, Python, R, Scala & Java, Go, Etc. Governance Rigid schema, ACID transactions, ETL at data ingestion Little enforced governance, post-hoc ETL, [hybrid architectures for data warehouse functionality ]** Support for ACID transactions, data versioning, metadata/indexing, continuous ETL Ideal use cases Relational Analytics, Business Intelligence Data Science, Machine Learning, [Business Intelligence, Relational Analytics]** Relational Analytics, Business Intelligence, Data Science, Machine Learning and MLOps *DuckDB is an example of a modern, OSS data warehouse project, bucking the legacy trend of proprietary platforms and data formats. **Hybrid architectures aim to achieve warehouse functionality by combining a data lake with one or more data warehouses. Conclusion Developed in response to the growing data needs of last century’s businesses, the highly structured and proprietary origins of the data warehouses show their age. However, data warehouses still offer top-notch performance for traditional business intelligence and relational data analytics. In addition, the learning curve for a data warehouse is typically shallow, and managed warehouse services relieve the end user of some of the technical aspects of data management. But this comes at the cost of flexibility. Data lakes offer flexible support for modern data needs, open data formats, and the potential for reduced compute costs. However, while data science and machine learning workflows are supported, more traditional data analytics, such as SQL queries and business reports, are supported poorly, if at all. Data lakes can be thought of as a less rigidly regulated data management strategy than data warehouses. Although access and application can be distributed, the data lake itself is still centralized. A decentralized alternative to a data lake, a data mesh, retains the strengths and weaknesses of a data lake in a distributed architecture. DIY hybrid architectures attempt to bring the benefits of lakes and warehouses together but at some expense, depending on what each data platform is best at alone. A data lakehouse is a true hybrid solution built from the ground up. It retains the best aspects of data warehouses and lakes while incorporating lessons learned and responding to modern data needs. Performance for a lakehouse approaches that of a traditional warehouse for traditional analytics while supporting direct data access needs of data science workflows and training machine learning models. The data lakehouse is a powerful paradigm capable of meeting the data needs of the modern enterprise. It has been developed with a modern perspective and decades of accumulated data engineering wisdom. On the other hand, a data lakehouse might leave some aspects unfamiliar to users, and technical maintenance and optimization can be challenging.

By Noa Shavit

Iceberg Catalogs: A Guide for Data Engineers

Apache Iceberg has become a popular choice for managing large datasets with flexibility and scalability. Catalogs are central to Iceberg’s functionality, which is vital in table organization, consistency, and metadata management. This article will explore what Iceberg catalogs are, their various implementations, use cases, and configurations, providing an understanding of the best-fit catalog solutions for different use cases. What Is an Iceberg Catalog? In Iceberg, a catalog is responsible for managing table paths, pointing to the current metadata files that represent a table’s state. This architecture is essential because it enables atomicity, consistency, and efficient querying by ensuring that all readers and writers access the same state of the table. Different catalog implementations store this metadata in various ways, from file systems to specialized metastore services. Core Responsibilities of an Iceberg Catalog The fundamental responsibilities of an Iceberg catalog are: Mapping Table Paths: Linking a table path (e.g., “db.table”) to the corresponding metadata file.Atomic Operations Support: Ensuring consistent table state during concurrent reads/writes.Metadata Management: Storing and managing the metadata, ensuring accessibility and consistency. Iceberg catalogs offer various implementations to accommodate diverse system architectures and storage requirements. Let’s examine these implementations and their suitability for different environments. Types of Iceberg Catalogs 1. Hadoop Catalog The Hadoop Catalog is typically the easiest to set up, requiring only a file system. This catalog manages metadata by looking up the most recent metadata file in a table’s directory based on file timestamps. However, due to its reliance on file-level atomic operations (which some storage systems like S3 lack), the Hadoop catalog may not be suitable for production environments where concurrent writes are common. Configuration Example To configure the Hadoop catalog with Apache Spark: SQL spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.0 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.type=hadoop \ --conf spark.sql.catalog.my_catalog.warehouse=file:///D:/sparksetup/iceberg/spark_warehouse A different way to set catalog in the spark job itself: SQL SparkConf sparkConf = new SparkConf() .setAppName("Example Spark App") .setMaster("local[*]") .set("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .set("spark.sql.catalog.local","org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.local.type","hadoop") .set("spark.sql.catalog.local.warehouse", "file:///D:/sparksetup/iceberg/spark_warehouse") In the above example, we set the catalog name to “local” as configured in spark “spark.sql.catalog.local". This can be a choice of your name. Pros: Simple setup, no external metastore required.Ideal for development and testing environments. Cons: Limited to single file systems (e.g., a single S3 bucket).Not recommended for production 2. Hive Catalog The Hive Catalog leverages the Hive Metastore to manage metadata location, making it compatible with numerous big data tools. This catalog is widely used for production because of its integration with existing Hive-based infrastructure and compatibility with multiple query engines. Configuration Example To use the Hive catalog in Spark: SQL spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.0 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.type=hive \ --conf spark.sql.catalog.my_catalog.uri=thrift://<metastore-host>:<port> Pros: High compatibility with existing big data tools.Cloud-agnostic and flexible across on-premises and cloud setups. Cons: Requires maintaining a Hive metastore, which may add operational complexity.Lacks multi-table transaction support, limiting atomicity for operations across tables 3. AWS Glue Catalog The AWS Glue Catalog is a managed metadata catalog provided by AWS, making it ideal for organizations heavily invested in the AWS ecosystem. It handles Iceberg table metadata as table properties within AWS Glue, allowing seamless integration with other AWS services. Configuration Example To set up AWS Glue with Iceberg in Spark: SQL spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-x.x_x.xx:x.x.x,software.amazon.awssdk:bundle:x.xx.xxx \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \ --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY Pros: Managed service, reducing infrastructure and maintenance overhead.Strong integration with AWS services. Cons: AWS-specific, which limits cross-cloud flexibility.No support for multi-table transactions 4. Project Nessie Catalog Project Nessie offers a “data as code” approach, allowing data version control. With its Git-like branching and tagging capabilities, Nessie enables users to manage data branches in a way similar to source code. It provides a robust framework for multi-table and multi-statement transactions. Configuration Example To configure Nessie as the catalog: SQL spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-x.x_x.xx:x.x.x" \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.nessie.NessieCatalog \ --conf spark.sql.catalog.my_catalog.uri=http://<host>:<port> Pros: Provides “data as code” functionalities with version control.Supports multi-table transactions. Cons: Requires self-hosting, adding infrastructure complexity.Limited tool support compared to Hive or AWS Glue 5. JDBC Catalog The JDBC Catalog allows you to store metadata in any JDBC-compliant database, like PostgreSQL or MySQL. This catalog is cloud-agnostic and ensures high availability by using reliable RDBMS systems. Configuration Example SQL spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-x.x_x.xx:x.x.x \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \ --conf spark.sql.catalog.my_catalog.uri=jdbc:<protocol>://<host>:<port>/<database> \ --conf spark.sql.catalog.my_catalog.jdbc.user=<username> \ --conf spark.sql.catalog.my_catalog.jdbc.password=<password> Pros: Easy to set up with existing RDBMS infrastructure.High availability and cloud-agnostic. Cons: No support for multi-table transactions.Increases dependencies on JDBC drivers for all accessing tools 6. Snowflake Catalog Snowflake offers robust support for Apache Iceberg tables, allowing users to leverage Snowflake’s platform as the Iceberg catalog. This integration combines Snowflake’s performance and query semantics with the flexibility of Iceberg’s open table format, enabling efficient management of large datasets stored in external cloud storage. Refer to the snowflake documentation for further configuration it at the link Pros: Seamless Integration: Combines Snowflake’s performance and query capabilities with Iceberg’s open table format, facilitating efficient data management.Full Platform Support: Provides comprehensive read and write access, along with features like ACID transactions, schema evolution, and time travel.Simplified Maintenance: Snowflake handles lifecycle tasks such as compaction and reducing operational overhead. Cons: Cloud and Region Constraints: The external volume must be in the same cloud provider and region as the Snowflake account, limiting cross-cloud or cross-region configurations.Data Format Limitation: Supports only the Apache Parquet file format, which may not align with all organizational data format preferences.Third-Party Client Restrictions: Prevents third-party clients from modifying data in Snowflake-managed Iceberg tables, potentially impacting workflows that rely on external tools. 7. REST-Based catalogs Iceberg supports REST-based catalogs to address several challenges associated with traditional catalog implementations. Challenges With Traditional Catalogs Client-Side Complexity: Traditional catalogs often require client-side configurations and dependencies for each language (Java, Python, Rust, Go), leading to inconsistencies across different programming languages and processing engines. Read more about it here.Scalability Constraints: Managing metadata and table operations at the client level can introduce bottlenecks, affecting performance and scalability in large-scale data environments. Benefits of Adopting the REST Catalog Simplified Client Integration: Clients can interact with the REST catalog using standard HTTP protocols, eliminating the need for complex configurations or dependencies.Scalability: The REST catalog's server-side architecture allows for scalable metadata management, accommodating growing datasets and concurrent access patterns.Flexibility: Organizations can implement custom catalog logic on the server side, tailoring the REST catalog to meet specific requirements without altering client applications. Several implementations of the REST catalog have emerged, each catering to specific organizational needs: Gravitino: An open-source Iceberg REST catalog service that facilitates integration with Spark and other processing engines, offering a straightforward setup for managing Iceberg tables.Tabular: A managed service providing a REST catalog interface, enabling organizations to leverage Iceberg’s capabilities without the overhead of managing catalog infrastructure. Read more at Tabular.Apache Polaris: An open-source, fully-featured catalog for Apache Iceberg, implementing the REST API to ensure seamless multi-engine interoperability across platforms like Apache Doris, Apache Flink, Apache Spark, StarRocks, and Trino. Read more at GitHub. One of my favorite and simple ways to try out the REST catalog with Iceberg tables is using plain Java REST implementation. Please check the GitHub link here. Conclusion Selecting the appropriate Apache Iceberg catalog is crucial for optimizing your data management strategy. Here’s a concise overview to guide your decision: Hadoop Catalog: Best suited for development and testing environments due to its simplicity. However, consistency issues may be encountered in production scenarios with concurrent writes.Hive Metastore Catalog: This is ideal for organizations with existing Hive infrastructure. It offers compatibility with a wide range of big data tools and supports complex data operations. However, maintaining a Hive Metastore service can add operational complexity.AWS Glue Catalog: This is optimal for those heavily invested in the AWS ecosystem. It provides seamless integration with AWS services and reduces the need for self-managed metadata services. However, it is AWS-specific, which may limit cross-cloud flexibility.JDBC Catalog: Suitable for environments preferring relational databases for metadata storage, allowing the use of any JDBC-compliant database. This offers flexibility and leverages existing RDBMS infrastructure but may introduce additional dependencies and require careful management of database connections.REST Catalog: This is ideal for scenarios requiring a standardized API for catalog operations, enhancing interoperability across diverse processing engines and languages. It decouples catalog implementation details from clients but requires setting up a REST service to handle catalog operations, which may add initial setup complexity.Project Nessie Catalog: This is perfect for organizations needing version control over their data, similar to Git. It supports branching, tagging, and multi-table transactions. It provides robust data management capabilities but requires deploying and managing the Nessie service, which may add operational overhead. Understanding these catalog options and their configurations will enable you to make informed choices and optimize your data lake or lakehouse setup to meet your organization’s specific needs.

By Naresh Dulam

How to Optimize Edge Devices for AI Processing

Edge computing allows data to be processed on devices rather than transferred to the cloud. Besides offering security-related benefits, this option can overcome the latency associated with moving information. As artificial intelligence (AI) has become more prominent in various industries, more people are interested in meeting edge AI computing goals by combining the two technologies for mutual benefits. Many are also exploring how to design for edge AI, making careful tweaks that result in the desired optimization. How can you follow their lead? Take an All-Encompassing Design Approach Creating edge devices to process AI content requires evaluating all design aspects, from hardware and software to power sources. Many artificial intelligence processing tasks are already resource-intensive, so those who want to make AI-friendly edge devices must apply forward-thinking decision-making to overcome known challenges. From a hardware perspective, edge devices should have dedicated AI chips that offer the necessary processing capabilities. Then, as people review how the device’s software will function, they should heavily scrutinize all proposed features to determine which are essential. That is a practical way to conserve battery life and ensure the device can maximize the resources for handling AI data. Rather than using trial-and-error approaches, people should strongly consider relying on industrial digital twins, which enable designers to see the likely impacts of decisions before committing to them. Collaborative project management tools allow leaders to assign tasks to specific parties and encourage an accountability culture. Comment threads are similarly useful for determining when individual changes occurred and why. Then, reverting to another iteration when necessary is more straightforward and efficient. Become Familiar With the Tiny AI Movement Knowing how to design for edge AI means understanding that some enhancements occur outside the devices themselves. One popular movement is Tiny AI, which integrates algorithms into specialized hardware to improve latency and conserve power consumption. Those furthering Tiny AI efforts generally take at least one of several approaches. Sometimes, they aim to shorten the algorithms, minimizing the computational capabilities required to handle them. Another possibility is to build devices with small but optimized hardware that can continue working with the most complex algorithms while getting energy-efficient results. Finally, people consider new ways of training machine learning algorithms that require less energy. Answering application-specific questions, such as the kind of AI data processed by the edge device or the amount of information associated with the particular use case, will help product designers determine which Tiny AI aim is most valuable. Create a List of Must-Have Characteristics and Capabilities An essential optimization in edge AI computing involves determining the device’s crucial performance attributes. Then, creators can identify the steps to achieving those outcomes. One practical way to start is to consider how specific materials may have desirable properties. Silicon and silicon carbide are two popular semiconductor materials that may come up in discussions about an edge device’s internal components. Silicon carbide has become a popular option for high-performance applications due to its tolerance to higher voltages and temperatures. Knowing how to design for edge AI also requires the responsible parties to consider data storage specifics and built-in security measures. Since many users rely on AI to process information about everything from customer purchases to process improvement results, it’s critical to protect sensitive data from cybercriminals. A fundamental step is to encrypt all data. However, device-level administrator controls are also important for restricting which parties can interact with the information and how. What steps must users take to update or configure their edge device? Making the product as user-friendly as possible will enable people to set up and update their devices — a critical security-related step. It’s also important to keep future design needs in mind. How likely is it that the business will process more or a different type of information within the next several years? Do its developers intend to create and implement additional algorithms that could increase processing needs? Stay Aware of Relevant Efforts to Pair Edge Computing and AI Estimates suggest that three-quarters of enterprise data creation and processing will happen outside the traditional cloud by 2025. That finding drives home how important it is for professionals to keep exploring how to make purpose-built edge computing devices that can handle large quantities of data — including AI. Although some companies and clients will have specific requests that design teams, engineers, and others should follow, it is also valuable to stay abreast of events and innovations in the wider industry. Collaboration between skilled, knowledgeable parties can speed up progress faster than when people work independently without bouncing ideas off each other. One example is a European Union-funded project called EdgeAI. It involves coordinated activities from 48 research and development organizations within Europe. The three-year project will center on edge computing and the intelligent processing required to handle AI applications on those devices. Participants will develop hardware and software frameworks, electronic components, and systems, all while remaining focused on edge AI computing. The long-term goal is for Europe to become a leading region in intelligent edge computing applications. Those involved will use the solutions they have developed for real-life applications, demonstrating their board potential. Such efforts will be instrumental in showing leaders how edge AI can get them closer to goals. Record Details for How to Design for Edge AI Beyond considering these actionable strategies, you should also carefully document your processes and include detailed notes about your rationale and results. Besides assisting knowledge transfer to your colleagues and others interested in this topic, your records will allow you to refer to what you have learned, paving the way for applying those details to new projects.

By Emily Newton

Key Considerations for On-Premise to Cloud Data Warehouse Migration

Migrating your on-premise data warehouse to the cloud — an efficient and scalable storage solution — requires thorough planning. You'll need to evaluate cost impact, scalability benefits, performance gains, governance structures, and compliance requirements, not to mention security and privacy implications. This article will explore these critical factors while giving you a high-level overview of the challenges inherent in transitioning your data to the cloud. Understanding the Shift: On-Premise to Cloud Traditional on-premise data warehouses pose limitations — significant upfront investments, limited scalability, and hefty maintenance costs. As businesses grapple with growing data demands, these constraints can hinder scalability and agility, leading to operational inefficiencies. In this regard, moving your data warehouse to the cloud emerges as a compelling solution as it offers elastic resources, cost-effective scaling, and managed services, thus addressing many on-premises shortcomings. Yet, cloud adoption isn't one-size-fits-all since considerations such as data sovereignty, security, and latency can influence the decision. For instance, a global enterprise with variable workloads might thrive with a cloud data warehouse, leveraging its on-demand scalability for efficient resource utilization. Conversely, an organization with sensitive data subject to stringent regulatory controls might opt for an on-premise solution to maintain tighter control. The scenarios above underscore the need to weigh the benefits and limitations of each approach, setting the stage for an in-depth examination of cloud data warehousing pros and cons. Pros and Cons of Cloud Data Warehousing: Factors to Consider Before Migrating Your On-Premise Infrastructure Indeed, cloud data warehouses offer a convincing solution for many organizations with their pay-as-you-go models, seamless scalability, and reduced overhead for maintenance. However, the cloud's suitability hinges on individual use cases, which we'll explore next, ensuring you make an informed decision. Cost Implications Transitioning to a cloud data warehouse comes with many financial considerations. Initially, the shift can potentially reduce capital expenses, as the need for investing in physical infrastructure is eliminated. The cloud's pay-as-you-go pricing model offers a variable cost structure, allowing businesses to pay only for the storage and computing resources they use, leading to significant savings. Nevertheless, it's crucial to account for the total cost of ownership (TCO) when migrating to the cloud, including some “hidden” costs that are often overlooked. Think of data transfer fees (egress fees), the cost of services to manage and secure the data warehouse, and the potential charges for additional features or higher performance tiers. All in all, cloud scalability is a double-edged sword: while it allows businesses to handle increased loads without a hitch, it can also lead to unexpected costs if not managed judiciously. For instance, consider a company that traditionally maintained a sizeable on-premise data center. The operational expenses included hardware depreciation, electricity for power and cooling, and a dedicated IT staff for maintenance. Transitioning to the cloud, the company shifts to a subscription model, paying for compute and storage capacity as needed. Initially, this model reduces costs by eliminating hardware investments and lowering energy bills. However, as data usage grows, the company may choose higher-tier services for better performance, inadvertently increasing monthly subscription costs. The chosen cloud provider's pricing structure and the company's ability to manage and optimize cloud resources thus become critical in determining whether operational expenses decrease or rise post-migration. In other words, when evaluating the financial aspects of migrating your data warehouse to the cloud, consider your business's specific data needs, growth projections, and usage patterns. A meticulous analysis will help determine if the cloud's scalability and operational flexibility align with your financial goals. Last but not least, remember that the most cost-effective solution may not always be the cheapest one upfront but the one that offers the best value in the long run. Data Security and Privacy Considerations Besides costs, migrating to a cloud data warehouse introduces a new landscape of security and privacy considerations. On the one hand, major public cloud providers invest heavily in security measures, offering robust protections that may surpass what individual businesses can implement on-premise. Encrypting data, both at rest and in transit, advanced firewalls, and regular security audits are standard offerings that safeguard data. Yet, entrusting sensitive information to a third party necessitates a comprehensive understanding of the provider's security policies and compliance with regulations such as GDPR, HIPAA, or PCI DSS. For this reason, it's imperative to clarify roles and responsibilities for data security and ensure that the provider's security posture aligns with your company’s privacy standards. Data residency is another concern; the physical location of servers can affect compliance with national data sovereignty protection laws. Therefore, businesses must be vigilant about where their data is stored and processed. An illustrative example is a healthcare organization transitioning to the cloud. While the cloud provider ensures encryption and network security, the organization must still comply with HIPAA regulations, which require controls over access to patient data. The organization must establish clear access policies and ensure the cloud environment is configured to enforce these policies, potentially requiring additional governance tools or services. On the other hand, a financial institution subject to strict regulatory compliance might find a cloud data warehouse suboptimal. The reason is simple. Banks and other financial institutions handle sensitive financial data requiring stringent security protocols and immediate access control, which could be compromised by the multi-tenant nature of public cloud services, potentially exposing them to regulatory penalties and undermining client trust due to perceived security risks. All in all, while the move to a cloud data warehouse can enhance security capabilities, it also shifts some control over sensitive data to the cloud provider. That’s why it is crucial that companies perform due diligence to exhaustively evaluate the security risks to which they are exposed when migrating to the cloud and whether these risks conflict with data governance and compliance applicable to their industry. Data Governance and Compliance As you plan your cloud warehouse migration, consider data governance and compliance as pivotal to your strategy. Beyond assessing data residency and sovereignty, you must also evaluate data center certifications, ensuring they meet industry-specific standards. For instance, financial services may require an SSAE 18 assessment, while healthcare might look for HITRUST CSF compliance. Other industries, such as US government contractors, may need all that and more. Depending on your use case, you may also need to use private connections, like AWS Direct Connect, Equinix Fabric, or Azure ExpressRoute. These private connections can bolster security and compliance by establishing a dedicated network link between your premises and the cloud provider. This setup minimizes exposure to public internet vulnerabilities and enhances data transfer reliability. Moreover, you should also delve into the cloud provider's data retention policies and their ability to support your data lifecycle management, ensuring that data deletion and archival practices align with legal and business requirements. Lastly, consider the provider's compliance and auditing capabilities. That is, for regulatory reviews and internal audits, you'll need precise logs and audit trails. Ensure that your chosen cloud warehouse offers comprehensive tools for monitoring, reporting, and alerting that support your compliance workflows. Performance and Scalability Significant performance and scalability improvements are probably the main reason numerous organizations migrate to cloud data warehouses. That is, cloud warehouses provide flexible computing resources that can be scaled up or down to suit fluctuating data processing demands. This elasticity allows businesses to handle peak loads without the need for over-provisioning, optimizing resource utilization and cost. Furthermore, cloud solutions such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse Analytics, Oracle Autonomous Data Warehouse, and IBM Db2 Warehouse leverage Massively Parallel Processing (MPP) technology to offer maximum performance and scalability. For instance, a weather app could use an MPP cloud data warehouse to swiftly process and analyze vast amounts of meteorological data from multiple sources. This would provide real-time, localized weather predictions to users while seamlessly scaling during high-demand events like storms or heat waves. That’s only one example of how cloud data warehouses enable efficient data querying, allowing for real-time analytics and faster decision-making. Another advantage is the global deployment capabilities of major public cloud providers. They offer multiple regions and availability zones, reducing latency by locating data closer to end-users and ensuring higher availability and disaster recovery capabilities. For example, a retail company experiencing significant traffic spikes during holiday seasons can benefit from a cloud data warehouse. They can temporarily scale up their compute resources to handle the increased demand for real-time inventory tracking and customer analytics, ensuring smooth operations and customer satisfaction. Post-holiday, they can scale down to reduce costs and maintain efficiency. Overall, cloud data warehouses offer a level of performance and scalability that traditional on-premises solutions struggle to match. By leveraging these aspects, businesses can stay agile and competitive, with the ability to adapt to evolving data requirements. Data Migration and Integration Challenges Migrating to a cloud warehouse often presents complex data migration and integration challenges; let’s review the most important ones. Data Security: Securely transferring sensitive data is critical. Protection measures must include encryption during transit and at rest to prevent breaches.Data Format Compatibility: Legacy systems often use data formats that may not match modern cloud warehouses. For instance, a financial institution may need to convert historical transaction data from a proprietary format.Business Application Integration: Ensuring existing applications work with the new cloud warehouse can be complicated. For example, a retail company integrating its CRM system with the cloud warehouse must avoid downtime and data inconsistencies. Addressing these challenges requires detailed planning, appropriate migration tools, and potentially the assistance of data warehouse experts to ensure a smooth process. Conclusion This article has highlighted that a successful cloud data warehouse migration hinges on managing costs, ensuring data security, maintaining regulatory compliance, and assessing the benefits of performance and scalability improvements. By prioritizing these elements, organizations can harness the cloud's potential to boost data management and significantly enhance their data-driven decisions, all while maintaining operational and financial stability.

By Maria Anurag Reddy Basani

Streamlining Database Management: Running PostgreSQL in Docker Containers

Docker containers offer a lightweight, portable, and consistent way to deploy databases across different environments. This article will guide you through the process of running a PostgreSQL database in a Docker container, providing you with a flexible and scalable solution for your database needs. Why Docker for PostgreSQL? Before diving into the how-to, let's briefly discuss why running PostgreSQL in a Docker container is beneficial: Isolation: Docker containers provide isolated environments, reducing conflicts with other system components.Portability: Containers can be easily moved between development, testing, and production environments.Version Control: Docker allows precise control over PostgreSQL versions and configurations.Quick Setup: Setting up a new PostgreSQL instance becomes a matter of minutes, not hours.Resource Efficiency: Containers use fewer resources compared to traditional virtual machines. Step-by-Step Guide 1. Installing Docker Ensure Docker is installed on your system. Visit the Docker website for installation instructions specific to your operating system. 2. Pulling the PostgreSQL Image Open your terminal and run: Plain Text docker pull postgres This command downloads the latest official PostgreSQL image from Docker Hub. 3. Creating and Running the PostgreSQL Container Execute the following command to create and start a new PostgreSQL container: Plain Text docker run --name my-postgres -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d postgres This command: Names the container "my-postgres"Sets a superuser passwordMaps the container's 5432 port to the host's 5432 portRuns the container in detached mode 4. Verifying the Container Status Check if your container is running: Plain Text docker ps You should see "my-postgres" listed among active containers. 5. Connecting to the Database Connect to your PostgreSQL database using: Plain Text docker exec -it my-postgres psql -U postgres This opens a psql session inside the container. 6. Managing the Container To stop the container: Plain Text docker stop my-postgres To start it again: Plain Text docker start my-postgres Advanced Configurations Persistent Data Storage For data persistence across container restarts, mount a volume: Plain Text docker run --name my-postgres -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -v /path/on/host:/var/lib/postgresql/data -d postgres Replace /path/on/host with your desired host machine path. Custom PostgreSQL Configurations To use a custom postgresql.conf file: Plain Text docker run --name my-postgres -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -v /path/to/custom/postgresql.conf:/etc/postgresql/postgresql.conf -d postgres -c 'config_file=/etc/postgresql/postgresql.conf' Best Practices and Security Considerations Use Strong Passwords: Replace mysecretpassword with a strong, unique password in production environments.Regular Backups: Implement a backup strategy for your PostgreSQL data.Network Security: Consider using Docker networks to isolate your database container.Keep Updated: Regularly update your PostgreSQL image to the latest version for security patches. Conclusion Running PostgreSQL in a Docker container offers a flexible, efficient, and scalable solution for database management. By following this guide, you can quickly set up a PostgreSQL environment that's easy to manage and reproduce across different systems. Whether you're a developer, database administrator, or DevOps professional, this approach can significantly streamline your database workflows and enhance your overall productivity.

By VARUNREDDY DEVIREDDY

Big Data

DZone's Featured Big Data Resources

Top Big Data Experts

The Latest Big Data Topics