IoT Resources

DZone's Featured IoT Resources

Understanding WebRTC Security Architecture and IoT

By Carsten Rhod Gregersen

In the IoT world, security is one of the biggest challenges. When you’re connecting multiple devices together over a network, various doors are left ajar to security threats, along with an increase in the number of doors themselves that open to the involved data transmissions. However, data transmissions are an integral part of IoT, because they allow devices to share various types of data among themselves and transmit it to other devices, including notifications and media files. This ability is essential for IoT ecosystems, in which devices need to communicate efficiently to perform complex tasks. However, access to the data channel must be both restricted and encrypted to maintain security. WebRTC is one approach to establishing secure data channels over an IoT network. WebRTC establishes direct peer-to-peer connections, allowing data to flow directly between devices instead of through a separate server. The basic security consists of three mandatory WebRTC encryption protocols: secure real-time protocol (SRTP), secure encryption key exchange, and secure signaling. These protocols encrypt the data sent through WebRTC, protect the encryption keys, and secure the web server connection. Here, we’ll explain further how WebRTC security works to protect your IoT network. A Look at SRTP for WebRTC Security One of the primary concerns in IoT security is the potential for data interception. WebRTC mitigates this risk with secure real-time protocol (SRTP), which encrypts media streams and data packets during transfer. These protocols are widely used in systems such as video surveillance, smart home devices, healthcare IoT, industrial IoT, and connected vehicles, making them essential for securing real-time data transfer across various IoT applications. SRTP builds on the basic real-time protocol (RTP) by adding encryption and authentication layers. Each data packet is encrypted using a unique key shared exclusively between communicating devices. This ensures that even if a packet is intercepted, its content cannot be accessed without the decryption key. WebRTC achieves secure key exchange through DTLS-SRTP, which integrates datagram transport layer security with SRTP to establish secure connections. In addition to encryption, SRTP includes mechanisms for data integrity verification. Every packet has an authentication tag, a digital signature that confirms it has not been tampered with during transmission. If a packet’s tag fails verification, it is discarded, protecting communication from interference. Encryption Key Exchange While SRTP encrypts the data itself, WebRTC employs secure encryption key exchange mechanisms to protect the keys that control access to data streams. These keys, often referred to as session keys, are unique, temporary codes used to encrypt and decrypt the data exchanged between devices. Without these keys, intercepted data cannot be read or modified. Key exchange begins with a DTLS “handshake,” a process that verifies the identities of communicating devices and securely transfers encryption keys. This step ensures that only authenticated devices can participate in the communication. Essentially, Datagram Transport Layer Security (DTLS) plays a critical role in WebRTC by confirming the credentials of both the sender and receiver (similar to verifying IDs) to ensure all participants in the media stream are who they claim to be. A crucial part of this process involves the exchange and validation of certificate fingerprints. WebRTC provides a mechanism to generate fingerprints of certificates, which act as unique identifiers for each device in the connection. Secure Signaling In WebRTC, signaling — the process that helps establish a peer-to-peer connection — is a crucial security component. Signaling mechanisms are used to set up, modify, and terminate WebRTC connections. Although WebRTC doesn’t define a specific signaling protocol, developers typically rely on secure channels (like HTTPS or WebSockets) to manage signaling messages. To understand the differences between SRTP, secure encryption key exchange, and secure signaling, think of them as three roles in building a secure house: SRTP: SRTP is like the lock on the doors and windows in WebRTC security. It ensures that once people (here, data) are inside the house, they are safe and cannot be accessed by unauthorized individuals. It encrypts media streams (audio, video, or data packets) and ensures they remain private and untampered during transmission.Encryption key exchange: This is like the locksmith who provides and secures the keys to the locks. DTLS verifies the identities of the participants (like showing ID to ensure you’re the homeowner) and securely delivers the session keys that control access to the encrypted data.Secure signaling: Secure signaling is like the blueprint and construction crew that set up the house and its security features. Signaling manages the negotiation of how the connection will function — determining the structure (e.g., codecs, ICE candidates, and connection parameters) while ensuring the plans (signaling messages) are not intercepted or altered during setup. So, while SRTP and DTLS focus on protecting the data itself and the keys that enable encryption, secure signaling ensures that the initial connection setup process remains private and free from interference. By securing the signaling messages, WebRTC prevents attackers from tampering with the connection parameters or hijacking the session during its setup phase. Additional WebRTC Security Considerations While SRTP, encryption key exchange, and secure signaling are foundational to WebRTC security, several other safeguards ensure that WebRTC operates within a robust security framework. Browser trust and security updates: Since WebRTC is a browser-based technology, security depends heavily on the browser’s integrity and update cycle. Trusted browsers like Chrome and Firefox automatically receive security patches, reducing the likelihood of vulnerabilities. However, downloading from a trusted source is critical; a compromised browser could weaken WebRTC’s security.User permissions and access control: WebRTC requires explicit user permission to access local resources like cameras and microphones. This permission-based access prevents unauthorized apps from using a device’s hardware and informs users when an application is accessing these resources.TURN servers and data routing: When direct peer-to-peer connections are not possible, WebRTC falls back on TURN servers, which relay data but cannot access its content due to encryption. This fallback option ensures secure communication even in network-restricted environments. Final Thoughts While WebRTC provides robust security features, its effectiveness depends heavily on how it is implemented in applications. The protocols discussed earlier — SRTP for encrypting data streams, DTLS for secure key exchange, and secure signaling for safeguarding the connection setup — form a strong foundation. However, if developers cut corners or mismanage these elements, the data channel can still be left vulnerable to attack. For example, using insecure signaling mechanisms, such as unencrypted HTTP instead of HTTPS or WebSockets, undermines the secure signaling process and exposes the connection setup to interception. Similarly, failing to implement proper DTLS key exchange protocols or neglecting to update SRTP configurations with the latest security standards can compromise the integrity of the encrypted data streams. By adhering to WebRTC security best practices — ensuring secure signaling channels, maintaining updated encryption standards, and leveraging the inherent strengths of SRTP and DTLS — IoT developers can create applications that are both highly functional and secure. These measures are critical to protecting sensitive data and ensuring the reliability of IoT ecosystems in a world where security threats continue to evolve. More

Unsupervised Learning Methods for Analyzing Encrypted Network Traffic

By Anurag Agrawal

Unsupervised learning methods have emerged as invaluable tools for analyzing encrypted network traffic. These techniques are particularly useful because they don't require labeled data, which is often difficult or impossible to obtain for encrypted communications. Let's explore how unsupervised learning methods are applied to encrypted traffic analysis: Clustering Algorithms Clustering algorithms are widely used for encrypted traffic analysis due to their ability to group similar traffic flows without prior knowledge of their classification. K-Means K-means groups traffic flows into K clusters based on similarity in features like packet size, inter-arrival times, and flow duration. It can help identify different types of encrypted traffic (e.g., streaming, browsing, file transfer) based on their behavioral patterns. However, determining the optimal number of clusters (K) can be challenging and may require domain expertise. DBSCAN (Density-Based Spatial Clustering of Applications With Noise) DBSCAN is particularly useful for encrypted traffic analysis because: It can identify clusters of arbitrary shape, capturing complex traffic patterns.It's effective at detecting outliers, which could represent anomalous or malicious encrypted traffic.It doesn't require specifying the number of clusters beforehand, making it more flexible for diverse traffic patterns. HDBSCAN (Hierarchical DBSCAN) HDBSCAN extends DBSCAN's capabilities by handling clusters of varying densities and providing a hierarchical clustering structure. This allows for multi-level analysis of encrypted traffic patterns, which is useful for analyzing different types of encrypted traffic with varying characteristics. Dimensionality Reduction Dimensionality reduction techniques are crucial for handling the high-dimensional nature of encrypted traffic data: Principal Component Analysis (PCA) PCA is widely used in encrypted traffic analysis to: Identify the most important features, reducing noise and computational complexity.Reveal underlying patterns that may not be apparent in the original high-dimensional space.Visualize encrypted traffic data in lower dimensions, aiding in the identification of clusters or anomalies. Autoencoders Autoencoders, a type of neural network, are increasingly used for dimensionality reduction in encrypted traffic analysis: They learn compact representations of encrypted traffic features, capturing complex non-linear relationships.They are effective at noise reduction, helping to isolate the most relevant characteristics of encrypted traffic flows.The reconstruction error of autoencoders can be used to detect anomalies in encrypted traffic. Anomaly Detection Unsupervised learning methods are particularly valuable for detecting anomalies in encrypted traffic: Isolation Forest This algorithm is effective for identifying outliers in encrypted traffic: It isolates anomalies by randomly selecting features and splitting them into random values.It is computationally efficient and works well with high-dimensional data, making it suitable for encrypted traffic analysis. One-Class SVM One-Class SVM is used for novelty detection in encrypted traffic: It learns a decision boundary around normal encrypted traffic patterns.Any traffic falling outside this boundary is flagged as potentially anomalous.This method is particularly useful when the majority of the training data represents normal encrypted traffic. Applications and Case Studies Researchers have successfully applied unsupervised learning methods to various encrypted traffic analysis tasks: 1. Protocol Identification Clustering algorithms have been used to group encrypted traffic flows based on their behavioral characteristics, enabling the identification of different protocols without decryption. 2. Malware Detection Autoencoders and anomaly detection techniques have been employed to identify malicious encrypted traffic by learning the normal behavior of encrypted communications and flagging deviations. 3. User Behavior Analysis Unsupervised learning methods have been used to profile user behavior in encrypted traffic, helping to detect account compromises or insider threats. 4. Network Performance Optimization By clustering encrypted traffic flows, network administrators can identify patterns and optimize network resources without compromising user privacy. Challenges and Considerations While unsupervised learning methods offer significant advantages for encrypted traffic analysis, there are some challenges to consider: 1. Interpretability The results of unsupervised learning can sometimes be difficult to interpret, especially in the context of encrypted traffic where the ground truth is not always available. Researchers are working on developing more explainable models to address this issue. 2. Feature Selection Choosing the right features for analysis is crucial, as encrypted traffic limits the available information. Researchers must carefully select and engineer features that capture relevant behavioral patterns without compromising encryption. Common features include packet sizes, inter-arrival times, and flow duration statistics. 3. Evolving Traffic Patterns Encrypted traffic patterns can change over time due to new protocols or applications. Unsupervised learning methods need to be adaptable to these changes. Some researchers are exploring online learning techniques to address this challenge. 4. Privacy Concerns Even though the payload is encrypted, there are still privacy considerations when analyzing metadata and traffic patterns. Researchers must ensure that their analysis methods respect user privacy and comply with regulations. Techniques such as differential privacy are being explored to enhance privacy protection in traffic analysis. 5. Scalability As network speeds increase and the volume of encrypted traffic grows, unsupervised learning methods must be optimized for real-time analysis. Distributed and streaming algorithms are being developed to address this challenge. Conclusion By leveraging these unsupervised learning techniques, researchers and network administrators can gain valuable insights into encrypted traffic patterns, detect anomalies, and improve network security without the need for decryption or access to payload data. As encryption becomes more prevalent, these methods will play an increasingly important role in maintaining network security while preserving user privacy. More

Data Processing With Python: Choosing Between MPI and Spark

By Anil Kumar Moka

Understanding Neural Networks

By Akash Lomas

Getting Started With Snowflake Snowpark ML: A Step-by-Step Guide

By Sevinthi Kali Sankar Nagarajan

Kubernetes Deployments With DMZ Clusters: An Essential Guide

As organizations increasingly adopt Kubernetes for managing microservices and containerized workloads, securing these deployments becomes paramount. A Demilitarized Zone (DMZ) cluster, a proven security architecture that isolates public-facing services from sensitive internal resources, ensures robust protection against external threats. In this article, we’ll explore the concept of DMZ clusters in Kubernetes, their importance, and how to implement these robust security measures effectively. What Is a DMZ Cluster in Kubernetes? A DMZ is a network boundary that exposes specific services to external traffic while safeguarding the internal network. In Kubernetes, this architecture is implemented by creating separate clusters for public-facing applications and internal workloads, ensuring limited and tightly controlled communication between them. Key Features of a DMZ Cluster Isolation: Public-facing services are isolated in the DMZ cluster, preventing direct access to the internal network.Controlled Access: Secure communication is established between the DMZ and internal clusters using firewalls, service meshes, or ingress rules.Scalability: DMZ clusters can scale independently of internal resources, ensuring high availability for public-facing workloads. Why Use a DMZ Cluster? Modern applications often require exposing APIs, websites, or services to external users. However, exposing these directly from the internal cluster introduces significant security risks. DMZ clusters address these challenges by: Minimizing attack surface: Public-facing services are isolated from sensitive workloads.Improving security posture: Network policies and firewalls restrict unauthorized access.Simplifying compliance: Regulatory requirements often mandate segregating external and internal services. Key Components of a Kubernetes DMZ Cluster Ingress Controller: Handles external traffic and routes it to appropriate services in the DMZ cluster (e.g., NGINX or Traefik).Network Policies: Restrict communication between DMZ and internal clusters.Firewall Rules: Block unauthorized traffic between external users and internal networks.Service Mesh: Tools like Istio or Linkerd provide secure and observable service-to-service communication.Monitoring and Logging: Tools like Prometheus and Grafana ensure visibility into cluster activities. Implementing a DMZ Cluster in Kubernetes Here’s a step-by-step guide to setting up a DMZ cluster in Kubernetes: Step 1: Plan the Architecture Design a multi-cluster environment with: A DMZ cluster for public-facing services.An internal cluster for private workloads. Step 2: Deploy the DMZ Cluster Set up the cluster: Use Kubernetes deployment tools like ClusterAPI or managed Kubernetes services (e.g., GKE, EKS, AKS).Configure ingress: Deploy an ingress controller to handle traffic. YAML apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: dmz-ingress spec: rules: - host: public-service.example.com http: paths: - path: / pathType: Prefix backend: service: name: public-service port: number: 80 Step 3: Enforce Network Policies Restrict traffic between DMZ and internal clusters: YAML apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: limit-dmz-access namespace: dmz spec: podSelector: matchLabels: app: public-service ingress: - from: - ipBlock: cidr: 0.0.0.0/0 ports: - protocol: TCP port: 80 Step 4: Secure Communication With Service Mesh Deploy a service mesh like Istio to secure traffic between DMZ and internal clusters: Encrypt all communications using mutual TLS (mTLS).Define traffic policies to restrict access. Step 5: Monitor and Audit Use tools like Prometheus and Grafana to track traffic patterns.Log cluster activity using ELK stack (Elasticsearch, Logstash, Kibana). Best Practices for DMZ Clusters Least Privilege Access: Grant minimum permissions between DMZ and internal clusters.Zero-Trust Architecture: Continuously authenticate and validate all traffic.Regular Audits: Periodically review firewall rules, ingress policies, and service configurations.Resilience Testing: Perform chaos engineering experiments (e.g., using LitmusChaos) to validate system robustness. Conclusion DMZ clusters in Kubernetes are essential for securing public-facing applications while protecting internal resources. Organizations can create a secure and scalable infrastructure by isolating workloads, enforcing strict access controls, and leveraging tools like service meshes and network policies. Implementing a DMZ cluster might seem complex, but with the proper planning and tools, your Kubernetes deployments will be secure and high-performing. Author's Note: Adopt DMZ clusters today to build a more resilient and secure Kubernetes environment!

By Sai Sandeep Ogety

CORE

Efficiently Processing Billions of Rows Daily With Presto

In a world where companies rely heavily on data for insights about their performance, potential issues, and areas for improvement, logging comprehensively is crucial, but it comes at a cost. If not stored properly it can become cumbersome to maintain, query, and overall expensive. Logging detailed user activities like time spent on various apps, which interface where they are active, navigation path, app start-up times, crash reports, country of login, etc. could be vital in understanding user behaviors — but we can easily end up with billions of rows of data, which can quickly become an issue if scalable solutions are not implemented at the time of logging. In this article, we will discuss how we can efficiently store data in an HDFS system and use some of Presto’s functionality to query massive datasets with ease, reducing compute costs drastically in data pipelines. Partitioning Partitioning is a technique where similar logical data can be clubbed together and stored in a single file making retrieval quicker. For example, let's consider an app like YouTube. It would be useful to group data belonging to the same date and country into one file, which would result in multiple smaller files making scanning easier. Just by looking at the metadata, Presto can figure out which one of the specific files needs to be scanned based on the query the user provides. Internally, a folder called youtube_user_data would be created within which multiple subfolders would be created for each partition by date and country (e.g., date=2023-10-01/country=US). If the app was launched in 2 countries and has been active for 2 days, then the number of files generated would be 2*2 = 4 (cartesian product of the unique values in the partition columns). Hence, choosing columns with low cardinality is essential. For example, if we add interface as another partition column, with three possible values (ios, android, desktop), it would increase the number of files to 2×2×3=12. Based on the partitioning strategy described, the data would be stored in a directory structure like this: Below is an example query on how to create a table with partition columns as date and country: SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, ); Ad Hoc Querying When querying a partitioned table, specifying only the needed partitions can speed up your query wall time greatly. SQL SELECT SUM(1) AS total_users_above_30 FROM youtube_user_data WHERE Login_date = ‘2023-10-01’ And country = ‘US’ And age > 30 By specifying the partition columns as filters in the query, Presto will directly jump to the folder 2023-10-01 and US, and retrieve only the file within that folder skipping the scanning of other files completely. Scheduling Jobs If the source table is partitioned by country, then setting up daily ETL jobs also becomes easier, as we can now run them in parallel. For example: Python # Sample Dataswarm job scheduling, that does parallel processing # taking advantage of partitions in the source table insert_task = {} wait_for = {} for country in ["US", "CA"]: # wait for job wait_for[country] = WaitforOperator( table="youtube_user_data", partitions=f"login_date=<DATEID>/country={country}" ) # insert job insert_task[country] = PrestoOperator( dep_list = [wait_for[country]], input_data = { "in": input.table("youtube_user_data").col("login_date").eq("<DATEID>") .col("country").eq(country)}, output_data = {"out": output.table("output_table_name").col("login_date").eq("<DATEID>") .col("country").eq(country)}, select = """ SELECT user_id, SUM(1) as total_count FROM <in:youtube_user_data> """ ) Note: The above uses Dataswarm as an example for processing/inserting data. Here, there will be two parallel running tasks — insert_task[US] and insert_task[CA] — which will query only the data pertaining to those partitions and load them into a target table which would also be partitioned on country and date. Another benefit is that waitforoperator can be set up to check if that particular partition of interest has landed rather than waiting for the whole table. If, say, CA data is delayed, but US data has landed, then we can start the US insert task first and later once CA upstream data lands, then kick off the CA insert job. Above is a simple DAG showing the sequence of events that would be run. Bucketing If frequent Group by and join operations are to be performed on a table, then we can further optimize the storage using bucketing. Bucketing organizes data into smaller chunks within a file based on a key column (e.g., userid), so when querying, Presto would know in which bucket a specific ID would be present. How to Implement Bucketing Choose a bucketing column: Pick a key column that is commonly used for joins and group bys.Define buckets: Specify the number of buckets to divide the data into. SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, bucket_count = 1024, bucketed_by = ARRAY['user_id'], ); Note: The bucket size should be a power of 2. In the above example, we chose 1024 (2^10). Before Bucketing Data for a partition is stored in a single file, requiring a full scan to locate a specific user_id: After Bucketing Userids are put into smaller buckets based on which range they fall under. You'll notice that user IDs are assigned to specific buckets based on their value. For example, a new user ID of 1567 would be placed in Bucket 1: Bucket 1: 1000 to 1999Bucket 2: 2000 to 2999Bucket 3: 3000 to 3999Etc. When performing a join with another table — say, to retrieve user attributes like gender and birthdate for a particular user (e.g., 4592) — it would be much quicker, as Presto would know under which bucket (bucket 4) that user would be so it can directly jump to that specific one and skip scanning the others. It would still need to search where that user would be within that bucket. We can speed up that process as well by taking advantage of sorting the data on the key ID while storing them within each of the buckets, which we will explore in the later section. SQL SELECT a.user_id, b.gender, b.birthdate FROM youtube_user_data a JOIN dim_user_info b ON a.user_id = b.user_id WHERE a.login_date = '<DATEID>' AND a.country = 'US' AND b.date = '<DATEID>' Hidden $bucket Column For bucketed tables, there is a hidden column to let you specify the buckets you want to read data from. For example, the following query will count over bucket #17 (the bucket ID starts from 0). SQL SELECT SUM(1) AS total_count FROM youtube_user_data WHERE ds='2023-05-01' AND "$bucket" = 17 The following query will roughly count over 10% of the data for a table with 1024 buckets: SQL SELECT SUM(1) AS total_count FROM youtube_user_data WHERE ds='2023-05-01' AND "$bucket" BETWEEN 0 AND 100 Sorting To further optimize the buckets, we can sort them while inserting the data so query speeds can be further improved, as Presto can directly jump to the specific index within a specific bucket within a specific partition to fetch the data needed. How to Enable Sorting Choose a sorting column: Typically, this is the same column used for bucketing, such as user_id.Sort data during insertion: Ensure that data is sorted as it is inserted into each bucket. SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, bucket_count = 1024, bucketed_by = ARRAY['user_id'], sorted_by = ARRAY['userid'] ); In a sorted bucket, the userids are inserted in an orderly manner, which makes retrieval efficient. It becomes very handy when we have to join large tables or perform aggregations across billions of rows of data. Conclusion Partitioning: For large datasets, partition the table on low cardinality columns like date, country, and interface, which would result in smaller HDFS files. Presto can then only query the needed files by looking up the metadata/file name.Bucketing and sorting: If a table is to be used frequently in several join or group bys, then it would be beneficial to bucket and sort the data within each partition further reducing key lookup time.Caveat: There is an initial compute cost for bucketing and sorting as Presto would have to remember the order of the key while inserting. However, this one-time cost could be justified by savings in repeated downstream queries.

By Ajay Krishnan Prabhakaran

Delta Live Tables in Databricks: A Guide to Smarter, Faster Data Pipelines

Data pipelines are the main arteries of any organization that functions in the data economy. However, building and maintaining them can be a complex, time-consuming process and can be quite frustrating for data engineers. Maintaining data quality, maintaining data processes, and processing data in real-time are programmable challenges that can complicate projects and, thus, the quality of information. Delta Live Tables (DLT) from Databricks wants to do this differently. Through data validation automation, pipeline management simplification, and real-time processing handling by DLT, data engineers are able to design more efficient pipelines with fewer issues. This article will introduce you to DLT, showing how it can make data pipeline management easier and more efficient. What Are Delta Live Tables? DLT is a capability in Databricks that can be used to create data pipelines. It allows data engineers to build pipelines with a few lines of code in SQL or Python, which means that users with different programming experiences will be able to use it. DLT helps automate most of the routine processes associated with data pipelines. It does data validation checks coping with dependencies. Thus, it reduces time consumption and the probability of errors. In simple terms, DLT assists in establishing efficient and high-quality value chains that are less likely to break down frequently and require the attention of a manager. In the Databricks context, DLT is used in conjunction with other services such as data storage and Delta Lake. However, while Delta Lake is all about data storage and structuring, DLT is about making data movement and transformation much simpler, particularly in real time. This combination enables users to work on data from the input stage to the output stage without much difficulty. Key Benefits of Delta Live Tables Enhanced Data Quality and Validation One of the standout features of DLT is its ability to automatically check and enforce data quality. DLT performs data validation at each step, ensuring only clean, reliable data moves through the pipeline. It can even detect schema changes and handle errors without requiring constant oversight. This built-in quality control reduces the risk of bad data impacting your analytics or machine learning models. Simplified Pipeline Management Managing data pipelines can often be complex, with dependencies and tasks that need careful coordination. DLT simplifies this by automatically handling dependencies within the pipeline, making the setup easier and less prone to errors. Data engineers can use a straightforward, declarative syntax to define pipelines in SQL or Python, which makes them more accessible to teams with varied skill sets. This approach allows for faster setup and easier maintenance, reducing the overall workload. Real-Time Data Processing DLT supports both batch and real-time (streaming) data, giving data engineers flexibility depending on their needs. With DLT’s real-time processing capabilities, users can gain immediate insights, which is especially valuable for applications like fraud detection, customer personalization, or any scenario requiring instant data updates. This ability to handle data instantly makes Delta Live Tables a strong choice for companies looking to move from batch to real-time analytics. Use Cases and Examples DLT offers solutions for a range of real-world data challenges across different industries. Here are a few practical ways DLT can be applied: Banking Fraud Detection Banks and financial institutions require cheap, fast, and accurate means of identifying instances of fraud. In the case of DLT, the banks can process the transaction data in real time and identify suspicious patterns at the same time. This allows for the prevention of fraud more often; thus, customers’ safety and minimization of losses are ensured. Customer Personalization in Retail In retail, firms seek to offer specific experiences to consumers according to their buying behaviors. DLT enables retail organizations to analyze customer behavioral data in real time and provide the right recommendations and offers to the respective customer. Such instant personalization can help to increase the level of involvement and sales. Healthcare Data Processing Healthcare providers manage massive volumes of patient data, where data accessibility is vital and should not be delayed. DLT provides for the processing of patient records and lab data, amongst others, in a real-time manner. Since this can help in making a faster diagnosis, enhance patient care, and ease the flow of data in healthcare facilities. Example Configuration To illustrate how DLT works, here’s a simple example configuration in SQL. This code snippet demonstrates setting up a basic DLT pipeline to validate and clean incoming data: SQL CREATE OR REFRESH LIVE TABLE customer_data AS SELECT customer_id, name, age, purchase_history FROM streaming_data_source WHERE age IS NOT NULL AND purchase_history IS NOT NULL; In this example, we create a table called customer_data that pulls from a live data source, filtering out records with missing values. This is just a basic use case, but it highlights how DLT can help automate data cleaning and validation, ensuring only quality data flows through the pipeline. These use cases and examples show the versatility of DLT, making it useful for any organization that needs real-time, reliable data insights. Future Implications of Delta Live Tables As data demands grow, DLT could transform how organizations manage and use data. In the future, DLT may integrate more closely with machine learning workflows, enabling faster data preparation for complex models. This could streamline processes for AI-driven projects. DLT’s impact on real-time analytics will also expand. With businesses increasingly dependent on immediate data, DLT could play a key role in sectors like IoT, where constant, live data streams drive automation. This would make industries like manufacturing and logistics more efficient and responsive. Lastly, DLT could make data workflows accessible to a broader range of users. By simplifying pipeline creation, DLT may allow data analysts and business teams to manage their own data workflows. This shift could foster a more data-driven culture, where more teams can leverage insights without relying on engineering support. Challenges and Considerations While DLT offers many benefits, there are some challenges to consider. There may be an initial learning curve for new users, especially those unfamiliar with Databricks or declarative pipeline design. Adapting to DLT’s setup may require some training or practice. Cost is another factor. Real-time processing and continuous monitoring in DLT can increase operational expenses, especially for organizations managing large data volumes. Teams should evaluate their budget and choose processing options wisely to control costs. Data governance and security are also important considerations. Since DLT handles data in real-time, organizations dealing with such data will be subject to data protection laws like GDPR or HIPAA. Committing to strong security measures will be a priority to ensure data security and procedural compliance. Final Words Delta Live Tables (DLT) simplifies data pipeline management, enhancing data quality, real-time processing, and overall workflow efficiency. By automating complex tasks and supporting scalable, reliable data operations, DLT helps organizations make faster, data-driven decisions with confidence. As data demands increase, tools like DLT are essential for building flexible, future-ready data systems. For those looking to explore more, understanding how DLT integrates with other Databricks features could be a valuable next step. Frequently Asked Questions Here are some questions you may have about Delta Live Tables answered: What types of tables does Delta Live Tables support? It supports streaming tables, materialized views, and views, each suited for different processing needs like continuous data ingestion or pre-computed results. How does Delta Live Tables enhance data quality? DLT allows users to set data quality rules called “expectations” to filter or flag problematic data, helping ensure accuracy throughout the pipeline. Can Delta Live Tables process both batch and streaming data? Yes, DLT handles both types, allowing for either scheduled batch updates or continuous real-time processing based on needs. What is Delta Live Tables' relationship with Delta Lake? DLT builds on Delta Lake’s storage, adding features like pipeline automation and data validation to streamline workflows. How can I monitor Delta Live Tables pipelines? DLT includes monitoring tools to track pipeline health, detect errors, and review processing times in real time.

By Kiran Polimetla

Secure Data Stack: Navigating Adoption Challenges of Data Encryption

Encryption is one of the most effective data security strategies, alongside access control, software updates, and network segmentation. However, adding encryption to an existing tech stack can be challenging. Data pipelines aren’t typically designed to handle encryption and decryption, leading to additional migration work, operational complexity, and higher compute costs. Data encryption presents its own challenges in modern times. While at-rest and in-transit encryption are common, they often fall short of preventing data breaches within the data store itself. Unlike typical encryption in TLS, where AES and RSA are effective for securing data during transmission, applying these methods to long-term data storage introduces unique challenges. Currently, there is no clear strategy for maintaining long-lived encryption keys. Managing keys and performing key rotations can be error-prone and costly. The high computational cost of processing encrypted data further complicates maintaining secure encryption over time. This highlights the need for data-specific security strategies that go beyond just encryption in-transit or at-rest, ensuring robust protection throughout the entire data lifecycle. Challenges of Better Data Security Technical Complexity Data pipelines are often built without the flexibility to handle encryption and decryption processes. For example, a pipeline designed to compute feature values from event data may need to be re-engineered to include encryption and decryption layers, adding cost and complexity to the implementation. Adding data encryption can also require updates to APIs, databases, and storage systems to support encrypted data formats. This involves updating the codebase, rewriting queries to handle encrypted fields, and ensuring that encryption keys are properly managed and accessible to services and compute environments. Performance Encrypting and decrypting data in real-time introduces higher latency. In high-throughput use cases, as large volumes of data are processed, the extra time needed to encrypt or decrypt each piece can slow down data processing. Encryption and decryption involve mathematical operations that require additional computational resources. When handling large datasets, this can be both costly and time-consuming, potentially preventing organizations from adopting encryption, even though it's a better data security practice. Key Management Currently, there’s no unified key management system specifically for data encryption. Common use cases like TLS are designed for short-lived keys, and long-lived key management lacks standardized solutions. Like any security component, visibility into key management is important — understanding when and how a key was created, when and where it has been accessed, and how long it’s been in use before being rotated. Without this visibility, maintaining secure and effective data encryption practices can be risky and error-prone. The complexity increases when different teams within a company, or across partner organizations, need to securely share or manage keys. Gaps in key management practices can lead to security breaches, misconfigurations, and unauthorized access. Cost Adding encryption to an existing tech stack can involve non-trivial development work. The additional encryption and decryption processes require more computational resources, which can lead to higher operational and cloud infrastructure costs. The financial impact of a breach can be enormous, including fines for non-compliance with regulations, legal fees, and the costs associated with incident response and remediation. Like many security practices, the cost of data encryption must be balanced against the potential costs of a data breach (e.g., fines for non-compliance with regulations, legal fees, incident response and remediation expenses, and the reputational damage to the brand). Expertise Gaps Existing data security practices are increasingly insufficient to defend the growing frequency and scale of data breaches. As the technology involve, data and software engineers will also need to catch up on the corresponding skills and knowledge to implement data security effectively. This requires a commitment to continuous learning and adaptation by the companies and organizations. Data Security Strategy: A Gradual Approach Technology is ever-evolving, and so too is software implementation. A complete rewrite is not only costly but can also be detrimental to businesses. Instead, a less risky but more common approach is to identify high-impact, low-interruption service areas to roll out the change. By starting small, organizations can try out new technologies and processes, minimizing risks and disruptions to the business while gradually scaling up. High Impact Areas Field-Level Encryption Field-level encryption targets specific sensitive information within a data object. Whole message or full file encryption, by contrast, encrypts entire data objects in one shot, making the entire content inaccessible without the proper decryption key. For real-time or data processing pipelines, field-level encryption allows the system to continue operating, enabling engineering teams to gradually update dependencies or downstream steps. In complex systems, a hard, full cutover can be challenging and risky. Field-level encryption lowers the adoption barrier by avoiding the need for an immediate, full migration, thus allowing for a more manageable and phased approach. One advantage of starting with field-level encryption is that it opens the door to techniques such as data masking, redaction, deterministic encryption, and privacy-preserving encryption. These solutions can be implemented based on the specific security needs and the existing tech stack. Whole File Encryption in Data Security Vulnerability Remediation While field-level encryption is useful in more complex real-time systems and data pipelines, whole file encryption can be very effective for data security vulnerability remediation. Encrypting files containing sensitive information make it much easier for security teams to handle vulnerabilities on their own. Instead of having to consult with engineering or legal teams on whether potentially exposed data can be deleted, security teams can easily encrypt log files or data archives with sensitive information. This makes the remediation easier and also ensures that sensitive information is not only protected, reducing the risk of unauthorized access, but also recoverable when needed. Summary While data encryption is effective for protecting sensitive information, it can be costly to implement. Taking a gradual approach, starting with high-impact areas like field-level and whole file encryption, makes adoption more manageable. As technology and security threats evolve, adopting new security strategies is essential. With the right migration strategy, you can keep your tech stack up to date without excessive risks of disrupting your day-to-day business operations.

By Lulu Cheng

Digitalization of Airport and Airlines With IoT and Data Streaming Using Kafka and Flink

The digitalization of airports faces challenges such as integrating diverse legacy systems, ensuring cybersecurity, and managing the vast amounts of data generated in real time. The vision for a digitalized airport includes seamless passenger experiences, optimized operations, consistent integration with airlines and retail stores, and enhanced security through the use of advanced technologies like IoT, AI, and real-time data analytics. This blog post shows the relevance of data streaming with Apache Kafka and Flink in the aviation industry to enable data-driven business process automation and innovation while modernizing the IT infrastructure with cloud-native hybrid cloud architecture. Schiphol Group operating Amsterdam Airport shows a few real-world deployments. The Digitalization of Airports and the Aviation Industry Digitalization transforms airport operations and improves the experience of employees and passengers. It affects various aspects of airport operations, passenger experiences, and overall efficiency. Schiphol Group is a Dutch company that owns and operates airports in the Netherlands. The company is primarily known for operating Amsterdam Airport Schiphol, which is one of the busiest and most important airports in Europe. The Schiphol Group is involved in a range of activities related to airport management, including aviation and non-aviation services. Schiphol Group describes its journey of becoming a leading autonomous airport until 2050: Data Streaming With Kafka and Flink for Airport Operations and Passenger Experience Data streaming with Apache Kafka and Apache Flink enables airport and aviation systems to process and analyze real-time data from various sources, such as flight information, passenger movements, and baggage tracking, enhancing operational efficiency and passenger experience. These technologies facilitate predictive maintenance, personalized services, and improved security measures through the continuous flow and immediate processing of critical data at any scale reliably. Continuous Stream Processing in an Event-Driven Architecture With Kafka Streams, KSQL, or Flink Continuous processing of incoming events in real-time enables transparency and context-specific decision-making. OpenCore, an IT consultancy in Germany, presented already in 2018 at Kafka Summit San Francisco how stream processing with technologies like Kafka Streams, KSQL, or Apache Flink serves the real-time needs of an airport. Think about the technical IoT events ingested from aircraft, gates, retail stores, passenger mobile apps, and many other interfaces . . . Source: OpenCore . . . and how continuous correlation of data in real-time enables use cases such as predictive forecasting, planning, and maintenance, plus scenarios like cross-organization loyalty platforms, advertisement, and recommendation engines for improving the customer experience and increasing revenue: Source: OpenCore Airport Digitalization With Data Streaming Using Apache Kafka and Flink Real-time data beats slow data. That's true for almost any use in the aviation industry, including airports, airlines, and other involved organizations. Additionally, data consistency matters across organizations. Here are key areas where digitalization affects airports. While compiling this list, I realized I wrote about many of these scenarios in the past because other industries already deployed these use cases. Hence, each section includes a reference to another article where data streaming with Kafka and Flink is already applied in this context. 1. Passenger Experience As a frequent traveler myself, I put this at the beginning of the list. Examples: Self-service kiosks: Check-in, baggage drop, and boarding processes have become faster and more efficient.Mobile applications: Passengers can book tickets, receive real-time flight updates, and access boarding passes.Biometric systems: Facial recognition and fingerprint scanning expedite security checks and boarding. The past decade already significantly improved the passenger experience. But it still needs to get better. And data consistency matters. Today, a flight delay or cancellation is not shared consistently across the customer mobile app, airport screens, and customer service of the airline and airport. 2. Operational Efficiency Automation with IoT sensors, paperless processes, and software innovation enables more cost-efficient and reliable airport operations. Examples: Automated baggage handling: RFID tags and automated systems track and manage luggage, reducing errors and lost baggage).Predictive maintenance: IoT sensors and data analytics predict equipment failures before they occur, ensuring smoother operations.Air traffic management: Advanced software systems enhance the coordination and efficiency of air traffic control. 3. Security, Safety and Health Enhancements Safety and health are one of the most important aspects of any airport. Airports continuously improved security, monitoring, and surveillance because of terrorist attacks, the COVID-19 pandemic, and many other dangerous scenarios. Advanced screening technologies: AI-powered systems and improved scanning technologies detect threats more effectively.Cybersecurity: Protecting sensitive data and systems from cyber threats is crucial, requiring robust digital security measures.Health monitoring: Temperature measurements and people tracking were introduced during the COVID-19 pandemic in many airports. 4. Sustainability and Energy Management Sustainability and energy management in airports involve optimizing energy use and reducing environmental impact through efficient resource management and implementing eco-friendly technologies. Examples: Smart lighting and HVAC systems: Automated systems reduce energy consumption and enhance sustainability.Data analytics: Monitoring and optimizing resource usage helps reduce the carbon footprint of airports. Sustainability and energy management in an airport can be significantly enhanced by using Apache Kafka and Apache Flink to stream and analyze real-time data from smart meters and HVAC systems, optimizing energy consumption and reducing environmental impact. 5. Customer Service and Communication Customer service is crucial for each airport. While lots of information comes from airlines (like delays, cancellations, seat changes, etc.), the airport provides the critical communication backend with displays, lounges, service personnel, and so on. Examples to improve the customer experience: AI chatbots: Provide 24/7 customer support for inquiries and assistance with Generative AI (GenAI) embedded into the existing business processesDigital signage: Real-time updates on flight information, gate changes, and other announcements improve communicationLoyalty integration: Airports do not provide a loyalty platform, but they integrate more and more with airlines (e.g., to reward miles for shopping). 6. Revenue Management Airport revenue management involves optimizing income from aviation and non-aviation sources through demand forecasting and strategic resource allocation. Examples: Dynamic Pricing: Algorithms adjust prices for parking, retail spaces, and other services based on demand and other factors.Personalized Marketing: Data analytics help target passengers with tailored offers and promotions. 7. Emergency Response and Safety Emergency response and safety at the airport involve coordinating real-time monitoring, quick decision-making, and efficient resource deployment to ensure the safety and security of passengers, staff, and infrastructure during emergencies. Examples: Real-time monitoring: IoT devices and sensors provide live data on airport conditions, aiding in faster response times.Digital simulation and training: Virtual reality and simulation technologies enhance training for emergency scenarios.Seamless connectivity: Stable Wi-Fi and 5G Networks with good latency and network slicing for safety-critical use cases Data Sharing With Kafka between Airport, Airlines, and other B2B Partners like Retail Stores Cross-organization data sharing is crucial for any airport and airline. Today, most integrations are implemented with APIs (usually HTTP/REST) or still even file-based systems. This works well for some use cases. But data streaming — by nature — is perfect for sharing streaming data like transactions, sensor data, location-based services, etc. in real-time between organizations: As Apache Kafka is the de facto standard for data streaming, many companies directly replicate data to partners using the Kafka protocol. AsyncAPI as an open standard (beyond Kafka) and integration via HTTP on top of Kafka (via Kafka Connect API connectors) are other common patterns. Real-World Success Stories for Data Streaming in the Aviation Industry Several real-world success stories exist for deployments of data streaming with Apache Kafka and Flink in airports and airlines. Let's explore a few case studies and refer to further material. Schiphol Group (Amsterdam Airport) Roel Donker and Christiaan Hoogendoorn from Schiphol Group presented at the Data in Motion Tour 2024 in Utrecht, Netherlands. This was an excellent presentation with various data streaming use cases across fields like application integration, data analytics, the Internet of Things, and artificial intelligence. On its journey to an autonomous airport until 2025, digitalization involves many technologies and software/cloud services. Schiphol Group transitioned from open-source Apache Kafka to Confluent Cloud for cost-efficiency, elasticity, and multi-tenancy. The company runs operational and analytical data streaming workloads with different SLAs. The integration team uses the data streaming platform to integrate with both the legacy and the new world, also 3rd parties like airlines, GDS, police, etc. (all point-to-point and with different interfaces). Here are a few examples of the scenarios Schiphol Group explored: Schiphol Group: Data Platform with Apache Kafka Schiphol uses Apache Kafka as a core integration platform. The various use cases require different Kafka clusters depending on the uptime SLA, scalability, security, and latency requirements. Confluent Cloud fully manages the data streaming platform, including connectors to various data sources and sinks: Source: Schiphol Group Kafka connects critical PostgreSQL databases, Databricks analytics platform, applications running in containers on Red Hat OpenShift, and others. 3Scale is used as a complementary API gateway for request-response communication. The latter is not a surprise, but very common. HTTP/REST APIs and Apache Kafka complement each other. API Management solutions such as 3Scale, MuleSoft, Apigee, or Kong connect to Kafka via HTTP or other interfaces. Schiphol Group: IoT With Apache Kafka Some use cases at Schiphol Group require connectivity and processing of IoT data. That's not really a big surprise in the aviation industry, where airports and airlines rely on data-driven business processes: Source: Schiphol Group Kafka Connect and stream processing connect and combine IoT data and feed relevant context into other IT applications. Connectivity covers various infrastructures and networks, including: Private LoRa networksPassenger flow management system(FMS)BLIP (the supplier delivering IoT devices in the terminal measuring real-time how crowded areas are so people can be redirected when needed)Wi-Fi location services (like heatmaps for crowd management) Schiphol Group: AI and Machine Learning With Apache Kafka Artificial Intelligence (AI) requires various technologies and concepts to add business value. Predictive analytics, active learning, batch model training, debugging and testing the entire pipeline, and many other challenges need to be solved. Apache Kafka is the data fabric of many AI/ML infrastructures. Here is how Kafka provides the foundation of an event-driven AI architecture at Schiphol Group: Source: Schiphol Group The combination of Apache Kafka and AI/ML technologies enables various valuable use cases at Schiphol Group, including: Analysis of historical data (root cause analysis, critical path and process analysis, reporting)Insights on real-time data (insight on turnaround process with one shared truth, real-time insight on ramp capacity and turnaround progress per ramp, real-time insight on ramp safety, input for E2E insight Airside)Predictions (input for dynamic gate management, input for autonomous vehicles, input for predicting delays) Lufthansa, Southwest, Cathay Pacific, and Many Other Airlines I met plenty of airlines that already use data streaming in production for different scenarios. Fortunately, a few of these airlines were happy to share their stories with the public: Southwest Airlines (Data in Motion Tour 2024 in Dallas): Single pane of glass with the ability to view all flight operations and sync their three key schedules: aircraft, passengers, and workforce.Cathay Pacific (Data in Motion Tour 2024 in Singapore): Rebranded to Cathay because of transitioning from focusing on passenger transport to adding cargo and lifestyle/shopping experiences.Lufthansa (Webinar 2023): Operations steering, IT modernization (from MQ and ESB to Confluent), and real-time analytics with AI/ML The Lufthansa success story is available in its own blog post (including a video recording). For even more examples, including Singapore Airlines, Air France, and Amadeus, check out the overview article "Apache Kafka in the Airline, Aviation and Travel Industry". Apache Kafka and Flink as Data Fabric for Operational and Analytical Airport Use Cases Schiphol Group's vision of an autonomous Amsterdam Airport in 2050 shows where the aviation industry is going: automated business processes, continuous monitoring and processing of IoT infrastructure, and data-driven decision-making and passenger experiences. Airports like Amsterdam, similar to airlines such as Lufthansa, Southwest, or Cathay, modernize existing IT infrastructure, transition to hybrid cloud architectures, and innovate with new use cases (often learning from other industries like financial services, retail, or manufacturing). Data Streaming with Apache Kafka and Flink plays a crucial role in this journey. Data processing at any scale to provide consistent and good quality data in real-time enables any downstream application (including batch and API) to build reliable operational and analytical systems. How do you leverage data streaming with Kafka and Flink in the aviation industry? Let’s connect on LinkedIn and discuss it!

By Kai Wähner

CORE

Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service

The Apache Flink Managed Service in AWS, offered through Amazon Kinesis data analytics for Apache Flink, allows developers to run Flink-based stream processing applications without the complexities of managing the underlying infrastructure. This fully managed service simplifies the deployment, scaling, and operation of real-time data processing pipelines, enabling users to concentrate on building applications rather than handling cluster setup and maintenance. With seamless integration into AWS services such as Kinesis and S3, it provides automatic scaling, monitoring, and fault tolerance, making it ideal for real-time analytics, event-driven applications, and large-scale data processing in the cloud. This guide talks about how to use the Apache Flink dashboard for monitoring and managing real-time data processing applications within AWS-managed services, ensuring efficient and reliable stream processing. The Apache Flink Dashboard The Apache Flink dashboard offers an intuitive interface for managing real-time data services on AWS, enabling developers to monitor, debug, and optimize Flink applications effectively. AWS-managed services like Amazon Kinesis data analytics leverage the dashboard’s insights into job statuses, task performance, and resource usage, assisting developers in monitoring live data streams and assessing job health through metrics such as throughput, latency, and error rates. The Flink dashboard facilitates real-time debugging and troubleshooting by providing access to logs and task execution metrics. This visibility is essential for identifying performance bottlenecks and errors, ensuring high availability and low latency for AWS-managed real-time data processing services. Overall, the dashboard equips users with the necessary transparency to maintain the health and efficiency of these services. Accessing the Apache Flink Dashboard To begin analyzing Flink applications, access the Apache Flink dashboard, which provides real-time insights into job performance and health. Code Example Consider the following code snippet where an Apache Flink application processes streaming data from Amazon Kinesis using Flink’s data stream API: Java DataStream<String> dataStream = env.addSource(new FlinkKinesisConsumer<>( INPUT_STREAM_NAME, new SimpleStringSchema(), setupInputStreamProperties(streamRole, streamRegion)) ); SingleOutputStreamOperator<ArrayList<TreeMap<String, TreeMap<String, Integer>>>> result = dataStream .map(Application::toRequestEventTuple) .returns(Types.TUPLE(Types.LIST(Types.STRING), Types.LIST(Types.STRING), Types.LIST(Types.INT))) .windowAll(TumblingProcessingTimeWindows.of(Time.minutes(5))) .aggregate(new EventObservationAggregator()); REGIONS.forEach(region -> { result.flatMap(new CountErrorsForRegion(region)).name("CountErrors(" + region + ")"); result.flatMap(new CountFaultsForRegion(region)).name("CountFaults(" + region + ")"); }); env.execute("Kinesis Analytics Application Job"); This Apache Flink application processes real-time data from an Amazon Kinesis stream using Flink's data stream API. The execution environment is established, retrieving AWS-specific properties such as the role ARN and region to access the Kinesis stream. The data stream is consumed and deserialized as strings, which are then mapped to tuples for further processing. The application utilizes 5-minute tumbling windows to aggregate events, applying custom functions to count errors and faults for various AWS regions. The job is executed continuously, processing and analyzing real-time data from Kinesis to ensure scalable, region-specific error and fault tracking. Summary Source: Reads data from a Kinesis stream, using a Flink Kinesis consumer with a specified region and roleTransformation: The data is converted into tuples and aggregated in 5-minute windows.Counting: Errors and faults are counted for each AWS region.Execution: The job runs indefinitely, processing data in real-time as it streams from Kinesis. Job Graph The job graph in the Flink Dashboard visually represents the execution of an Apache Flink job, highlighting the data processing flow across different regions while tracking errors and faults. Explanation Source: Custom Source -> Map: The initial component is the source, where data is ingested from Amazon Kinesis. The custom source processes data in parallel with two tasks (as you see in image Parallelism: 2).Trigger window (TumblingProcessingTimeWindows): The next step applies a TumblingWindow with a 5-minute processing time; i.e., grouping incoming data into 5-minute intervals for batch-like processing of streaming data. The aggregation function combines data within each window (as represented by AllWindowedStream.aggregate()) with Parallelism: 1 indicating a single task performing this aggregation.Regional processing (CountErrors/CountFaults): Following window aggregation, the data is rebalanced and distributed across tasks responsible for processing different regions. Each region has two tasks responsible for counting errors and faults, each operating with Parallelism: 2, ensuring concurrent processing of each region's data. Summary The data flows from a custom source, is mapped and aggregated in 5-minute tumbling windows, and is processed to count errors and faults for different regions. The parallel processing of each region ensures efficient handling of real-time streaming data across regions, as depicted in the diagram. Operator/Task Data Flow Information The dashboard provides a quick overview of the data flow within the Flink job, showcasing the processing status and data volume at each step. It displays information about various operators or tasks in the Flink job. Here’s a breakdown of what the table shows: Name: Lists operators or processing steps in the Flink job, such as "Source: Custom Source -> Map," "TriggerWindow," and various "CountErrors" and "CountFaults" for different regionsStatus: This displays the status of tasks. All listed operators are in "RUNNING" status with green labels.Bytes Received: Displays the amount of data received by each operator; for example, the "TriggerWindow" operator receiving the 31.6 MB of dataRecords Received: Indicates the number of records processed by each operator, again with the "TriggerWindow" operator leading (148,302)Bytes Sent: Shows the amount of data sent by each operator; for example: the "Source: Custom Source -> Map" sending the most (31.6 MB)Records Sent: Displays the number of records sent by each operator, with the "Source: Custom Source -> Map" also sending the most (148,302)Tasks: Indicates the number of parallel tasks for each operator; all tasks have parallelism 2 except the "TriggerWindow" operator having 1 parallelism. This configuration view provides insights into the Flink job manager setup, encompassing cluster behavior, Java options, and exception handling. Understanding and potentially adjusting these parameters is crucial for optimizing the Flink environment's behavior. Conclusion In this guide, we explored several key views of the Apache Flink Dashboard that enhance the understanding and management of data pipelines. These include the Job Graph, which visually represents data processing flow; the Operator/Task Data Flow Information Table, which provides detailed insights into the flow between tasks and operators; and the Configuration Tab, which offers control over job manager settings. The dashboard provides numerous additional features that help developers gain a deeper understanding of their Apache Flink applications, facilitating the monitoring, debugging, and optimization of real-time data processing pipelines within AWS-managed services.

By Sneha Murganoor

How to Optimize Edge Devices for AI Processing

Edge computing allows data to be processed on devices rather than transferred to the cloud. Besides offering security-related benefits, this option can overcome the latency associated with moving information. As artificial intelligence (AI) has become more prominent in various industries, more people are interested in meeting edge AI computing goals by combining the two technologies for mutual benefits. Many are also exploring how to design for edge AI, making careful tweaks that result in the desired optimization. How can you follow their lead? Take an All-Encompassing Design Approach Creating edge devices to process AI content requires evaluating all design aspects, from hardware and software to power sources. Many artificial intelligence processing tasks are already resource-intensive, so those who want to make AI-friendly edge devices must apply forward-thinking decision-making to overcome known challenges. From a hardware perspective, edge devices should have dedicated AI chips that offer the necessary processing capabilities. Then, as people review how the device’s software will function, they should heavily scrutinize all proposed features to determine which are essential. That is a practical way to conserve battery life and ensure the device can maximize the resources for handling AI data. Rather than using trial-and-error approaches, people should strongly consider relying on industrial digital twins, which enable designers to see the likely impacts of decisions before committing to them. Collaborative project management tools allow leaders to assign tasks to specific parties and encourage an accountability culture. Comment threads are similarly useful for determining when individual changes occurred and why. Then, reverting to another iteration when necessary is more straightforward and efficient. Become Familiar With the Tiny AI Movement Knowing how to design for edge AI means understanding that some enhancements occur outside the devices themselves. One popular movement is Tiny AI, which integrates algorithms into specialized hardware to improve latency and conserve power consumption. Those furthering Tiny AI efforts generally take at least one of several approaches. Sometimes, they aim to shorten the algorithms, minimizing the computational capabilities required to handle them. Another possibility is to build devices with small but optimized hardware that can continue working with the most complex algorithms while getting energy-efficient results. Finally, people consider new ways of training machine learning algorithms that require less energy. Answering application-specific questions, such as the kind of AI data processed by the edge device or the amount of information associated with the particular use case, will help product designers determine which Tiny AI aim is most valuable. Create a List of Must-Have Characteristics and Capabilities An essential optimization in edge AI computing involves determining the device’s crucial performance attributes. Then, creators can identify the steps to achieving those outcomes. One practical way to start is to consider how specific materials may have desirable properties. Silicon and silicon carbide are two popular semiconductor materials that may come up in discussions about an edge device’s internal components. Silicon carbide has become a popular option for high-performance applications due to its tolerance to higher voltages and temperatures. Knowing how to design for edge AI also requires the responsible parties to consider data storage specifics and built-in security measures. Since many users rely on AI to process information about everything from customer purchases to process improvement results, it’s critical to protect sensitive data from cybercriminals. A fundamental step is to encrypt all data. However, device-level administrator controls are also important for restricting which parties can interact with the information and how. What steps must users take to update or configure their edge device? Making the product as user-friendly as possible will enable people to set up and update their devices — a critical security-related step. It’s also important to keep future design needs in mind. How likely is it that the business will process more or a different type of information within the next several years? Do its developers intend to create and implement additional algorithms that could increase processing needs? Stay Aware of Relevant Efforts to Pair Edge Computing and AI Estimates suggest that three-quarters of enterprise data creation and processing will happen outside the traditional cloud by 2025. That finding drives home how important it is for professionals to keep exploring how to make purpose-built edge computing devices that can handle large quantities of data — including AI. Although some companies and clients will have specific requests that design teams, engineers, and others should follow, it is also valuable to stay abreast of events and innovations in the wider industry. Collaboration between skilled, knowledgeable parties can speed up progress faster than when people work independently without bouncing ideas off each other. One example is a European Union-funded project called EdgeAI. It involves coordinated activities from 48 research and development organizations within Europe. The three-year project will center on edge computing and the intelligent processing required to handle AI applications on those devices. Participants will develop hardware and software frameworks, electronic components, and systems, all while remaining focused on edge AI computing. The long-term goal is for Europe to become a leading region in intelligent edge computing applications. Those involved will use the solutions they have developed for real-life applications, demonstrating their board potential. Such efforts will be instrumental in showing leaders how edge AI can get them closer to goals. Record Details for How to Design for Edge AI Beyond considering these actionable strategies, you should also carefully document your processes and include detailed notes about your rationale and results. Besides assisting knowledge transfer to your colleagues and others interested in this topic, your records will allow you to refer to what you have learned, paving the way for applying those details to new projects.

By Emily Newton

The Future of Data Lies in Transformer Models vs. Big Data Transformations

Last year, we witnessed the explosive rise of large models, generating global enthusiasm and making AI seem like a solution to all problems. This year, as the hype subsides, large models have entered a deeper phase, aiming to reshape the foundational logic of various industries. In the realm of big data processing, the collision between large models and traditional ETL (Extract, Transform, Load) processes has sparked new debates. Large models feature “Transformers,” while ETL relies on “Transform” processes — similar names representing vastly different paradigms. Some voices boldly predict: "ETL will be completely replaced in the future, as large models can handle all data!" Does this signal the end of the decades-old ETL framework underpinning data processing? Or is it merely a misunderstood prediction? Behind this conflict lies a deeper contemplation of technology's future. Will Big Data Processing (ETL) Disappear? With the rapid development of large models, some have begun to speculate whether traditional big data processing methods, including ETL, are still necessary. Large models, capable of autonomously learning rules and discovering patterns from vast datasets, are undeniably impressive. However, my answer is clear: ETL will not disappear. Large models still fail to address several core data challenges: 1. Efficiency Issues Despite their outstanding performance in specific tasks, large models incur enormous computational costs. Training a large-scale Transformer model may take weeks and consume vast energy and financial resources. By contrast, ETL, which relies on predefined rules and logic, is efficient, resource-light, and excels at processing structured data. For everyday enterprise data tasks, many operations remain rule-driven, such as: Data Cleaning: Removing anomalies using clear rules or regular expressions.Format Conversion: Standardizing formats to facilitate data transmission and integration across systems.Aggregation and Statistics: Categorizing, aggregating, and calculating data daily, weekly, or monthly. These tasks can be swiftly handled by ETL tools without requiring the complex inference capabilities of large models. 2. Ambiguity in Natural Language Large models have excelled in natural language processing (NLP) but have also exposed inherent challenges — ambiguity and vagueness in human language. For example: A single input query may yield varied interpretations depending on the context, with no guaranteed accuracy.Differences in data quality may lead models to generate results misaligned with real-world requirements. By contrast, ETL is deterministic, processing data based on pre-defined rules to produce predictable, standardized outputs. In high-demand sectors like finance and healthcare, ETL's reliability and precision remain critical advantages. 3. Strong Adaptability to Structured Data Large models are adept at extracting insights from unstructured data (e.g., text, images, videos) but often struggle with structured data tasks. For instance: Traditional ETL efficiently processes relational databases, handling complex operations like JOINs and GROUP BYs.Large models require data to be converted into specific formats before processing, introducing redundancy and delays. In scenarios dominated by structured data (e.g., tables, JSON), ETL remains the optimal choice. 4. Explainability and Compliance Large models are often referred to as “black boxes.” Even when data processing is complete, their internal workings and decision-making mechanisms remain opaque: Unexplainable Results: In regulated industries like finance and healthcare, predictions from large models may be unusable due to their lack of transparency.Compliance Challenges: Many industries require full auditing of data flows and processing logic. Large models, with their complex data pipelines and decision mechanisms, pose significant auditing challenges. ETL, in contrast, provides highly transparent processes, with every data handling step documented and auditable, ensuring compliance with corporate and industry standards. 5. Data Quality and Input Standardization Large models are highly sensitive to data quality. Noise, anomalies, or non-standardized inputs can severely affect their performance: Data Noise: Large models cannot automatically identify erroneous data, potentially using it as "learning material" and producing biased predictions.Lack of Standardization: Feeding raw, uncleaned data into large models can result in inconsistencies and missing values, requiring preprocessing tools like ETL. ETL ensures data is cleaned, deduplicated, and standardized before being fed into large models, maintaining high data quality. Despite the excellence of large models in many areas, their complexity, reliance on high-quality data, hardware demands, and practical limitations ensure they cannot entirely replace ETL. As a deterministic, efficient, and transparent tool, ETL will continue to coexist with large models, providing dual safeguards for data processing. CPU vs. GPU: A Parallel to ETL vs. Large Models While ETL cannot be replaced, the rise of large models in data processing is an inevitable trend. For decades, computing systems were CPU-centric, with other components considered peripherals. GPUs were primarily used for gaming, but today, data processing relies on the synergy of CPUs and GPUs (or NPUs). This paradigm shift reflects broader changes mirrored in the stock trends of Intel and NVIDIA. From Single-Center to Multi-Center Computing Historically, data processing architectures evolved from "CPU-centric" to "CPU+GPU (and even NPU) collaboration." This transition, driven by changes in computing performance requirements, has deeply influenced the choice of data processing tools. During the CPU-centric era, early ETL processes heavily relied on CPU logic for operations like data cleaning, formatting, and aggregation. These tasks were well-suited to CPUs’ sequential processing capabilities. However, the rise of complex data formats (audio, video, text) and exponential storage growth revealed the limitations of CPU power. GPUs, with their unparalleled parallel processing capabilities, have since taken center stage in data-intensive tasks like training large Transformer models. From Traditional ETL to Large Models Traditional ETL processes, optimized for "CPU-centric" computing, excel at handling rule-based, structured data tasks. Examples include: Data validation and cleaning.Format standardization.Aggregation and reporting. Large models, in contrast, require GPU power for high-dimensional matrix computations and large-scale parameter optimization: Preprocessing: Real-time normalization and data segmentation.Model training: Compute-heavy tasks involving floating-point operations.Inference services: Optimized batch processing for low latency and high throughput. This reflects a shift from logical computation to neural inference, broadening data processing to include reasoning and knowledge extraction. Toward a New Generation of ETL Architecture for Large Models The rise of large models highlights inefficiencies in traditional data processing, necessitating a more advanced, unified architecture. Pain Points in Current Data Processing Complex, Fragmented Processes: Data cleaning, annotation, and preprocessing remain highly manual and siloed.Low Reusability: Teams often recreate data pipelines, leading to inefficiencies.Inconsistent Quality: The lack of standardized tools results in varying data quality.High Costs: Separate development and maintenance for each team inflate costs. Solutions: AI-Enhanced ETL Tools Future ETL tools will embed AI capabilities, merging traditional strengths with modern intelligence: Embedding Generation: Built-in support for text, image, and audio vectorization.LLM Knowledge Extraction: Automated structuring of unstructured data.Dynamic Cleaning Rules: Context-aware optimization of data cleaning strategies.Unstructured Data Handling: Support for keyframe extraction, OCR, and speech-to-text.Automated Augmentation: Intelligent data generation and enhancement. The Ultimate Trend: Transformers + Transform With the continuous advancement of technology, large models and traditional ETL processes are gradually converging. The next generation of ETL architectures is expected to blend the intelligence of large models with the efficiency of ETL, creating a comprehensive framework capable of processing diverse data types. Hardware: Integration of Data Processing Units The foundation of data processing is shifting from CPU-centric systems to a collaborative approach involving CPUs and GPUs: CPU for foundational tasks: CPUs excel at basic operations like preliminary data cleaning, integration, and rule-based processing, such as extracting, transforming, and loading structured data.GPU for advanced analytics: With powerful parallel computing capabilities, GPUs handle large model training and inference tasks on pre-processed data. This trend is reflected not only in technical innovation but also in industry dynamics: Intel is advancing AI accelerators for CPU-AI collaboration, while NVIDIA is expanding GPU applications into traditional ETL scenarios. The synergy between CPUs and GPUs promises higher efficiency and intelligent support for next-generation data processing. Software: Integration of Data Processing Architectures As ETL and large model functionalities become increasingly intertwined, data processing is evolving into a multifunctional, collaborative platform where ETL serves as a data preparation tool for large models. Large models require high-quality input data during training, and ETL provides the preliminary processing to create ideal conditions: Noise removal and cleaning: Eliminates noisy data to improve dataset quality.Formatting and standardization: Converts diverse data formats into a unified structure suitable for large models.Data augmentation: Expands data scale and diversity through preprocessing and rule-based enhancements. Emergence of AI-Enhanced ETL Architectures The future of ETL tools lies in embedding AI capabilities to achieve smarter data processing: 1. Embedding Capabilities Integrating modules for generating embeddings to support vector-based data processing.Producing high-dimensional representations for text, images, and audio; using pre-trained models for semantic embeddings in downstream tasks.Performing embedding calculations directly within ETL workflows, reducing dependency on external inference services. 2. LLM Knowledge Extraction Leveraging large language models (LLMs) to efficiently process unstructured data, extracting structured information like entities and events.Completing and inferring complex fields, such as filling in missing values or predicting future trends.Enabling multi-language data translation and semantic alignment during data integration. 3. Unstructured Data Recognition and Keyframe Extraction Supporting video, image, and audio data natively, enabling automatic keyframe extraction for annotation or training datasets.Extracting features from images (e.g., object detection, OCR) and performing audio-to-text conversion, sentiment analysis, and more. 4. Dynamic Cleaning Rules Dynamically adjusting cleaning and augmentation strategies based on data context to enhance efficiency and relevance.Detecting anomalies in real-time and generating adaptive cleaning rules.Optimizing cleaning strategies for specific domains (e.g., finance, healthcare). 5. Automated Data Augmentation and Generation Dynamically augmenting datasets through AI models (e.g., synonym replacement, data back-translation, adversarial sample generation).Expanding datasets for low-sample scenarios and enabling cross-language or cross-domain data generation. AI-enhanced ETL represents a transformative leap from traditional ETL, offering embedding generation, LLM-based knowledge extraction, unstructured data processing, and dynamic rule generation to significantly improve efficiency, flexibility, and intelligence in data processing. Case Study: Apache SeaTunnel – A New Generation AI-Enhanced ETL Architecture As an example, the open-source Apache SeaTunnel project is breaking traditional ETL limitations by supporting innovative data formats and advanced processing capabilities, showcasing the future of data processing: Native support for unstructured data: The SeaTunnel engine supports text, video, and audio processing for diverse model training needs.Vectorized data support: Enables seamless compatibility with deep learning and large-model inference tasks.Embedding large model features: SeaTunnel v2.3.8 supports embedding generation and LLM transformations, bridging traditional ETL with AI inference workflows.“Any-to-Any” transformation: Transforms data from any source (e.g., databases, binlogs, PDFs, SaaS, videos) to any target format, delivering unmatched versatility. Tools like SeaTunnel illustrate how modern data processing has evolved into an AI+Big Data full-stack collaboration system, becoming central to enterprise AI and data strategies. Conclusion Large model transformers and big data transforms are not competitors but allies. The future of data processing lies in the deep integration of ETL and large models, as illustrated below: Collaborative data processing units: Leveraging CPU-GPU synergy for both structured and unstructured data processing.Dynamic data processing architecture: Embedding AI capabilities into ETL for embedding generation, LLM knowledge extraction, and intelligent decision-making.Next-gen tools: Open-source solutions like Apache SeaTunnel highlight this trend, enabling "Any-to-Any" data transformation and redefining ETL boundaries. The convergence of large models and ETL will propel data processing into a new era of intelligence, standardization, and openness. By addressing enterprise demands, this evolution will drive business innovation and intelligent decision-making, becoming a core engine for the future of data-driven enterprises.

By William Guo

Apache Iceberg: The Open Table Format for Lakehouses and Data Streaming

Every data-driven organization has operational and analytical workloads. A best-of-breed approach emerges with various data platforms, including data streaming, data lake, data warehouse and lakehouse solutions, and cloud services. An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets, and cost-efficient storage while providing strong support for ACID transactions and time travel queries. This article explores market trends; adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake, and XTable; and the product strategy of some of the leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka/Flink), Amazon Athena, and Google BigQuery. What Is an Open Table Format for a Data Platform? An open table format helps in maintaining data integrity, optimizing query performance, and ensuring a clear understanding of the data stored within the platform. The open table format for data platforms typically includes a well-defined structure with specific components that ensure data is organized, accessible, and easily queryable. A typical table format contains a table name, column names, data types, primary and foreign keys, indexes, and constraints. This is not a new concept. Your favorite decades-old database — like Oracle, IBM DB2 (even on the mainframe) or PostgreSQL — uses the same principles. However, the requirements and challenges changed a bit for cloud data warehouses, data lakes, and lakehouses regarding scalability, performance, and query capabilities. Benefits of a "Lakehouse Table Format" Like Apache Iceberg Every part of an organization becomes data-driven. The consequence is extensive data sets, data sharing with data products across business units, and new requirements for processing data in near real-time. Apache Iceberg provides many benefits for enterprise architecture: Single storage: Data is stored once (coming from various data sources), which reduces cost and complexityInteroperability: Access without integration efforts from any analytical engineAll data: Unify operational and analytical workloads (transactional systems, big data logs/IoT/clickstream, mobile APIs, third-party B2B interfaces, etc.)Vendor independence: Work with any favorite analytics engine (no matter if it is near real-time, batch, or API-based) Apache Hudi and Delta Lake provide the same characteristics. Though, Delta Lake is mainly driven by Databricks as a single vendor. Table Format and Catalog Interface It is important to understand that discussions about Apache Iceberg or similar table format frameworks include two concepts: table format and catalog interface! As an end user of the technology, you need both! The Apache Iceberg project implements the format but only provides a specification (but not implementation) for the catalog: The table format defines how data is organized, stored, and managed within a table.The catalog interface manages the metadata for tables and provides an abstraction layer for accessing tables in a data lake. The Apache Iceberg documentation explores the concepts in much more detail, based on this diagram: Source: Apache Iceberg documentation Organizations use various implementations for Iceberg's catalog interface. Each integrates with different metadata stores and services. Key implementations include: Hadoop catalog: Uses the Hadoop Distributed File System (HDFS) or other compatible file systems to store metadata. Suitable for environments already using Hadoop.Hive catalog: Integrates with Apache Hive Metastore to manage table metadata. Ideal for users leveraging Hive for their metadata management.AWS Glue catalog: Uses AWS Glue Data Catalog for metadata storage. Designed for users operating within the AWS ecosystem.REST catalog: Provides a RESTful interface for catalog operations via HTTP. Enables integration with custom or third-party metadata services.Nessie catalog: Uses Project Nessie, which provides a Git-like experience for managing data. The momentum and growing adoption of Apache Iceberg motivates many data platform vendors to implement their own Iceberg catalog. I discuss a few strategies in the below section about data platform and cloud vendor strategies, including Snowflake's Polaris, Databricks' Unity, and Confluent's Tableflow. First-Class Iceberg Support vs. Iceberg Connector Please note that supporting Apache Iceberg (or Hudi/Delta Lake) means much more than just providing a connector and integration with the table format via API. Vendors and cloud services differentiate by advanced features like automatic mapping between data formats, critical SLAs, travel back in time, intuitive user interfaces, and so on. Let's look at an example: Integration between Apache Kafka and Iceberg. Various Kafka Connect connectors were already implemented. However, here are the benefits of using a first-class integration with Iceberg (e.g., Confluent's Tableflow) compared to just using a Kafka Connect connector: No connector configNo consumption through connectorBuilt-in maintenance (compaction, garbage collection, snapshot management)Automatic schema evolutionExternal catalog service synchronizationSimpler operations (in a fully-managed SaaS solution, it is serverless with no need for any scale or operations by the end user) Similar benefits apply to other data platforms and potential first-class integration compared to providing simple connectors. Open Table Format for a Data Lake/Lakehouse using Apache Iceberg, Apache Hudi, and Delta Lake The general goal of table format frameworks such as Apache Iceberg, Apache Hudi, and Delta Lake is to enhance the functionality and reliability of data lakes by addressing common challenges associated with managing large-scale data. These frameworks help to: Improve data management Facilitate easier handling of data ingestion, storage, and retrieval in data lakes.Enable efficient data organization and storage, supporting better performance and scalability.Ensure data consistency Provide mechanisms for ACID transactions, ensuring that data remains consistent and reliable even during concurrent read and write operations.Support snapshot isolation, allowing users to view a consistent state of data at any point in time.Support schema evolution Allow for changes in data schema (such as adding, renaming, or removing columns) without disrupting existing data or requiring complex migrations.Optimize query performance Implement advanced indexing and partitioning strategies to improve the speed and efficiency of data queries.Enable efficient metadata management to handle large datasets and complex queries effectively.Enhance data governance Provide tools for better tracking and managing data lineage, versioning, and auditing, which are crucial for maintaining data quality and compliance. By addressing these goals, table format frameworks like Apache Iceberg, Apache Hudi, and Delta Lake help organizations build more robust, scalable, and reliable data lakes and lakehouses. Data engineers, data scientists and business analysts leverage analytics, AI/ML, or reporting/visualization tools on top of the table format to manage and analyze large volumes of data. Comparison of Apache Iceberg, Hudi, Paimon, and Delta Lake I won't do a comparison of the table format frameworks Apache Iceberg, Apache Hudi, Apache Paimon, and Delta Lake here. Many experts wrote about this already. Each table format framework has unique strengths and benefits. But updates are required every month because of the fast evolution and innovation, adding new improvements and capabilities within these frameworks. Here is a summary of what I see in various blog posts about the four options: Apache Iceberg: Excels in schema and partition evolution, efficient metadata management, and broad compatibility with various data processing engines.Apache Hudi: Best suited for real-time data ingestion and upserts, with strong change data capture capabilities and data versioning.Apache Paimon: A lake format that enables building a real-time lakehouse architecture with Flink and Spark for both streaming and batch operations.Delta Lake: Provides robust ACID transactions, schema enforcement, and time travel features, making it ideal for maintaining data quality and integrity. A key decision point might be that Delta Lake is not driven by a broad community like Iceberg and Hudi, but mainly by Databricks as a single vendor behind it. Apache XTable as Interoperable Cross-Table Framework Supporting Iceberg, Hudi, and Delta Lake Users have lots of choices. XTable, formerly known as OneTable, is yet another incubating table framework under the Apache open-source license to seamlessly interoperate cross-table between Apache Hudi, Delta Lake, and Apache Iceberg. Apache XTable: Provides cross-table omnidirectional interoperability between lakehouse table formats.Is not a new or separate format. Apache XTable provides abstractions and tools for the translation of lakehouse table format metadata. Maybe Apache XTable is the answer to provide options for specific data platforms and cloud vendors while still providing simple integration and interoperability. But be careful: A wrapper on top of different technologies is not a silver bullet. We saw this years ago when Apache Beam emerged. Apache Beam is an open-source, unified model and set of language-specific SDKs for defining and executing data ingestion and data processing workflows. It supports a variety of stream processing engines, such as Flink, Spark, and Samza. The primary driver behind Apache Beam is Google, which allow the migration workflows in Google Cloud Dataflow. However, the limitations are huge, as such a wrapper needs to find the least common denominator of supporting features. And most frameworks' key benefit is the 20% that do not fit into such a wrapper. For these reasons, for instance, Kafka Streams intentionally does not support Apache Beam because it would have required too many design limitations. Market Adoption of Table Format Frameworks First of all, we are still in the early stages. We are still at the innovation trigger in terms of the Gartner Hype Cycle, coming to the peak of inflated expectations. Most organizations are still evaluating but not adopting these table formats in production across the organization yet. Flashback: The Container Wars of Kubernetes vs. Mesosphere vs. Cloud Foundry The debate round Apache Iceberg reminds me of the container wars a few years ago. The term "Container Wars" refers to the competition and rivalry among different containerization technologies and platforms in the realm of software development and IT infrastructure. The three competing technologies were Kubernetes, Mesosphere, and Cloud Foundry. Here is where it went: Cloud Foundry and Mesosphere were early, but Kubernetes still won the battle. Why? I never understood all the technical details and differences. In the end, if the three frameworks are pretty similar, it is all about: Community adoptionRight timing of feature releasesGood marketingLuckAnd a few other factors But it is good for the software industry to have one leading open-source framework to build solutions and business models on instead of three competing ones. Present: The Table Format Wars of Apache Iceberg vs. Hudi vs. Delta Lake Obviously, Google Trends is no statistical evidence or sophisticated research. But I used it a lot in the past as an intuitive, simple, free tool to analyze market trends. Therefore, I also used this tool to see if Google searches overlap with my personal experience of the market adoption of Apache Iceberg, Hudi and Delta Lake (Apache XTable is too small yet to be added): We obviously see a similar pattern as the container wars showed a few years ago. I have no idea where this is going. And if one technology wins, or if the frameworks differentiate enough to prove that there is no silver bullet, the future will show us. My personal opinion? I think Apache Iceberg will win the race. Why? I cannot argue with any technical reasons. I just see many customers across all industries talk about it more and more. And more and more vendors start supporting it. But we will see. I actually do not care who wins. However, similar to the container wars, I think it is good to have a single standard and vendors differentiating with features around it, like it is with Kubernetes. But with this in mind, let's explore the current strategy of the leading data platforms and cloud providers regarding table format support in their platforms and cloud services. Data Platform and Cloud Vendor Strategies for Apache Iceberg I won't do any speculation in this section. The evolution of the table format frameworks moves quickly, and vendor strategies change quickly. Please refer to the vendors' websites for the latest information. But here is the status quo about the data platform and cloud vendor strategies regarding the support and integration of Apache Iceberg. Snowflake: Supports Apache Iceberg for quite some time alreadyAdding better integrations and new features regularlyInternal and external storage options (with trade-offs) like Snowflake's storage or Amazon S3Announced Polaris, an open-source catalog implementation for Iceberg, with commitment to support community-driven, vendor-agnostic bi-directional integrationDatabricks: Focuses on Delta Lake as the table format and (now open sourced) Unity as catalogAcquired Tabular, the leading company behind Apache IcebergUnclear future strategy of supporting open Iceberg interface (in both directions) or only to feed data into its lakehouse platform and technologies like Delta Lake and Unity CatalogConfluent: Embeds Apache Iceberg as a first-class citizen into its data streaming platform (the product is called Tableflow)Converts a Kafka Topic and related schema metadata (i.e., data contract) into an Iceberg tableBi-directional integration between operational and analytical workloadsAnalytics with embedded serverless Flink and its unified batch and streaming API or data sharing with third-party analytics engines like Snowflake, Databricks, or Amazon AthenaMore data platforms and open-source analytics engines: The list of technologies and cloud services supporting Iceberg grows every monthA few examples: Apache Spark, Apache Flink, ClickHouse, Dremio, Starburst using Trino (formerly PrestoSQL), Cloudera using Impala, Imply using Apache Druid, FivetranCloud service providers (AWS, Azure, Google Cloud, Alibaba): Different strategies and integrations, but all cloud providers increase Iceberg support across their services these days, for instance: Object Storage: Amazon S3, Azure Data Lake Storage (ALDS), Google Cloud Storage Catalogs: Cloud-specific like AWS Glue Catalog or vendor agnostic like Project Nessie or Hive CatalogAnalytics: Amazon Athena, Azure Synapse Analytics, Microsoft Fabric, Google BigQuery Shift Left Architecture With Kafka, Flink, and Iceberg to Unify Operational and Analytical Workloads The shift left architecture moves data processing closer to the data source, leveraging real-time data streaming technologies like Apache Kafka and Flink to process data in motion directly after it is ingested. This approach reduces latency and improves data consistency and data quality. Unlike ETL and ELT, which involve batch processing with the data stored at rest, shift left architecture enables real-time data capture and transformation. It aligns with the zero-ETL concept by making data immediately usable. But in contrast to zero-ETL, shifting data processing to the left side of the enterprise architecture avoids a complex, hard-to-maintain spaghetti architecture with many point-to-point connections. Shift left architecture also reduces the need for reverse ETL by ensuring data is actionable in real-time for both operational and analytical systems. Overall, this architecture enhances data freshness, reduces costs, and speeds up the time-to-market for data-driven applications. Learn more about this concept in my blog post about "The Shift Left Architecture." Apache Iceberg as Open Table Format and Catalog for Seamless Data Sharing Across Analytics Engines An open table format and catalog introduces enormous benefits into the enterprise architecture: InteroperabilityFreedom of choice of the analytics enginesFaster time-to-marketReduced cost Apache Iceberg seems to become the de facto standard across vendors and cloud providers. However, it is still at an early stage and competing and wrapper technologies like Apache Hudi, Apache Paimon, Delta Lake, and Apache XTable are trying to get momentum, too. Iceberg and other open table formats are not just a huge win for single storage and integration with multiple analytics/data/AI/ML platforms such as Snowflake, Databricks, Google BigQuery, et al., but also for the unification of operational and analytical workloads using data streaming with technologies such as Apache Kafka and Flink. Shift left architecture is a significant benefit to reduce efforts, improve data quality and consistency, and enable real time instead of batch applications and insights. Finally, if you still wonder what the differences are between data streaming and lakehouses (and how they complement each other), check out this ten minute video: What is your table format strategy? Which technologies and cloud services do you connect? Let’s connect on LinkedIn and discuss it!

By Kai Wähner

CORE

The Benefits of Using Cloud for Big Data Processing

The quantity of data generated per second is astonishing in today's digital world. Big data allows organizations and businesses to create new products and services, enabling them to make decisions and enhance customer experiences. However, processing and analyzing large volumes of data can be quite challenging. This is where cloud computing comes into play. Having worked as a cloud computing engineer, I have witnessed how much leeway the adoption of cloud technology has provided in terms of improving big data processing capabilities. This post discusses some advantages of cloud solutions for big data processing and how they ensure the success of organizations. 10 Reasons to Use Cloud for Big Data Processing 1. Scalability One of the major advantages of cloud computing is scalability. In most cases, traditional data processing systems require much money in hardware and software to bear increased loads. Since these services are cloud-based, you may scale up or down according to your needs. The scalability provides an additional advantage to businesses in managing resources efficiently as one pays only for what is required. Whether terabytes of data must be streamed in minutes for some short project or steady data streams over time, the cloud can take on your requirement with less onerous infrastructure change. 2. Cost-Effectiveness Big data solution implementations can be costly for any organization, especially small and medium-scale enterprises. Cloud platforms ensure a pay-per-use pricing model; an organization need not pay in advance for hardware and software. This will help them use their budget effectively to make more valuable investments. Even then, they can leverage fully loaded data-processing capabilities. Moreover, maintenance and updates are also usually within the scope of services provided by cloud service providers. This further reduces the overall costs for companies. 3. Advanced Tools and Technologies Various cloud service providers offer many advanced tools and technologies that simplify big data processing. Most of the time, these tools come equipped with the latest features and updates; this allows an organization to use recent technologies without actually managing them. These cloud platforms have an enormous list of services, from data storage and processing to machine learning, analytics, and more, enabling cloud computing engineers to build and deploy their solutions rapidly. Access to these advanced tools will enormously boost productivity and innovation. 4. Improved Collaboration Success today means collaboration in a work environment that is ever more remote and global. Cloud-based solutions help make this a reality: multiple users can access and analyze the same data in real time. That feature is particularly useful for big data projects, where insights might come from large, diverse teams with different areas of expertise. Moving to the cloud lets an organization ensure all team members have access to the same data and tools for better communication and collaboration. 5. Security and Compliance Data security is among the major big data business concerns. This element makes cloud providers invest much in security measures to protect infrastructures and client data. They offer such features as encryption, identity management, and regular security audits. Besides, many cloud services meet industry standards and regulations, making it easy for businesses to meet compliance requirements. The sensitive nature of the information an organization may handle calls for this added layer of security to provide a sense of assurance to clients and help reduce risk factors. 6. Speed and Performance Cloud computing enables organizations to process their data much faster and more efficiently. With high-performance access to computing resources, cloud platforms can process bulk volumes of data and complex computations much faster than any in-house solution. This speed is of the essence for big data applications, where real-time analysis leads to timely insights and informed decisions. Businesses can use such resources to improve performance and responsiveness to changing market conditions. 7. Simplified Data Management Data management can often become very cumbersome when volumes are large. Often, cloud solutions have embedded tools that make data management easier. These tools organize and store data so that its retrieval is also efficiently done, enabling the cloud computing engineer to analyze rather than wrestle with it. By offering automated backups, data replication, and highly flexible controls for ensuring access, cloud platforms make data management seamless to help an organization ensure data integrity and availability. 8. Disaster Recovery and Backup Solutions A reliable backup and disaster recovery plan will cater to circumstances where data is lost or a system fails. Cloud services offer some of the strongest backup solutions to ensure that data is well-secured and can be recovered quickly. Most cloud providers incorporate disaster recovery into their services, enabling an organization to limit data loss and reduce downtime. This becomes particularly crucial in big data processing, where large amounts lost can lead to significant changes in analysis and results. 9. Leveraging Global Resources Organizations can access global resources in the cloud, helping them reduce the friction and effort required to analyze and process data from different locations. This global reach is also one of the major drivers for businesses with a distributed workforce or those working in multiple regions. Based on cloud infrastructure, organizations may analyze data from different sources and gain a far more complete view of their market. This global perspective shall then enable better decision-making and strategic planning. 10. Continuous Innovation Finally, the cloud enables continuous innovation. Cloud service providers continuously update the services to make them more beneficial for organizations to keep up with the latest technologies and new features. This continuous improvement cycle takes place in such a way that lets businesses be competitive and agile in a fast-changing market. In big data processing solutions, cloud computing engineers can refine and enhance it regularly and utilize new advancements. Summary The advantages of cloud computing for big data processing are numerous and valid. Scalability and cost-effectiveness, among others, in improved collaboration and security, cloud solutions provide organizations with what they need to prosper in a data-driven environment. As big data gets bigger each day, the role of cloud architecture will also be seen to grow further and be more important, molding the future of data processing and analytics. However, for organizations that want to harness big data for their benefit, leveraging cloud technologies is no longer an option but an imperative. An investment in cloud solutions lets the organization unlock the full value of its data to drive meaningful insights that will lead to successful outcomes.

By Job Ready Program

IoT

DZone's Featured IoT Resources

Top IoT Experts

The Latest IoT Topics