Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture
AI Paradigm Shift: Analytics Without SQL
My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.
Artificial Intelligence is rapidly becoming a part of everyday devices — smartphones, cars, cameras, and even home appliances. Traditionally, these systems rely on cloud servers to send, process, and analyze data before making decisions, which increases latency and delays responses. However, many applications require instant decision-making, where even a slight delay can be critical. In such scenarios, relying on network connectivity is not always practical, and decisions need to be made locally on the device itself. This has led to a growing shift toward running intelligence directly on devices, making real-time local processing more important than ever. In this article, we’ll explore why this shift matters and how it is shaping the future of modern intelligent systems. What is Edge AI? Edge AI refers to running AI models directly on devices such as IoT systems, smartphones, autonomous cars, drones, and sensors — right where the data is generated. With this approach, there is no need to transfer data to cloud servers or centralized systems. Edge AI enables faster, real-time decision-making by processing data locally, without sending it elsewhere. For example, Instead of sending every transaction to a central server for analysis, the system can analyze transaction patterns locally in real time. If any unusual activity is detected — such as an abnormal withdrawal amount, location mismatch, or suspicious behavior — the system can instantly block the transaction or trigger an alert. Why Real-Time Processing Matters? Real-time processing means a system can process data instantly and make decisions without delay. Even small delays in decision-making can create critical situations and lead to serious consequences. For example, an autonomous car must detect obstacles and react within milliseconds. If it relies on the cloud, even a small delay could lead to serious consequences. By processing data locally, Edge AI enables immediate decisions — such as braking or steering — making the system safer and more efficient. Reduce Latency and Faster Decisions Latency is the time it takes for data to travel to the cloud and back. Even a delay of a few milliseconds can be too slow for certain applications. With Edge AI: Data is processed instantly on the device itself.There’s no need to wait for a network response.Performance is faster, more reliable, and less dependent on connectivity. For example, a voice recognition system on a smartphone can respond much faster when speech processing runs locally on the device, rather than relying on cloud or centralized servers. Improved Privacy and Data Security Sending sensitive data to the cloud raises privacy concerns, as it can be exposed during transmission or storage. Edge AI minimizes these risks by processing data directly on the local device instead of sending it to the cloud. This approach enhances data security and helps maintain user privacy, since sensitive information never leaves the device. It also supports compliance with data protection regulations and reduces the chances of unauthorized access or data breaches. For example, a healthcare wearable that monitors heart activity should not transmit sensitive personal health data to external servers. Instead, it can analyze patterns locally on the device and instantly alert the user if any irregularities are detected. This approach not only protects patient privacy but also enables faster, real-time responses in critical situations. Such local processing is especially important in industries like banking, healthcare, finance, and smart homes, where data security and immediate decision-making are essential. Reliability Without Internet Dependency Edge devices can operate even without an internet connection, making them more stable and reliable in remote areas or environments with poor network coverage. This ensures continuous performance without interruptions or delays caused by connectivity issues. As a result, critical applications can function smoothly regardless of network availability. For example, a drone used in disaster rescue operations cannot depend on internet connectivity. It must process images locally and detect survivors in real time, enabling faster and more effective rescue efforts. Lower Bandwidth Usage and Reduce Infrastructure Costs Sending large amounts of data to the cloud consumes significant bandwidth and increases operational costs. Edge AI helps reduce these costs by processing data locally on the device. This minimizes the need for constant data transmission and optimizes network usage. Only relevant or critical information is sent to the cloud, making the system more efficient and cost-effective. For example, a factory machine monitoring system can analyze sensor data locally and send alerts only when an issue is detected, instead of continuously streaming all the data. Scalability and Cost Efficiency Cloud processing for millions of devices can become expensive and resource-intensive. Edge AI addresses this by distributing computations across devices, reducing the load on central servers. This decentralized approach lowers infrastructure costs, improves scalability, and enhances overall system performance. It also reduces latency by minimizing the need for constant communication with the cloud. For example, in a smart city, thousands of cameras can process data locally instead of sending everything to a central cloud system. This not only saves bandwidth and infrastructure costs but also enables faster, real-time insights and responses. Better User Experience Real-time processing significantly improves user experience by making systems feel faster, smoother, and more responsive. Quicker responses lead to higher user satisfaction and a more seamless interaction. With Edge AI, data is processed instantly on the device, eliminating delays and ensuring consistent performance. This is especially important for applications that require immediate feedback. For example, in gaming or augmented reality (AR), local AI can render objects and interactions in real time, creating a smoother, more immersive, and engaging user experience. An edge-based platform helps by enabling data processing and decision-making directly on devices, rather than relying entirely on centralized cloud systems. It supports faster, real-time responses by analyzing data locally, which is essential for applications that require immediate action. This leads to improved performance and reliability, especially in environments with limited or unstable internet connectivity. It also enhances data privacy and security by keeping sensitive information on the device, reducing the need for data transmission. Additionally, it optimizes bandwidth usage and lowers infrastructure costs by sending only meaningful insights or alerts to central systems instead of continuous raw data. Overall, this approach helps build systems that are faster, more efficient, secure, and scalable by bringing intelligence closer to where data is generated. Conclusion Edge AI is transforming modern systems by bringing intelligence closer to where data is created, enabling faster and real-time decision-making. It reduces latency and improves performance by processing data locally instead of relying on the cloud. This approach also enhances privacy and minimizes dependence on constant internet connectivity. Additionally, it helps reduce bandwidth usage and lowers infrastructure costs. From smart cities to healthcare and industrial automation, edge computing is driving a new era of faster, smarter, and more efficient systems. Edge AI brings intelligence closer to where data is created, enabling real-time decisions, faster performance, enhanced privacy, and reliable operation without depending on constant connectivity.
Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.
The convergence of IoT, real-time data streaming, and modern frontend frameworks is redefining how engineers build enterprise monitoring systems. Over the course of designing and leading the Device IoT Platform — an enterprise-grade solution for real-time monitoring, configuration, and diagnostics of thousands of distributed network devices — I encountered and solved a core architectural challenge: how do you build a frontend dashboard that can handle hundreds of concurrent device telemetry streams without sacrificing performance, maintainability, or user experience? This article shares the architectural patterns, technology decisions, and hard-won lessons from that journey — covering the full stack from MQTT ingestion to Vue 3 reactivity to Kafka-backed event processing. The Core Problem: Real-Time at Scale Most developers are familiar with polling — periodically fetching data from an API endpoint. For IoT, polling is fundamentally broken: Latency: A 5-second polling interval means 5 seconds of stale state.Inefficiency: You're requesting data even when nothing has changed.Scale: 1,000 devices × 1 request/5s = 200 requests/second just to read status — before any user interaction. The solution is event-driven architecture: devices push telemetry when something changes, and the platform reacts. This requires a rethinking of both backend ingestion and frontend state management. Architecture Overview Plain Text [IoT Devices] | MQTT Broker (Mosquitto / AWS IoT Core) | [Node.js Telemetry Microservice] | [Kafka Topic: device.telemetry.raw] | (stream processor) [Kafka Topic: device.telemetry.enriched] | [WebSocket Server (Node.js)] | [Vue 3 Dashboard Frontend] Each layer has a distinct responsibility: MQTT Broker handles lightweight publish/subscribe with devices using minimal overhead.Node.js microservices bridge MQTT → Kafka, performing initial validation and normalization.Kafka provides durable, replayable event streams — critical for audit trails and late-joining consumers.WebSocket server fans out enriched telemetry to connected dashboard clients in real time.Vue 3 handles reactive rendering, ensuring only the affected UI components re-render when new data arrives. Backend: MQTT → Kafka Bridge in Node.js The heart of the ingestion pipeline is a lightweight Node.js service using the mqtt and kafkajs libraries. Plain Text import mqtt from 'mqtt'; import { Kafka } from 'kafkajs'; const mqttClient = mqtt.connect(process.env.MQTT_BROKER_URL!, { clientId: `telemetry-bridge-${process.pid}`, username: process.env.MQTT_USERNAME, password: process.env.MQTT_PASSWORD, clean: true, }); const kafka = new Kafka({ clientId: 'iot-bridge', brokers: [process.env.KAFKA_BROKER!] }); const producer = kafka.producer(); mqttClient.on('connect', async () => { await producer.connect(); mqttClient.subscribe('devices/+/telemetry', { qos: 1 }); console.log('MQTT → Kafka bridge active'); }); mqttClient.on('message', async (topic, payload) => { const deviceId = topic.split('/')[1]; const data = JSON.parse(payload.toString()); await producer.send({ topic: 'device.telemetry.raw', messages: [ { key: deviceId, value: JSON.stringify({ deviceId, timestamp: Date.now(), ...data }), }, ], }); }); Key design decisions here: QoS Level 1 — ensures at-least-once delivery for telemetry messages. For command acknowledgments, we use QoS 2.Device ID as Kafka partition key — guarantees ordering per device while allowing parallel processing across partitions.Separation of raw vs. enriched topics — the device.telemetry.raw topic contains the bare payload; a downstream stream processor enriches it with device metadata, geolocation, and alert thresholds before publishing to device.telemetry.enriched. WebSocket Fan-Out Server The WebSocket layer subscribes to Kafka's enriched topic and pushes updates to connected browser clients. We use Kafka consumer groups to allow horizontal scaling of the WebSocket tier. Plain Text import { WebSocketServer } from 'ws'; import { Kafka } from 'kafkajs'; const wss = new WebSocketServer({ port: 8080 }); const kafka = new Kafka({ clientId: 'ws-fanout', brokers: [process.env.KAFKA_BROKER!] }); const consumer = kafka.consumer({ groupId: 'websocket-fanout-group' }); // Track subscriptions: deviceId → Set<WebSocket> const deviceSubscriptions = new Map<string, Set<WebSocket>>(); wss.on('connection', (ws) => { ws.on('message', (msg) => { const { action, deviceId } = JSON.parse(msg.toString()); if (action === 'subscribe') { if (!deviceSubscriptions.has(deviceId)) { deviceSubscriptions.set(deviceId, new Set()); } deviceSubscriptions.get(deviceId)!.add(ws); } }); ws.on('close', () => { deviceSubscriptions.forEach((clients) => clients.delete(ws)); }); }); async function startKafkaConsumer() { await consumer.connect(); await consumer.subscribe({ topic: 'device.telemetry.enriched' }); await consumer.run({ eachMessage: async ({ message }) => { const payload = JSON.parse(message.value!.toString()); const clients = deviceSubscriptions.get(payload.deviceId); clients?.forEach((client) => { if (client.readyState === WebSocket.OPEN) { client.send(JSON.stringify(payload)); } }); }, }); } startKafkaConsumer(); This design enables selective subscription — a dashboard user viewing 50 devices only receives telemetry for those 50 devices, not the full firehose. This is critical for performance at scale. Frontend: Vue 3 Reactive Architecture The frontend is built with Vue 3 Composition API + Pinia for state management. The goal is to update only the UI components bound to a specific device when its telemetry arrives — not re-render the entire dashboard. WebSocket Composable Plain Text // composables/useDeviceTelemetry.ts import { ref, onMounted, onUnmounted } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; export function useDeviceTelemetry(deviceIds: string[]) { const store = useDeviceStore(); let ws: WebSocket | null = null; const connect = () => { ws = new WebSocket(import.meta.env.VITE_WS_URL); ws.onopen = () => { deviceIds.forEach((id) => { ws!.send(JSON.stringify({ action: 'subscribe', deviceId: id })); }); }; ws.onmessage = (event) => { const telemetry = JSON.parse(event.data); store.updateDeviceTelemetry(telemetry.deviceId, telemetry); }; ws.onclose = () => { // Exponential backoff reconnection setTimeout(connect, Math.min(1000 * 2 ** reconnectAttempts++, 30000)); }; }; onMounted(connect); onUnmounted(() => ws?.close()); } Pinia Store with Fine-Grained Reactivity Plain Text // stores/deviceStore.ts import { defineStore } from 'pinia'; import { reactive } from 'vue'; interface DeviceTelemetry { deviceId: string; status: 'online' | 'offline' | 'degraded'; signalStrength: number; latency: number; lastSeen: number; alerts: string[]; } export const useDeviceStore = defineStore('devices', () => { const telemetryMap = reactive<Record<string, DeviceTelemetry>>({}); function updateDeviceTelemetry(deviceId: string, data: Partial<DeviceTelemetry>) { if (!telemetryMap[deviceId]) { telemetryMap[deviceId] = {} as DeviceTelemetry; } Object.assign(telemetryMap[deviceId], data); } return { telemetryMap, updateDeviceTelemetry }; }); Using reactive() with a map structure means Vue's dependency tracking is at the property level — a component subscribed to telemetryMap['device-001'].signalStrength won't re-render when device-002's data changes. This is the key to dashboard scalability. Device Card Component Plain Text <!-- components/DeviceCard.vue --> <template> <div class="device-card" :class="statusClass"> <div class="device-header"> <span class="device-id">{{ deviceId }</span> <StatusBadge :status="telemetry?.status" /> </div> <div class="metrics"> <MetricBar label="Signal" :value="telemetry?.signalStrength" unit="dBm" /> <MetricBar label="Latency" :value="telemetry?.latency" unit="ms" /> </div> <AlertList :alerts="telemetry?.alerts ?? []" /> </div> </template> <script setup lang="ts"> import { computed } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; const props = defineProps<{ deviceId: string }>(); const store = useDeviceStore(); // Only this device's slice of state — targeted re-renders only const telemetry = computed(() => store.telemetryMap[props.deviceId]); const statusClass = computed(() => ({ 'status-online': telemetry.value?.status === 'online', 'status-offline': telemetry.value?.status === 'offline', 'status-degraded': telemetry.value?.status === 'degraded', })); </script> Performance Optimizations 1. Virtual Scrolling for Large Device Lists When monitoring 500+ devices, rendering all device cards simultaneously tanks performance. We use vue-virtual-scrollerto only render visible cards: Plain Text <RecycleScroller class="device-list" :items="filteredDevices" :item-size="120" key-field="deviceId" v-slot="{ item }" > <DeviceCard :device-id="item.deviceId" /> </RecycleScroller> 2. Debounced Batch Updates Devices can emit bursts of telemetry. Updating the Pinia store on every single message causes excessive re-renders. We batch incoming messages within a 100ms window: Plain Text let pendingUpdates: Record<string, Partial<DeviceTelemetry>> = {}; let batchTimeout: ReturnType<typeof setTimeout> | null = null; function queueUpdate(deviceId: string, data: Partial<DeviceTelemetry>) { pendingUpdates[deviceId] = { ...(pendingUpdates[deviceId] ?? {}), ...data }; if (!batchTimeout) { batchTimeout = setTimeout(() => { Object.entries(pendingUpdates).forEach(([id, update]) => { store.updateDeviceTelemetry(id, update); }); pendingUpdates = {}; batchTimeout = null; }, 100); } } 3. Lazy Loading and Code Splitting Device diagnostic panels (charts, event logs, configuration editors) are loaded on demand: Plain Text const DeviceDiagnostics = defineAsyncComponent( () => import('@/components/DeviceDiagnostics.vue') ); Combined with route-level code splitting, the initial bundle stays under 200KB gzipped. Security Architecture: OAuth 2.0 + RBAC Device management platforms require fine-grained access control. Not every user should be able to issue firmware update commands to production devices. JWT Claims-Based RBAC We encode role information directly in the JWT access token: Plain Text { "sub": "user-123", "roles": ["device:read", "device:configure"], "scope": "region:us-east", "exp": 1699999999 } The frontend reads these claims to conditionally render action buttons, and the backend validates them on every API call: Plain Text // middleware/rbac.ts export function requirePermission(permission: string) { return (req: Request, res: Response, next: NextFunction) => { const token = req.headers.authorization?.split(' ')[1]; const decoded = verifyJWT(token!); if (!decoded.roles.includes(permission)) { return res.status(403).json({ error: 'Insufficient permissions' }); } next(); }; } // Route definition router.post('/devices/:id/firmware', requirePermission('device:firmware'), handleFirmwareUpdate); Deployment: CI/CD on AWS The entire platform is containerized and deployed via a GitLab CI/CD pipeline to AWS ECS with Fargate. Plain Text # .gitlab-ci.yml (excerpt) stages: - test - build - deploy build-and-push: stage: build script: - docker build -t $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA . - docker push $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA deploy-production: stage: deploy script: - aws ecs update-service --cluster iot-platform --service frontend --force-new-deployment environment: production only: - main Blue-green deployments ensure zero downtime for this 24/7 critical infrastructure platform. Results and Key Metrics After migrating from a polling-based architecture to this event-driven stack: Dashboard latency: reduced from 5–10 seconds (polling) to under 200ms (WebSocket push).Backend API load: reduced by ~78% — telemetry pushes replaced constant polling.Frontend bundle size: kept under 220KB gzipped through lazy loading and tree-shaking.Throughput: validated at 10,000 concurrent telemetry events/second through Kafka partitioning. Conclusion Building a real-time IoT dashboard at enterprise scale requires rethinking the entire data flow — from device communication protocols through streaming pipelines to fine-grained frontend reactivity. The combination of MQTT for lightweight device communication, Kafka for durable event streaming, WebSockets for real-time push to browsers, and Vue 3's targeted reactivity model creates a system that scales gracefully without sacrificing developer ergonomics. The patterns described here — selective WebSocket subscriptions, batched Pinia updates, virtual scrolling, and JWT-based RBAC — have been validated in production on a platform serving critical network infrastructure. They are broadly applicable to any domain requiring real-time monitoring at scale: energy management, fleet tracking, industrial automation, and beyond. Github: Real-Time-IoT-Dashboards-Vue-3-MQTT-Kafka
Starting a Google data migration usually feels easy at first. You set it up, watch the progress bar move, and assume everything is fine. Then, the migration reaches 99% and just stays there. Users start wondering if something went wrong, if emails are missing, or if the process will ever finish. The good news is that this issue is common, and most of the time, it’s not serious. In this blog, we’ll look at why Google data migration stuck at 99% and what you can do to fix it is to backup the Google data. Why is your Google Data Migration Pending at 99%? There can be many reasons for Google data migration getting stuck at 99%. Here are some possibilities listed below: Migrating a large amount of data takes time and can slow things down.A slow internet connection can interrupt the data export.There can be some compatibility issues with certain data types.Insufficient system storage on either the source or target servers can hinder migration progress.If the data being transferred contains corruption or an integrity issue, the migration process may encounter some issues that prevent it from completing successfully. What to Do When Google Data Migration Stuck at 99%? When Google data migration gets stuck at 99%, users can feel confused and a bit stressed. Everything looks almost done, but nothing is transferring anymore. You keep checking the status, hoping it will finish on its own. In this section, we’ll see some simple steps you can take to help move things forward when Google Data Migration is not completing: Solution 01: Wait, Then Refresh The first thing to try is simply waiting. A lot of migrations finish on their own after a few hours. The progress bar isn’t always honest about what’s happening. Wait for around 30-60 minutes.Do not close any web browser or log out of the Google Admin Console.If the problem persists, refresh the migration page. Solution 02: Test your Internet Connection A poor internet connection is the most common reason why Google Data Migration not completing. Do not use VPNs or proxy networks, as they can slow down the transfer process.During the migration process, do not download heavy files.Once your internet gets stabilized, refresh the migration page. Solution 03: Manage Mailbox Size Oversized mailboxes can affect the Google data migration process. If the migration includes large attachments or huge attachments, it can get stuck. Log in to the source account, find the unnecessary emails and attachments, and delete them.Empty the Trash or Spam folders to reduce the total size.You can also try to split the entire migration into two for a faster and more efficient migration. Solution 04: Resume Migration for Pending Data When Google Data Migration status is stuck at 99%, it means only a few emails are pending for migration. Check the migration status from the Google Admin migration dashboard.If the migration gets stuck, pause it and then resume the process. Also, if there is an option to select the new items or the remaining items to be migrated, choose that option. Backup Google Data to Fix Google Data Migration Stuck at 99% The professional Cigait Gmail Backup Tool is easy to use for users and does what it claims. It helps you to backup Gmail emails when the Google data migration is stuck at 99%. The process is simple and doesn’t take much effort. You can select which data or emails you want to transfer. The advanced filters let you skip emails you don’t need. This saves time and keeps things clean. The tool doesn’t only work with emails; it can also back up Google Drive files, contacts, calendars, photos, and other Google data. Follow the given steps to fix the Google data migration pending at 99%: First, install this tool, then launch it on your system.After that, enter your Email ID and App password and click Sign In.Now, select and preview your files and hit Next.Then, select PDF from the given list of file formats.Afterward, apply any optional features and filters you need and click Next.After that, click Save Path, choose the location, and save your fileLastly, press Download to begin the process. Final Words Google data migration stuck at 99% is a common issue and can be really frustrating, but it usually isn’t serious. Most of the time, the last few files or emails just need extra time to complete. If this issue keeps coming back here, we discussed the Gmail Backup Tool that can help to backup the Google data. With a little patience, the transfer process can be completed without losing data. Frequently Asked Questions Q1. Why does Google data migration get stuck at 99%? Ans. Google data migration usually gets stuck at 99% because a few emails or files take time to transfer. Old messages, attachments, or items in spam and trash folders can slow down the final step. Most of the time, the migration is still running in the background and needs more time to finish. Q2. How long can Google's data migration stay at 99%? Ans. Google’s data migration can stay at 99% for a few hours and sometimes up to a day. The progress bar doesn’t always update properly near the end, even though the migration is still going in the background. Q3. What should I do if the Google data migration is not completing? Ans. If the Google data migration is not completing, first wait for some time, then check the status again. Many times, it gets completed. If it stays stuck, try restarting the migration process. If the issue keeps occurring again, using the Gmail backup tool can help move the remaining emails and other data without getting stuck. Q4. Why does Google data migration stay pending at 99% for large mailboxes? Ans. Google data migration stays pending at 99% for large mailboxes because there is more data to process. A few large emails, old messages, or attachments can take extra time to transfer, which slows down the final part of the migration.
When we talk about data analytics the way we set up our tables is really important. This is because it can make a difference, in how well our systems work and how fast they can grow. Data analytics and Open Table Formats go hand in hand. Open Table Formats are a part of how we build our data systems today. They make it easy to work with systems. Get more out of our data. In this blog post we will talk about what Open Table Formatsre. We will discuss data analytics and Open Table Formats in detail. We will also look at some examples. Help you figure out which Open Table Format is best for your data analytics needs. We want to help organizations choose the Open Table Format for their data systems because the Open Table Format is very important, for organizations. The Open Table Format is what organizations need to make their data systems work well. What Are Open Table Formats? Open Table Formats are really good at keeping data neat and tidy, in tables. Nobody owns Open Table Formats so they are made to work with lots of tools and systems. This is great because Open Table Formats can be used by people and computers and they all work together. The goal of Open Table Formats is to make it easy for people to share data and use it so everyone can work together smoothly no matter what kind of computer or system they use, with Open Table Formats. Popular Open Table Formats People really, like using Open Table Formats when they are dealing with data. Here are some popular Open Table Formats that people use a lot when they are working with Open Table Formats: Apache Iceberg Apache Iceberg is a way to organize tables. It helps people work with sets of data in an controlled way. Apache Iceberg gives us things like ACID transactions, which's, like a guarantee that Apache Iceberg will handle our data correctly. Apache Iceberg also has isolation so we can look at our data without worrying about people changing Apache Iceberg data at the same time. Apache Iceberg allows for schema evolution, which means we can change the way our Apache Iceberg data is organized without having to start over again with Apache Iceberg. I think Apache Iceberg is really useful for people who deal with datasets in data lakes. Apache Iceberg is very helpful because it makes working with amounts of data a lot easier for people who do this kind of work, with Apache Iceberg. Advantages The main advantages of this system are that it makes sure the data is consistent. It helps with queries. This system also allows the database schema to change and evolve over time without losing any of the data, from the database schema. The system ensures data consistency. It supports queries and it enables the database schema evolution. Use Cases: Ideal for data lakes requiring transactional guarantees and schema flexibility. Delta Lake Delta Lake is a way to store data that's free for anyone to use. It helps make sure the Delta Lake data is reliable. When many people use the Delta Lake data at the time Delta Lake makes sure there are no problems. The Delta Lake also keeps track of a lot of information, about the Delta Lake data. Delta Lake makes it easy to use data that is coming in all the time and old data that is already stored in the Delta Lake. The Delta Lake does all this by using something called ACID transactions to help the Delta Lake work properly. Delta Lake is really great when it comes to dealing with an amount of data. Delta Lake works well with data that is coming in all the time and Delta Lake also works well with data that comes in big groups. This thing has a lot of points. It makes sure the data is good and reliable. You can also go back. Look at old versions of the data. The data works well with the tools that use the data. The tools that process the data, like it when the data is set up this way. Use Cases: Suitable for data lakes requiring reliability, data versioning, and unified data processing. Apache Hudi Apache Hudi is a tool for working with data. It helps you add data to the data you already have. Apache Hudi also makes it easier to build systems that can move data around. This is really helpful when you have a lot of data in a data lake. Anyone can use Apache Hudi because it is source. The best thing about Apache Hudi is that it makes handling data processing and building data pipelines on data lakes simpler. Apache Hudi is very useful, for people who work with data lakes and need to process a lot of data. This system is good because it helps with processing data a little at a time. It also keeps track of versions of the data. The data system makes it easy to get the data in and to ask questions about the data. The data system is really helpful when you want to ask questions, about the data. Use Cases: Ideal for data lakes requiring incremental data processing and data pipeline management. Choosing the Right Open Table Format When you are trying to pick the Open Table Format for the data analytics you need you have to think about a lot of things. You have to think about what you will be using the Open Table Format for. What is your data, like? Will the Open Table Format work with the systems you use? How well does the Open Table Format need to perform for your data analytics? Here are some important things to think about when you're trying to decide on an Open Table Format for your data analytics needs: Use Cases and Workloads When you want to make sure your transactions are safe and your data is consistent you should think about using formats like Apache Iceberg or Delta Lake. These formats give you something called ACID transactions which's, like a promise that everything will work correctly. Apache Iceberg and Delta Lake are options because they help you keep your data safe and make sure everything is consistent. If you are looking for something that will guarantee your data is safe Apache Iceberg and Delta Lake are the way to go because Apache Iceberg and Delta Lake give you this guarantee. When we talk about Incremental Data Processing we need to think about how to handle Incremental Data Processing. For people who work with Incremental Data Processing and manage data pipelines Apache Hudi is an option to consider for their Incremental Data Processing needs. Apache Hudi can really help with tasks related to Incremental Data Processing. Data Characteristics When you are working with data think about how data you will have to deal with. You have to store data. Some ways of storing data are better for sets of data. Data volume is something you should think about because some formats can handle lots of data better, than others. This is really important when you are working with a lot of data. If you are working with data data volume can be a problem if you are not using the format for your data. Data Complexity You have to find out how complicated your data is. This means you need to look at the types of data you have. You should think about if you will need to make changes to how your data's organized. Some data formats, like Apache Iceberg and Delta Lake are very helpful. They are helpful because they let you make changes to your data easily. You can change your data without a lot of trouble when you use Apache Iceberg and Delta Lake. Ecosystem Compatibility When you choose an Open Testing Framework you need to make sure it works well with the data processing tools you already use. For example Delta Lake works with Apache Spark. This is really important because you want your Open Testing Framework to be compatible with your existing data processing frameworks and tools, like your Open Testing Framework and your data processing tools. You want your Open Testing Framework to work smoothly with the tools you have so your Open Testing Framework and your data processing tools work together perfectly. When you think about Cloud Platforms you need to think about how the OTF works with the Cloud Platform you want to use. You have to see if the OTF is compatible with the Cloud Platform you like.. You have to check if it works with the infrastructure you have at home or in your office. This is really important for Cloud Platforms, like the ones you use every day. You need to make sure the OTF and the Cloud Platform work together. The Cloud Platform you choose should be able to work with the OTF. Performance Requirements Let us take a look at the On The Fly system and see how it works when we have to handle queries. The On The Fly system has to be able to handle our queries. We need to check how well the On The Fly system does when it comes to query performance. This is important because we do a lot of work. The On The Fly system has to be good, at handling the kind of work we do. We have to test the On The Fly system to see how it performs with our workloads. The On The Fly system needs to be able to handle these workloads. * We are going to take a look, at how the On The Fly system works when it comes to answering queries. We want to see how the On The Fly system does its job. The On The Fly system is what we are focusing on. * We are going to use this for the work we do when we analyze things for our workloads. This will help us with our workloads. The main thing we want to figure out is how good the On The Fly system is at doing our work. We need to see if the On The Fly system can give us the results we need fast. This will help us decide if the On The Fly system is really good, for the kind of work we do with the On The Fly system. Data Ingestion We need to check how well our Data Ingestion processes are working, especially when we are getting Data Ingestion done on time or really close to time for analytics. This is really important, for Data Ingestion because it helps us understand what is happening now with our Data Ingestion. We need to see how Data Ingestion works with a lot of information. We have to know how fast Data Ingestion can process this information. For Data Ingestion to be really useful it has to be able to handle all this information. Data Ingestion is only good if it can do this. Open Table Formats are really important for working with data these days. They make it easy to work with systems and Open Table Formats can do a lot of things. If you know what makes Open Table Formats like Apache Iceberg, Delta Lake and Apache Hudi special you can pick the Open Table Format that's best, for your company. You need to think about your data. What is your data like? You should figure out what you want to do with your data and what tools you are using with your data. You should also think about what you want your data to be like. Then you can pick the Open Table Format that's best for your data and what you want to do with your data. Open Table Formats are important for your data so choosing the Open Table Format is important, for your data needs.
The Day Everything Looked Fine — Until It Wasn’t The dashboards were green. Every test passed. And yet, by morning, the company’s revenue had mysteriously dropped by roughly $1 million. The data team huddled together, blinking at their screens. Schema checks? It looked good.Nulls? Checks passed, and everything appeared to be in order.Completeness? It looked good. Nothing looked wrong, except that something was causing the business to bleed. What they didn’t know yet was that an innocent iOS app update had quietly scrambled the order of user events. To the system, customers were suddenly purchasing before browsing. The models didn’t break in code; they broke in meaning. The team discovered a crucial lesson: even flawless data systems can mislead without true observability. Why “Good Data” Isn’t Good Enough Anymore There was a time when data quality was the gold standard and a measure of success. DQ checks meant your dataset is protected. If your dataset were clean, complete, and validated, your insights would be gold. But that was back when pipelines were simple, ETL jobs ran once a night, and life was predictable. Back then, most data was read by people, not systems. Analysts looked at dashboards after the fact, asked questions when numbers felt off, and applied judgment before anyone made a real decision. If a table landed late or a metric looked strange, someone usually noticed; often before it caused real damage. Data quality checks were designed for this world: static, batch-oriented, and tolerant of human interpretation. But as technology changed, so did expectations. Today’s world is different. This shift matters most for data engineers, analytics engineers, and platform teams responsible for the reliability of downstream dashboards, APIs, and machine learning systems. Modern cloud-native companies run thousands of interdependent batch and streaming pipelines, constantly feeding dashboards, APIs, and machine learning systems. A single column rename, a delayed partition, or an unnoticed schema tweak can quietly throw everything off course. Traditional data quality is like checking your car’s oil once a month. Data observability involves installing a dashboard that provides real-time alerts when the engine is overheating. The Shift: From Data Quality to Data Observability Data quality answers the question: “Is this dataset correct right now?” Data observability asks something deeper: “Is my data behaving as it should?” Aspect Data Quality Data Observability Focus Data-at-rest Data-in-motion Checks Accuracy, completeness, validity Freshness, volume, distribution, schema, lineage When Point-in-time Continuous Goal Ensure correctness Ensure reliability View Local End-to-end The Five Pillars of Data Observability Freshness: Is data arriving on time relative to SLAs?Volume: Are record counts within expected ranges?Distribution: Have key statistics (e.g., averages, percentiles) drifted unexpectedly?Schema: Did upstream fields change without notice?Lineage: What depends on what, and who owns it? Together, these pillars act as an early-warning system for your data ecosystem, sensing changes before they cause downstream impact. The Story Behind the $1M Drop Our e-commerce company’s recommendation engine accounted for 40% of revenue. After a routine app update, click-throughs fell by 15%, conversions by 22%, and revenue tumbled. And yet, all quality checks still passed. Check Status Missed Insight Schema ✅ Timestamps changed meaning Nulls ✅ Events arrived out of sequence Ranges ✅ Valid values, wrong order Data quality confirmed the structure. It missed the story. Event order sounds like a minor detail, but for recommendation models, it’s foundational. Browsing before purchasing means something very different than purchasing before browsing. When that sequence flipped, nothing crashed; the model simply learned the wrong story about customers. Since the data remained complete, valid, and schema-compliant, every traditional check passed, even as the model’s understanding of user behavior quietly unraveled. The Hidden Issue The iOS app began batching events. They arrived six hours late and out of order. Before (Healthy) After (Broken) View → Add to Cart → Purchase Purchase → View → Add to Cart The model interpreted chaos as logic, and that’s when recommendations became noise. How Observability Would Have Saved the Day Within two hours, an observability system would have screamed: Freshness Alert: Event lag jumped from 5 mins to 360 minsDistribution Alert: 78% of events out of sequenceLineage Alert: iOS v1.3.0 deployed, impacting 47 tables and degrading 12 ML models Approach Detection Root Cause Resolution Time Data Quality Missed Undetected 3 days Data Observability Caught early iOS v1.3.0 deployment 6 hours Observability didn’t just find the broken data; it connected the dots to the moment things went wrong. The real win wasn’t just catching the issue faster. It was knowing exactly what changed, when it changed, and how far the damage spread. That made it possible to roll back quickly and explain what happened without guesswork. Without observability, teams debate symptoms. With it, they start acting on causes. Building Observability Step by Step So how does a modern data team move from reactive firefighting to proactive confidence? 1. Define Data Contracts Every dataset has a clear, versioned schema (YAML, Avro, Protobuf). Contracts live in code and are automatically validated before pipeline runs and new data is added to the dataset. Data contracts are often the first thing teams skip. They feel slow, bureaucratic, and unnecessary, right up until a breaking change slips through and every downstream table starts lying. 2. Add Freshness & Volume Monitors Track how long data takes to arrive and whether counts fall outside norms. Row updated at timestamp should be within the defined SLO. Define SLOs such as “99% of partitions land within 10 minutes.” Without explicit SLAs, delays are only discovered after dashboards update or don’t. By then, decisions have already been made on stale data. 3. Strengthen Tests Layer dbt checks for `not_null` and `uniqueness` with drift tests — e.g., “average session_length stays within 10% of baseline,” or “count of new orders placed stays within 10% of the baseline.” Basic checks are good at catching broken tables, but they don’t tell you when data starts behaving differently. Drift tests exist for the uncomfortable cases where everything looks valid but isn’t. 4. Emit Lineage Integrate OpenLineage with Airflow or dbt to visualize dependencies and trace impact instantly. Without lineage, every alert triggers a manual investigation. With it, teams can immediately see blast radius and ownership. 5. Centralize Visibility Bring all signals into one pane of glass. When freshness lives in one tool, lineage in another, and alerts in Slack, every incident turns into a scavenger hunt. Pulling those signals together is what turns alerts into answers. Now, when an alert fires, you know what broke, where, and who’s responsible. A Familiar Pattern If this story sounds familiar, it’s because it’s happening everywhere. Teams at Netflix have described recommendation quality degrading after upstream data schemas changed without downstream safeguards.Uber has publicly discussed timezone-related bugs that impacted time-based systems, including pricing and incentives.Airbnb has shared incidents where aggressive deduplication and data-cleaning logic removed valid records.Stripe has written extensively about how tiny currency-rounding errors can quietly compound into material financial discrepancies at scale.Different problems, same root cause: great data quality, no visibility. Let’s Distill the Lesson: Quality Validates. Observability Protects. Data quality ensures your data is correct. Data observability ensures your system stays trustworthy. In today’s interconnected world, where every pipeline is a domino, observability isn’t a luxury; it’s a seatbelt. So the next time your dashboard shows that comforting little green badge labeled “Fresh & Verified,” remember: behind that glow lies a safety net of observability quietly keeping your business upright.
Large messaging platforms rarely collapse because authentication is broken. They collapse because authorization quietly expands, then stays expanded. The failure mode is not a single bug but a system property: credentials that were created for one narrow purpose become reusable, long-lived, and operationally too useful, until they function as capability grants far beyond the original intent. The industry has spent a decade hardening identity proofing and login defenses, yet incident reports keep circling back to the same operational reality: leaked tokens, misconfigured partner integrations, and automation scripts that inherit privileges no one remembers granting. What turns these common events into major incidents is blast radius. A single credential ends up authorizing too much surface area across assets, APIs, and workflows that were never meant to be coupled. That coupling is not malicious. It is entropy. In large platforms, shortcuts accumulate because they reduce friction for onboarding, rollout, and support. A token minted for setup becomes a token used for management. A scope added temporarily remains because removing it might break revenue-critical traffic. Over time, the platform’s authorization model stops describing reality and starts describing what teams wish were true. This is why overprivileged tokens should be treated as a platform failure, not a security bug. A platform that cannot bound token damage will repeatedly trade safety for continuity during pressure, and continuity will win every time. Assume Compromise: A Design Constraint Security guidance often says to assume compromise, but many systems still behave as if compromise is an edge case. An authorization design that truly assumes compromise treats every token as potentially leaked and optimizes for containment, not prevention. That changes the objective function: you are no longer trying to stop every unauthorized access. You are trying to make every credential failure cheap. In practice, this pushes a platform toward three invariants: Tokens must be purpose-specific and asset-bound.Authorization must be enforceable at runtime, not only at mint time.Migration must preserve business continuity, or it will be bypassed. If any one of these is missing, the platform will drift back toward one token that works everywhere, because it is operationally convenient. Granular Tokens: Turning Credentials Into Bounded Capabilities A granular token is not a JWT with scopes. It is a capability grant with explicit boundaries that survive refactors. At a minimum, you want the token to encode: Subject: who the token represents (partner, service, automation identity)Assets: which specific resources it can act on (business account, phone number, template namespace, etc.)Actions: what it can do (send message, read profile, manage templates, rotate keys)Context: how it was minted and intended to be used (channel, onboarding version, risk tier) A minimal JSON representation (conceptual) looks like this: JSON { "sub": "partner:acme", "aud": "messaging-api", "exp": 1767225600, "scopes": ["message.send", "profile.read"], "assets": ["acct:WABA_12345"], "context": { "channel": "api", "onboarding_version": "v2", "risk_tier": "standard" } } The containment story is straightforward. If this token leaks, the worst-case impact is bounded by the assets and scopes embedded in the token. You do not need an emergency revocation that breaks unrelated integrations because the token never had cross-asset authority in the first place. That is the first half of the fix. The second half is where most platforms fail. Static Permissions Do Not Survive Platform Reality Even with granular tokens, the platform still needs to answer questions the token cannot predict: Is this token suddenly being used from a new environment or automation pipeline?Is the request pattern anomalous relative to the identity’s baseline?Is the target asset in a degraded state or under investigation?Is the subject verified, suspended, or constrained by policy changes? If those conditions matter — and in large platforms they always do — then authorization cannot be “token is valid → allow.” It must be a runtime decision that incorporates policy, state, and signals. A typical evaluation path is a policy engine that receives a normalized request context, the parsed token, and a small set of risk signals. Kotlin-style pseudocode: Kotlin data class RequestContext( val subject: String, val requiredScope: String, val targetAsset: String, val channel: String, val requestIp: String, val userAgent: String ) data class TokenClaims( val active: Boolean, val scopes: Set<String>, val assets: Set<String>, val riskTier: String ) enum class Decision { ALLOW, DENY, CHALLENGE } fun authorize(ctx: RequestContext, token: TokenClaims, risk: Double): Decision { if (!token.active) return Decision.DENY if (ctx.requiredScope !in token.scopes) return Decision.DENY if (ctx.targetAsset !in token.assets) return Decision.DENY // Risk gating: throttle, step-up, or challenge instead of global revocation if (risk >= 0.85) return Decision.CHALLENGE return Decision.ALLOW } Two details matter here. First, the challenge is not a UX flourish. It is an operational safety valve that lets you contain suspicious use without detonating the entire integration ecosystem. In partner-heavy platforms, blanket revocation often costs more than the incident you are trying to stop, which is how systems end up tolerating risk. Second, this logic must be uniform. If each service re-implements its own checks, drift returns through inconsistency. The enforcement layer must be a shared middleware or gateway component, not a set of best-practice docs. Shared Enforcement Libraries Prevent Policy Drift At platform scale, ad hoc checks become a reliability problem. One forgotten endpoint becomes the bypass. One outdated library becomes the weakest link. The correct abstraction is a shared enforcement module that every API integrates with, so policy changes do not require coordinated redeploys across dozens of teams. Kotlin middleware sketch: Kotlin class AuthzMiddleware(private val policy: PolicyEngine) { fun enforce(ctx: RequestContext, token: TokenClaims, risk: Double) { when (policy.evaluate(ctx, token, risk)) { Decision.ALLOW -> return Decision.CHALLENGE -> throw TooManyRequestsException("Risk threshold exceeded") Decision.DENY -> throw ForbiddenException("Not authorized") } } } interface PolicyEngine { fun evaluate(ctx: RequestContext, token: TokenClaims, risk: Double): Decision } This shifts authorization from scattered conventions to programmable governance. It also makes audits feasible. You can explain what rule allowed or denied a request, because the rule is centralized and versioned. Migration: The Part Everyone Underestimates The technical design is not the hard part. Migration is. Most large platforms cannot revoke legacy tokens quickly without breaking high-value partners or revenue-critical flows. If the migration plan assumes immediate compliance, teams will invent exceptions, and exceptions become the new default. A safe migration path looks less like a rewrite and more like controlled containment: Phase 1: Parity Audit Ensure every legacy capability exists in the new model. Missing parity guarantees shadow workarounds. Phase 2: Dual-Path Issuance New onboarding flows mint granular tokens. Legacy flows continue, but you instrument usage to learn what those tokens actually do. Phase 3: Progressive Restriction Start restricting the highest-risk scopes and the widest asset access first, while leaving low-risk functionality untouched. Phase 4: Deprecation Based on Observed Usage Deprecate legacy tokens only after usage drops below an agreed threshold and partner replacements are proven. This is not slow for the sake of caution. It is a recognition that platforms are socio-technical systems. Authorization controls that ignore operational incentives will be bypassed. Verification Data Is Not a Badge. It Is an Input Signal Verification systems are often framed as UX trust indicators, but their deeper value is policy. Verified entities can have different scope ceilings, different rate limits, different escalation paths, and different anomaly thresholds. That only works if the verification state is consistent and centralized. Multiple sources of truth for verification create two failures: increased attack surface and unpredictable enforcement. Consolidating verification data is therefore not merely hygiene. It is a prerequisite infrastructure for consistent authorization. Observability: Authorization Decisions Must Be Explainable If authorization is a runtime decision, observability becomes part of the authorization system. You need structured events that allow you to reconstruct “what was allowed, why, and under which policy version.” A compact event schema: JSON { "token_id": "tok_abc123", "subject": "partner:acme", "asset": "acct:WABA_12345", "scope": "message.send", "decision": "ALLOW", "policy_version": "2026-01-28.3", "risk_score": 0.12, "timestamp": "2026-01-28T10:42:00Z" } Without this, incident response degrades into guesswork. Teams become afraid to tighten policy because they cannot predict impact, and the platform returns to permissive defaults. Why This Matters Now Messaging platforms have become commerce rails, identity brokers, and customer support infrastructure. Tokens do not merely send messages. They trigger workflows, expose regulated data, and create downstream consequences that are hard to unwind. In that environment, overprivileged tokens are not a theoretical risk. They are latent incidents waiting for scale and human error to align. The durable systems are not the ones with the most complicated policy language. They are the ones who assume credentials fail and make failure cheap. Overprivileged tokens are rarely a single mistake. They are the result of authorization drift under operational pressure. The fix is not a lecture about least privilege. The fix is an architecture that enforces least privilege at runtime, uses shared libraries to prevent divergence, migrates without breaking continuity, and emits evidence for every decision. At platform scale, trust is not maintained by perfect prevention. It is maintained by designing for containment.
LLM-powered agents are gaining popularity and 2026 is set to be the year of agents just like 2025. Generative AI applications have now moved from normal chatbot applications, search and retrieve systems to building more of autonomous agents that can break bigger tasks down in smaller sub-tasks, achieving a goal while also interacting with environment. Before diving deeper into LLM powered agents and tools to create one let's start by answering the most important question What is an Agent? According to the gold standard definition that comes from Artificial Intelligence: A Modern Approach textbook by Stuart Russell and Peter Norvig "An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators." A vacuum cleaning robot is a good example of an agent. It uses sensors such as cameras, dirt detectors, bump sensors, and infrared sensors to gather information about its surroundings. To interact with and act upon its environment, it relies on actuators like wheels for movement, brushes for sweeping, and a suction motor for collecting dirt. This agent also performs a perception-action cycle to achieve its goals. Plain Text 1. PERCEIVE → Sensors detect dirty floor ahead 2. DECIDE → Agent decides to move forward and clean 3. ACT → Wheels move, brushes spin, motor sucks dirt 4. REPEAT → Continue until floor is clean The term percept refers to the input that the agent receives and perceives, percept sequence is a history of everything that the agent has received or perceived. Broadly the agents can be divided into the following categories: Simple Reflex Agents: These are the most simplest kind of agents, and can be considered as following a rule-based approach. If this then that kind of logical approach to problem solving. These agents take action only considering the current input or percept ignoring everything from the percept historyModel Based Reflex Agents: These agents are also reflex agents, however they take informed decisions by maintaining an internal state or storage that tracks the part of environment that it has visited, but cannot observe right now. If an environment changes then the agent behavior needs to be updated.Goal Based Agents: These agents work backwards from a desired goal, take and plan actions in accordance with this goal. This is different from simple reflex based agents, as decision making considers what will happen if an action is performed. Hence, considering the impact of their current choice on future state.Utility Based Agents: Utility based agents are also working backwards from a desired goal, but are also optimizing for a metric. The performance is tracked based on an utility function, the agent tries to achieve goals while maximizing the utility function. For example, an agent that is designed to find shortest path between two points, the goal is to find a path between the two points, while also considering that the length of path is shortest. Learning Agents: The most advanced type of agent. This has three main elements: the learning element, the performance element and the problem generator. Imagine an agent with decomposes a tasks, critics the current set of actions and finally also suggest actions that will lead to new experience. We will look at and construct simple examples of most of these agents. While a vacuum uses physical sensors, an LLM agent 'perceives' through text inputs and API responses, and 'acts' by generating text or calling functions. Just as a vacuum can be a simple bump-and-turn robot or a sophisticated room-mapper, LLM agents vary in intelligence. We can categorize them into five levels of sophistication:, for a LLM-powered agent the "sensor" is the user input. These agents usually have four main components: The agent core or the agent brainThe planning moduleMemoryTools Theoretical definitions are fine, but how do we build them? We will use LangChain and LangGraph frameworks to build our first LLM powered agent. Both are open source frameworks that are used to build LLM powered applications. The choice depends on the type of agent we are building and the intended level of control we wish to have on the agentic architecture. LangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool — so you can build agents that adapt as fast as the ecosystem evolve. LangChain is used to build on ideas of chain or a pipeline, a sequence of steps executed in a linear order. Every LangChain workflow is treated as a DAG (Direct Acyclic Graph) where tasks flow in one direction without any cycles or loops. LangGraph is great at handling complex workflows, loops, decision branches in workflows, complex decision trees etc, it also provides great flexibility and control into each component of agent setup. In this article we will create an agent using both LangChain and LangGraph to understand the pros and cons and usage of both these frameworks. For Simple Reflex Agents, that follow a straight line (Input -> Output), LangChain is our go-to framework. However, as we move toward Model based, Goal and Utility Agents that require loops, self-correction, and state management, we need the flexibility of LangGraph. Let’s see this evolution in action by building a 'Chef Agent' that grows smarter with every iteration.We will see how we can go from building a simple reflex agent for this task to a utility based agent, based on the complexities we add for this agent. Let's set up our environment for this task. We will be using Groq to invoke gpt-oss api. You will need to get your API access key from here, we will be using open-source models for this exercise. We will start by installing the required libraries Python pip install langchain langchain-groq langgraph langchain-core pydantic Next we will set up few variables that we will use across all our agents, these include the API key, model name. Python #config import os from dotenv import load_dotenv load_dotenv() # Set your API key GROQ_API_KEY="YOUR_API_KEY" os.environ["GROQ_API_KEY"] = GROQ_API_KEY model = 'openai/gpt-oss-safeguard-20b' Now let's start with building a simple reflex agent. A simple reflex agent is a very simple agent, that in this case if the user asks for a recipe suggests the recipe. It works with if else logic blocks. We use langchain to create this agent, the agent suggests whatever recipe the user requests. Python ######### llm brain ############# import os from langchain_groq import ChatGroq # Setup Groq LLM llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # 1. Perception -> Action (Direct Chain) reflex_prompt = ChatPromptTemplate.from_template( "You are a chef. Given a request: {input}, provide a single recipe immediately." ) reflex_agent = reflex_prompt | llm | StrOutputParser() #invoke the brain print(reflex_agent.invoke({"input": "I want a spicy pasta."})) Imagine you start using this agent to get a recipe for your dinner tonight, however you lack the ingredients that are needed to prepare the food. Such a bummer! Wouldn't it be nice to have an agent that has knowledge of the ingredients in your pantry or your dietary preferences before suggesting a recipe? Our Reflex Agent is fast, but it’s 'forgetful.' It suggests a spicy pasta even if you have no pasta in your pantry or a gluten allergy. To make it a Model-Based Reflex Agent, we must give it a way to track the 'internal state' of its world specifically, your pantry inventory and dietary needs.For this, we move to LangGraph, which allows the agent to maintain a persistent memory (State) and use tools to 'sense' its environment Python from langchain.tools import tool from langchain.chat_models import init_chat_model import operator from langgraph.prebuilt import ToolNode, InjectedState import operator from typing import Annotated, List, Literal llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) @tool def get_inventory(state:Annotated[dict, InjectedState]): "Get cuurent user inventory" return state["inventory"] @tool def get_dietary_prefs(state:Annotated[dict, InjectedState]): "Get user dietary preferences" return state["dietary_prefs"] @tool def get_history(state:Annotated[dict, InjectedState]): "Get user history" return state["history"] # Augment the LLM with tools tools = [get_inventory, get_dietary_prefs, get_history] tools_by_name = {tool.name: tool for tool in tools} model_with_tools = llm.bind_tools(tools) # Step 2: Define state from langchain.messages import AnyMessage from typing_extensions import TypedDict import operator class RecipeState(TypedDict): inventory: List[str] # What's in the fridge dietary_prefs: List[str] # e.g., "Vegetarian" suggestion: str # The output messages: Annotated[List[AnyMessage], operator.add] # Step 3: Define model node from langchain.messages import SystemMessage def llm_call(state: dict): """LLM decides whether to call a tool or not""" # Combine system message with user messages messages_to_send = [ SystemMessage( content="""You are a helpful chef agent. When user asks for a recipe, look at user's current inventory, dietary prefrences to suggest the recipe. You can use the available tools whenever needed.""" ) ] + state["messages"] response = model_with_tools.invoke(messages_to_send) return { "messages": [response], "inventory": state.get("inventory", []), "dietary_prefs": state.get("dietary_prefs", []) } # Step 4: Define tool node from langchain.messages import ToolMessage tool_node = ToolNode(tools) # Step 5: Define logic to determine whether to end from langgraph.graph import StateGraph, START, END # Conditional edge function to route to the tool node or end based upon whether the LLM made a tool call def should_continue(state: RecipeState) -> Literal["tool_node", END]: """Decide if we should continue the loop or stop based upon whether the LLM made a tool call""" messages = state["messages"] last_message = messages[-1] # If the LLM makes a tool call, then perform an action if last_message.tool_calls: return "tool_node" # Otherwise, we stop (reply to the user) return END # Step 6: Build agent # Build workflow agent_builder = StateGraph(RecipeState) # Add nodes agent_builder.add_node("llm_call", llm_call) agent_builder.add_node("tool_node", tool_node) # Add edges to connect nodes agent_builder.add_edge(START, "llm_call") agent_builder.add_conditional_edges( "llm_call", should_continue, ["tool_node", END] ) agent_builder.add_edge("tool_node", "llm_call") # Compile the agent agent = agent_builder.compile() from IPython.display import Image, display # Show the agent display(Image(agent.get_graph(xray=True).draw_mermaid_png())) # Invoke from langchain.messages import HumanMessage messages = [HumanMessage(content="Suggest a recipe for pasta")] current_inventory = ["pasta", "water", "tomatoes", "salt","parmesan"] current_dietary_prefs = ["vegetarian"] messages = agent.invoke({"inventory":current_inventory, "dietary_prefs":current_dietary_prefs, "messages": messages}) for m in messages["messages"]: m.pretty_print() This agent considers not just the current user request, but also takes into account the user environment, in this case the pantry and suggest recipes based on the items actually available in the pantry. Having memory is a start, but a truly intelligent chef doesn't just look at what's in the fridge, it works toward a specific outcome. This brings us to Goal and Utility-Based Agents. In this next evolution, the agent doesn't just suggest any recipe; it must satisfy a 'Goal': suggesting a meal that fits within a specific calorie budget. This requires a Planning/Verification loop where the agent critiques its own suggestion before presenting it to you. Python # Step 1: Define tools and model from langchain.tools import tool from langchain.chat_models import init_chat_model import operator from langgraph.prebuilt import ToolNode, InjectedState llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) @tool def get_inventory(state:Annotated[dict, InjectedState]): "Get cuurent user inventory" return state["inventory"] @tool def get_dietary_prefs(state:Annotated[dict, InjectedState]): "Ger user dietary prefrences" return state["dietary_prefs"] @tool def get_remaining_calories_range(state:Annotated[dict, InjectedState]): "Get range of remaining calories" return (state["total_calories"]- state["consumed_calories"]+state["error_margin_calories"], state["total_calories"]- state["consumed_calories"]-state["error_margin_calories"]) # Augment the LLM with tools tools = [get_inventory, get_dietary_prefs, get_history] tools_by_name = {tool.name: tool for tool in tools} model_with_tools = llm.bind_tools(tools) # Step 2: Define state from langchain.messages import AnyMessage,HumanMessage, SystemMessage from typing_extensions import TypedDict, Annotated import operator from typing import List class RecipeState(TypedDict): inventory: List[str] # What's in the fridge dietary_prefs: List[str] # e.g., "Vegetarian" suggestion: str # The output total_calories: int # The total number of calories to consume consumed_calories: int # The number of calories already_consumed error_margin_calories: int # The number of calories that can be added or deleted from total calories num_tries: int # The number of tries the agent has made messages: Annotated[List[AnyMessage], operator.add] # Step 3: Define Planner Node def planner_node(state: RecipeState): """Suggests a recipe and estimates calories.""" messages_to_send = [ SystemMessage(content=( "You are a chef. Suggest a recipe based on inventory and dietary prefs. " "IMPORTANT: You MUST provide an estimated calorie count for the meal." "You can use the available tools whenever needed." )) ] + state["messages"] response = model_with_tools.invoke(messages_to_send) return {"messages": [response], "num_tries":state.get("num_tries",0)} def verify_search_node(state: RecipeState): """Checks if the suggested meal fits the calorie constraints.""" last_message = state["messages"][-1].content # Simple logic: Ask LLM to extract or verify, # or use regex/logic if you want to be strict. prompt = f""" The user's goal is {state['total_calories']} calories (margin: +/- {state['error_margin_calories']}). Current consumed: {state['consumed_calories']}. The chef suggested: {last_message} Does this recipe fit the remaining calorie budget? If yes, reply 'VALID'. If no, explain why. You can use the avilable tools whenever needed. """ verification_response = llm.invoke([HumanMessage(content=prompt)]) # We can add this to messages to keep track of the critique return {"messages": [verification_response]} def should_verify(state: RecipeState) -> Literal["verify_node", "tool_node", END]: last_message = state["messages"][-1] if last_message.tool_calls: return "tool_node" # If no tool calls, it means the LLM has made a suggestion. Now verify it. return "verify_node" def is_it_valid(state: RecipeState) -> Literal["planner_node", END]: last_message = state["messages"][-1].content if "VALID" in last_message.upper(): return END # If not valid, loop back to the planner to try again return "planner_node" def num_tries_exceeded(state: RecipeState) -> Literal["planner_node",END]: if state["num_tries"] > 5: return END return "planner_node" # Build workflow agent_builder = StateGraph(RecipeState) agent_builder.add_node("planner_node", planner_node) agent_builder.add_node("tool_node", tool_node) agent_builder.add_node("verify_node", verify_search_node) agent_builder.add_edge(START, "planner_node") # Route from Planner agent_builder.add_conditional_edges( "planner_node", should_verify, {"tool_node": "tool_node", "verify_node": "verify_node", END: END} ) # Route from Tool back to Planner agent_builder.add_edge("tool_node", "planner_node") # Route from Verifier back to Planner OR End agent_builder.add_conditional_edges( "verify_node", is_it_valid, {"planner_node": "planner_node", END: END} ) agent = agent_builder.compile() from IPython.display import Image, display # Show the agent display(Image(agent.get_graph(xray=True).draw_mermaid_png())) # Invoke from langchain.messages import HumanMessage messages = [HumanMessage(content="Suggest a recipe for pasta")] current_inventory = ["pasta", "water", "tomatoes", "salt","parmesan"] current_dietary_prefs = ["vegetarian"] messages = agent.invoke({"inventory":current_inventory, "dietary_prefs":current_dietary_prefs, "total_calories": 1000, "consumed_calories": 800, "error_margin_calories": 100, "messages": messages}) for m in messages["messages"]: m.pretty_print() We have seen how an agent evolves from a simple 'If-Then' reflex into a sophisticated system capable of maintaining state and verifying its own goals. By moving from LangChain’s linear chains to LangGraph’s cyclic workflows, we’ve bridged the gap between basic automation and autonomous reasoning. However, the true power of agents lies in their ability to optimize for complex preferences and learn from their mistakes. Because Utility-Based Agents and Learning Agents involve more intricate scoring functions and feedback loops, we will dedicate our next entire article to mastering those advanced architectures.
Most discussions about AI model training focus on architecture choices, compute budgets, and evaluation benchmarks. The data pipeline that feeds those models? It gets a paragraph, maybe two. Maybe a diagram with an arrow labeled "data ingestion." That gap is a real problem. In practice, data engineering is where most AI projects quietly fall apart. Not at the model level. Not at inference. At the pipeline. I've spent the last two years building multimodal data infrastructure at Abaka AI, delivering datasets to frontier AI companies training next-generation reasoning and conversational models. The lessons below come from that work, specifically from the parts that broke in unexpected ways and forced us to rethink assumptions we didn't realize we were making. Multimodal Means Multiple Failure Modes A text pipeline has one content type to worry about. A multimodal pipeline has many: scanned books, handwritten documents, photographs, structured tables, diagrams, audio transcripts, and video frames are all possible inputs, and each one breaks differently. The first mistake most teams make is treating multimodal ingestion as a collection of separate pipelines that happen to feed the same model. That sounds reasonable until you need consistency across modalities, and you realize your text preprocessing strips the metadata that your image pipeline needs to correctly associate figures with captions. Now you have two clean pipelines producing corrupted outputs. The second mistake is assuming format-level parsing solves the content problem. A PDF parser that correctly extracts text may still produce garbage if the source document used a two-column layout, footnotes interspersed with body text, or embedded mathematical notation. Correct extraction is necessary but not sufficient. What actually works is treating each document type as a first-class object with its own quality contract. For each modality, define what "clean" means before writing a single line of processing code. For scanned text, clean might mean a character error rate below 2% with section headers preserved. For images, it might mean resolution above a minimum threshold with alt-text that accurately describes content rather than just naming the file. Write those contracts down. They become your QC spec, and more importantly, they give you something to argue about before problems appear rather than after. Pipeline Speed Is a Product Feature, Not a Bonus When we delivered our first dataset to a large enterprise AI client, the turnaround from contract signing to delivery was 21 days. That wasn't a coincidence or a heroic sprint. It was a consequence of pipeline decisions made months earlier: batch size calibration, parallelized QC, and pre-built validation tooling that didn't require human review at every stage. The conventional wisdom in data engineering is that quality gates slow you down. That's true if the gates are manual. Automated quality checks, built early and run on every batch, are what make speed possible in the first place. If a document fails your character error rate threshold, it gets routed to a remediation queue immediately, not discovered weeks later during model evaluation. There's also a less obvious payoff. When you can tell a client exactly how long processing takes for a given document type and volume, you become predictable. Predictability is what turns a one-time data vendor relationship into a sustained one. Clients who can plan around your delivery cadence will plan around it. Clients who can't will find someone else. The Annotation Problem Nobody Wants to Solve Annotation is the part of multimodal data engineering that everyone wishes could be automated, and almost never fully can be. For some tasks, models are good enough at self-annotation that human review can be reduced to sampling. For others, especially tasks involving nuanced reasoning, spatial relationships in images, or domain-specific knowledge, you still need human annotators who understand the material. The failure mode I see most often is annotation pipelines that treat all tasks as equivalent. Same workflow, same annotator pool, same quality threshold. That breaks when a task requires specialized knowledge. A general annotator pool might correctly label object presence in images but produce low-quality labels for questions about whether a scanned diagram accurately represents a logical circuit. Segment your annotation tasks by complexity and required expertise before building the workflow. For high-complexity tasks, route to specialists and build in consensus checking. For lower-complexity tasks, use larger annotator pools with statistical agreement thresholds. Keep the two workflows separate, even if they feed the same output schema. Version Control for Data Is Not Optional This sounds obvious. Most teams still don't do it properly. The issue is that data versioning gets treated as a documentation problem: label your dataset with a date, note what changed, move on. But if you can't reproduce a specific dataset version exactly, including its preprocessing parameters, source document selection, and annotation schema, you can't debug model regressions that trace back to data changes. We run full lineage tracking on every dataset we produce. Every document has a source identifier, an ingestion timestamp, a processing version, and a list of applied transformations with parameters. When a client reports unexpected model behavior, we can trace it to a specific data batch and often to a specific preprocessing decision. The tooling for this isn't exotic. A well-structured metadata store and a deterministic transformation pipeline are mostly sufficient. The discipline of actually using them consistently across every pipeline stage is the harder part. What Skipping Ingestion QC Actually Costs You Here's a failure pattern I've seen in multiple production pipelines. A team ingests a large corpus of scanned documents. OCR looks reasonable on spot check. They run embeddings and deliver the dataset. Three months later, the client reports degraded model output on a specific class of questions involving tables and structured data. Pull the relevant training documents. About 15% of the scanned tables were ingested with columns transposed, because the OCR engine misread the column separator characters. The model learned from that corrupted structure. The degradation was there from day one, and nobody caught it because the spot check didn't include table-heavy documents at a meaningful sample rate. The fix is not more spot checks. The fix is structured validation at ingestion: schema-aware quality checks that specifically test the document features most likely to corrupt model training, run automatically before any document enters the processing queue. Catching problems at ingestion is ten times cheaper than catching them during evaluation. Catching them during evaluation is still far cheaper than catching them after deployment. Building for a Model Team You Don't Control One dynamic that rarely gets discussed in data engineering: you often don't know exactly how the consuming model team will use the dataset you produce. They may apply additional filtering. They may upsample certain document types. They may concatenate your dataset with other sources in ways that create distribution shifts you never anticipated. All of this affects whether your data, however well-produced, actually helps the model. The practical response is to over-document. Deliver metadata that tells the model team what's in the dataset in granular detail: distribution across document types, languages, topic areas, source characteristics, and annotation confidence scores. Build your delivery format around their needs, not yours. If you can establish a feedback loop where model evaluation results flow back to the data team, build it early and protect it. That loop is how you build datasets that improve over time instead of delivering batches and hoping. A Note on What This Work Actually Is Multimodal data engineering tends to get framed as infrastructure work, a supporting function that enables the "real" AI work to happen. I think that framing causes teams to underinvest in it and then be surprised when things go wrong. The pipeline decisions that look like implementation details, annotation segmentation, version lineage, ingestion-time QC, and delivery format conventions are the decisions that determine whether your data actually improves the models it trains. A model trained on well-engineered data outperforms one trained on carelessly processed data with the same architecture. That's been true in every project I've worked on. What you're really building isn't a pipeline. It's a quality assurance system that happens to move data through stages. The sooner teams internalize that distinction, the fewer late-stage surprises they'll face, and the fewer post-mortems they'll have to write.
Salman Khan
Director Data Science,
Afiniti
Fawaz Ghali, PhD