DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Data

Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.

icon
Latest Premium Content
Trend Report
Data Engineering
Data Engineering
Refcard #269
Getting Started With Data Quality
Getting Started With Data Quality
Refcard #398
Open-Source Data Management Practices and Patterns
Open-Source Data Management Practices and Patterns

DZone's Featured Data Resources

Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture

Why Your DLP Policies Fall Short the Moment AI Agents Enter the Picture

By Priyanka Neelakrishnan
I have been working in enterprise data security for a while now, and I have watched the threat landscape shift many times. Ransomware, phishing, insider threats, and cloud misconfigurations. Each wave brought new problems, and organizations learned, adapted, and invested. But what is happening today with AI agents feels different. It is not just a new attack vector. It is a fundamental change in how data moves inside an organization, and most security teams are not ready for it. Let me explain what I mean. Traditional Data Loss Prevention (DLP) was designed with a pretty clear mental model: a human employee sits at a computer, touches sensitive data, and either accidentally or intentionally tries to move it somewhere they should not. Your DLP policy watches for that. It flags the email with the credit card numbers, blocks the USB upload, or quarantines the cloud sync. It works because there is a human in the loop, and human behavior has patterns that security tools can learn. AI agents break that model entirely. An agent does not hesitate before accessing a file. It does not trigger behavioral anomalies because it was granted permission to do exactly what it is doing. It can read thousands of documents in the time it takes a human to open one. And if it is compromised, misconfigured, or simply pointed at the wrong thing, it can exfiltrate data at a scale and speed that no human attacker could match. That is the invisible threat, and it is sitting inside enterprise environments right now. Why AI Agents Are Different From Every Other Threat Before getting into the specific risks, it is worth stepping back to understand what makes agentic AI architecturally different from previous automation tools. Traditional automation scripts or bots were narrow. They did one thing. A script that pulled a report from a database every morning did not have the context or capability to go read your HR files or send data to an external API. The attack surface was small and well-defined. AI agents, by contrast, are designed to be general-purpose. They use large language models to reason about tasks, and they are given tools: the ability to read files, call APIs, browse the web, write to databases, send messages, and interact with other services. This is what makes them powerful for automation. It is also what makes them dangerous from a security standpoint. When you give an agent access to your document store to help employees find information faster, you have also given it, in principle, the ability to read everything in that store. When you connect it to your email system so it can draft replies, you have opened a channel through which data can flow. The agent is not malicious. It is doing exactly what it was built to do. The problem is that the existing security infrastructure was never designed to supervise something that behaves like a trusted user but operates at machine scale. The 5 Data Security Risks Unique to AI Agents 1. Over-Permissioned OAuth Scopes and Shadow Data Access This is the one I see most often in enterprise deployments, and it is almost always accidental. When development teams integrate an AI agent with a SaaS platform, whether it is SharePoint, Google Drive, Salesforce, or Slack, they need to grant the agent API access. The path of least resistance is to grant broad OAuth scopes. Read all files. Access all channels. The agent needs it for the use case, so the scope gets approved, and nobody revisits it. What this creates is a situation where the agent has access to data it will never actually need for its intended job, but which it can reach if something goes wrong. A prompt injection attack, a bug in the agent's reasoning logic, or a malicious instruction buried in a document the agent was asked to summarize could all redirect the agent to access and transmit that shadow data. The NASA ITAR filtering issue from 2019 is a useful reference here, even though it predates AI agents. A security control that was too broad caused operational disruption. The same principle applies in reverse: an agent granted too-broad access can cause a data exposure that was never intended by anyone in the organization. 2. Prompt Injection Leading to Data Leakage Prompt injection is probably the most discussed AI security risk right now, and for good reason. The basic idea is that an attacker can embed instructions inside content that the agent will read, effectively hijacking the agent's behavior. Here is a concrete scenario. An enterprise deploys an AI agent that monitors incoming emails and summarizes them for executives. An attacker sends a carefully crafted email that contains, embedded in normal-looking text, instructions telling the agent to forward all emails it reads to an external address. If the agent's output layer is not properly sandboxed, this kind of attack can succeed without the attacker ever breaking into any system. They just sent an email. This is qualitatively different from phishing. Phishing targets humans and relies on human error. Prompt injection targets the agent and relies on the agent doing exactly what it was designed to do, which is to follow instructions in its input. From a DLP perspective, the data exfiltration looks like authorized activity because the agent was authorized to send data. 3. Retrieval-Augmented Generation Pipelines Pulling Sensitive Context RAG systems, where an agent retrieves documents from an internal knowledge base to ground its responses, are becoming standard in enterprise AI deployments. They are genuinely useful. They are also a data security problem that most teams have not fully thought through. When a user asks a RAG-enabled agent a question, the system searches the knowledge base and pulls in relevant documents as context for the model. The model then uses that context to generate a response. The issue is that the retrieval step is often not governed by the same access controls as direct document access. An employee who does not have permission to read a particular HR policy document might be able to ask the agent a question that causes the agent to retrieve and summarize that document for them. This is not a hypothetical. It is a real architectural gap that exists in many early-stage enterprise RAG deployments. The knowledge base was indexed without granular access metadata, and the retrieval system does not know whether the person asking the question should have access to the documents it is about to surface. 4. Agent-to-Agent Data Passing With No Human Review The next wave of enterprise AI is multi-agent systems, where specialized agents hand off tasks to each other. An orchestrator agent receives a request, breaks it into subtasks, delegates those subtasks to specialized agents, and aggregates the results. This is efficient. It is also a chain of data handling that has no human checkpoint anywhere in the middle. From a security standpoint, this creates what I would call a provenance problem. When data moves through three or four agent hops before producing a final output, it becomes very difficult to audit what data was accessed, what was transmitted between agents, and where the output ended up. Traditional DLP watches data at egress points, but in a multi-agent pipeline, the egress points are not always obvious, and intermediate agent-to-agent communication may not be captured at all. The Capital One breach in 2019 demonstrated how a chain of access privileges, even if each individual link looks authorized, can result in catastrophic data exposure. Multi-agent pipelines create the same kind of daisy-chained access, but at a speed and scale that makes the Capital One incident look slow. 5. AI as a Supply Chain Risk This one is less talked about but deserves attention. Enterprise organizations are increasingly building agents on top of third-party foundation models and agent frameworks. When you do that, you are trusting not just the model's capabilities but also the data handling practices of the model provider and the framework maintainers. If a third-party agent framework has a vulnerability, or if a model provider's logging and telemetry captures inputs in ways that are not disclosed, your sensitive enterprise data could be at risk in ways that your internal DLP policies have no visibility into. The SolarWinds breach in 2020 showed exactly how supply chain trust can be weaponized. AI infrastructure is the new software supply chain, and most enterprises have not started treating it that way yet. What Breaks in Your Existing DLP Policies Most enterprise DLP policies were designed around a set of assumptions that AI agents violate by default. It is worth being specific about this because the gaps are not immediately obvious. First, DLP systems use behavioral baselines. They learn what normal data access looks like for a given user or endpoint and flag deviations. An AI agent does not behave like a human user. Its access patterns are bursty, high-volume, and systematic in a way that looks suspicious to a human but is entirely normal for an agent. Tuning DLP to accommodate agent behavior without opening holes for actual attackers is genuinely difficult. Second, many DLP policies focus on content inspection at egress: checking what is in an email attachment, what is being uploaded to a cloud service, and what is being printed. They are less equipped to inspect data that is being passed between internal systems or that is loaded into an LLM's context window. The context window is, in effect, a temporary data store that existing DLP tools cannot see into. Third, agent actions are often attributed to the agent's service account rather than the human who initiated the request. If something goes wrong, the audit trail points to a service identity, not a person, which makes incident response significantly harder. In my earlier article on DLP policy tuning, I wrote about the importance of finding the balance between protection and usability. With AI agents, that balance has to be rethought from scratch. The old tuning frameworks assume a human actor. Agents are a different category. Mapping the Gap: What Your DLP Covers vs. What Agents Require Traditional DLP AssumptionReality With AI Agents Human actor with behavioral patterns Machine actor with high-volume, systematic patterns Data moves at human speed Data moves at API call speed, thousands of operations per second Egress inspection catches exfiltration Exfiltration can happen inside the context window or between agents Access is tied to user identity Access is tied to service account or OAuth scope Anomaly detection flags unusual behavior Agent behavior looks normal because it was authorized Audit trails point to a person Audit trails point to a service identity Practical Controls: What to Do Today I want to be clear that I am not suggesting organizations should slow down their AI agent deployments. The productivity and operational efficiency gains are real, and the competitive pressure to adopt these technologies is not going away. What I am suggesting is that security needs to be built into the deployment architecture from the start, not layered on afterward. Enforce Least-Privilege Agent Identities Every AI agent should have its own identity, with access scoped to the exact data and systems it needs for its specific function. Not a shared service account. Not a developer's credentials. Not an admin-level OAuth token granted for convenience. This sounds obvious, but in my experience, it is violated in the majority of early enterprise agent deployments because speed of deployment takes priority over access hygiene. Work with your identity team to define agent personas the same way you define human user roles. An agent that summarizes customer support tickets should have read access to the support ticket system and nothing else. If it later needs to write back to the system, that permission should be explicitly granted and reviewed, not assumed. Implement Output Inspection Layers If you cannot yet see inside the context window, you can at least inspect what comes out of it. Treat agent outputs the same way you treat email or file uploads in your DLP system. Apply content detection to the agent's final responses and any data it writes to downstream systems. This will not catch everything, but it will catch cases where sensitive data that should not have been surfaced ends up in an agent's output. Security vendors are beginning to build agent-aware DLP capabilities, and this is an area where the product landscape is evolving quickly. Evaluate whether your current DLP vendor has a roadmap for agent output inspection, and if not, that is a conversation worth having with them. Tag Sensitive Data Before It Enters Agent Context This is where classification infrastructure, which I covered in my DLP policies article, becomes even more critical. If your sensitive documents are properly classified and tagged before an agent can access them, you have the foundation for enforcing context-aware retrieval controls. A RAG system that knows a document is tagged as confidential can check whether the requester has access rights before pulling it into context. This requires investment in tagging infrastructure and close collaboration between your data governance team and the teams building the AI systems. It is not trivial. But it is the most durable defense against the RAG access control gap I described earlier. Build Agent Activity Logging Into the Architecture Every action an agent takes should be logged with enough context to reconstruct what happened. Which documents were accessed, what queries were sent to external APIs, what data was written where, and who or what triggered the agent's actions. This logging should be centralized and tamper-resistant, and it should be integrated with your security information and event management (SIEM) system. The goal is to ensure that when something goes wrong, and at some point, something will, your incident response team has the information they need to understand what data was exposed and how. Without this, you are flying blind. Treat Third-Party Agent Frameworks as Supply Chain Risk Apply the same vendor security review process to AI frameworks and model providers that you apply to any third-party software vendor. Ask about data handling practices, logging and telemetry, vulnerability disclosure processes, and compliance certifications. If a vendor cannot answer these questions clearly, that is a signal worth paying attention to. For federal customers, this intersects directly with FedRAMP and FISMA requirements, which I covered in my earlier piece on federal data security. The compliance overlay does not change the fundamental architecture question, but it does add a layer of formal verification that can be useful. A Note on Vendor Responsibility I want to end with something I feel strongly about, because it reflects what I have seen in my work with enterprise customers. Security vendors have a responsibility here that goes beyond selling products. Right now, most enterprise security products are not ready for the AI agent threat landscape. DLP tools that work beautifully for human-driven data flows struggle with agent-generated activity. SIEM systems that are great at correlating human behavioral signals have not been updated to understand agent orchestration patterns. Identity platforms that manage human identities well are still figuring out how to handle non-human agent identities at scale. This is not a criticism. It is a statement of where the industry is. The technology moved faster than the security tooling, which is how it usually goes. But vendors need to be honest with their customers about these gaps and invest now in the capabilities that enterprise organizations will need over the next 12 to 24 months. The enterprises that will navigate this well are the ones that start the conversation with their security vendors today, before a breach forces the conversation. Ask your DLP vendor how their product handles agent service accounts. Ask your SIEM vendor what their roadmap looks like for multi-agent pipeline visibility. Ask your identity vendor how they plan to support agent persona management. These are not theoretical questions. They are operational requirements. Conclusion AI agents are not going away, and they should not. They represent a genuine step forward in what organizations can accomplish with their data and their people. But every significant capability expansion in enterprise technology has also expanded the attack surface, and this one is no different. The threat is invisible right now because agents look like trusted users. They have credentials, they have permissions, and they perform authorized actions. Traditional security controls are not built to be suspicious of authorized behavior. That is the gap that adversaries will eventually learn to exploit, if they have not already started. The answer is not to slow down AI adoption. The answer is to build the security architecture around it properly: least-privilege agent identities, output inspection, classified data tagging, comprehensive logging, and supply chain rigor for third-party frameworks. None of these is a novel security concept. They are well-understood principles being applied to a new context. Your DLP policies were written for a world where humans moved data. That world still exists, but it now shares space with a world where agents move data faster, on a larger scale, and with less friction than any human ever could. It's time to update the playbook. More
AI Paradigm Shift: Analytics Without SQL

AI Paradigm Shift: Analytics Without SQL

By Haricharan Shivram Suresh Chandra Kumar
The idea of “asking data questions in plain English” has been around for a while, but most implementations never made it into production in a serious way. The usual reason is not the language model itself but everything around it: security boundaries, schema ambiguity, cost control, and the fact that analytics systems are rarely clean enough for unconstrained natural language to work reliably. What has changed in the last couple of years is not that natural language is suddenly perfect, but that data platforms have started bringing computation, metadata, and AI into the same controlled environment. One example of this approach is the way agents are being built directly inside data warehouses like Snowflake. The important detail is not the brand itself, but the architectural pattern: the model, the data, and the execution layer sit together rather than being stitched across multiple systems. That shift changes how analytics tools are designed. Instead of building external “AI layers” on top of a warehouse, teams are embedding the agent logic inside the warehouse itself using tools like Snowpark and managed LLM services such as Snowflake Cortex. The result is a system where natural language is just another input format, not a separate application tier. From Dashboards to Agent-Driven Querying Traditional analytics workflows are structured around predefined models: dashboards, semantic layers, and curated datasets. A user question is usually translated into one of these prebuilt views. If the question does not fit, someone writes SQL or updates the dashboard. Agent-based systems invert this flow. Instead of forcing questions into predefined structures, they attempt to generate the structure dynamically. At a high level, the flow looks like this: A user submits a natural language questionThe system enriches the prompt with schema and access contextA model generates SQL or an execution planThe query runs inside the warehouseResults are returned in a structured form or visualized output The key difference from earlier “text-to-SQL” experiments is that steps two and three are tightly grounded in the database context. The model is not guessing a schema from generic training data. It is being provided with actual table definitions, column descriptions, and sometimes usage statistics. This context injection is what makes the system usable in production. Without it, SQL generation tends to fail in subtle ways: incorrect joins, wrong aggregations, or hallucinated columns. Agent Architecture Inside the Warehouse A practical implementation of an analytics agent inside a warehouse typically has three layers. 1. Context and Permission Layer Before any model is called, the system resolves what the user is allowed to see. This includes: Role-based access controlRow-level filtersColumn masking rulesAvailable schemas and tables This step is often underestimated, but it is what makes the system safe enough for real usage. Without it, natural language becomes a bypass mechanism for data access control. 2. Language Model Translation Layer Once context is assembled, the prompt is passed into an LLM hosted within the data platform. In Snowflake’s case, this is handled through Cortex services, but the pattern is not unique to any vendor. The model’s job is not just to produce SQL but to produce SQL that is: Syntactically validAligned with schema constraintsConsistent with security rulesOptimized for warehouse execution patterns For example, a question like, “Show top 10 products by revenue in Q1 2024 grouped by region,” might become: SQL SELECT region, product_name, SUM(revenue) AS total_revenue FROM sales.fact_sales WHERE transaction_date >= '2024-01-01' AND transaction_date < '2024-04-01' GROUP BY region, product_name ORDER BY total_revenue DESC LIMIT 10; The challenge here is not generating SQL that looks correct, but ensuring it respects business definitions. Revenue, for example, might need to be net of returns or adjusted for currency conversion, depending on the organization. 3. Execution and Governance Layer Once SQL is generated, it is executed inside the warehouse engine. This is where the architecture becomes important: nothing leaves the system. The same security policies that apply to human-written queries apply here as well. Because execution happens inside the warehouse, audit logs remain consistent. Every agent action can be traced as a query event, which is important for compliance-heavy environments. Why Snowpark Matters in This Setup Tools like Snowpark extend this model beyond SQL generation. Instead of limiting the agent to query rewriting, Snowpark allows it to execute Python-based logic directly next to the data. This becomes useful when the question is not purely relational. For example: “Forecast next month’s sales for product X.” A simple SQL query cannot answer this. The agent can instead generate a Snowpark Python job that: Extracts historical time series dataConverts it into a DataFrameApplies a forecasting model such as ARIMA or ProphetWrites predictions back into a table The important point is that the data is never exported to an external notebook environment. The compute moves to the data, not the other way around. This pattern also applies to machine learning inference. Pretrained models can be registered as user-defined functions, and the agent can call them like regular SQL functions: SQL SELECT feedback_text, predict_sentiment(feedback_text) AS sentiment_score FROM customer_feedback; From a systems perspective, the agent becomes a planner rather than just a translator. It decides whether SQL is sufficient or whether a Python-based workflow is required. The Streamlit Layer: Turning Queries Into Applications While the warehouse handles computation and the agent handles reasoning, users still need an interface. One of the simpler ways to build this layer is with Streamlit. Streamlit is often used because it reduces the overhead of building internal analytics tools. Instead of designing full frontend systems, teams can wrap agent logic into lightweight interactive apps. A minimal pattern looks like this: Python import streamlit as st st.title("Data Agent Interface") query = st.text_input("Ask a question about your data") if query: result = run_agent(query) st.subheader("Generated SQL") st.code(result["sql"]) st.subheader("Results") st.dataframe(result["data"]) In more mature setups, the Streamlit layer becomes more than a query box. It evolves into a dynamic dashboard generator: Charts are generated based on query intentFilters are derived from schema metadataResults can be saved into reusable viewsUsers can refine queries conversationally This reduces dependency on static dashboards, which are often slow to update and hard to maintain. Governance Is the Real Constraint, Not AI Capability A common misconception is that the main challenge in building these systems is model accuracy. In practice, governance is the harder problem. Three constraints usually define whether an agent system is viable: Data access control must remain intact. Natural language cannot become a bypass layer for restricted data.Query cost must be predictable. Poorly generated queries can become expensive quickly in large warehouses.Results must be reproducible. Two identical questions should not produce different interpretations unless the underlying data changes. Warehouse-native architectures help with this because they centralize execution. There is no separate “AI data layer” that can drift from governance rules. Limitations of the Current Approach Despite progress, these systems are not fully autonomous analytics engines. There are still recurring issues: Ambiguous business definitions lead to incorrect aggregationsComplex joins across poorly modeled schemas still fail frequentlyLLMs may overgeneralize metrics like revenue or churnLatency increases when multi-step reasoning is required In practice, most teams deploy agents as assistants rather than replacements for BI systems. They are good at exploration, not final reporting. Closing Thoughts What is emerging is not a replacement for SQL or dashboards, but a new interface layer on top of them. Natural language becomes a routing mechanism that decides how to query or compute over structured data. The interesting architectural shift is that the intelligence layer is moving closer to the data itself. Whether implemented in Snowflake or other platforms, the pattern is consistent: context-aware models, governed execution, and embedded compute through tools like Snowpark and Cortex. Streamlit or similar tools then complete the stack by providing a lightweight interface that can evolve from simple query boxes into full analytical applications. The result is not “analytics without SQL” but something more realistic: analytics where SQL is still present, but no longer the only way in. More
Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down
Ingesting Fixed-Width Mainframe Files Into Delta Lake: The Details Nobody Writes Down
By Jeevan Krishna Paruchuri
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
By Erkin Karanlık
Stop Running Two Data Systems for One Agent Query
Stop Running Two Data Systems for One Agent Query
By Varun Srinivas
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me

My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.

By Kuladeep Sandra
Bringing Intelligence Closer to the Source: Why Real-Time Processing is the Heart of Edge AI
Bringing Intelligence Closer to the Source: Why Real-Time Processing is the Heart of Edge AI

Artificial Intelligence is rapidly becoming a part of everyday devices — smartphones, cars, cameras, and even home appliances. Traditionally, these systems rely on cloud servers to send, process, and analyze data before making decisions, which increases latency and delays responses. However, many applications require instant decision-making, where even a slight delay can be critical. In such scenarios, relying on network connectivity is not always practical, and decisions need to be made locally on the device itself. This has led to a growing shift toward running intelligence directly on devices, making real-time local processing more important than ever. In this article, we’ll explore why this shift matters and how it is shaping the future of modern intelligent systems. What is Edge AI? Edge AI refers to running AI models directly on devices such as IoT systems, smartphones, autonomous cars, drones, and sensors — right where the data is generated. With this approach, there is no need to transfer data to cloud servers or centralized systems. Edge AI enables faster, real-time decision-making by processing data locally, without sending it elsewhere. For example, Instead of sending every transaction to a central server for analysis, the system can analyze transaction patterns locally in real time. If any unusual activity is detected — such as an abnormal withdrawal amount, location mismatch, or suspicious behavior — the system can instantly block the transaction or trigger an alert. Why Real-Time Processing Matters? Real-time processing means a system can process data instantly and make decisions without delay. Even small delays in decision-making can create critical situations and lead to serious consequences. For example, an autonomous car must detect obstacles and react within milliseconds. If it relies on the cloud, even a small delay could lead to serious consequences. By processing data locally, Edge AI enables immediate decisions — such as braking or steering — making the system safer and more efficient. Reduce Latency and Faster Decisions Latency is the time it takes for data to travel to the cloud and back. Even a delay of a few milliseconds can be too slow for certain applications. With Edge AI: Data is processed instantly on the device itself.There’s no need to wait for a network response.Performance is faster, more reliable, and less dependent on connectivity. For example, a voice recognition system on a smartphone can respond much faster when speech processing runs locally on the device, rather than relying on cloud or centralized servers. Improved Privacy and Data Security Sending sensitive data to the cloud raises privacy concerns, as it can be exposed during transmission or storage. Edge AI minimizes these risks by processing data directly on the local device instead of sending it to the cloud. This approach enhances data security and helps maintain user privacy, since sensitive information never leaves the device. It also supports compliance with data protection regulations and reduces the chances of unauthorized access or data breaches. For example, a healthcare wearable that monitors heart activity should not transmit sensitive personal health data to external servers. Instead, it can analyze patterns locally on the device and instantly alert the user if any irregularities are detected. This approach not only protects patient privacy but also enables faster, real-time responses in critical situations. Such local processing is especially important in industries like banking, healthcare, finance, and smart homes, where data security and immediate decision-making are essential. Reliability Without Internet Dependency Edge devices can operate even without an internet connection, making them more stable and reliable in remote areas or environments with poor network coverage. This ensures continuous performance without interruptions or delays caused by connectivity issues. As a result, critical applications can function smoothly regardless of network availability. For example, a drone used in disaster rescue operations cannot depend on internet connectivity. It must process images locally and detect survivors in real time, enabling faster and more effective rescue efforts. Lower Bandwidth Usage and Reduce Infrastructure Costs Sending large amounts of data to the cloud consumes significant bandwidth and increases operational costs. Edge AI helps reduce these costs by processing data locally on the device. This minimizes the need for constant data transmission and optimizes network usage. Only relevant or critical information is sent to the cloud, making the system more efficient and cost-effective. For example, a factory machine monitoring system can analyze sensor data locally and send alerts only when an issue is detected, instead of continuously streaming all the data. Scalability and Cost Efficiency Cloud processing for millions of devices can become expensive and resource-intensive. Edge AI addresses this by distributing computations across devices, reducing the load on central servers. This decentralized approach lowers infrastructure costs, improves scalability, and enhances overall system performance. It also reduces latency by minimizing the need for constant communication with the cloud. For example, in a smart city, thousands of cameras can process data locally instead of sending everything to a central cloud system. This not only saves bandwidth and infrastructure costs but also enables faster, real-time insights and responses. Better User Experience Real-time processing significantly improves user experience by making systems feel faster, smoother, and more responsive. Quicker responses lead to higher user satisfaction and a more seamless interaction. With Edge AI, data is processed instantly on the device, eliminating delays and ensuring consistent performance. This is especially important for applications that require immediate feedback. For example, in gaming or augmented reality (AR), local AI can render objects and interactions in real time, creating a smoother, more immersive, and engaging user experience. An edge-based platform helps by enabling data processing and decision-making directly on devices, rather than relying entirely on centralized cloud systems. It supports faster, real-time responses by analyzing data locally, which is essential for applications that require immediate action. This leads to improved performance and reliability, especially in environments with limited or unstable internet connectivity. It also enhances data privacy and security by keeping sensitive information on the device, reducing the need for data transmission. Additionally, it optimizes bandwidth usage and lowers infrastructure costs by sending only meaningful insights or alerts to central systems instead of continuous raw data. Overall, this approach helps build systems that are faster, more efficient, secure, and scalable by bringing intelligence closer to where data is generated. Conclusion Edge AI is transforming modern systems by bringing intelligence closer to where data is created, enabling faster and real-time decision-making. It reduces latency and improves performance by processing data locally instead of relying on the cloud. This approach also enhances privacy and minimizes dependence on constant internet connectivity. Additionally, it helps reduce bandwidth usage and lowers infrastructure costs. From smart cities to healthcare and industrial automation, edge computing is driving a new era of faster, smarter, and more efficient systems. Edge AI brings intelligence closer to where data is created, enabling real-time decisions, faster performance, enhanced privacy, and reliable operation without depending on constant connectivity.

By Jitendra Bafna
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata
Building Enterprise-Grade Real-Time IoT Dashboards with Vue 3, MQTT, and Kafka
Building Enterprise-Grade Real-Time IoT Dashboards with Vue 3, MQTT, and Kafka

The convergence of IoT, real-time data streaming, and modern frontend frameworks is redefining how engineers build enterprise monitoring systems. Over the course of designing and leading the Device IoT Platform — an enterprise-grade solution for real-time monitoring, configuration, and diagnostics of thousands of distributed network devices — I encountered and solved a core architectural challenge: how do you build a frontend dashboard that can handle hundreds of concurrent device telemetry streams without sacrificing performance, maintainability, or user experience? This article shares the architectural patterns, technology decisions, and hard-won lessons from that journey — covering the full stack from MQTT ingestion to Vue 3 reactivity to Kafka-backed event processing. The Core Problem: Real-Time at Scale Most developers are familiar with polling — periodically fetching data from an API endpoint. For IoT, polling is fundamentally broken: Latency: A 5-second polling interval means 5 seconds of stale state.Inefficiency: You're requesting data even when nothing has changed.Scale: 1,000 devices × 1 request/5s = 200 requests/second just to read status — before any user interaction. The solution is event-driven architecture: devices push telemetry when something changes, and the platform reacts. This requires a rethinking of both backend ingestion and frontend state management. Architecture Overview Plain Text [IoT Devices] | MQTT Broker (Mosquitto / AWS IoT Core) | [Node.js Telemetry Microservice] | [Kafka Topic: device.telemetry.raw] | (stream processor) [Kafka Topic: device.telemetry.enriched] | [WebSocket Server (Node.js)] | [Vue 3 Dashboard Frontend] Each layer has a distinct responsibility: MQTT Broker handles lightweight publish/subscribe with devices using minimal overhead.Node.js microservices bridge MQTT → Kafka, performing initial validation and normalization.Kafka provides durable, replayable event streams — critical for audit trails and late-joining consumers.WebSocket server fans out enriched telemetry to connected dashboard clients in real time.Vue 3 handles reactive rendering, ensuring only the affected UI components re-render when new data arrives. Backend: MQTT → Kafka Bridge in Node.js The heart of the ingestion pipeline is a lightweight Node.js service using the mqtt and kafkajs libraries. Plain Text import mqtt from 'mqtt'; import { Kafka } from 'kafkajs'; const mqttClient = mqtt.connect(process.env.MQTT_BROKER_URL!, { clientId: `telemetry-bridge-${process.pid}`, username: process.env.MQTT_USERNAME, password: process.env.MQTT_PASSWORD, clean: true, }); const kafka = new Kafka({ clientId: 'iot-bridge', brokers: [process.env.KAFKA_BROKER!] }); const producer = kafka.producer(); mqttClient.on('connect', async () => { await producer.connect(); mqttClient.subscribe('devices/+/telemetry', { qos: 1 }); console.log('MQTT → Kafka bridge active'); }); mqttClient.on('message', async (topic, payload) => { const deviceId = topic.split('/')[1]; const data = JSON.parse(payload.toString()); await producer.send({ topic: 'device.telemetry.raw', messages: [ { key: deviceId, value: JSON.stringify({ deviceId, timestamp: Date.now(), ...data }), }, ], }); }); Key design decisions here: QoS Level 1 — ensures at-least-once delivery for telemetry messages. For command acknowledgments, we use QoS 2.Device ID as Kafka partition key — guarantees ordering per device while allowing parallel processing across partitions.Separation of raw vs. enriched topics — the device.telemetry.raw topic contains the bare payload; a downstream stream processor enriches it with device metadata, geolocation, and alert thresholds before publishing to device.telemetry.enriched. WebSocket Fan-Out Server The WebSocket layer subscribes to Kafka's enriched topic and pushes updates to connected browser clients. We use Kafka consumer groups to allow horizontal scaling of the WebSocket tier. Plain Text import { WebSocketServer } from 'ws'; import { Kafka } from 'kafkajs'; const wss = new WebSocketServer({ port: 8080 }); const kafka = new Kafka({ clientId: 'ws-fanout', brokers: [process.env.KAFKA_BROKER!] }); const consumer = kafka.consumer({ groupId: 'websocket-fanout-group' }); // Track subscriptions: deviceId → Set<WebSocket> const deviceSubscriptions = new Map<string, Set<WebSocket>>(); wss.on('connection', (ws) => { ws.on('message', (msg) => { const { action, deviceId } = JSON.parse(msg.toString()); if (action === 'subscribe') { if (!deviceSubscriptions.has(deviceId)) { deviceSubscriptions.set(deviceId, new Set()); } deviceSubscriptions.get(deviceId)!.add(ws); } }); ws.on('close', () => { deviceSubscriptions.forEach((clients) => clients.delete(ws)); }); }); async function startKafkaConsumer() { await consumer.connect(); await consumer.subscribe({ topic: 'device.telemetry.enriched' }); await consumer.run({ eachMessage: async ({ message }) => { const payload = JSON.parse(message.value!.toString()); const clients = deviceSubscriptions.get(payload.deviceId); clients?.forEach((client) => { if (client.readyState === WebSocket.OPEN) { client.send(JSON.stringify(payload)); } }); }, }); } startKafkaConsumer(); This design enables selective subscription — a dashboard user viewing 50 devices only receives telemetry for those 50 devices, not the full firehose. This is critical for performance at scale. Frontend: Vue 3 Reactive Architecture The frontend is built with Vue 3 Composition API + Pinia for state management. The goal is to update only the UI components bound to a specific device when its telemetry arrives — not re-render the entire dashboard. WebSocket Composable Plain Text // composables/useDeviceTelemetry.ts import { ref, onMounted, onUnmounted } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; export function useDeviceTelemetry(deviceIds: string[]) { const store = useDeviceStore(); let ws: WebSocket | null = null; const connect = () => { ws = new WebSocket(import.meta.env.VITE_WS_URL); ws.onopen = () => { deviceIds.forEach((id) => { ws!.send(JSON.stringify({ action: 'subscribe', deviceId: id })); }); }; ws.onmessage = (event) => { const telemetry = JSON.parse(event.data); store.updateDeviceTelemetry(telemetry.deviceId, telemetry); }; ws.onclose = () => { // Exponential backoff reconnection setTimeout(connect, Math.min(1000 * 2 ** reconnectAttempts++, 30000)); }; }; onMounted(connect); onUnmounted(() => ws?.close()); } Pinia Store with Fine-Grained Reactivity Plain Text // stores/deviceStore.ts import { defineStore } from 'pinia'; import { reactive } from 'vue'; interface DeviceTelemetry { deviceId: string; status: 'online' | 'offline' | 'degraded'; signalStrength: number; latency: number; lastSeen: number; alerts: string[]; } export const useDeviceStore = defineStore('devices', () => { const telemetryMap = reactive<Record<string, DeviceTelemetry>>({}); function updateDeviceTelemetry(deviceId: string, data: Partial<DeviceTelemetry>) { if (!telemetryMap[deviceId]) { telemetryMap[deviceId] = {} as DeviceTelemetry; } Object.assign(telemetryMap[deviceId], data); } return { telemetryMap, updateDeviceTelemetry }; }); Using reactive() with a map structure means Vue's dependency tracking is at the property level — a component subscribed to telemetryMap['device-001'].signalStrength won't re-render when device-002's data changes. This is the key to dashboard scalability. Device Card Component Plain Text <!-- components/DeviceCard.vue --> <template> <div class="device-card" :class="statusClass"> <div class="device-header"> <span class="device-id">{{ deviceId }</span> <StatusBadge :status="telemetry?.status" /> </div> <div class="metrics"> <MetricBar label="Signal" :value="telemetry?.signalStrength" unit="dBm" /> <MetricBar label="Latency" :value="telemetry?.latency" unit="ms" /> </div> <AlertList :alerts="telemetry?.alerts ?? []" /> </div> </template> <script setup lang="ts"> import { computed } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; const props = defineProps<{ deviceId: string }>(); const store = useDeviceStore(); // Only this device's slice of state — targeted re-renders only const telemetry = computed(() => store.telemetryMap[props.deviceId]); const statusClass = computed(() => ({ 'status-online': telemetry.value?.status === 'online', 'status-offline': telemetry.value?.status === 'offline', 'status-degraded': telemetry.value?.status === 'degraded', })); </script> Performance Optimizations 1. Virtual Scrolling for Large Device Lists When monitoring 500+ devices, rendering all device cards simultaneously tanks performance. We use vue-virtual-scrollerto only render visible cards: Plain Text <RecycleScroller class="device-list" :items="filteredDevices" :item-size="120" key-field="deviceId" v-slot="{ item }" > <DeviceCard :device-id="item.deviceId" /> </RecycleScroller> 2. Debounced Batch Updates Devices can emit bursts of telemetry. Updating the Pinia store on every single message causes excessive re-renders. We batch incoming messages within a 100ms window: Plain Text let pendingUpdates: Record<string, Partial<DeviceTelemetry>> = {}; let batchTimeout: ReturnType<typeof setTimeout> | null = null; function queueUpdate(deviceId: string, data: Partial<DeviceTelemetry>) { pendingUpdates[deviceId] = { ...(pendingUpdates[deviceId] ?? {}), ...data }; if (!batchTimeout) { batchTimeout = setTimeout(() => { Object.entries(pendingUpdates).forEach(([id, update]) => { store.updateDeviceTelemetry(id, update); }); pendingUpdates = {}; batchTimeout = null; }, 100); } } 3. Lazy Loading and Code Splitting Device diagnostic panels (charts, event logs, configuration editors) are loaded on demand: Plain Text const DeviceDiagnostics = defineAsyncComponent( () => import('@/components/DeviceDiagnostics.vue') ); Combined with route-level code splitting, the initial bundle stays under 200KB gzipped. Security Architecture: OAuth 2.0 + RBAC Device management platforms require fine-grained access control. Not every user should be able to issue firmware update commands to production devices. JWT Claims-Based RBAC We encode role information directly in the JWT access token: Plain Text { "sub": "user-123", "roles": ["device:read", "device:configure"], "scope": "region:us-east", "exp": 1699999999 } The frontend reads these claims to conditionally render action buttons, and the backend validates them on every API call: Plain Text // middleware/rbac.ts export function requirePermission(permission: string) { return (req: Request, res: Response, next: NextFunction) => { const token = req.headers.authorization?.split(' ')[1]; const decoded = verifyJWT(token!); if (!decoded.roles.includes(permission)) { return res.status(403).json({ error: 'Insufficient permissions' }); } next(); }; } // Route definition router.post('/devices/:id/firmware', requirePermission('device:firmware'), handleFirmwareUpdate); Deployment: CI/CD on AWS The entire platform is containerized and deployed via a GitLab CI/CD pipeline to AWS ECS with Fargate. Plain Text # .gitlab-ci.yml (excerpt) stages: - test - build - deploy build-and-push: stage: build script: - docker build -t $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA . - docker push $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA deploy-production: stage: deploy script: - aws ecs update-service --cluster iot-platform --service frontend --force-new-deployment environment: production only: - main Blue-green deployments ensure zero downtime for this 24/7 critical infrastructure platform. Results and Key Metrics After migrating from a polling-based architecture to this event-driven stack: Dashboard latency: reduced from 5–10 seconds (polling) to under 200ms (WebSocket push).Backend API load: reduced by ~78% — telemetry pushes replaced constant polling.Frontend bundle size: kept under 220KB gzipped through lazy loading and tree-shaking.Throughput: validated at 10,000 concurrent telemetry events/second through Kafka partitioning. Conclusion Building a real-time IoT dashboard at enterprise scale requires rethinking the entire data flow — from device communication protocols through streaming pipelines to fine-grained frontend reactivity. The combination of MQTT for lightweight device communication, Kafka for durable event streaming, WebSockets for real-time push to browsers, and Vue 3's targeted reactivity model creates a system that scales gracefully without sacrificing developer ergonomics. The patterns described here — selective WebSocket subscriptions, batched Pinia updates, virtual scrolling, and JWT-based RBAC — have been validated in production on a platform serving critical network infrastructure. They are broadly applicable to any domain requiring real-time monitoring at scale: energy management, fleet tracking, industrial automation, and beyond. Github: Real-Time-IoT-Dashboards-Vue-3-MQTT-Kafka

By Venkata Sandeep Dhullipalla
Why Google Data Migration Gets Stuck at 99%: Causes and Proven Fixes
Why Google Data Migration Gets Stuck at 99%: Causes and Proven Fixes

Starting a Google data migration usually feels easy at first. You set it up, watch the progress bar move, and assume everything is fine. Then, the migration reaches 99% and just stays there. Users start wondering if something went wrong, if emails are missing, or if the process will ever finish. The good news is that this issue is common, and most of the time, it’s not serious. In this blog, we’ll look at why Google data migration stuck at 99% and what you can do to fix it is to backup the Google data. Why is your Google Data Migration Pending at 99%? There can be many reasons for Google data migration getting stuck at 99%. Here are some possibilities listed below: Migrating a large amount of data takes time and can slow things down.A slow internet connection can interrupt the data export.There can be some compatibility issues with certain data types.Insufficient system storage on either the source or target servers can hinder migration progress.If the data being transferred contains corruption or an integrity issue, the migration process may encounter some issues that prevent it from completing successfully. What to Do When Google Data Migration Stuck at 99%? When Google data migration gets stuck at 99%, users can feel confused and a bit stressed. Everything looks almost done, but nothing is transferring anymore. You keep checking the status, hoping it will finish on its own. In this section, we’ll see some simple steps you can take to help move things forward when Google Data Migration is not completing: Solution 01: Wait, Then Refresh The first thing to try is simply waiting. A lot of migrations finish on their own after a few hours. The progress bar isn’t always honest about what’s happening. Wait for around 30-60 minutes.Do not close any web browser or log out of the Google Admin Console.If the problem persists, refresh the migration page. Solution 02: Test your Internet Connection A poor internet connection is the most common reason why Google Data Migration not completing. Do not use VPNs or proxy networks, as they can slow down the transfer process.During the migration process, do not download heavy files.Once your internet gets stabilized, refresh the migration page. Solution 03: Manage Mailbox Size Oversized mailboxes can affect the Google data migration process. If the migration includes large attachments or huge attachments, it can get stuck. Log in to the source account, find the unnecessary emails and attachments, and delete them.Empty the Trash or Spam folders to reduce the total size.You can also try to split the entire migration into two for a faster and more efficient migration. Solution 04: Resume Migration for Pending Data When Google Data Migration status is stuck at 99%, it means only a few emails are pending for migration. Check the migration status from the Google Admin migration dashboard.If the migration gets stuck, pause it and then resume the process. Also, if there is an option to select the new items or the remaining items to be migrated, choose that option. Backup Google Data to Fix Google Data Migration Stuck at 99% The professional Cigait Gmail Backup Tool is easy to use for users and does what it claims. It helps you to backup Gmail emails when the Google data migration is stuck at 99%. The process is simple and doesn’t take much effort. You can select which data or emails you want to transfer. The advanced filters let you skip emails you don’t need. This saves time and keeps things clean. The tool doesn’t only work with emails; it can also back up Google Drive files, contacts, calendars, photos, and other Google data. Follow the given steps to fix the Google data migration pending at 99%: First, install this tool, then launch it on your system.After that, enter your Email ID and App password and click Sign In.Now, select and preview your files and hit Next.Then, select PDF from the given list of file formats.Afterward, apply any optional features and filters you need and click Next.After that, click Save Path, choose the location, and save your fileLastly, press Download to begin the process. Final Words Google data migration stuck at 99% is a common issue and can be really frustrating, but it usually isn’t serious. Most of the time, the last few files or emails just need extra time to complete. If this issue keeps coming back here, we discussed the Gmail Backup Tool that can help to backup the Google data. With a little patience, the transfer process can be completed without losing data. Frequently Asked Questions Q1. Why does Google data migration get stuck at 99%? Ans. Google data migration usually gets stuck at 99% because a few emails or files take time to transfer. Old messages, attachments, or items in spam and trash folders can slow down the final step. Most of the time, the migration is still running in the background and needs more time to finish. Q2. How long can Google's data migration stay at 99%? Ans. Google’s data migration can stay at 99% for a few hours and sometimes up to a day. The progress bar doesn’t always update properly near the end, even though the migration is still going in the background. Q3. What should I do if the Google data migration is not completing? Ans. If the Google data migration is not completing, first wait for some time, then check the status again. Many times, it gets completed. If it stays stuck, try restarting the migration process. If the issue keeps occurring again, using the Gmail backup tool can help move the remaining emails and other data without getting stuck. Q4. Why does Google data migration stay pending at 99% for large mailboxes? Ans. Google data migration stays pending at 99% for large mailboxes because there is more data to process. A few large emails, old messages, or attachments can take extra time to transfer, which slows down the final part of the migration.

By Aryan Malhotra
Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
Scaling Cloud Data Automation: A Practical Guide to Open Table Formats

When we talk about data analytics the way we set up our tables is really important. This is because it can make a difference, in how well our systems work and how fast they can grow. Data analytics and Open Table Formats go hand in hand. Open Table Formats are a part of how we build our data systems today. They make it easy to work with systems. Get more out of our data. In this blog post we will talk about what Open Table Formatsre. We will discuss data analytics and Open Table Formats in detail. We will also look at some examples. Help you figure out which Open Table Format is best for your data analytics needs. We want to help organizations choose the Open Table Format for their data systems because the Open Table Format is very important, for organizations. The Open Table Format is what organizations need to make their data systems work well. What Are Open Table Formats? Open Table Formats are really good at keeping data neat and tidy, in tables. Nobody owns Open Table Formats so they are made to work with lots of tools and systems. This is great because Open Table Formats can be used by people and computers and they all work together. The goal of Open Table Formats is to make it easy for people to share data and use it so everyone can work together smoothly no matter what kind of computer or system they use, with Open Table Formats. Popular Open Table Formats People really, like using Open Table Formats when they are dealing with data. Here are some popular Open Table Formats that people use a lot when they are working with Open Table Formats: Apache Iceberg Apache Iceberg is a way to organize tables. It helps people work with sets of data in an controlled way. Apache Iceberg gives us things like ACID transactions, which's, like a guarantee that Apache Iceberg will handle our data correctly. Apache Iceberg also has isolation so we can look at our data without worrying about people changing Apache Iceberg data at the same time. Apache Iceberg allows for schema evolution, which means we can change the way our Apache Iceberg data is organized without having to start over again with Apache Iceberg. I think Apache Iceberg is really useful for people who deal with datasets in data lakes. Apache Iceberg is very helpful because it makes working with amounts of data a lot easier for people who do this kind of work, with Apache Iceberg. Advantages The main advantages of this system are that it makes sure the data is consistent. It helps with queries. This system also allows the database schema to change and evolve over time without losing any of the data, from the database schema. The system ensures data consistency. It supports queries and it enables the database schema evolution. Use Cases: Ideal for data lakes requiring transactional guarantees and schema flexibility. Delta Lake Delta Lake is a way to store data that's free for anyone to use. It helps make sure the Delta Lake data is reliable. When many people use the Delta Lake data at the time Delta Lake makes sure there are no problems. The Delta Lake also keeps track of a lot of information, about the Delta Lake data. Delta Lake makes it easy to use data that is coming in all the time and old data that is already stored in the Delta Lake. The Delta Lake does all this by using something called ACID transactions to help the Delta Lake work properly. Delta Lake is really great when it comes to dealing with an amount of data. Delta Lake works well with data that is coming in all the time and Delta Lake also works well with data that comes in big groups. This thing has a lot of points. It makes sure the data is good and reliable. You can also go back. Look at old versions of the data. The data works well with the tools that use the data. The tools that process the data, like it when the data is set up this way. Use Cases: Suitable for data lakes requiring reliability, data versioning, and unified data processing. Apache Hudi Apache Hudi is a tool for working with data. It helps you add data to the data you already have. Apache Hudi also makes it easier to build systems that can move data around. This is really helpful when you have a lot of data in a data lake. Anyone can use Apache Hudi because it is source. The best thing about Apache Hudi is that it makes handling data processing and building data pipelines on data lakes simpler. Apache Hudi is very useful, for people who work with data lakes and need to process a lot of data. This system is good because it helps with processing data a little at a time. It also keeps track of versions of the data. The data system makes it easy to get the data in and to ask questions about the data. The data system is really helpful when you want to ask questions, about the data. Use Cases: Ideal for data lakes requiring incremental data processing and data pipeline management. Choosing the Right Open Table Format When you are trying to pick the Open Table Format for the data analytics you need you have to think about a lot of things. You have to think about what you will be using the Open Table Format for. What is your data, like? Will the Open Table Format work with the systems you use? How well does the Open Table Format need to perform for your data analytics? Here are some important things to think about when you're trying to decide on an Open Table Format for your data analytics needs: Use Cases and Workloads When you want to make sure your transactions are safe and your data is consistent you should think about using formats like Apache Iceberg or Delta Lake. These formats give you something called ACID transactions which's, like a promise that everything will work correctly. Apache Iceberg and Delta Lake are options because they help you keep your data safe and make sure everything is consistent. If you are looking for something that will guarantee your data is safe Apache Iceberg and Delta Lake are the way to go because Apache Iceberg and Delta Lake give you this guarantee. When we talk about Incremental Data Processing we need to think about how to handle Incremental Data Processing. For people who work with Incremental Data Processing and manage data pipelines Apache Hudi is an option to consider for their Incremental Data Processing needs. Apache Hudi can really help with tasks related to Incremental Data Processing. Data Characteristics When you are working with data think about how data you will have to deal with. You have to store data. Some ways of storing data are better for sets of data. Data volume is something you should think about because some formats can handle lots of data better, than others. This is really important when you are working with a lot of data. If you are working with data data volume can be a problem if you are not using the format for your data. Data Complexity You have to find out how complicated your data is. This means you need to look at the types of data you have. You should think about if you will need to make changes to how your data's organized. Some data formats, like Apache Iceberg and Delta Lake are very helpful. They are helpful because they let you make changes to your data easily. You can change your data without a lot of trouble when you use Apache Iceberg and Delta Lake. Ecosystem Compatibility When you choose an Open Testing Framework you need to make sure it works well with the data processing tools you already use. For example Delta Lake works with Apache Spark. This is really important because you want your Open Testing Framework to be compatible with your existing data processing frameworks and tools, like your Open Testing Framework and your data processing tools. You want your Open Testing Framework to work smoothly with the tools you have so your Open Testing Framework and your data processing tools work together perfectly. When you think about Cloud Platforms you need to think about how the OTF works with the Cloud Platform you want to use. You have to see if the OTF is compatible with the Cloud Platform you like.. You have to check if it works with the infrastructure you have at home or in your office. This is really important for Cloud Platforms, like the ones you use every day. You need to make sure the OTF and the Cloud Platform work together. The Cloud Platform you choose should be able to work with the OTF. Performance Requirements Let us take a look at the On The Fly system and see how it works when we have to handle queries. The On The Fly system has to be able to handle our queries. We need to check how well the On The Fly system does when it comes to query performance. This is important because we do a lot of work. The On The Fly system has to be good, at handling the kind of work we do. We have to test the On The Fly system to see how it performs with our workloads. The On The Fly system needs to be able to handle these workloads. * We are going to take a look, at how the On The Fly system works when it comes to answering queries. We want to see how the On The Fly system does its job. The On The Fly system is what we are focusing on. * We are going to use this for the work we do when we analyze things for our workloads. This will help us with our workloads. The main thing we want to figure out is how good the On The Fly system is at doing our work. We need to see if the On The Fly system can give us the results we need fast. This will help us decide if the On The Fly system is really good, for the kind of work we do with the On The Fly system. Data Ingestion We need to check how well our Data Ingestion processes are working, especially when we are getting Data Ingestion done on time or really close to time for analytics. This is really important, for Data Ingestion because it helps us understand what is happening now with our Data Ingestion. We need to see how Data Ingestion works with a lot of information. We have to know how fast Data Ingestion can process this information. For Data Ingestion to be really useful it has to be able to handle all this information. Data Ingestion is only good if it can do this. Open Table Formats are really important for working with data these days. They make it easy to work with systems and Open Table Formats can do a lot of things. If you know what makes Open Table Formats like Apache Iceberg, Delta Lake and Apache Hudi special you can pick the Open Table Format that's best, for your company. You need to think about your data. What is your data like? You should figure out what you want to do with your data and what tools you are using with your data. You should also think about what you want your data to be like. Then you can pick the Open Table Format that's best for your data and what you want to do with your data. Open Table Formats are important for your data so choosing the Open Table Format is important, for your data needs.

By Sandeep Batchu
When Perfect Data Breaks: The Journey from Data Quality to Data Observability
When Perfect Data Breaks: The Journey from Data Quality to Data Observability

The Day Everything Looked Fine — Until It Wasn’t The dashboards were green. Every test passed. And yet, by morning, the company’s revenue had mysteriously dropped by roughly $1 million. The data team huddled together, blinking at their screens. Schema checks? It looked good.Nulls? Checks passed, and everything appeared to be in order.Completeness? It looked good. Nothing looked wrong, except that something was causing the business to bleed. What they didn’t know yet was that an innocent iOS app update had quietly scrambled the order of user events. To the system, customers were suddenly purchasing before browsing. The models didn’t break in code; they broke in meaning. The team discovered a crucial lesson: even flawless data systems can mislead without true observability. Why “Good Data” Isn’t Good Enough Anymore There was a time when data quality was the gold standard and a measure of success. DQ checks meant your dataset is protected. If your dataset were clean, complete, and validated, your insights would be gold. But that was back when pipelines were simple, ETL jobs ran once a night, and life was predictable. Back then, most data was read by people, not systems. Analysts looked at dashboards after the fact, asked questions when numbers felt off, and applied judgment before anyone made a real decision. If a table landed late or a metric looked strange, someone usually noticed; often before it caused real damage. Data quality checks were designed for this world: static, batch-oriented, and tolerant of human interpretation. But as technology changed, so did expectations. Today’s world is different. This shift matters most for data engineers, analytics engineers, and platform teams responsible for the reliability of downstream dashboards, APIs, and machine learning systems. Modern cloud-native companies run thousands of interdependent batch and streaming pipelines, constantly feeding dashboards, APIs, and machine learning systems. A single column rename, a delayed partition, or an unnoticed schema tweak can quietly throw everything off course. Traditional data quality is like checking your car’s oil once a month. Data observability involves installing a dashboard that provides real-time alerts when the engine is overheating. The Shift: From Data Quality to Data Observability Data quality answers the question: “Is this dataset correct right now?” Data observability asks something deeper: “Is my data behaving as it should?” Aspect Data Quality Data Observability Focus Data-at-rest Data-in-motion Checks Accuracy, completeness, validity Freshness, volume, distribution, schema, lineage When Point-in-time Continuous Goal Ensure correctness Ensure reliability View Local End-to-end The Five Pillars of Data Observability Freshness: Is data arriving on time relative to SLAs?Volume: Are record counts within expected ranges?Distribution: Have key statistics (e.g., averages, percentiles) drifted unexpectedly?Schema: Did upstream fields change without notice?Lineage: What depends on what, and who owns it? Together, these pillars act as an early-warning system for your data ecosystem, sensing changes before they cause downstream impact. The Story Behind the $1M Drop Our e-commerce company’s recommendation engine accounted for 40% of revenue. After a routine app update, click-throughs fell by 15%, conversions by 22%, and revenue tumbled. And yet, all quality checks still passed. Check Status Missed Insight Schema ✅ Timestamps changed meaning Nulls ✅ Events arrived out of sequence Ranges ✅ Valid values, wrong order Data quality confirmed the structure. It missed the story. Event order sounds like a minor detail, but for recommendation models, it’s foundational. Browsing before purchasing means something very different than purchasing before browsing. When that sequence flipped, nothing crashed; the model simply learned the wrong story about customers. Since the data remained complete, valid, and schema-compliant, every traditional check passed, even as the model’s understanding of user behavior quietly unraveled. The Hidden Issue The iOS app began batching events. They arrived six hours late and out of order. Before (Healthy) After (Broken) View → Add to Cart → Purchase Purchase → View → Add to Cart The model interpreted chaos as logic, and that’s when recommendations became noise. How Observability Would Have Saved the Day Within two hours, an observability system would have screamed: Freshness Alert: Event lag jumped from 5 mins to 360 minsDistribution Alert: 78% of events out of sequenceLineage Alert: iOS v1.3.0 deployed, impacting 47 tables and degrading 12 ML models Approach Detection Root Cause Resolution Time Data Quality Missed Undetected 3 days Data Observability Caught early iOS v1.3.0 deployment 6 hours Observability didn’t just find the broken data; it connected the dots to the moment things went wrong. The real win wasn’t just catching the issue faster. It was knowing exactly what changed, when it changed, and how far the damage spread. That made it possible to roll back quickly and explain what happened without guesswork. Without observability, teams debate symptoms. With it, they start acting on causes. Building Observability Step by Step So how does a modern data team move from reactive firefighting to proactive confidence? 1. Define Data Contracts Every dataset has a clear, versioned schema (YAML, Avro, Protobuf). Contracts live in code and are automatically validated before pipeline runs and new data is added to the dataset. Data contracts are often the first thing teams skip. They feel slow, bureaucratic, and unnecessary, right up until a breaking change slips through and every downstream table starts lying. 2. Add Freshness & Volume Monitors Track how long data takes to arrive and whether counts fall outside norms. Row updated at timestamp should be within the defined SLO. Define SLOs such as “99% of partitions land within 10 minutes.” Without explicit SLAs, delays are only discovered after dashboards update or don’t. By then, decisions have already been made on stale data. 3. Strengthen Tests Layer dbt checks for `not_null` and `uniqueness` with drift tests — e.g., “average session_length stays within 10% of baseline,” or “count of new orders placed stays within 10% of the baseline.” Basic checks are good at catching broken tables, but they don’t tell you when data starts behaving differently. Drift tests exist for the uncomfortable cases where everything looks valid but isn’t. 4. Emit Lineage Integrate OpenLineage with Airflow or dbt to visualize dependencies and trace impact instantly. Without lineage, every alert triggers a manual investigation. With it, teams can immediately see blast radius and ownership. 5. Centralize Visibility Bring all signals into one pane of glass. When freshness lives in one tool, lineage in another, and alerts in Slack, every incident turns into a scavenger hunt. Pulling those signals together is what turns alerts into answers. Now, when an alert fires, you know what broke, where, and who’s responsible. A Familiar Pattern If this story sounds familiar, it’s because it’s happening everywhere. Teams at Netflix have described recommendation quality degrading after upstream data schemas changed without downstream safeguards.Uber has publicly discussed timezone-related bugs that impacted time-based systems, including pricing and incentives.Airbnb has shared incidents where aggressive deduplication and data-cleaning logic removed valid records.Stripe has written extensively about how tiny currency-rounding errors can quietly compound into material financial discrepancies at scale.Different problems, same root cause: great data quality, no visibility. Let’s Distill the Lesson: Quality Validates. Observability Protects. Data quality ensures your data is correct. Data observability ensures your system stays trustworthy. In today’s interconnected world, where every pipeline is a domino, observability isn’t a luxury; it’s a seatbelt. So the next time your dashboard shows that comforting little green badge labeled “Fresh & Verified,” remember: behind that glow lies a safety net of observability quietly keeping your business upright.

By Divyakumar Savla
The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise
The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise

Large messaging platforms rarely collapse because authentication is broken. They collapse because authorization quietly expands, then stays expanded. The failure mode is not a single bug but a system property: credentials that were created for one narrow purpose become reusable, long-lived, and operationally too useful, until they function as capability grants far beyond the original intent. The industry has spent a decade hardening identity proofing and login defenses, yet incident reports keep circling back to the same operational reality: leaked tokens, misconfigured partner integrations, and automation scripts that inherit privileges no one remembers granting. What turns these common events into major incidents is blast radius. A single credential ends up authorizing too much surface area across assets, APIs, and workflows that were never meant to be coupled. That coupling is not malicious. It is entropy. In large platforms, shortcuts accumulate because they reduce friction for onboarding, rollout, and support. A token minted for setup becomes a token used for management. A scope added temporarily remains because removing it might break revenue-critical traffic. Over time, the platform’s authorization model stops describing reality and starts describing what teams wish were true. This is why overprivileged tokens should be treated as a platform failure, not a security bug. A platform that cannot bound token damage will repeatedly trade safety for continuity during pressure, and continuity will win every time. Assume Compromise: A Design Constraint Security guidance often says to assume compromise, but many systems still behave as if compromise is an edge case. An authorization design that truly assumes compromise treats every token as potentially leaked and optimizes for containment, not prevention. That changes the objective function: you are no longer trying to stop every unauthorized access. You are trying to make every credential failure cheap. In practice, this pushes a platform toward three invariants: Tokens must be purpose-specific and asset-bound.Authorization must be enforceable at runtime, not only at mint time.Migration must preserve business continuity, or it will be bypassed. If any one of these is missing, the platform will drift back toward one token that works everywhere, because it is operationally convenient. Granular Tokens: Turning Credentials Into Bounded Capabilities A granular token is not a JWT with scopes. It is a capability grant with explicit boundaries that survive refactors. At a minimum, you want the token to encode: Subject: who the token represents (partner, service, automation identity)Assets: which specific resources it can act on (business account, phone number, template namespace, etc.)Actions: what it can do (send message, read profile, manage templates, rotate keys)Context: how it was minted and intended to be used (channel, onboarding version, risk tier) A minimal JSON representation (conceptual) looks like this: JSON { "sub": "partner:acme", "aud": "messaging-api", "exp": 1767225600, "scopes": ["message.send", "profile.read"], "assets": ["acct:WABA_12345"], "context": { "channel": "api", "onboarding_version": "v2", "risk_tier": "standard" } } The containment story is straightforward. If this token leaks, the worst-case impact is bounded by the assets and scopes embedded in the token. You do not need an emergency revocation that breaks unrelated integrations because the token never had cross-asset authority in the first place. That is the first half of the fix. The second half is where most platforms fail. Static Permissions Do Not Survive Platform Reality Even with granular tokens, the platform still needs to answer questions the token cannot predict: Is this token suddenly being used from a new environment or automation pipeline?Is the request pattern anomalous relative to the identity’s baseline?Is the target asset in a degraded state or under investigation?Is the subject verified, suspended, or constrained by policy changes? If those conditions matter — and in large platforms they always do — then authorization cannot be “token is valid → allow.” It must be a runtime decision that incorporates policy, state, and signals. A typical evaluation path is a policy engine that receives a normalized request context, the parsed token, and a small set of risk signals. Kotlin-style pseudocode: Kotlin data class RequestContext( val subject: String, val requiredScope: String, val targetAsset: String, val channel: String, val requestIp: String, val userAgent: String ) data class TokenClaims( val active: Boolean, val scopes: Set<String>, val assets: Set<String>, val riskTier: String ) enum class Decision { ALLOW, DENY, CHALLENGE } fun authorize(ctx: RequestContext, token: TokenClaims, risk: Double): Decision { if (!token.active) return Decision.DENY if (ctx.requiredScope !in token.scopes) return Decision.DENY if (ctx.targetAsset !in token.assets) return Decision.DENY // Risk gating: throttle, step-up, or challenge instead of global revocation if (risk >= 0.85) return Decision.CHALLENGE return Decision.ALLOW } Two details matter here. First, the challenge is not a UX flourish. It is an operational safety valve that lets you contain suspicious use without detonating the entire integration ecosystem. In partner-heavy platforms, blanket revocation often costs more than the incident you are trying to stop, which is how systems end up tolerating risk. Second, this logic must be uniform. If each service re-implements its own checks, drift returns through inconsistency. The enforcement layer must be a shared middleware or gateway component, not a set of best-practice docs. Shared Enforcement Libraries Prevent Policy Drift At platform scale, ad hoc checks become a reliability problem. One forgotten endpoint becomes the bypass. One outdated library becomes the weakest link. The correct abstraction is a shared enforcement module that every API integrates with, so policy changes do not require coordinated redeploys across dozens of teams. Kotlin middleware sketch: Kotlin class AuthzMiddleware(private val policy: PolicyEngine) { fun enforce(ctx: RequestContext, token: TokenClaims, risk: Double) { when (policy.evaluate(ctx, token, risk)) { Decision.ALLOW -> return Decision.CHALLENGE -> throw TooManyRequestsException("Risk threshold exceeded") Decision.DENY -> throw ForbiddenException("Not authorized") } } } interface PolicyEngine { fun evaluate(ctx: RequestContext, token: TokenClaims, risk: Double): Decision } This shifts authorization from scattered conventions to programmable governance. It also makes audits feasible. You can explain what rule allowed or denied a request, because the rule is centralized and versioned. Migration: The Part Everyone Underestimates The technical design is not the hard part. Migration is. Most large platforms cannot revoke legacy tokens quickly without breaking high-value partners or revenue-critical flows. If the migration plan assumes immediate compliance, teams will invent exceptions, and exceptions become the new default. A safe migration path looks less like a rewrite and more like controlled containment: Phase 1: Parity Audit Ensure every legacy capability exists in the new model. Missing parity guarantees shadow workarounds. Phase 2: Dual-Path Issuance New onboarding flows mint granular tokens. Legacy flows continue, but you instrument usage to learn what those tokens actually do. Phase 3: Progressive Restriction Start restricting the highest-risk scopes and the widest asset access first, while leaving low-risk functionality untouched. Phase 4: Deprecation Based on Observed Usage Deprecate legacy tokens only after usage drops below an agreed threshold and partner replacements are proven. This is not slow for the sake of caution. It is a recognition that platforms are socio-technical systems. Authorization controls that ignore operational incentives will be bypassed. Verification Data Is Not a Badge. It Is an Input Signal Verification systems are often framed as UX trust indicators, but their deeper value is policy. Verified entities can have different scope ceilings, different rate limits, different escalation paths, and different anomaly thresholds. That only works if the verification state is consistent and centralized. Multiple sources of truth for verification create two failures: increased attack surface and unpredictable enforcement. Consolidating verification data is therefore not merely hygiene. It is a prerequisite infrastructure for consistent authorization. Observability: Authorization Decisions Must Be Explainable If authorization is a runtime decision, observability becomes part of the authorization system. You need structured events that allow you to reconstruct “what was allowed, why, and under which policy version.” A compact event schema: JSON { "token_id": "tok_abc123", "subject": "partner:acme", "asset": "acct:WABA_12345", "scope": "message.send", "decision": "ALLOW", "policy_version": "2026-01-28.3", "risk_score": 0.12, "timestamp": "2026-01-28T10:42:00Z" } Without this, incident response degrades into guesswork. Teams become afraid to tighten policy because they cannot predict impact, and the platform returns to permissive defaults. Why This Matters Now Messaging platforms have become commerce rails, identity brokers, and customer support infrastructure. Tokens do not merely send messages. They trigger workflows, expose regulated data, and create downstream consequences that are hard to unwind. In that environment, overprivileged tokens are not a theoretical risk. They are latent incidents waiting for scale and human error to align. The durable systems are not the ones with the most complicated policy language. They are the ones who assume credentials fail and make failure cheap. Overprivileged tokens are rarely a single mistake. They are the result of authorization drift under operational pressure. The fix is not a lecture about least privilege. The fix is an architecture that enforces least privilege at runtime, uses shared libraries to prevent divergence, migrates without breaking continuity, and emits evidence for every decision. At platform scale, trust is not maintained by perfect prevention. It is maintained by designing for containment.

By Prakash Wagle
LLM Agents and Getting Started with Them
LLM Agents and Getting Started with Them

LLM-powered agents are gaining popularity and 2026 is set to be the year of agents just like 2025. Generative AI applications have now moved from normal chatbot applications, search and retrieve systems to building more of autonomous agents that can break bigger tasks down in smaller sub-tasks, achieving a goal while also interacting with environment. Before diving deeper into LLM powered agents and tools to create one let's start by answering the most important question What is an Agent? According to the gold standard definition that comes from Artificial Intelligence: A Modern Approach textbook by Stuart Russell and Peter Norvig "An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators." A vacuum cleaning robot is a good example of an agent. It uses sensors such as cameras, dirt detectors, bump sensors, and infrared sensors to gather information about its surroundings. To interact with and act upon its environment, it relies on actuators like wheels for movement, brushes for sweeping, and a suction motor for collecting dirt. This agent also performs a perception-action cycle to achieve its goals. Plain Text 1. PERCEIVE → Sensors detect dirty floor ahead 2. DECIDE → Agent decides to move forward and clean 3. ACT → Wheels move, brushes spin, motor sucks dirt 4. REPEAT → Continue until floor is clean The term percept refers to the input that the agent receives and perceives, percept sequence is a history of everything that the agent has received or perceived. Broadly the agents can be divided into the following categories: Simple Reflex Agents: These are the most simplest kind of agents, and can be considered as following a rule-based approach. If this then that kind of logical approach to problem solving. These agents take action only considering the current input or percept ignoring everything from the percept historyModel Based Reflex Agents: These agents are also reflex agents, however they take informed decisions by maintaining an internal state or storage that tracks the part of environment that it has visited, but cannot observe right now. If an environment changes then the agent behavior needs to be updated.Goal Based Agents: These agents work backwards from a desired goal, take and plan actions in accordance with this goal. This is different from simple reflex based agents, as decision making considers what will happen if an action is performed. Hence, considering the impact of their current choice on future state.Utility Based Agents: Utility based agents are also working backwards from a desired goal, but are also optimizing for a metric. The performance is tracked based on an utility function, the agent tries to achieve goals while maximizing the utility function. For example, an agent that is designed to find shortest path between two points, the goal is to find a path between the two points, while also considering that the length of path is shortest. Learning Agents: The most advanced type of agent. This has three main elements: the learning element, the performance element and the problem generator. Imagine an agent with decomposes a tasks, critics the current set of actions and finally also suggest actions that will lead to new experience. We will look at and construct simple examples of most of these agents. While a vacuum uses physical sensors, an LLM agent 'perceives' through text inputs and API responses, and 'acts' by generating text or calling functions. Just as a vacuum can be a simple bump-and-turn robot or a sophisticated room-mapper, LLM agents vary in intelligence. We can categorize them into five levels of sophistication:, for a LLM-powered agent the "sensor" is the user input. These agents usually have four main components: The agent core or the agent brainThe planning moduleMemoryTools Theoretical definitions are fine, but how do we build them? We will use LangChain and LangGraph frameworks to build our first LLM powered agent. Both are open source frameworks that are used to build LLM powered applications. The choice depends on the type of agent we are building and the intended level of control we wish to have on the agentic architecture. LangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool — so you can build agents that adapt as fast as the ecosystem evolve. LangChain is used to build on ideas of chain or a pipeline, a sequence of steps executed in a linear order. Every LangChain workflow is treated as a DAG (Direct Acyclic Graph) where tasks flow in one direction without any cycles or loops. LangGraph is great at handling complex workflows, loops, decision branches in workflows, complex decision trees etc, it also provides great flexibility and control into each component of agent setup. In this article we will create an agent using both LangChain and LangGraph to understand the pros and cons and usage of both these frameworks. For Simple Reflex Agents, that follow a straight line (Input -> Output), LangChain is our go-to framework. However, as we move toward Model based, Goal and Utility Agents that require loops, self-correction, and state management, we need the flexibility of LangGraph. Let’s see this evolution in action by building a 'Chef Agent' that grows smarter with every iteration.We will see how we can go from building a simple reflex agent for this task to a utility based agent, based on the complexities we add for this agent. Let's set up our environment for this task. We will be using Groq to invoke gpt-oss api. You will need to get your API access key from here, we will be using open-source models for this exercise. We will start by installing the required libraries Python pip install langchain langchain-groq langgraph langchain-core pydantic Next we will set up few variables that we will use across all our agents, these include the API key, model name. Python #config import os from dotenv import load_dotenv load_dotenv() # Set your API key GROQ_API_KEY="YOUR_API_KEY" os.environ["GROQ_API_KEY"] = GROQ_API_KEY model = 'openai/gpt-oss-safeguard-20b' Now let's start with building a simple reflex agent. A simple reflex agent is a very simple agent, that in this case if the user asks for a recipe suggests the recipe. It works with if else logic blocks. We use langchain to create this agent, the agent suggests whatever recipe the user requests. Python ######### llm brain ############# import os from langchain_groq import ChatGroq # Setup Groq LLM llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # 1. Perception -> Action (Direct Chain) reflex_prompt = ChatPromptTemplate.from_template( "You are a chef. Given a request: {input}, provide a single recipe immediately." ) reflex_agent = reflex_prompt | llm | StrOutputParser() #invoke the brain print(reflex_agent.invoke({"input": "I want a spicy pasta."})) Imagine you start using this agent to get a recipe for your dinner tonight, however you lack the ingredients that are needed to prepare the food. Such a bummer! Wouldn't it be nice to have an agent that has knowledge of the ingredients in your pantry or your dietary preferences before suggesting a recipe? Our Reflex Agent is fast, but it’s 'forgetful.' It suggests a spicy pasta even if you have no pasta in your pantry or a gluten allergy. To make it a Model-Based Reflex Agent, we must give it a way to track the 'internal state' of its world specifically, your pantry inventory and dietary needs.For this, we move to LangGraph, which allows the agent to maintain a persistent memory (State) and use tools to 'sense' its environment Python from langchain.tools import tool from langchain.chat_models import init_chat_model import operator from langgraph.prebuilt import ToolNode, InjectedState import operator from typing import Annotated, List, Literal llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) @tool def get_inventory(state:Annotated[dict, InjectedState]): "Get cuurent user inventory" return state["inventory"] @tool def get_dietary_prefs(state:Annotated[dict, InjectedState]): "Get user dietary preferences" return state["dietary_prefs"] @tool def get_history(state:Annotated[dict, InjectedState]): "Get user history" return state["history"] # Augment the LLM with tools tools = [get_inventory, get_dietary_prefs, get_history] tools_by_name = {tool.name: tool for tool in tools} model_with_tools = llm.bind_tools(tools) # Step 2: Define state from langchain.messages import AnyMessage from typing_extensions import TypedDict import operator class RecipeState(TypedDict): inventory: List[str] # What's in the fridge dietary_prefs: List[str] # e.g., "Vegetarian" suggestion: str # The output messages: Annotated[List[AnyMessage], operator.add] # Step 3: Define model node from langchain.messages import SystemMessage def llm_call(state: dict): """LLM decides whether to call a tool or not""" # Combine system message with user messages messages_to_send = [ SystemMessage( content="""You are a helpful chef agent. When user asks for a recipe, look at user's current inventory, dietary prefrences to suggest the recipe. You can use the available tools whenever needed.""" ) ] + state["messages"] response = model_with_tools.invoke(messages_to_send) return { "messages": [response], "inventory": state.get("inventory", []), "dietary_prefs": state.get("dietary_prefs", []) } # Step 4: Define tool node from langchain.messages import ToolMessage tool_node = ToolNode(tools) # Step 5: Define logic to determine whether to end from langgraph.graph import StateGraph, START, END # Conditional edge function to route to the tool node or end based upon whether the LLM made a tool call def should_continue(state: RecipeState) -> Literal["tool_node", END]: """Decide if we should continue the loop or stop based upon whether the LLM made a tool call""" messages = state["messages"] last_message = messages[-1] # If the LLM makes a tool call, then perform an action if last_message.tool_calls: return "tool_node" # Otherwise, we stop (reply to the user) return END # Step 6: Build agent # Build workflow agent_builder = StateGraph(RecipeState) # Add nodes agent_builder.add_node("llm_call", llm_call) agent_builder.add_node("tool_node", tool_node) # Add edges to connect nodes agent_builder.add_edge(START, "llm_call") agent_builder.add_conditional_edges( "llm_call", should_continue, ["tool_node", END] ) agent_builder.add_edge("tool_node", "llm_call") # Compile the agent agent = agent_builder.compile() from IPython.display import Image, display # Show the agent display(Image(agent.get_graph(xray=True).draw_mermaid_png())) # Invoke from langchain.messages import HumanMessage messages = [HumanMessage(content="Suggest a recipe for pasta")] current_inventory = ["pasta", "water", "tomatoes", "salt","parmesan"] current_dietary_prefs = ["vegetarian"] messages = agent.invoke({"inventory":current_inventory, "dietary_prefs":current_dietary_prefs, "messages": messages}) for m in messages["messages"]: m.pretty_print() This agent considers not just the current user request, but also takes into account the user environment, in this case the pantry and suggest recipes based on the items actually available in the pantry. Having memory is a start, but a truly intelligent chef doesn't just look at what's in the fridge, it works toward a specific outcome. This brings us to Goal and Utility-Based Agents. In this next evolution, the agent doesn't just suggest any recipe; it must satisfy a 'Goal': suggesting a meal that fits within a specific calorie budget. This requires a Planning/Verification loop where the agent critiques its own suggestion before presenting it to you. Python # Step 1: Define tools and model from langchain.tools import tool from langchain.chat_models import init_chat_model import operator from langgraph.prebuilt import ToolNode, InjectedState llm = ChatGroq( temperature=0, model_name=model, api_key=os.environ.get("GROQ_API_KEY") ) @tool def get_inventory(state:Annotated[dict, InjectedState]): "Get cuurent user inventory" return state["inventory"] @tool def get_dietary_prefs(state:Annotated[dict, InjectedState]): "Ger user dietary prefrences" return state["dietary_prefs"] @tool def get_remaining_calories_range(state:Annotated[dict, InjectedState]): "Get range of remaining calories" return (state["total_calories"]- state["consumed_calories"]+state["error_margin_calories"], state["total_calories"]- state["consumed_calories"]-state["error_margin_calories"]) # Augment the LLM with tools tools = [get_inventory, get_dietary_prefs, get_history] tools_by_name = {tool.name: tool for tool in tools} model_with_tools = llm.bind_tools(tools) # Step 2: Define state from langchain.messages import AnyMessage,HumanMessage, SystemMessage from typing_extensions import TypedDict, Annotated import operator from typing import List class RecipeState(TypedDict): inventory: List[str] # What's in the fridge dietary_prefs: List[str] # e.g., "Vegetarian" suggestion: str # The output total_calories: int # The total number of calories to consume consumed_calories: int # The number of calories already_consumed error_margin_calories: int # The number of calories that can be added or deleted from total calories num_tries: int # The number of tries the agent has made messages: Annotated[List[AnyMessage], operator.add] # Step 3: Define Planner Node def planner_node(state: RecipeState): """Suggests a recipe and estimates calories.""" messages_to_send = [ SystemMessage(content=( "You are a chef. Suggest a recipe based on inventory and dietary prefs. " "IMPORTANT: You MUST provide an estimated calorie count for the meal." "You can use the available tools whenever needed." )) ] + state["messages"] response = model_with_tools.invoke(messages_to_send) return {"messages": [response], "num_tries":state.get("num_tries",0)} def verify_search_node(state: RecipeState): """Checks if the suggested meal fits the calorie constraints.""" last_message = state["messages"][-1].content # Simple logic: Ask LLM to extract or verify, # or use regex/logic if you want to be strict. prompt = f""" The user's goal is {state['total_calories']} calories (margin: +/- {state['error_margin_calories']}). Current consumed: {state['consumed_calories']}. The chef suggested: {last_message} Does this recipe fit the remaining calorie budget? If yes, reply 'VALID'. If no, explain why. You can use the avilable tools whenever needed. """ verification_response = llm.invoke([HumanMessage(content=prompt)]) # We can add this to messages to keep track of the critique return {"messages": [verification_response]} def should_verify(state: RecipeState) -> Literal["verify_node", "tool_node", END]: last_message = state["messages"][-1] if last_message.tool_calls: return "tool_node" # If no tool calls, it means the LLM has made a suggestion. Now verify it. return "verify_node" def is_it_valid(state: RecipeState) -> Literal["planner_node", END]: last_message = state["messages"][-1].content if "VALID" in last_message.upper(): return END # If not valid, loop back to the planner to try again return "planner_node" def num_tries_exceeded(state: RecipeState) -> Literal["planner_node",END]: if state["num_tries"] > 5: return END return "planner_node" # Build workflow agent_builder = StateGraph(RecipeState) agent_builder.add_node("planner_node", planner_node) agent_builder.add_node("tool_node", tool_node) agent_builder.add_node("verify_node", verify_search_node) agent_builder.add_edge(START, "planner_node") # Route from Planner agent_builder.add_conditional_edges( "planner_node", should_verify, {"tool_node": "tool_node", "verify_node": "verify_node", END: END} ) # Route from Tool back to Planner agent_builder.add_edge("tool_node", "planner_node") # Route from Verifier back to Planner OR End agent_builder.add_conditional_edges( "verify_node", is_it_valid, {"planner_node": "planner_node", END: END} ) agent = agent_builder.compile() from IPython.display import Image, display # Show the agent display(Image(agent.get_graph(xray=True).draw_mermaid_png())) # Invoke from langchain.messages import HumanMessage messages = [HumanMessage(content="Suggest a recipe for pasta")] current_inventory = ["pasta", "water", "tomatoes", "salt","parmesan"] current_dietary_prefs = ["vegetarian"] messages = agent.invoke({"inventory":current_inventory, "dietary_prefs":current_dietary_prefs, "total_calories": 1000, "consumed_calories": 800, "error_margin_calories": 100, "messages": messages}) for m in messages["messages"]: m.pretty_print() We have seen how an agent evolves from a simple 'If-Then' reflex into a sophisticated system capable of maintaining state and verifying its own goals. By moving from LangChain’s linear chains to LangGraph’s cyclic workflows, we’ve bridged the gap between basic automation and autonomous reasoning. However, the true power of agents lies in their ability to optimize for complex preferences and learn from their mistakes. Because Utility-Based Agents and Learning Agents involve more intricate scoring functions and feedback loops, we will dedicate our next entire article to mastering those advanced architectures.

By Harshita Asnani
What Nobody Tells You About Multimodal Data Pipelines for AI Training
What Nobody Tells You About Multimodal Data Pipelines for AI Training

Most discussions about AI model training focus on architecture choices, compute budgets, and evaluation benchmarks. The data pipeline that feeds those models? It gets a paragraph, maybe two. Maybe a diagram with an arrow labeled "data ingestion." That gap is a real problem. In practice, data engineering is where most AI projects quietly fall apart. Not at the model level. Not at inference. At the pipeline. I've spent the last two years building multimodal data infrastructure at Abaka AI, delivering datasets to frontier AI companies training next-generation reasoning and conversational models. The lessons below come from that work, specifically from the parts that broke in unexpected ways and forced us to rethink assumptions we didn't realize we were making. Multimodal Means Multiple Failure Modes A text pipeline has one content type to worry about. A multimodal pipeline has many: scanned books, handwritten documents, photographs, structured tables, diagrams, audio transcripts, and video frames are all possible inputs, and each one breaks differently. The first mistake most teams make is treating multimodal ingestion as a collection of separate pipelines that happen to feed the same model. That sounds reasonable until you need consistency across modalities, and you realize your text preprocessing strips the metadata that your image pipeline needs to correctly associate figures with captions. Now you have two clean pipelines producing corrupted outputs. The second mistake is assuming format-level parsing solves the content problem. A PDF parser that correctly extracts text may still produce garbage if the source document used a two-column layout, footnotes interspersed with body text, or embedded mathematical notation. Correct extraction is necessary but not sufficient. What actually works is treating each document type as a first-class object with its own quality contract. For each modality, define what "clean" means before writing a single line of processing code. For scanned text, clean might mean a character error rate below 2% with section headers preserved. For images, it might mean resolution above a minimum threshold with alt-text that accurately describes content rather than just naming the file. Write those contracts down. They become your QC spec, and more importantly, they give you something to argue about before problems appear rather than after. Pipeline Speed Is a Product Feature, Not a Bonus When we delivered our first dataset to a large enterprise AI client, the turnaround from contract signing to delivery was 21 days. That wasn't a coincidence or a heroic sprint. It was a consequence of pipeline decisions made months earlier: batch size calibration, parallelized QC, and pre-built validation tooling that didn't require human review at every stage. The conventional wisdom in data engineering is that quality gates slow you down. That's true if the gates are manual. Automated quality checks, built early and run on every batch, are what make speed possible in the first place. If a document fails your character error rate threshold, it gets routed to a remediation queue immediately, not discovered weeks later during model evaluation. There's also a less obvious payoff. When you can tell a client exactly how long processing takes for a given document type and volume, you become predictable. Predictability is what turns a one-time data vendor relationship into a sustained one. Clients who can plan around your delivery cadence will plan around it. Clients who can't will find someone else. The Annotation Problem Nobody Wants to Solve Annotation is the part of multimodal data engineering that everyone wishes could be automated, and almost never fully can be. For some tasks, models are good enough at self-annotation that human review can be reduced to sampling. For others, especially tasks involving nuanced reasoning, spatial relationships in images, or domain-specific knowledge, you still need human annotators who understand the material. The failure mode I see most often is annotation pipelines that treat all tasks as equivalent. Same workflow, same annotator pool, same quality threshold. That breaks when a task requires specialized knowledge. A general annotator pool might correctly label object presence in images but produce low-quality labels for questions about whether a scanned diagram accurately represents a logical circuit. Segment your annotation tasks by complexity and required expertise before building the workflow. For high-complexity tasks, route to specialists and build in consensus checking. For lower-complexity tasks, use larger annotator pools with statistical agreement thresholds. Keep the two workflows separate, even if they feed the same output schema. Version Control for Data Is Not Optional This sounds obvious. Most teams still don't do it properly. The issue is that data versioning gets treated as a documentation problem: label your dataset with a date, note what changed, move on. But if you can't reproduce a specific dataset version exactly, including its preprocessing parameters, source document selection, and annotation schema, you can't debug model regressions that trace back to data changes. We run full lineage tracking on every dataset we produce. Every document has a source identifier, an ingestion timestamp, a processing version, and a list of applied transformations with parameters. When a client reports unexpected model behavior, we can trace it to a specific data batch and often to a specific preprocessing decision. The tooling for this isn't exotic. A well-structured metadata store and a deterministic transformation pipeline are mostly sufficient. The discipline of actually using them consistently across every pipeline stage is the harder part. What Skipping Ingestion QC Actually Costs You Here's a failure pattern I've seen in multiple production pipelines. A team ingests a large corpus of scanned documents. OCR looks reasonable on spot check. They run embeddings and deliver the dataset. Three months later, the client reports degraded model output on a specific class of questions involving tables and structured data. Pull the relevant training documents. About 15% of the scanned tables were ingested with columns transposed, because the OCR engine misread the column separator characters. The model learned from that corrupted structure. The degradation was there from day one, and nobody caught it because the spot check didn't include table-heavy documents at a meaningful sample rate. The fix is not more spot checks. The fix is structured validation at ingestion: schema-aware quality checks that specifically test the document features most likely to corrupt model training, run automatically before any document enters the processing queue. Catching problems at ingestion is ten times cheaper than catching them during evaluation. Catching them during evaluation is still far cheaper than catching them after deployment. Building for a Model Team You Don't Control One dynamic that rarely gets discussed in data engineering: you often don't know exactly how the consuming model team will use the dataset you produce. They may apply additional filtering. They may upsample certain document types. They may concatenate your dataset with other sources in ways that create distribution shifts you never anticipated. All of this affects whether your data, however well-produced, actually helps the model. The practical response is to over-document. Deliver metadata that tells the model team what's in the dataset in granular detail: distribution across document types, languages, topic areas, source characteristics, and annotation confidence scores. Build your delivery format around their needs, not yours. If you can establish a feedback loop where model evaluation results flow back to the data team, build it early and protect it. That loop is how you build datasets that improve over time instead of delivering batches and hoping. A Note on What This Work Actually Is Multimodal data engineering tends to get framed as infrastructure work, a supporting function that enables the "real" AI work to happen. I think that framing causes teams to underinvest in it and then be surprised when things go wrong. The pipeline decisions that look like implementation details, annotation segmentation, version lineage, ingestion-time QC, and delivery format conventions are the decisions that determine whether your data actually improves the models it trains. A model trained on well-engineered data outperforms one trained on carelessly processed data with the same architecture. That's been true in every project I've worked on. What you're really building isn't a pipeline. It's a quality assurance system that happens to move data through stages. The sooner teams internalize that distinction, the fewer late-stage surprises they'll face, and the fewer post-mortems they'll have to write.

By Yunfei Zhao

Top Data Experts

expert thumbnail

Salman Khan

Director Data Science,
Afiniti

Salman Khan is the Director of Data Science at Afiniti, where he drives innovative solutions to complex business challenges through data science. With a specialization in machine learning, statistical modelling, and a strong focus on generative AI, Salman leads multiple teams of data scientists and engineers in the development and deployment of cutting-edge AI-driven applications. Salman has led AI projects delivering measurable business value, including real-time prediction systems, advanced language models, semantic search platforms, and generative AI applications. Salman’s expertise spans deep learning, probabilistic modelling, and a broad range of data science techniques, with advanced proficiency in Python, R, and SQL.
expert thumbnail

Fawaz Ghali, PhD

Fawaz Ghali is an independent applied AI technologist with over 25 years of experience spanning AI research, higher education, and industry delivery. He works at the point where enthusiasm has faded, tools have been tried, and AI initiatives are expected to deliver real value. Holding a PhD in Computer Science, Fawaz has led and contributed to applied AI, data engineering, and distributed systems across academia and industry. He has published 45+ peer-reviewed papers and delivered over 300 talks worldwide, focusing on what actually works once AI meets real data, real constraints, and real organisational pressure. Fawaz speaks and teaches independently, without vendor affiliation or sales agendas. His sessions are direct, evidence-based, and deliberately free of marketing, product promotion, or speculative claims, offering audiences clarity, critical evaluation, and hard-earned insight grounded in practice rather than hype.

The Latest Data Topics

article thumbnail
Stop Loading Everything into Redshift: A Spectrum + Iceberg Pattern for Hybrid Analytics
Store large and cold datasets in Iceberg on S3, query them through Spectrum, and reserve Redshift local tables for workloads that need low latency or high concurrency.
June 12, 2026
by Vivek Venkatesan
· 341 Views
article thumbnail
Operationalizing Enterprise AI at Scale: Architecture, Governance, and Adoption
Enterprise AI success depends on scalable architecture, governance automation, AI operations, observability, and developer-first enablement strategies.
June 12, 2026
by Aravind Nuthalapati DZone Core CORE
· 355 Views · 1 Like
article thumbnail
Native SQL in Java Without JDBC Boilerplate — Meet Ujorm3
Ujorm3 eliminates JDBC boilerplate without a full ORM. Write native SQL with named parameters, get objects back — including nested relations.
June 11, 2026
by Pavel Ponec
· 683 Views · 2 Likes
article thumbnail
Rust-Native Alternatives to Spark SQL and DataFrame Workloads
Sail is an open-source computation framework that serves as a drop-in replacement for Apache Spark (SQL and DataFrame API) in both single-host and distributed settings.
June 11, 2026
by Srinivasarao Rayankula
· 501 Views · 1 Like
article thumbnail
Orchestrating Zero-Downtime Deployments With Temporal
Temporal provides the durable control plane for safe zero-downtime deployments across canaries, approvals, retries, and rollbacks.
June 10, 2026
by Akhil Madineni
· 491 Views
article thumbnail
Amazon OpenSearch Vector Search Explained for RAG Systems
Use Amazon OpenSearch k-NN as your RAG vector store. Build a small Python example: create the index, embed docs, search by meaning.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 689 Views
article thumbnail
Token Attribution Framework for Agentic AI in CI/CD
A practical framework for tracking attribution, setting budgets, and circuit-breaking spending on LLM in your CI/CD pipeline by using an OpenTelemetry implementation.
June 9, 2026
by Intiaz Shaik
· 6,034 Views
article thumbnail
The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
This comprehensive technical guide breaks down the essential architectural, storage, and integration patterns required to scale enterprise big data platforms.
June 8, 2026
by Ram Ghadiyaram DZone Core CORE
· 1,211 Views
article thumbnail
Production-Grade RAG: Why Vector Search Isn't Enough (and How Hybrid Search Fills the Gaps)
RAG pipelines are getting more and more popular with vector search at the core of them. However, vector search might not be just enough for high-quality retrieval.
June 8, 2026
by Alejandro Duarte DZone Core CORE
· 920 Views · 1 Like
article thumbnail
From 24 Hours to 2 Hours: How We Fixed a Broken BI System With Apache Airflow
Broken pipelines, inaccurate data, frustrated stakeholders. Here is what we did about it and what I wish I had known before we started.
June 5, 2026
by Chinni krishna Abburi
· 1,872 Views
article thumbnail
Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
No, but its role has fundamentally changed. Here is what I have seen work, after building data platforms at enterprise scale across multiple industries.
June 5, 2026
by Nabarun Bandyopadhyay
· 2,862 Views · 1 Like
article thumbnail
Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
Throughput-based load balancing breaks down when streaming messages have heterogeneous processing costs — the fix is balancing on actual per-partition resource usage.
June 5, 2026
by Semyon Slepov
· 1,928 Views
article thumbnail
Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
A mutation testing pattern for analytics metrics that checks if validation catches realistic business logic errors early.
June 4, 2026
by Prateek Arora
· 2,071 Views · 1 Like
article thumbnail
A System Cannot Protect What It Does Not Understand
Inside the system, there is always a boundary between incoming data and stored state, and that boundary is not passive. It acts like a gatekeeper.
June 4, 2026
by Jan Nilsson
· 1,810 Views
article thumbnail
Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
This article details a resilient pseudo-labeling architecture. It combines Redis ingestion, Matryoshka embeddings, XGBoost to neutralize self-training confirmation bias.
June 4, 2026
by Harshith Narasimhan Srivatsa
· 1,906 Views
article thumbnail
Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch
STIX/TAXII in, ECS normalized, provenance preserved deterministic IDs, correct bulk writes, ingest pipelines keep threat indicator data reliable and queryable under load.
June 3, 2026
by Krishnaveni Musku
· 2,456 Views
article thumbnail
How to Save Money Using Custom LLMs for Specific Tasks
MCP transforms AI from "chatbot" to "capable agent" by managing the messy details of tool integration and execution. With local models.
June 3, 2026
by Max Tcvetkov
· 1,862 Views · 2 Likes
article thumbnail
Using LLMs to Automate Data Cleaning and Transformation Pipelines
Data cleaning is brittle and time-consuming; LLMs introduce a semantic layer that makes workflows more resilient and easier to maintain.
June 3, 2026
by David Taiwo Balogun
· 1,832 Views
article thumbnail
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
Glue failures scatter evidence across logs, metadata, and table state. A triage layer pulls it together and flags whether a rerun is safe.
June 2, 2026
by Vivek Venkatesan
· 1,793 Views · 1 Like
article thumbnail
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
Many MVPs get too big because teams treat several user-facing systems and vendor-dependent workflows as one app instead of planning one complete path first.
June 2, 2026
by Kajol Shah
· 1,333 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×