Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
The New Senior Developer Job Description: Half Engineer, Half AI Systems Architect
Building Production-Safe Agentic Remediation With Docker MCP Gateway: Lessons From 43% to 100% Accuracy
AI coding assistants are becoming increasingly capable at generating code, explaining systems, and accelerating development workflows. But in real engineering environments, the biggest blocker is often not the model’s ability to write code. The bigger issue is whether the assistant has the right context before it starts making changes. A developer rarely works from a single source of truth. A Jira ticket may describe the implementation task. A Google Doc may contain the detailed requirements. A slide deck may explain the business goal. A meeting summary may include key decisions, open questions, and next steps that never made it back into the ticket. For a human developer, this creates friction. For an AI coding assistant, it creates risk. The assistant may generate code that looks correct, passes basic syntax checks, and follows existing patterns - but still implements the wrong behavior because the actual feature context was fragmented across multiple places. This is where a PARA-style context workspace becomes useful. PARA - Projects, Areas, Resources, and Archives is commonly used to organize knowledge by actionability. Applied to AI-assisted software development, it can become a practical architecture pattern for preparing scattered engineering knowledge before an AI coding assistant touches code. The goal is not to dump every document into the model. The goal is to organize scattered context so the assistant can reason with the right information for the task. The Problem: AI Coding Assistants Often See Only Part of the Work Consider a developer asked to build a new data pipeline that calculates a generic quality score. The implementation sounds straightforward: Build a pipeline that joins multiple input tables, applies business rules, and produces a quality score output table. But the actual context may be spread across several sources: SourceWhat It May ContainTicketImplementation scope, acceptance criteria, due dateRequirements docBusiness rules, scoring logic, data definitionsSlide deckBusiness goal, stakeholder alignment, expected impactMeeting summaryFinal decisions, open questions, changed thresholdsExisting codePipeline patterns, naming conventions, dependency structureOlder documentsPrevious decisions, deprecated approaches, known constraints If the AI coding assistant only sees the ticket, it may miss the deeper context needed to implement the feature correctly. This is especially risky for data pipelines and analytics features, where correctness depends not only on code structure but also on interpretation: which source tables to use, how freshness should be handled, how business rules are applied, and how downstream consumers will use the output. What Can Go Wrong If the Agent Only Reads the Ticket? A ticket often captures the visible work, but not the full reasoning behind the work. If the assistant only uses the ticket, it may: Implement the task but miss business rules from the requirements documentIgnore key decisions captured in meeting summariesUse a technically available source table that is not the approved source for this featureMiss freshness expectations for the output tableProduce a score that does not match how downstream dashboards or reports will consume itFollow an outdated implementation pattern because it found old but similar codeGenerate a pull request that looks reasonable but fails product or data-quality expectations This is the core issue: The AI assistant may know how to write code, but it may not know which code should be written. That distinction matters. For coding agents to become more reliable, developers need a better way to prepare context before code generation begins. Reframing PARA for AI Coding Agents PARA can be adapted from a personal knowledge organization method into a context classification pattern for AI-assisted development. In a PARA-style context workspace: PARA CategoryEngineering MeaningAgent Context RoleProjectsActive work being deliveredCurrent feature scope, ticket, task goalAreasOngoing responsibilitiesStandards, ownership, governance, quality expectationsResourcesReusable knowledgeDocs, runbooks, design patterns, pipeline examplesArchivesCompleted or inactive knowledgeHistorical decisions, old approaches, past incidents This structure helps the AI assistant understand the role of each piece of information. A current requirement should not be treated the same way as an old design decision. A meeting decision should not be buried behind a generic document search. A reusable pipeline pattern should be available to guide implementation, while archived material should be used carefully as historical context. The value of PARA is not just an organization. It gives the assistant a way to distinguish between active task context, long-running rules, reusable references, and historical information. This flow changes how the assistant approaches implementation. Instead of asking: “What code should I generate from this ticket?” The assistant can reason from a richer question: “What is the active feature goal, what rules must be followed, what reusable references apply, and what historical context should be considered before changing code?” That shift is small, but important. Applying PARA to a Quality Score Pipeline Now apply this to the quality score pipeline example. The feature requires a pipeline that joins multiple input tables, applies business rules, and writes a quality score output table. The exact business logic is intentionally generic, but the pattern is common across analytics engineering, data engineering, machine learning platforms, and reporting systems. A PARA-style workspace could organize the context like this: Project Context This is the active feature work. It may include: The current ticketFeature scopeAcceptance criteriaCurrent implementation statusTarget output tableExpected delivery milestoneKnown blockers or open questions For the coding assistant, this answers: “What am I being asked to build right now?” Area Context This represents ongoing expectations that apply beyond this one feature. It may include: Data quality standardsFreshness expectationsOwnership rulesPrivacy or compliance constraintsNaming conventionsRelease processTesting expectations For the coding assistant, this answers: “What rules and standards must this implementation follow?” Resource Context This is reusable technical knowledge. It may include: Existing pipeline patternsSimilar transformation logicData model documentationDashboard dependency notesCommon test patternsRunbooksData validation examples For the coding assistant, this answers: “What reusable references should guide the implementation?” Archive Context This is historical information that may still be useful, but should not automatically drive the implementation. It may include: Older design decisionsDeprecated scoring logicPast pipeline migrationsPrevious quality metric experimentsHistorical meeting notesOld RCA or incident learnings For the coding assistant, this answers: “What historical context may explain why the system works this way?” The critical point is that archived context should be used for awareness, not blindly copied into the current implementation. Why Meeting Summaries Matter Meeting summaries are often underestimated in AI-assisted development. In many teams, the final decision is not always reflected immediately in the ticket or requirements document. A meeting summary may contain important details such as: A threshold was changed after stakeholder discussionA source table was rejected because of data freshness concernsA metric definition was clarifiedA downstream dashboard dependency was identifiedA launch decision was postponedAn open question was assigned to another teamA temporary workaround was approved only for the first release For a human developer, these details may be remembered from the meeting. For an AI coding assistant, they are invisible unless they are included in context. This is one reason a PARA-style workspace can be valuable. It gives meeting summaries a place in the feature context without treating them as random notes. A meeting summary tied to an active feature belongs in the Project context. A recurring decision about data freshness may become the Area context. A reusable explanation of metric calculation may become the Resource context. Once the feature is complete, the same meeting summary may eventually move into the Archive context. How the Coding Assistant Should Use Context Before Changing Code Before generating code, the AI coding assistant should use the structured context to form an implementation understanding. For a quality score pipeline, it should first understand: What the feature is trying to accomplishWhich input data sources are approvedWhich business rules define the scoreWhich decisions were finalized in meetingsWhat freshness or latency expectations existWhich existing pipeline patterns should be followedWhat downstream dashboards, reports, or consumers depend on the outputWhich historical approaches should be avoided Only after that should it propose an implementation plan or modify code. This changes the assistant’s role. It is no longer simply a code generator responding to a ticket. It becomes a context-aware engineering assistant that can reason across requirements, decisions, standards, and existing system patterns. The Bigger Shift: From Prompting to Context Preparation Prompting is still useful, but it is not enough for complex engineering work. A good prompt cannot fully compensate for missing requirements, outdated context, or scattered decisions. For AI coding assistants, the quality of the result depends heavily on the quality of the context that comes before the prompt. This is especially true when the task involves business logic, analytics definitions, data contracts, or cross-team decisions. In those cases, the question is not: “How do we write a better prompt?” The better question is: “How do we prepare the right engineering context before asking the assistant to write code?” For developers building with AI coding agents, this may become one of the most important habits: do not ask the agent to write code first. Prepare the context first. Because the future of AI-assisted development will not belong only to teams with the most powerful coding models. It will belong to teams that know how to structure knowledge so those models can make better engineering decisions.
In mid-September 2025, engineers inside Anthropic's threat intelligence team noticed something that didn't fit the usual pattern of automated probing on their platform. Ten days of digging later, they had a name for it: GTG-1002, a Chinese state-sponsored group that had turned Claude Code into the operational core of a cyber-espionage campaign against roughly thirty organizations — banks, chemical manufacturers, tech firms, government agencies. When Anthropic published its account of the intrusion on November 14, the detail that made security teams sit up wasn't the target list. It was the autonomy ratio: by the company's own estimate, the AI agent executed somewhere between 80 and 90 percent of the operation — reconnaissance, vulnerability discovery, exploit development, lateral movement, exfiltration — with humans stepping in only at a handful of strategic checkpoints. Jacob Klein, who heads threat intelligence at Anthropic, called it an escalation that lowers the bar for who can run a sophisticated intrusion at all. I've spent the better part of this year watching that bar keep dropping, one disclosure at a time. And the thing I keep coming back to is this: the security industry built thirty years of tooling around the assumption that the dangerous actor inside your network is a person — a careless employee, a disgruntled admin, a phished contractor. That assumption is now wrong often enough to be a liability. The dangerous actor increasingly has no payroll record, no badge, no manager to flag erratic behavior. It's a process. And it's already inside. Skeleton Keys for Software Here's the uncomfortable arithmetic. CyberArk's 2025 Identity Security Landscape study found machine identities now outnumber human ones by more than 80 to 1 inside the average enterprise, with AI specifically named as the biggest driver of new privileged accounts this year. Other measurements land in a wide band — Rubrik Zero Labs put it at 82 to 1, Entro Labs measured DevOps-heavy environments at 144 to 1 — but every credible estimate points in the same direction, and the gap is widening faster than anyone's governance program. What makes this dangerous isn't the count. It's the habit. Most teams I've talked with over the past eighteen months reached for the path of least resistance when they first wired an agent into production: they handed it a copy of a human's API key, or a service account with the same standing privileges everyone else in that pipeline already had. It's the software equivalent of cutting a spare house key and leaving it under the mat — convenient until the day someone you didn't intend to find it. That convenience is exactly what blew up Salesloft and its customers in August 2025. Attackers tracked as UNC6395 didn't breach Salesforce. They stole OAuth tokens belonging to Drift, a chatbot integration plugged into it, and used those long-lived, broadly scoped tokens to walk into Salesforce, Slack, AWS, and Google Workspace environments at more than 700 downstream organizations — Cloudflare and Google among them — over roughly a ten-day window. Nobody compromised the platform. They compromised the credential that the integration was trusted with, and that credential opened far more doors than the integration's actual job required. Swap "chatbot integration" for "AI agent," and you've described the exact failure mode every analyst is now warning about for 2026. The fix that keeps surfacing in serious architecture conversations isn't exotic — it's the same zero-trust logic that's been preached at humans for a decade, finally pointed at software: Skeleton-key modelScoped-identity modelCredentialCopied human API key or shared service accountUnique identity per agent, issued via OAuth client credentials or a workload-identity standard like SPIFFELifetimeStatic, often unrotated for months or yearsShort-lived, reissued per session or taskBlast radius if stolenEverything that account can touchOnly what that specific agent was scoped to doAuditability"Someone" did thisThis agent, acting on this task, did this None of this is theoretical anymore. Gartner is telling boards that by 2028, roughly a third of enterprise applications will carry embedded agentic AI, and 15 percent of day-to-day work decisions will be made without a human in the loop. You cannot run that volume of autonomous action on credentials designed for an employee who logs in, does a job, and logs out. When the Prompt Is the Payload If identity is the slower-burning problem, prompt injection is the one that's already setting things on fire. OWASP's 2025 Top 10 for LLM Applications kept it at the number-one slot for a second consecutive edition, and for good reason: an LLM has no architectural separation between "instructions I should obey" and "data I should merely read." Feed it both in the same channel, and a sufficiently clever attacker can make the model treat the second as the first. The cleanest public demonstration of how bad this gets in practice is CamoLeak, the vulnerability researcher Omer Mayraz disclosed through Legit Security in October 2025, tracked as CVE-2025-59145 with a CVSS score of 9.6. The setup was almost playful: hide an instruction inside a pull request's invisible comment field, wait for a developer to ask GitHub Copilot Chat to review that PR, and let Copilot — operating with that developer's own repository privileges — quietly search the codebase for strings like "AWS_KEY," then exfiltrate whatever it found one character at a time. Each character got mapped to its own GitHub-hosted image URL, routed through GitHub's own trusted Camo proxy so the outbound traffic looked like nothing more than a chat window rendering a picture. Legit Security's CTO, Liav Caspi, put the core problem plainly: a vigilant network monitor might catch the unusual request pattern, but the average user or maintainer almost certainly wouldn't. GitHub closed the hole in August by disabling image rendering in Copilot Chat entirely — a blunt fix, but an honest acknowledgment that there was no elegant patch for the underlying design flaw. What should worry you is that CamoLeak is GitHub-specific plumbing wrapped around a generic problem. Any agent that reads untrusted content and can also take action — summarize an inbox, browse a webpage, query a ticketing system — has the same exposed nerve. The attack surface isn't the code. It's the fact that the model can't reliably tell an instruction from a sentence describing one. MCP Didn't Invent the Confused Deputy. It Industrialized It. The Model Context Protocol turned eighteen months old this past spring, and in agent circles it's already being described, only half-jokingly, as the USB-C of AI tooling — a single standard that lets an agent plug into dozens of databases, SaaS platforms, and internal systems without custom integration code for each one. That convenience is precisely why it became 2025's most interesting new attack surface. CVE-2025-49596 let attackers run arbitrary commands through unauthenticated MCP Inspector instances, rated 9.4. CVE-2025-6514, found in the widely used mcp-remote project, hit 9.6 and gave attackers OS-level command execution simply by getting an MCP client to connect to a malicious server. Researchers at Invariant Labs separately showed they could pull private repository data and WhatsApp message history out through MCP integrations that trusted server-supplied tool descriptions a little too much. That last detail is the one practitioners now call tool poisoning, and it deserves more attention than it gets. An MCP server doesn't just expose a function — it ships a natural-language description of that function for the model to read. Bury a hidden instruction inside that description, and the agent absorbs it as context with the same credulity it would extend to legitimate documentation. Layer in what researchers call a rug pull — a tool that behaved safely last week, silently swapping in malicious behavior this week, with no re-approval prompt — and you've got a supply chain risk that traditional dependency scanning has no vocabulary for. Underneath all of it sits the same architectural sin the original insider-threat literature has been naming for years: authorization quietly divorcing from authentication. An MCP server executing a database query on an agent's behalf needs to know not just that the agent is who it claims to be, but what the human or task behind that request was actually authorized to do. Skip that check, and you've built a confused deputy that will dutifully escalate its own privileges on a stranger's behalf. Where the Policy Engine Has to Live The architecture pattern that's converging across the vendors and practitioners I trust most isn't subtle, and that's its strength. You insert a policy decision point — Cerbos, Open Policy Agent, or an equivalent — directly in the path between the agent's tool calls and the systems those calls touch, so that nothing executes on trust alone: Plain Text User | v AI Agent ----(declares identity + intent)----> Policy Engine (PDP) ^ | | allow? | deny? | v | MCP Server -----> Database / API | | +---------------------(action result)----------+ The point of that middle box is to ask a boring, specific question on every single call: which agent is this, what was it actually asked to do, and does this particular action fall inside that scope? "Only SalesBot may call lookup_customer." "Any transfer above a threshold requires a human approval step before the MCP server executes it." None of that logic lives in the model's good judgment, because the model's judgment is exactly what prompt injection is designed to corrupt. The enforcement has to sit somewhere a crafted sentence can't reach it. This is also, not coincidentally, where the Cloud Security Alliance's "toxic cloud trilogy" — a public workload, a real vulnerability, and standing high-level privilege, all present at once — actually gets defused. CSA's own telemetry shows that the combination is present in 38 percent of workloads in early 2024, down to 29 percent by mid-2025, as organizations started pulling standing privilege out of the equation. That's real progress. It's also nowhere near fast enough for the rate at which agents are being deployed. What 2026 Actually Requires I don't think the next twelve months are going to be defined by a single dramatic breach, although there will probably be one anyway. I think they'll be defined by something quieter and more structural: the slow, overdue migration of agents off static, shared credentials and onto something closer to what SPIFFE and SPIRE were originally built for in the service-mesh world — short-lived, cryptographically verifiable, per-workload identity that can be issued, scoped, and revoked without anyone touching a spreadsheet of API keys. OWASP published a dedicated Non-Human Identity Top 10 in 2025 for exactly this reason; the existing application-security and human-IAM playbooks simply don't have entries for credentials that never sleep, never request access, and inherit whatever standing permission happens to be sitting there. The governance gap is still wide open. Recent industry surveys put the share of organizations with mature agent-governance programs below one in five, even as more than ninety percent of security leaders rate the problem as critical. That mismatch — high anxiety, low operational maturity — is usually the exact condition under which the expensive breach happens. My honest read, after a year of watching this space accelerate: the organizations that treat their agents as first-class, individually identified, least-privileged principals from day one will look unremarkable in hindsight. The ones that didn't will be writing the incident reports everyone else cites in 2027.
The 3:00 AM Incident That Changed Everything It was a Tuesday morning when the alerts started firing. Our recommendation engine, the one that drives 30% of our revenue, had tanked. Accuracy dropped from 94% to 58% overnight. The data science team immediately blamed the model. They started tweaking hyperparameters, re-training on new data, and running diagnostics. Nothing worked. I got pulled into the war room at 3:00 AM. The first thing I asked wasn't "What's wrong with the model?" It was "What changed in the data pipeline?" Turns out, everything. A vendor had pushed a schema change upstream. A field that used to be required became optional. Null values started flowing through our pipeline. Our feature engineering code didn't handle nulls gracefully; it just propagated them downstream. By the time the data reached the model, 40% of our feature vectors were corrupted. The model wasn't broken. The data was. We spent six hours manually rolling back the schema change, re-running the pipeline, and restoring service. The incident report was brutal: "Lack of data validation caught a breaking change too late." That's when I realized we needed observability in our data pipeline, not just in our models. The Problem: Data Quality is Invisible Until It Breaks Here's the uncomfortable truth about data pipelines: they fail silently. Your ETL job completes successfully. Your Spark cluster finishes transformations. Your data warehouse loads without errors. Everything looks green in the monitoring dashboard. But the data itself? Garbage in, garbage out. There are three categories of failures that break AI models in production: Missing Values: A source system stops populating a field. Your pipeline doesn't validate it. The model gets NaN values it never saw during training. Predictions become random noise. Schema Changes: An upstream team adds a new column, renames an existing one, or changes data types. Your pipeline doesn't expect these changes. Either it crashes, or worse, it silently maps data to the wrong columns. Distribution Shifts: The statistical properties of your data change. A field that was always between 0 and 100 suddenly has values of 50,000. Your model's scaling assumptions break. Predictions become nonsensical. None of these show up in traditional infrastructure monitoring. Your CPU is fine. Memory is fine. Network is fine. But your data is on fire. The Solution: Observability at Every Layer I started building a three-layer observability framework using dbt, Great Expectations, and custom validation logic. The goal was simple: catch data quality issues before they reach the model. Layer 1: dbt Tests (The First Line of Defense) dbt tests are your cheapest, fastest way to catch obvious data quality issues. They run after every transformation and fail the entire pipeline if something's wrong. Here's what we implemented: SQL -- models/staging/stg_user_events.yml version: 2 models: - name: stg_user_events columns: - name: user_id tests: - not_null - unique - name: event_timestamp tests: - not_null - dbt_utils.expression_is_true: expression: "event_timestamp <= current_timestamp()" - name: event_value tests: - not_null - dbt_utils.expression_is_true: expression: "event_value > 0" These tests are simple but powerful. They catch: Missing required fields (not_null)Duplicate records (unique)Impossible values (event_timestamp in the future)Out-of-range values (negative prices) We run these tests on every dbt run. If any test fails, the pipeline stops. No data reaches the model. No silent corruption. The beauty of dbt tests is that they're version-controlled, documented, and part of your transformation code. When a schema change happens, you update the test, commit it, and everyone knows what changed. Layer 2: Great Expectations (The Statistical Validator) dbt tests catch structural issues. Great Expectations catches statistical anomalies, the subtle shifts that break models. Here's a real scenario: our user_age column had a distribution of 18-65 for two years. Then one day, we started getting ages of 200, 500, 1000. A data entry bug upstream. dbt tests wouldn't catch this because the values are technically valid integers. But Great Expectations would. Python # great_expectations/expectations/user_events_expectations.py from great_expectations.core.batch import RuntimeBatchRequest from great_expectations.data_context import DataContext context = DataContext() suite = context.create_expectation_suite( expectation_suite_name="user_events_suite", overwrite_existing=True ) validator = context.get_validator( batch_request=RuntimeBatchRequest( datasource_name="my_spark_datasource", data_connector_name="default_runtime_data_connector", data_asset_name="user_events" ), expectation_suite_name="user_events_suite" ) # Expect user_age to be between 18 and 120 validator.expect_column_values_to_be_between( column="user_age", min_value=18, max_value=120 ) # Expect event_value to have a mean between 50 and 200 validator.expect_column_mean_to_be_between( column="event_value", min_value=50, max_value=200 ) # Expect less than 5% missing values in critical columns validator.expect_column_values_to_not_be_null( column="user_id", mostly=0.95 ) # Expect the distribution to match historical patterns validator.expect_column_kl_divergence_from_list( column="event_type", partition_object={"event_type": ["click", "view", "purchase"]}, threshold=0.1 ) validator.save_expectation_suite(discard_failed_expectations=False) Great Expectations runs after dbt tests. It validates: Value ranges (age between 18 and 120)Statistical properties (mean event value between 50 and 200)Null rates (less than 5% missing in critical columns)Distribution shifts (event_type distribution matches historical patterns) If Great Expectations detects an anomaly, it alerts us. We investigate before the data reaches the model. Layer 3: Custom Validation (The Domain Expert) dbt and Great Expectations are generic. Your domain is specific. We added custom validation logic that understands our business. Python # pipelines/validation/custom_validators.py import pandas as pd from datetime import datetime, timedelta def validate_feature_engineering(df: pd.DataFrame) -> dict: """ Custom validation for features before they reach the model. Returns a dict of validation results. """ results = {} # Validate 1: Feature completeness # We need at least 95% of features populated feature_cols = [col for col in df.columns if col.startswith('feature_')] null_rate = df[feature_cols].isnull().sum().sum() / (len(df) * len(feature_cols)) results['feature_completeness'] = { 'passed': null_rate < 0.05, 'null_rate': null_rate, 'threshold': 0.05 } # Validate 2: Feature scaling # After normalization, features should be roughly between -3 and 3 (3 sigma) for col in feature_cols: max_val = df[col].max() min_val = df[col].min() results[f'{col}_scaling'] = { 'passed': max_val < 10 and min_val > -10, 'max': max_val, 'min': min_val } # Validate 3: Temporal consistency # Events should be recent (within last 30 days) if 'event_date' in df.columns: df['event_date'] = pd.to_datetime(df['event_date']) days_old = (datetime.now() - df['event_date'].max()).days results['temporal_freshness'] = { 'passed': days_old < 30, 'days_old': days_old, 'threshold_days': 30 } # Validate 4: Business logic # Revenue should always be positive if 'revenue' in df.columns: negative_revenue = (df['revenue'] < 0).sum() results['business_logic_revenue'] = { 'passed': negative_revenue == 0, 'negative_count': negative_revenue } return results def validate_and_alert(df: pd.DataFrame, validation_results: dict) -> bool: """ Check all validations and alert if any fail. Returns True if all pass, False otherwise. """ all_passed = True for check_name, check_result in validation_results.items(): if not check_result['passed']: all_passed = False print(f"ALERT: {check_name} failed") print(f"Details: {check_result}") # Send to monitoring system (Datadog, New Relic, etc.) # send_alert(check_name, check_result) return all_passed This custom validation runs after Great Expectations. It checks: Feature completeness (95% of features populated)Feature scaling (normalized features in the expected range)Temporal freshness (data is recent)Business logic (revenue is positive) If any check fails, we block the pipeline and alert the team. The Real-World Gotchas We Discovered Gotcha 1: Validation Overhead Running dbt tests, Great Expectations, and custom validation on every pipeline run adds latency. We went from 15-minute runs to 25-minute runs. The trade-off was worth it (catching one data quality issue saved us more time than we lost), but you need to plan for it. Gotcha 2: False Positives Great Expectations' distribution shift detection is sensitive. Legitimate business changes (a marketing campaign causing a spike in user_age distribution) triggered false alerts. We had to tune thresholds carefully and add context to alerts. Gotcha 3: Schema Changes Are Sneaky A vendor added a new column to an upstream table. Our pipeline didn't break; it just ignored the new column. But the data science team expected it. We added schema validation to catch new columns and alert us. Gotcha 4: Null Handling Varies Python treats null as None. SQL treats it as NULL. Spark treats it as null. When data flows between systems, nulls get lost or misinterpreted. We had to standardize null handling across the entire pipeline. The Framework: A Decision Matrix Here's how we decide which validation layer to use: Issue TypeCaught ByExampleActionMissing required fielddbt testsuser_id is nullFail pipeline immediatelyDuplicate recordsdbt testsSame user_id appears twiceFail pipeline immediatelyImpossible valuesdbt testsevent_timestamp in futureFail pipeline immediatelyOut-of-range valuesGreat Expectationsage > 150Alert, investigate, fail if severeDistribution shiftGreat Expectationsevent_value mean changes 50%Alert, investigate, continue if acceptableBusiness logic violationCustom validationrevenue is negativeAlert, investigate, failSchema changeCustom validationNew column added upstreamAlert, investigate, update tests The Results: From Chaos to Confidence After implementing this three-layer framework: Incident reduction: We went from 2-3 data quality incidents per month to 0 in six months.Time to resolution: When issues do occur, we catch them within minutes instead of hours.Model stability: Model accuracy stopped fluctuating. It's now consistently 93-95%.Team confidence: Data scientists trust the data. Engineers trust the pipeline. The best part? We caught the schema change incident before it happened. Great Expectations detected the distribution shift, we investigated, found the upstream change, and coordinated with the vendor team before any data reached production. Getting Started: The Minimal Viable Observability You don't need to implement everything at once. Start here: Week 1: Add dbt tests for not_null and unique on critical columns.Week 1: Add dbt tests for not_null and unique on critical columns.Week 1: Add dbt tests for not_null and unique on critical columns.Week 4: Set up alerting so you're notified when validations fail. That's it. You now have observability in your data pipeline. Conclusion: Observability Saves Models Your AI model isn't failing because it's bad. It's failing because the data feeding it is bad. And you won't know the data is bad until you look. The best models in the world can't save you from garbage data. But good observability can. dbt tests, Great Expectations, and custom validation aren't fun. They don't make it into conference talks. But they'll save your production system at 3:00 AM. Start small. Test early. Validate often.
Every CISO I talk to right now is juggling two deadlines that feel unrelated and aren't. One is the slow-motion arrival of quantum computers capable of breaking the public-key cryptography that underpins basically everything — TLS, SSH, JWTs, code-signing. The other is the much faster arrival of AI-assisted coding tools that are shipping security-critical code nobody has fully reviewed. I used to think of these as separate beats. I don't anymore, because the same root failure shows up in both: organizations adopting powerful new capability faster than they're building the visibility and discipline to govern it. Post-Quantum Planning: The Inventory Problem Comes First NIST finalized its first three post-quantum cryptography standards on August 13, 2024, after an eight-year, multi-round public competition: FIPS 203 (ML-KEM, the lattice-based key encapsulation mechanism formerly known as Kyber), FIPS 204 (ML-DSA, the signature scheme formerly known as Dilithium), and FIPS 205 (SLH-DSA, the hash-based fallback formerly known as SPHINCS+). In March 2025, NIST added a fourth algorithm, HQC, specifically chosen because it rests on a different mathematical hardness assumption than the lattice problems underneath ML-KEM and ML-DSA — a deliberate hedge in case lattice-based cryptography turns out to have a weakness nobody's found yet. The NSA's CNSA 2.0 guidance sets 2030 as the mandatory PQC migration deadline for national security systems, and NIST's broader timeline calls for deprecating RSA and ECDSA entirely by 2035. Gartner's framing of where most organizations actually stand is the line I keep sending to clients verbatim: many organizations are already prototyping PQC and improving crypto-agility, but visibility gaps persist. That's the polite analyst version of what I see in the field, which is teams that can tell you they've tested ML-KEM in a lab environment but cannot tell you how many of their production TLS endpoints, SSH host keys, or embedded device certificates are still running plain RSA-2048 with no migration path at all. Gartner's own recommendation sequence is the right one: start a cryptographic inventory, stand up a cryptographic center of excellence, push vendors for their PQC roadmaps, and prioritize migration for whatever data needs to stay confidential the longest. That last point matters more than people give it credit for — "harvest now, decrypt later" only threatens data that's still sensitive when a quantum computer capable of breaking it eventually shows up, so a database of last quarter's marketing metrics is not your priority. Decades-long medical records, government communications, and long-lived intellectual property are. The actual transition is happening faster than most security teams realize, which is encouraging, but it's happening unevenly. Cloudflare's 2025 Radar Year in Review reported that post-quantum-encrypted TLS 1.3 traffic nearly doubled across the year, from 29% in January to 52% by early December — driven heavily by browser vendors enabling hybrid post-quantum key exchange by default and by Apple's iOS 26 release in September 2025, after which the share of post-quantum-capable requests from iOS devices jumped from under 2% to 11% in four days and passed 25% by December. That's the client side. The server side is lagging noticeably: Cloudflare's own measurements put post-quantum-preferred key agreement on the origin server side at roughly 10% as of early 2026, up from under 1% a year earlier — a tenfold increase, but still a small minority. Browsers adopted PQC essentially invisibly. Backend infrastructure, predictably, is the harder problem, because it's full of legacy TLS terminators, hardcoded cipher suites, and vendor appliances nobody wants to touch. Quantum-Resistant Identity: Don't Wait for "Done" The identity layer is where crypto-agility gets concrete rather than theoretical. A PQC-ready JWT issuer isn't exotic engineering — it means your signing service can issue tokens using ML-DSA instead of (or alongside) RS256 or ES256, and your verification logic can check either signature type without a code change every time the algorithm preference shifts. The same logic applies to your internal certificate authority: if your CA can only issue RSA or ECDSA certs today, you don't have crypto-agility; you have a single point of future failure with a five-to-ten-year fuse on it. NIST has indicated that commercially available post-quantum certificates from public CAs likely won't be common until sometime in 2026, which means internal PKI teams building their own quantum-aware issuance now are ahead of the commercial market, not behind some imaginary deadline. It's worth being honest that the early implementations of these algorithms have already had real bugs. In late 2023, researchers disclosed "KyberSlash," a timing side-channel in several Kyber/ML-KEM implementations caused by non-constant-time arithmetic during decapsulation — an attacker with precise enough timing measurements could, in principle, recover a private key. The reference implementations were patched by December 2023, and it's a useful reminder that a mathematically sound post-quantum algorithm is not automatically a secure deployment; the implementation needs the same constant-time discipline that took classical cryptography decades to get right, except this time the industry doesn't have decades to learn the lesson slowly. AI/Vibe Coding Risk: The Other Deadline Andrej Karpathy coined the term "vibe coding" on February 2, 2025, to describe a development style where a programmer describes what they want in plain language, accepts the AI's output largely on faith, and iterates through follow-up prompts rather than reading the generated code line by line. Collins English Dictionary named it Word of the Year for 2025, which tells you how fast the practice spread — and the security data on what it's producing is not encouraging. Veracode's 2025 GenAI Code Security Report tested more than 100 large language models across multiple languages and found that AI-generated code failed basic secure-coding benchmarks roughly 45% of the time, containing on the order of 2.74 times more vulnerabilities than comparable human-written code, with Java the worst performer at a 72% failure rate. Georgia Tech's Systems Software and Security Lab has been tracking this concretely since launching its Vibe Security Radar project in May 2025: CVEs directly attributable to AI coding tools went from six in January 2026 to fifteen in February to thirty-five in March — more in that single month than the entire second half of 2025 combined. Hanqing Zhao, the graduate researcher leading the project, made the point that's stuck with me most: when an AI agent ships something without an authentication check, that's not a typo slipping through — it's a design flaw built in from the start, because the model was never reasoning about access control as a requirement in the first place. The concrete incident I'd point a skeptical engineering lead to is the "Rules File Backdoor," disclosed by Pillar Security on March 18, 2025. AI coding assistants like Cursor and GitHub Copilot let developers drop configuration files — .cursor/rules and similar — into a repository to steer the assistant's behavior and style. Pillar's researchers found that an attacker could embed hidden Unicode characters — zero-width joiners, bidirectional text-direction markers, invisible to a human skimming the file — inside those configuration files. The AI assistant parses and follows the hidden instructions anyway and silently generates backdoored code that looks completely clean in a normal code review because the part doing the steering was never visible to the reviewer in the first place. That's the vibe-coding risk model in one sentence: the attack surface isn't just "the model might write a bug." It's "the model is now a thing an attacker can prompt-inject without ever touching your repository's visible diff." What I'd Actually Build Plain Text PRE-COMMIT / CI LAYER → Static analysis + secret scanning on every AI-assisted commit, no exceptions for "just a quick fix" → Configuration-file integrity checks: scan .cursor/rules, Copilot instructions, and similar files for non-printable/invisible Unicode before they're trusted by any assistant → Flag any AI-generated auth, crypto, or payment-handling code for mandatory human review — never auto-merge CRYPTO-AGILITY LAYER (build-time) → Centralize all algorithm selection behind a crypto abstraction layer / feature flag, never hardcoded cipher suites or signature algorithms scattered through the codebase → CI step that fails the build if a new dependency introduces a hardcoded RSA/ECDSA-only code path with no PQC fallback registered DEPLOY LAYER (quantum-aware) → TLS termination points support hybrid key exchange (e.g., X25519+ML-KEM) by default → Internal CA issues hybrid or PQC-capable certs for anything with a multi-year expected lifetime → JWT issuers support dual-algorithm signing (classical + ML-DSA) during the transition window, with verification accepting either until classical is formally retired The pre-commit layer is aimed at the faster clock — it's the thing that would have caught the Rules File Backdoor pattern before it shipped, by treating AI-assistant configuration as untrusted input rather than developer intent. The crypto-agility and deploy layers are aimed at the slower clock, and they're cheaper to build now than to retrofit in 2029 when public certificate lifespans are down to 47 days, and nobody can find every RSA-2048 endpoint in a hurry. Neither layer replaces human judgment. Both exist because human judgment, applied once at design time, doesn't scale to a world where code gets generated in seconds, and algorithms need to rotate on a schedule measured in weeks, not years. The End-to-End Scenario, Compressed A developer asks an AI assistant to add a new payment-confirmation endpoint. The assistant generates working code, plus a JWT validation routine that happens to hardcode RS256. CI catches the hardcoded algorithm against the crypto-agility policy and fails the build, not because RS256 is currently insecure, but because the policy says nothing security-critical ships without going through the abstraction layer. A human reviews the auth logic specifically because the pipeline flagged it as AI-generated and security-sensitive. It merges with dual-algorithm signing support intact. None of this required the developer to become a post-quantum cryptography expert or to read every line the model produced. It required the pipeline to assume, by default, that AI-generated code and classical-only cryptography are both temporary conveniences that need a forcing function to age out gracefully — because left to their own momentum, neither one ages out on its own. The teams that get hurt by both of these trends at once aren't unlucky. They're the ones that treated "we'll deal with that later" as a plan for two clocks that were never going to wait.
When I decided to move into AI infrastructure, nobody warned me that I had to relearn how to think about compute. I proceeded with the usual steps, such as spinning up VMs, configuring networking, and managing costs. But then a moment came, and I watched, slightly horrified. I misconfigured the inter-node networking. The result was that an eight-node GPU ran a training job at just 11% GPU utilization. It was a wake-up call for me. AI workloads aren’t just different in a marketing sense. They’re different where it counts, i.e., in the architecture — how you build and run things. The ML engineers on that project immediately assumed the model was the problem. They decided to redesign the model and spent a couple of days tweaking the architecture, like chasing a ghost. The real issue resurfaced only when someone checked the network telemetry — the cluster nodes were using standard Ethernet, not InfiniBand. The model had no issues. The infrastructure configuration was incorrect. After years of working with Azure and a period on AWS before that, I wish someone had given me a cheat sheet before starting that project. Compute: Breaking Down the Model Many cloud engineers assume that AI infrastructure requires larger VMs: more cores and more memory, and the workload will run. This approach is insufficient. While right-sizing CPUs remains relevant, it now accounts for only about 20% of considerations. The remaining 80% is driven by GPUs, which operate fundamentally differently from CPUs and significantly impact the infrastructure. A GPU isn’t just a faster CPU; it's a collection of thousands of smaller cores working together to handle large datasets. If any part of your system—such as storage speed, network bandwidth, or data preprocessing—can't keep up, the GPU remains idle, incurring huge unwanted costs. On Azure, idle GPUs cost as much as active ones. Usually, the main limitation in AI infrastructure isn't the GPU itself, but the upstream systems that supply data to it. When working with Azure, you'll mostly use two main GPU families. The NC-series gives you a single A100 per VM at about $3.60 per hour on demand, making it the go-to choice for fine-tuning and inference tasks. The ND-series has eight A100S that are connected through NVLink and InfiniBand, which is perfect for distributed training. If your cluster uses regular Ethernet instead of InfiniBand between nodes, inter-GPU bandwidth can drop by 60 to 70 percent, and Azure may not warn you about this. It’s smart to double-check that your cluster is set up with InfiniBand before starting a multi-node run and to make sure your GPU quota is ready ahead of time. Storage: Where Training Jobs Are Exhausted When you’re training a language model, expect to chew through the dataset over and over — think of it as laps around a track, not a sprint. If you try to pipe 500GB of text straight from regular Azure Blob Storage, you’ll quickly find yourself staring at a progress bar that barely budges. Each blob tops out at about 60 megabytes per second, but an A100 GPU can eat data for breakfast at several gigabytes per second. There’s a massive mismatch. If you want to keep your GPUs busy (and not just waiting around), you’ll need something beefier — Azure Managed Lustre fits the bill, since it can dish out data to your training jobs at speeds regular storage can’t dream of. I’ll admit, the first time I ran into this, I wasted hours on model tweaks before realizing the bottleneck was staring me in the face the whole time. Model checkpoints are a cost trap that is often overlooked. A single checkpoint for a 7B parameter model is around 28GB. Saving checkpoints every 30 minutes over 72 hours generates more than 4TB of data. Configure a Blob lifecycle policy before you start to avoid unexpected storage costs. Networking: Two Problems, One Person Responsible During training, each GPU shares gradient updates with the others in the cluster via AllReduce. The efficiency of the cluster is directly determined by the bandwidth and latency of this communication. If this communication is disrupted, GPU utilization drops. Machine Learning teams often attribute this to model architecture issues, such as an excessive number of parameters or an incorrect batch size, but the network is usually the cause. First, assess network performance and address any issues before the job runs to avoid unnecessary model design, as ML engineers may not consider this when monitoring loss curves. The second networking problem is well known among cloud engineers. Many enterprise clients in financial services and healthcare require AI services that avoid the public internet. Azure AI services, such as Azure OpenAI, Azure ML, and Azure AI Search, all support Private Link, and the configuration process is identical to that of other PaaS services. The key consideration is to integrate private endpoint DNS zones with existing private DNS or manage them manually. ML engineers may interpret a generic “connection refused” error caused by an incorrect DNS configuration as an API issue. Both inter-GPU bandwidth and private network isolation — critical infrastructure concerns — typically fall under the same person’s responsibility. The Azure AI Services Stack: Known Infrastructure, Unknown Branding Recent Azure services such as OpenAI Service, Machine Learning, and AKS with GPU node pools might sound new, but for most infrastructure teams, the actual work remains familiar. The phrase “managed service” sometimes suggests that everything is taken care of, but in reality, only the AI model is managed. Everyday responsibilities like network security, permissions, cost tracking, and system monitoring still rest with your team, no matter how polished the portal looks. Azure OpenAI Service works much like other managed API endpoints, supporting private connections, role-based access, managed identities, and API Management for controlling usage rates. The main distinction is its use of Provisioned Throughput Units (PTUs) — these reserve GPU resources to guarantee performance. If you see HTTP 429 errors, it’s almost always a sign of resource bottlenecks rather than issues in your code, although the latter is a common assumption. Azure Machine Learning sits on top of other infrastructure stacks, such as Blob Storage, ACR, Key Vault, and compute, which you already manage. The failure mode is unique to Azure ML: the compute cluster lifecycle. Ensure clusters auto-scale to zero when idle. Unfortunately, this is not the default setting. When a bill arrives with huge costs due to a cluster running overnight because of an unset idle timeout, everyone looks to the cloud engineer first. While it’s tempting to go with Azure Container Apps for their apparent simplicity, most real-world inference workloads ultimately end up on AKS with GPU node pools. The reason? Container Apps are easy—that is, until you’re hit with cold start lag during actual user traffic and realize spinning up a GPU container on the fly just isn’t fast enough to meet your SLA. With AKS, you get far more say over things like keeping node pools warm, tuning autoscaling, and controlling scheduling—options that simply aren’t available with Container Apps. Costs: Higher Stakes, Faster Exposure Eight GPUs on an ND-series cluster aren’t cheap — about $27 an hour adds up quickly. A few long training runs and you’re already close to $2,000, and if you’re running a batch of experiments, $20,000 can disappear before anything launches. The price tag often slips by until accounting points it out. When models underperform, it’s easy to blame the architecture, but I’ve learned to glance at GPU usage first. If you’re seeing less than 60% during distributed runs, chances are the bottleneck is in the infrastructure, not the model itself. If you want to slash costs, spot VMs can drop your bill by as much as 90%. The catch? Your training jobs must be able to handle abrupt interruptions—so regular checkpointing and clean restarts are a must. If that’s not in place, spot isn’t the way to go—sort it out with your ML team before finance starts asking questions. Reserving GPU resources is a whole different equation than CPUs: GPU supply changes from region to region, and with how quickly AI hardware evolves, locking in a three-year reservation on today’s gear is a real gamble. Security: Same Toolkit, New Attack Surface For AI projects, you still need the basics like private networks, Managed Identity, strong RBAC, and encryption. But now there’s a twist: prompt injection. It’s like the old trick with SQL injection, but for language models. Someone might simply ask a chatbot to show its system prompt. If you haven’t set up protections, it could actually answer. Firewalls won’t help here. Azure Content Safety can block some of these risky requests, but most teams don’t use it until after trouble starts. If you’re in a regulated industry, logging every inference is a must. In finance or healthcare, you need to record inputs, outputs, who did what, and when, so auditors have all the details they need. Decide on your schema and retention policy before going live, because adding it later, after compliance comes calling, is always a headache. The ML engineers on these teams know the models well. But when infrastructure acts up, causing higher costs, slowdowns, or new risks, they're often the last to spot the cause. Closing that gap is the real challenge. For cloud engineers, "architecturally different" isn’t a red flag; it’s a chance to improve.
Current enterprise AI governance relies on "software hope," or the belief that probabilistic models can accurately police their own authority through mutable instructions. We've spent years treating system prompts and configuration files as if they're physical vaults. They aren't. They're suggestions that can be bypassed by a single misconfigured line of code. The most dangerous failure modes in modern systems aren't human errors; they're structural. In February 2026, the MITRE ATLAS OpenClaw investigation (CVE-2026-25253) provided a definitive autopsy of our current security models. A controlled red-team exercise demonstrated how a malicious prompt could trigger an unrestricted execution tool, allowing an agent to escape its sandbox and gain broad system access in fewer than two hours. This wasn't a perimeter breach — it was a failure of the architecture's self-concept. When we treat a non-deterministic model as a trusted operator, we're building the future of the autonomous enterprise on a substrate of suggestions. If your agentic governance exists only in the software layer, you're just hosting a crash. Figure 1: Moving the decision boundary from the policy manual (intent) to the hardware substrate (iron) via the Sovereign Spine architecture. The Structural Deficit in Agentic Security The cycle where we audit the vibes of a model and hope the alignment holds has reached its technical limit. True resilience requires a transition to Hardware Truth. Software-defined governance is insufficient for autonomous agency because it can't prevent the "God-mode" vulnerability — where a perfectly valid OAuth token is used to execute an illegitimate intent. To secure an agent, we must externalize and fix its logic path. This requires a technological stack that physically governs AI, termed the sovereign spine. By anchoring intent in silicon, we're eliminating the translation drift common in human bureaucratic governance and moving the decision boundary from the policy manual directly into the hardware substrate. The Sovereign Spine — A Dual-Stack Substrate The sovereign spine establishes a deterministic floor where an instruction physically cannot cycle unless its legitimacy is cryptographically witnessed and hardware-verified. This framework is built on two non-substitutable layers. 1. Reasoning Truth — The Ledger Substrate An agent's intent must be treated as an untrusted execution path until it's validated. We require a substrate capable of capturing an immutable, third-party record of the reasoning that led to an agentic proposal. The industry standard is shifting toward Proof of Reasoning (PoR), where the agent's internal weights and decision logic are hashed and anchored to a distributed ledger. This ensures the reasoning path can't be retroactively altered during a forensic audit. Implementations like the Ontologic framework generate a cryptographic identity for a decision committed to the ledger at the moment of intent. This prevents data tourism by ensuring decision logic is anchored to consensus before reaching the execution layer. 2. Execution Truth — The Citadel Protocol If the reasoning substrate provides the "why," the Citadel protocol provides the "how." Execution Truth requires a physical choke-point that operates independently of the model layer. The foundation of the Citadel protocol is the use of Trusted Execution Environments (TEEs). These hardware-isolated enclaves ensure that governance logic is protected from the host operating system and the agent itself. The protocol defines an intent airlock — a pre-execution stage where an agent's payload is held in a suspended state. The airlock is a non-bypassable gate that evaluates the semantic intent of a request against a sovereign mandate. FeatureSoftware Hope (Current)Hardware Truth (Sovereign Spine)Primary MechanismSystem Prompts / GuardrailsCryptographic Enforcement / TEEsTrust ModelTrust, then AuditVerify, then ExecuteFailure ModeFail Open (Bypassable)Fail Closed (Deterministic)Forensic AuditLog-Based (Mutable)Ledger-Based (Immutable)Authority RootOAuth Token / Policy PDFHardware Root of Trust / TEE The Sovereign Handshake The sovereign handshake is the protocol-level weld between the reasoning hash and the hardware gate. It enforces a suspended handoff where the execution path is physically blocked until two conditions are met. Reasoning truth: The reasoning path is ledger-verified.Execution truth: The execution intent is mandate-aligned. Functional Logic of the Sovereign Handshake The following sequence details the transition from probabilistic intent to deterministic execution within the Sovereign Spine. Figure 2: The Sovereign Handshake: The protocol-level verification of ledger-based reasoning hashes within a Trusted Execution Environment (TEE). 1 - 3: Intent and Witnessing The Autonomous Agent submits a reasoning intent to Hologlass. This intent is structured as rules, inputs, outputs, and meaning (RIOM) morphemes of the request. A human witness verifies the attestation within the Hologlass loop, ensuring accountability before the intent is committed to the hashgraph ledger, which returns the unique Auth_Hash. 4 - 5: Suspended Handoff The Agent submits the payload and the RIOM-based Auth_Hash to the Citadel hardware witness. The intent airlock immediately suspends execution, holding the instruction in a non-executable state. 6 - 8: Cryptographic and Semantic Audit The Hardware Witness performs a remote attestation check against the ledger to verify the hash’s validity. Once the cryptographic proof is received, the witness performs a semantic audit. This is a sovereign mandate check where the hardware witness compares the ruleHash within the RIOM morpheme against the authorized sovereign mandate hosted in the ontologic rule registry. This ensures the agent is not only following a rule, but specifically the current, immutable version of the mandate. 9 - 10: Admissibility (Success Path) If both the human-witnessed hash and semantic audit succeed, the hardware witness opens the gate to the target iron, allowing the instruction to cycle. 11: Terminal Refusal (Failure Path) If the cryptographic witness fails or the intent violates the sovereign mandate, the hardware witness issues a terminal refusal, physically locking the hardware gate and preventing execution. Practical Implementation — The Intent Airlock The intent airlock is more than a simple filter; it's a semantic validator. In a practical enterprise setting, this involves pre-defined business constraints — the sovereign mandate — that are loaded into the TEE at boot time. For instance, in a high-latency financial environment, the Mandate might state: "No single agent may authorize a transfer exceeding $10,000 without a human signature." When the agent attempts a $15,000 transfer, the airlock identifies the violation at the silicon level. Because the airlock resides in the TEE, even a compromised root user on the host system cannot modify the mandate or bypass the check. Governance as Physics The industry has reached its "TCP/IP moment" for AI trust. We must stop building bespoke Python wrappers and start building a unified substrate. You cannot audit a vibe, and you can't protect the enterprise with a PDF policy. The era of software hope is over. By anchoring agentic reasoning on the ledger and enforcing execution in the silicon, we're establishing a substrate of certainty. The future of the autonomous enterprise doesn't rely on better prompts — it's forged in the sovereign spine.
I often find myself in conversations where the same words keep popping up again and again: Agents, MCP, and A2A. Everyone seems excited about them. But the funny part is that when the topic shifts to MCP (Model Context Protocol), the explanations start to vary. One day, someone confidently said, “An MCP server is basically a tool.” Another person immediately disagreed and replied, “No, no — MCP is more like a client.” Before that debate could settle, someone else joined the conversation and said, “Actually, MCP is just a protocol.” And then another perspective appeared: “Think of it as middleware that sits between an agent and APIs.” At that moment, I realized something interesting: we were all talking about the same concept, yet each of us understood it a little differently. These conversations made me curious. If experienced developers and architects describe MCP in different ways, how confusing must it be for someone who is just starting to explore this space? The more I listened, the more I noticed a pattern — people weren’t wrong, but they were often describing only one piece of the puzzle. That realization is what inspired this blog. In this article, I want to step back from the buzzwords and walk through the concepts in a simple way. What exactly is MCP? Is it a server? A tool? A client? Or something else entirely? And how does it relate to the agents that everyone keeps talking about? Is it applicable only to agents, or is it applicable to assistants also? We will also explore MuleSoft's capability in this space. By the end of this post, my goal is to bring clarity to these terms and show how they connect. Instead of hearing multiple interpretations in different conversations, you’ll be able to see the complete picture of how MCP fits into modern AI and integration architectures. Let's Understand What Anthropic Says About MCP MCP (Model Context Protocol) is an open-source standard for connecting AI applications to external systems. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect electronic devices, MCP provides a standardized way to connect AI applications to external systems. MCP at high level Now let's break down each component and understand it in the simplest way possible. AI Application AI application can be any application that consists of an LLM, orchestration, and tools (You can think of it as assistants), or it may consist of more complex components such as Agent Orchestration, specialized agents, and Tools(You can think of it as an agentic application). Tools can be a Payment Gateway, a Data Retrieval API, a Weather API, a File System, a WebSearch, etc. MCP Model Context Protocol is an open protocol that enables seamless integration between AI applications (LLM Applications) and external data sources and tools. MCP provides a standardized way to connect LLMs with the context they need. MCP follows a client-server architecture. Key components of this architecture are MCP Host, MCP Client, and MCP Server. Let's extend our previous architecture. MCP architecture MCP Host It is nothing but a Host where the AI application is running. MCP Client It is a component that establishes a connection with the MCP Server and gets the context for the MCP Host to use. MCP Server It consists of external services that provide context to LLMs. Model Context Protocol consists of two layers: Data layer: The data layer implements a JSON-RPC 2.0 (JRPC) based exchange protocol that defines the message structure and semantics for client-server communication.Transport layer: The transport layer manages communication channels and authentication between clients and servers. It handles connection establishment, message framing, and secure communication between MCP participants.MCP supports two transport mechanisms: Stdio transport: Uses standard input/output streams for direct process communication between local processes on the same machine, providing optimal performance with no network overhead.Streamable HTTP transport: Uses HTTP POST for client-to-server messages with optional Server-Sent Events for streaming capabilities. This transport enables remote server communication and supports standard HTTP authentication methods, including bearer tokens, API keys, and custom headers. MCP recommends using OAuth to obtain authentication tokens. Use Case We can think of "Weather Intelligence Agent," which uses the MCP server to make a call to a tool that provides weather information based on a city name. This is a simple use case just to demonstrate how an API is called as a tool using MCP. We will use Postman and Cursor to mimic as Agent/Assistant, which will call the Weather API. Let's see how we can implement this use case using MuleSoft: Step 1: MuleSoft provides the MCP Server - Tool Listener connector. We will configure the MCP Server. MuleSoft code Refer to the code: XML <?xml version="1.0" encoding="UTF-8"?> <mule xmlns:ee="http://www.mulesoft.org/schema/mule/ee/core" xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns:mcp="http://www.mulesoft.org/schema/mule/mcp" xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd http://www.mulesoft.org/schema/mule/mcp http://www.mulesoft.org/schema/mule/mcp/current/mule-mcp.xsd http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd http://www.mulesoft.org/schema/mule/ee/core http://www.mulesoft.org/schema/mule/ee/core/current/mule-ee.xsd"> <http:listener-config name="HTTP_Listener_config" doc:name="HTTP Listener config" doc:id="251f2d7c-e84b-4974-a1e8-96d9779bc9e9" > <http:listener-connection host="0.0.0.0" port="8081" /> </http:listener-config> <mcp:server-config name="MCP_Server" doc:name="MCP Server" doc:id="289fb886-e732-4274-990e-9876aca405a6" serverName="mule-mcp-server" serverVersion="1.0.0"> <mcp:streamable-http-server-connection listenerConfig="HTTP_Listener_config"/> </mcp:server-config> <http:request-config name="HTTP_Request_config" doc:name="HTTP Request config" doc:id="b31d7d79-b45b-42ec-a970-50eb19a0a702" > <http:request-connection protocol="HTTPS" host="api.weatherstack.com" /> </http:request-config> <flow name="mcp-weahter-intelligence-apiFlow" doc:id="b1c21d3c-18f0-4eac-bb4e-3cf789608580" > <mcp:tool-listener doc:name="MCP Server - Tool Listener" doc:id="4c42c1cb-898d-4fb9-8d0e-edc541fffb75" config-ref="MCP_Server" name="get_weather_information"> <mcp:description ><![CDATA[This tool gets weather information. Check weather details for device by providing the city name as input or paramValue. Please use the query.]]></mcp:description> <mcp:parameters-schema ><![CDATA[{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "query": { "type": "string", "description": "city for querying weather data" } }, "required": ["query"], "additionalProperties": false }]]></mcp:parameters-schema> <mcp:responses > <mcp:text-tool-response-content text="#[payload.^raw]" priority="1"> <mcp:audience > <mcp:audience-item value="ASSISTANT" /> </mcp:audience> </mcp:text-tool-response-content> </mcp:responses> </mcp:tool-listener> <http:request doc:name="Request" doc:id="d10760de-5f93-4f63-aadc-9bfc491f94e0" config-ref="HTTP_Request_config" path="/current"> <http:query-params ><![CDATA[#[output application/java --- { "access_key" : "96d01954d0c4e444aa781fa10b92caff", "query" : payload.query, "units" : "m" }]]]></http:query-params> </http:request> </flow> </mule> Let's run this code and test it: MCP server started successfully: Deployment log Step 2: Let's use Postman as the MCP client to test it and see if it is working as expected: MCP server and available tools Step 3: Click on Connect: Connected to MCP Server Step 4: Now the MCP client is connected to the MCP server. You need to pass a query parameter as the city name, and you will get the weather details: I am writing this Blog from GOA (The Beach Capital of India). I will use GOA as the City name to retrieve weather information about GOA. Use the tool Step 5: Click on Run, and you will get the response as shown below: Response I have demonstrated it in my local version of code, which is deployed in Anypoint Studio. Let's test the same after deploying it to the runtime manager. I have deployed the code to the runtime manager. Deployed in the Anypoint platform Test result I have demonstrated this using Postman, where Postman worked as an MCP client to connect to the MCP server. We can extend it further and use Cursor to mimic the agentic behavior where the agent will use the MCP tool to get the answer. Cursor to use MCP I have used no code/low code tool, which is MuleSoft. In the next blog, I will use Python code to demonstrate the same. Watch the video for more details. Let me know if you liked it!
TL;DR: The AI Definition of Done Your team has a Definition of Done for a product increment. It has none for the 20-plus AI-supported outputs that leave the team each week: status reports, stakeholder emails, release notes, and updates for the C-level. Each one carries your team’s name. “I know quality when I see it” is the standard most teams actually run by, and you cannot audit it, teach it to a new colleague, or defend it when a claim turns out to be wrong. The AI Definition of Done fixes that with one page per task class, agreed by the team, before the output ships. Your Increment Has a Standard; Does Your AI Output? A model turns the Jira board into a Friday status update, and the update tells an enterprise prospect that the security feature is in production. Unfortunately, it is not. The feature was descoped three months ago, but the old ticket title persisted because no one felt responsible. So the model reported the title instead of the reality. Nobody checked the claim against the release notes because nobody had agreed that someone should. The email was sent with the team’s name on the cover. A functioning agile team should be able to tell you what “done” means for a product increment. Few can tell you what “done” means for that status update. No agreed standard governs it, and it ships every week. The product increment passes through a standard that the team argued over and agreed on. The AI-assisted output passes through one person’s gut feeling at the moment they clicked send. One of those you can defend to a stakeholder, an auditor, or a new hire. The other you cannot. The AI Definition of Done closes that gap without adding a governance department, which is exactly why it survives in organizations where “AI governance” earns eye rolls. It takes a practice every agile practitioner already owns and points it at the work you have started handing to a model. It is not for everything: skip it for private brainstorming, throwaway prompts, or personal sensemaking, unless the output later informs a decision or leaves the team. The Four Questions Every AI Definition of Done Answers The Concept Verification Level Which claims get checked, by whom, against what source, and how? “Looks good” is not a method. A method names the claim, the checker, the source, and the test: every factual claim about product status gets checked against the release notes by the sender before sending, every time. Where teams get stuck: approval gets mistaken for review. Someone skims a draft, clicks send, and the team’s name now sits on a claim nobody verified. Provenance Disclosure What does the team declare about how the output was produced? Three labels cover practice: a) Human means no material AI contribution to the content, claims, or structure (a spellchecker does not count), b) AI-assisted means AI contributed to drafting, summarizing, or analysis, and a named human reviewed the output and decided, and c) AI-automated means AI produced and sent the output under predefined rules, without human review before release, audited at a set cadence. The line that matters runs through “reviewed”: clicking send on an unread draft is approval, never review. An output approved without reading is AI-automated, whatever the team tells itself. Data Hygiene What never enters a model on the way to this output? Name the exclusions concretely: personal data from team surveys, customer-identifiable information, anything your organization’s AI policy restricts. If the input rules in your A3 Handoff Canvas already cover this, point to them. Do not keep two versions of the same rule. Where teams get stuck: nobody wrote the exclusions down, so each person guesses, and the guesses differ. Sufficiency Tier and Environment Which model, plan, and data boundary are good enough for this task class, and why? A top-notch frontier model drafting calendar invitation may fail in this regard. The cheapest model, run locally on an old Mac mini, can write a board update but likely fails in the other. Capability is only half of it: a board update may need an enterprise plan with a no-training guarantee or an approved connector, even when a mid-tier model is plenty. If your team has a routing policy, point to the tier and the environment it mandates. If it does not yet, name the model and the plan, and explain in one sentence why both are enough. The AI Definition of Done Template Four questions, plus two operating controls, one page. Here is the template a team fills in per task class: DimensionYour Standard for This Task ClassTask classVerification level: What is checked, by whom, against what, howProvenance label: Human (Avoid) / Assist / Automate from the A3 Delegation Framework, and where the label appearsData hygiene: What never enters the modelSufficiency tier and environment: Wich model, plan, and data boundary, and why they are enoughSign-off: Who agreed, on what date, and the review dateStop rule: When the delegation is paused, downgraded, or returned to manual work The last two rows are operational, not definitional: Sign-off records who agreed and when, and the stop rule names the condition that pauses the delegation, because this standard should say not only when an output may ship but when the task class stops being eligible for AI at all. Without it, teams keep tuning the prompt or skill long after the delegation has proven unfit. A Worked Example: External Status Communication The status update failure that opened this article maps to one task class, status communication, leaving the company. Here is the team’s first AI Definition of Done for it: DimensionStandardTask classStatus communication leaving the companyVerification levelEvery claim about feature status is checked against the release notes by the sending manager, before sending, every timeProvenance labelAI-assisted; footer states “Drafted with AI, reviewed by [name]”; Assist is not permitted for this task classData hygieneNo customer names, no security-finding details, no internal financials enter the modelSufficiency tier and environmentMid-tier model on an enterprise plan with no model training; drafting from structured release data needs no frontier modelSign-offTeam agreed, dated; review after the next four status updatesStop ruleIf two updates in a review cycle need a factual correction after sending, the task class returns to manual drafting until the standard is revised The standard costs the sending manager about four minutes a week, set against an error that can put a flagship deal at risk. Write Your AI Definition of Done in 75 Minutes An AI Definition of Done that one person downloads and pastes into the wiki doesn’t change anything. The argument over the standard is where the standard takes hold. Run it as a workshop: Pick three task classes (10 minutes): Choose from work the team actually shipped in the last two weeks, never hypotheticals. The best candidates are outputs that leave the team.Draft in pairs (20 minutes): Each pair fills the template for one task class. Pairs work without comparing notes; divergence is the point.Argue the differences (25 minutes): Compare drafts. Where pairs disagree on verification level or provenance, the team has found an unspoken assumption. Resolve each disagreement with a decision, never with “both are fine.”Set the labels (10 minutes): Agree where provenance labels appear: email footers, document headers, report covers. Visible beats buried.Adopt and date (10 minutes): Sign off each AI Definition of Done with a review date, and add the adoption to your AI working agreement. Ownership stays with the team running the delegation. Compliance, security, or legal may constrain the standard, but they do not write it for the team. When someone says, “We do not need this for internal outputs,” ask what happened the last time an internal draft got forwarded outside the team. Every team has that story. The Record You Get for Free Each signed-off AI Definition of Done is a dated, versioned, one-page record. Stack them, and they answer the due diligence question enterprise buyers increasingly ask, “How do you control AI-generated output?” with documents instead of assurances. Nobody wrote a governance report. The records came out of normal work. That answer is already part of procurement and due diligence conversations. Article 4 of the EU AI Act has been applied since February 2, 2025, and requires providers and deployers to ensure a sufficient level of AI literacy among staff and others operating AI systems on their behalf. The EU Commission’s Q&A places supervision and enforcement under national market surveillance authorities, with the enforcement rules applying from early August 2026. The practical question underlying the regulation is simpler, and a prospect’s procurement team will ask it before any regulator does: can you show the standard that underlies the output you sent us? Three Ways It Fails The downloaded standard: A template adopted without the workshop. Nobody argued, so nobody owns it. An AI Definition of Done that nobody argued about is one nobody will follow. The universal standard: One AI Definition of Done for all work. Verification that aligns with external communication suffocates internal brainstorming, and the team abandons the practice within a month. One page per task class. Contrary to the classic Definition of Done, there is no one-size-fits-all in our use case. The static standard: Written once, reviewed never. Models change, people change, task classes change. The review date is part of the artifact, and your next delegation inspection enforces it. Conclusion: Pick One Output This Week Pick one AI-assisted output your team ships regularly. The Friday status update, the Sprint summary, or the stakeholder email. Walk it through the four questions out loud in your next Retrospective: what gets checked and by whom, how we label it, what never enters the model, and which tier is enough. You will likely find at least one question where the honest answer is “nobody decided that.” Write the one-page response for that task class, argue it, sign it, and date it. One standard, agreed by the team, is the difference between a team that uses AI and a team that a customer can trust with it. Which of your AI-assisted outputs has a standard behind it right now, and which one is merely a habit? Key Questions This Article Answers What Is an AI Definition of Done? An AI Definition of Done is a one-page, team-agreed standard that an AI-assisted output must meet before it leaves the team. Teams write one per task class, such as external status communication or data analysis summaries, never one per task. It answers four questions: what gets verified, how the output is labeled, what data never enters the model, and which model and environment are sufficient. It borrows the discipline of the Scrum Definition of Done and applies it to work on a model touched. What Is the Difference Between Approval and Review for AI Output? Review means a named human reads the AI-generated output and checks its claims against a source before it ships. Approval means someone clicked send. Clicking send on an unread draft is approval, not review, whatever the team calls it. An output approved without reading is effectively AI-automated, and it should carry that provenance label rather than the AI-assisted label, which implies a human verified it. How Do You Write an AI Definition of Done? Run a 75-minute team workshop, not a solo download. Pick three task classes from work shipped in the last two weeks, draft the standard in pairs, then compare and resolve every disagreement with a decision. Agree where provenance labels appear, set a stop rule that returns the task class to manual drafting when outputs repeatedly fail, sign off each standard with a review date, and add the adoption to your AI working agreement. The argument over the standard is what makes the team own it. How Do Agile Teams Prove They Govern AI Output? Each signed-off AI Definition of Done is a dated, one-page record. Together, a team’s standards answer the procurement and due diligence question “how do you control AI-generated output” with documents rather than assurances. The records are a byproduct of normal work, so no separate governance report is needed. This matters because buyers and regulators, including under the EU AI Act Article 4, increasingly require evidence of controlled AI adoption. What Are the Four Dimensions of an AI Definition of Done? Verification level (which claims get checked, by whom, against what source, and how), provenance disclosure (Human, AI-assisted, or AI-automated, and where the label appears), data hygiene (what never enters the model), and sufficiency tier and environment (which model, plan, and data boundary are good enough and why). Each dimension fits on one line of a one-page template, signed off with an adoption date and a stop rule that pauses the delegation when outputs repeatedly fail.
This is the second follow-up to June 5's release post. It covers the platform APIs that moved into the framework core this release. There are two headline pieces (AI/LLM and the modern OAuth/OIDC stack) and two smaller pieces (WiFi/connectivity and share-sheet result callbacks). This continues the direction the previous release set when we moved NFC, biometrics, and cryptography into the framework core. The full background on that earlier set is in NFC, Crypto, Biometrics, And A New Build Cloud. AI: A First-Class LLM Client and a ChatView Component PR #5035 lands the com.codename1.ai package, the ChatView UI component, the speech and TTS additions, and the build-time dependency injection that wires the native pieces in. PR #5057 lands the developer-guide chapter and the agent-skill addition, so any project generated from the Initializr inherits the new APIs through its bundled AGENTS.md. LlmClient: The Basic Chat Request com.codename1.ai.LlmClient is the entry point. The simplest possible use: Java LlmClient client = LlmClient.openai(apiKey); ChatRequest req = new ChatRequest.Builder() .model("gpt-4o-mini") .system("You are a helpful assistant.") .user("What is the capital of France?") .temperature(0.7) .build(); client.chat(req).onResult((resp, err) -> { if (err != null) { Log.e(err); return; } Log.p(resp.firstChoice().content()); LlmClient.openai(...), LlmClient.anthropic(...), LlmClient.gemini(...), LlmClient.ollama(...), and LlmClient.openAiCompatible(baseUrl, apiKey) are the factories. All five are fully implemented native clients. The OpenAI client also drives Ollama, vLLM, llama.cpp, and any other endpoint that speaks the OpenAI wire format, so most local-model stacks plug in through LlmClient.openAiCompatible(...) without a separate driver. Streaming Chat (What You Actually Want for Chat UIs) For any UI that types responses out token-by-token, the streaming entry point is the one to reach for. The callback fires on the EDT, so you can append directly to a text component: Java client.chatStream(req, new ChatStreamListener() { @Override public void onDelta(ChatDelta d) { responseLabel.setText(responseLabel.getText() + d.contentDelta()); responseLabel.getParent().revalidateLater(); } @Override public void onComplete(ChatResponse fin) { sendButton.setEnabled(true); } @Override public void onError(Throwable t) { Log.e(t); sendButton.setEnabled(true); } Under the hood this is a custom ConnectionRequest subclass that parses SSE line-by-line and dispatches each delta through Display.callSerially. AsyncResource.cancel() kills the socket. So a chat UI that has a cancel button is a one-line cancellation. Tool Calls If you want the model to call back into your app, Tool / ToolChoice give you OpenAI-style function calling. Define the tool, hand the model your model and the available tools, and the response surfaces structured ToolCall objects you dispatch: Java Tool getWeather = Tool.builder() .name("get_weather") .description("Look up the current weather for a city.") .parameter("city", "string", "The city name, e.g. \"Paris\".") .build(); ChatRequest req = new ChatRequest.Builder() .model("gpt-4o-mini") .user("Is it raining in Tel Aviv right now?") .tool(getWeather) .toolChoice(ToolChoice.AUTO) .build(); client.chat(req).onResult((resp, err) -> { if (err != null) return; for (ToolCall call : resp.firstChoice().toolCalls()) { if ("get_weather".equals(call.name())) { String city = call.argument("city").asString(); String json = lookupWeather(city); // Loop the result back into the conversation client.chat(req.replyWithToolResult(call, json)) .onResult((followUp, e) -> updateUi(followUp)); } } The shape mirrors the OpenAI function-calling contract one for one, so anything you have written against the OpenAI API directly maps across without rethinking. Embeddings LlmClient.embed(...) returns a vector for any input string. Useful for similarity search against a local SQLite store (tomorrow's post will cover the new ORM that pairs with this): Java EmbeddingRequest er = new EmbeddingRequest.Builder() .model("text-embedding-3-small") .input("Codename One is a cross-platform mobile framework.") .build(); client.embed(er).onResult((emb, err) -> { float[] vector = emb.firstVector(); // store, search, compare Image Generation DALL-E and a Replicate scaffold are surfaced through ImageGenerator: Java ImageGenerator gen = ImageGenerator.openAiDallE(apiKey); gen.generate("A red bicycle leaning against an olive tree", "1024x1024") .onResult((img, err) -> { if (err != null) return; myImageComponent.setIcon(img); Working Against Ollama in the Simulator (No API Charges) JavaSEPort pings localhost:11434 at startup. If it finds Ollama, it sets the cn1.ai.ollamaDetected property. With cn1.ai.simulatorRedirect=auto (or =ollama) every LlmClient.openai(...) call routes through the local Ollama endpoint instead of OpenAI's. Production code does not change. The iteration loop, your tests, and your offline debugging stop costing money and stop needing an internet connection. In common/codenameone_settings.properties: Properties files simulator.cn1.ai.simulatorRedirect=auto (The simulator. prefix scopes the property to the JavaSE simulator path.) Then run Ollama locally with whichever model your code expects (ollama run llama3.2 or similar) and your existing LlmClient.openai(...) calls go to localhost. How to Handle API Keys A direct word on credentials before any of the above sees production. LLM provider API keys (OpenAI, Anthropic, Gemini, your Auth0 / Firebase configs) are bearer tokens with a budget attached. They must never be checked into source control, embedded in your app binary, or hard-coded in code. A leaked key can be extracted from any APK or IPA in minutes and used to drain your account. The correct shape is to fetch the key from your own backend over an authenticated request, then store it on the device using the platform's keychain / keystore. The framework provides both pieces: com.codename1.crypto.SecureStorage (from the previous release) is the cross-platform wrapper over iOS Keychain Services and Android EncryptedSharedPreferences. Values are encrypted at rest using the platform's hardware-backed protection class where one is available.This release adds a single-argument get / set / remove(account, ...) overloads next to the existing biometric-gated methods. The new overloads store the value without a per-read Face ID / Touch ID prompt, which is what you want for an LLM API key (you read it on every network call; a biometric prompt every time is not workable). The biometric-gated methods are still there for credentials you do want to gate per use. A reasonable shape: Java private static AsyncResource<String> getOpenAiKey() { String cached = SecureStorage.get("openai_api_key"); if (cached != null) { return AsyncResource.complete(cached); } return Rest.get(myServer + "/v1/credentials/openai") .bearerToken(userSessionToken()) .fetchAsString() .onResult((key, err) -> { if (err == null) { SecureStorage.set("openai_api_key", key); } }); Your server gates the credential request behind the user's session, your app caches the result on the keychain, and the key never sits anywhere a reverse-engineering pass could find it. If your server rotates the key, invalidate the cache and refetch. Existing biometric-gated SecureStorage calls keep working unchanged. The new overloads are additive. ChatView: A Ready-Made Streaming Chat UI com.codename1.components.ChatView is the matching UI component. Scrollable message list, ChatBubble for the per-message bubble (theme-aware UIIDs so it picks up the iOS Modern / Material 3 native themes consistently), ChatInput for the bottom input bar, and a one-line bindToLlm(...) that wires the input to a streaming chat request: Java ChatView view = new ChatView(); getOpenAiKey().onResult((key, err) -> { view.bindToLlm(LlmClient.openai(key), new ChatRequest.Builder() .model("gpt-4o-mini") .system("You are a friendly tutor for " + "Codename One developers.") .build()); }); Form f = new Form("Chat", new BorderLayout()); f.add(BorderLayout.CENTER, view); The result is a standard mobile chat layout, picked up from whichever native theme the project uses: If you want more control than bindToLlm(...) gives you (custom message styling, a "thinking" placeholder, hand-rolled retry, persistence to your own model class), drive the view by hand: Java ChatView view = new ChatView(); ConversationStore store = ConversationStore.open("tutor-thread"); view.setMessages(store.load()); LlmClient client = LlmClient.openai(apiKeyFromKeychain); view.setInputListener(userText -> { ChatMessage userMsg = ChatMessage.user(userText); view.appendMessage(userMsg); store.append(userMsg); ChatMessage assistant = ChatMessage.assistant(""); view.appendMessage(assistant); ChatRequest req = new ChatRequest.Builder() .model("gpt-4o-mini") .messages(store.load()) .build(); client.chatStream(req, new ChatStreamListener() { @Override public void onDelta(ChatDelta d) { view.appendToLastMessage(d.contentDelta()); } @Override public void onComplete(ChatResponse fin) { store.append(ChatMessage.assistant(view.lastMessage().content())); view.setInputEnabled(true); } @Override public void onError(Throwable t) { view.appendToLastMessage(" [error: " + t.getMessage() + "]"); view.setInputEnabled(true); } }); appendToLastMessage(...) is the streaming entry point; it marshals through callSerially so deltas land on the EDT in order. ConversationStore persists the thread (the default backing is Storage; pluggable via a custom implementation if you would rather keep it in SQLite or push it to your server). The AI cn1libs The core LLM stack is paired with a set of opt-in cn1libs that wrap specific on-device capabilities: Google ML Kit features, the TensorFlow Lite runtime, a local Whisper transcription engine, and an on-device Stable Diffusion model. Thirteen new cn1libs ship this release. These cn1libs are not yet listed in the Codename One Preferences cn1lib picker, so for the moment they are added by hand. Drop the matching dependency block into your project's common/pom.xml and rebuild. The build-time scanner does the rest: the iOS pod or Swift Package, the Android Gradle dependency, the plist usage strings (NSCameraUsageDescription for the vision libraries, NSSpeechRecognitionUsageDescription for Whisper, etc.), and the Android permissions (android.permission.RECORD_AUDIO for audio capture) are all injected automatically the first time the scanner sees the matching class on the classpath. For each cn1lib below, the dependency block is identical in shape; only the <artifactId> changes. The shared pattern is: XML <dependency> <groupId>com.codenameone</groupId> <artifactId><!-- cn1lib artifact id from below --></artifactId> <version>${cn1.version}</version> </dependency> cn1-ai-mlkit-text: Text Recognition (OCR) TL;DR. Pull printed or handwritten text out of an image (a photo of a page, a sign, a receipt) entirely on-device. Platforms. iOS bridges to GoogleMLKit/TextRecognition. Android bridges to com.google.mlkit:text-recognition. The JavaSE simulator returns an unsupported error. Use cases. Receipt scanning, sign translation pipelines (combine with cn1-ai-mlkit-translate), accessibility tools that read printed text aloud, automated form ingestion. Java byte[] jpeg = capturePhotoBytes(); TextRecognizer.recognize(jpeg).onResult((text, err) -> { if (err == null) Log.p("OCR: " + text); cn1-ai-mlkit-barcode: Barcode and QR Scanning TL;DR. Decodes QR, EAN, UPC, Data Matrix, PDF417, and the rest of the common 1D / 2D code families from a captured image. Platforms. iOS bridges to MLKitBarcodeScanning. Android bridges to com.google.mlkit:barcode-scanning. The JavaSE simulator returns an unsupported error. Use cases. Inventory scanning, ticket / boarding-pass readers, QR-driven onboarding flows, retail loyalty cards. Java byte[] jpeg = capturePhotoBytes(); BarcodeScanner.scan(jpeg).onResult((codes, err) -> { if (err == null) { for (String code : codes) Log.p("Found: " + code); } }); cn1-ai-mlkit-face: Face Detection TL;DR. Returns bounding boxes for human faces detected in an image. Each face is reported as a packed int[4] (x, y, width, height). Platforms. iOS bridges to MLKitFaceDetection. Android bridges to com.google.mlkit:face-detection. Use cases. Auto-crop a contact photo, mosaic / blur bystanders in a group shot, drive a face-tracked overlay for AR-lite filters. Java FaceDetector.detect(jpeg).onResult((boxes, err) -> { if (err != null) return; for (int i = 0; i < boxes.length; i += 4) { Log.p("face at " + boxes[i] + "," + boxes[i + 1] + " " + boxes[i + 2] + "x" + boxes[i + 3]); } }); cn1-ai-mlkit-labeling: Image Labeling TL;DR. "What is in this picture." Returns a list of descriptive labels for the image content. Platforms. iOS bridges to MLKitImageLabeling. Android bridges to com.google.mlkit:image-labeling. Use cases. Auto-tagging uploaded photos, content moderation pre-filters, content-based image search. Java ImageLabeler.label(jpeg).onResult((labels, err) -> { if (err == null) Log.p("labels: " + String.join(", ", labels)); }); cn1-ai-mlkit-translate: On-Device Translation TL;DR. Translate short text between supported language pairs entirely on-device; no server round-trip, no API key, works offline. Platforms. iOS bridges to MLKitTranslate. Android bridges to com.google.mlkit:translate. Languages are identified by their ISO 639-1 codes (en, fr, es, ...). Use cases. Offline travel assistants, chat translation, accessibility readers for foreign signage (combine with cn1-ai-mlkit-text). Java Translator.translate("Where is the train station?", "en", "fr") .onResult((fr, err) -> { if (err == null) Log.p(fr); // "Où est la gare ?" }); cn1-ai-mlkit-smartreply: Short Reply Suggestions TL;DR. Generates short suggested replies for chat conversations, similar to Gmail's Smart Reply chips. Platforms. iOS bridges to MLKitSmartReply. Android bridges to com.google.mlkit:smart-reply. The input is a JSON array of {role, message, timestamp, userId} objects. Use cases. A "quick reply" row above the keyboard in your in-app chat, response suggestions in a CRM inbox. Java String thread = "[{\"role\":\"remote\",\"message\":\"See you at 6?\"," + "\"timestamp\":" + System.currentTimeMillis() + "," + "\"userId\":\"u42\"}]"; SmartReply.suggest(thread).onResult((suggestions, err) -> { if (err == null) { for (String s : suggestions) Log.p("suggestion: " + s); } }); cn1-ai-mlkit-langid: Language Identification TL;DR. Returns the most likely ISO 639-1 code for a given text, or und (undetermined) when the input is too short or ambiguous. Platforms. iOS bridges to MLKitLanguageID. Android bridges to com.google.mlkit:language-id. Use cases. Auto-route a customer-support message to the right team, pick the correct TTS voice for an arbitrary string, pre-screen input before running an expensive translation. Java LanguageIdentifier.identify("Bonjour le monde").onResult((code, err) -> { if (err == null) Log.p(code); // "fr" }); cn1-ai-mlkit-pose: Pose Detection TL;DR. Returns 33 skeletal landmarks per detected pose as a packed float[3 * 33] (x, y, confidence triples). Platforms. iOS bridges to MLKitPoseDetection. Android bridges to com.google.mlkit:pose-detection. Use cases. Fitness apps with form correction, dance/yoga timing analysis, gesture-driven controls. Java PoseDetector.detect(jpeg).onResult((landmarks, err) -> { if (err != null || landmarks.length < 99) return; float noseX = landmarks[0], noseY = landmarks[1], noseConf = landmarks[2]; Log.p("nose at (" + noseX + ", " + noseY + ") conf=" + noseConf); }); cn1-ai-mlkit-segmentation: Selfie Segmentation TL;DR. Returns a per-pixel mask separating the person in the foreground from the background as byte[width * height] (0 = background, 255 = foreground). Platforms. iOS bridges to MLKitSegmentationSelfie. Android bridges to com.google.mlkit:segmentation-selfie. Use cases. Background replacement for video calls, sticker / portrait-mode effects, blur-the-background privacy filters. Java SelfieSegmenter.segment(jpeg).onResult((mask, err) -> { if (err == null) applyBackgroundReplacement(mask); }); cn1-ai-mlkit-docscan: Document Scanner TL;DR. Detects a rectangular document in a photo, perspective-corrects it, and writes the cropped JPEG to a temporary file. Returns the file path. Platforms. iOS uses Apple's VisionKit + Core Image rectangle detection (no extra pod). Android uses com.google.android.gms:play-services-mlkit-document-scanner. Use cases. "Scan to PDF" flows, expense apps that capture receipts, contract signing flows, ID-document capture. Java DocumentScanner.scanToFile(jpeg).onResult((path, err) -> { if (err == null) uploadDocument(path); }); cn1-ai-tflite: TensorFlow Lite Interpreter TL;DR. A general-purpose on-device inference engine. Bring your own .tflite model and run it against a float32 input tensor. Platforms. iOS uses TensorFlowLiteSwift (Pods or Swift Package). Android uses org.tensorflow:tensorflow-lite + tensorflow-lite-support. Use cases. Any custom on-device ML model your team trains or pulls from TF Hub. Image classification, simple regression, recommendation pre-filters. Java byte[] modelBytes = Util.readFully(Display.getInstance().getResourceAsStream(null, "/model.tflite")); float[] input = featureVector(); Interpreter.run(modelBytes, input).onResult((output, err) -> { if (err == null) Log.p("model returned " + output.length + " values"); }); cn1-ai-whisper: Speech-to-Text via whisper.cpp TL;DR. On-device transcription of a 16 kHz mono WAV file using a ggml-format Whisper model. The cn1lib bundles libwhisper.a. Platforms. iOS uses the Accelerate framework; Android uses a JNI build of the same whisper.cpp core. Models (e.g. ggml-base.bin) are not bundled; ship the one your app expects under the app's resources or download on first launch. Use cases. Voice notes, accessibility transcription, offline dictation, podcast indexing. Java String modelPath = SecureStorage.getFilePath("ggml-base.bin"); String audioPath = recordWavToFile(); WhisperRecognizer.transcribe(modelPath, audioPath) .onResult((text, err) -> { if (err == null) Log.p("heard: " + text); }); cn1-ai-stablediffusion: On-Device Image Generation TL;DR. Generates a JPEG from a text prompt using a bundled Stable Diffusion model. Multi-gigabyte payload, local build only. Platforms. iOS uses Core ML pipelines compiled from the bundled model. Android uses ONNX Runtime. Both configurations exceed the cloud build server's 2 GB upload limit, so this cn1lib triggers the cn1.ai.requiresBigUpload guard and the cloud build aborts with a "build this one locally" message. Add it to a project you build via mvn cn1:buildAndroid / mvn cn1:buildIosXcodeProject on the developer machine. Use cases. Avatar generation in apps where shipping to a cloud API is undesirable (offline-first apps, regulated industries, privacy-sensitive products). Java StableDiffusion.generate("a teal hot-air balloon over Lisbon, watercolour", 512, 512, /* steps */ 25) .onResult((jpeg, err) -> { if (err == null) display(Image.createImage(jpeg, 0, jpeg.length)); }); Why These Are cn1libs and Not Part of the Core The core gets the AI plumbing every app that adopts AI at all wants: the LLM client, streaming, the chat UI, the secure storage primitive for credentials, the simulator Ollama redirect for offline iteration. The cn1libs above are specialized verticals. Barcode scanning, document scanning, face detection, smart reply, pose detection, on-device translation, transcription, and on-device image generation are genuinely useful, but only for some apps. They also each bring a non-trivial native dependency. The Google ML Kit Android frameworks are large; the iOS pods carry their own weight; the bundled libwhisper.a and the Stable Diffusion model are big. Pulling all of them into the core would tax every app, whether the feature is used or not. The Stable Diffusion cn1lib in particular is large enough that the cloud build server cannot accept the upload at all (it trips the 2 GB pre-upload guard). That kind of opt-in does not belong in a dependency every app inherits. The corresponding chapter, including the full LlmClient API table, the ChatView reference, the SecureStorage overloads, the simulator Ollama redirect, and the full cn1lib coverage, is at AI, Chat UI, and Speech in the developer guide. OAuth and OIDC: The Modern Identity Stack The in-app-WebView Oauth2 flow that Codename One has shipped since approximately forever was the way every cross-platform mobile framework solved "sign in with Google / Facebook / Microsoft" in the 2010s. It is also the way every one of those identity providers stopped wanting you to solve it. Google has been blocking embedded user agents for years. Apple does not want third-party apps wrapping the Apple ID flow in a WKWebView. Microsoft and Facebook joined the chorus. The right answer is the system browser: ASWebAuthenticationSession on iOS, Custom Tabs on Android, with PKCE on the wire. That is what PR #5018 lands. PR #5039 adds a portable WebAuthn / passkey client on top. Sign In With Google (or Any OIDC Provider) com.codename1.io.oidc.OidcClient is the entry point. Point it at the discovery URL of an OIDC provider, hand it the client id and the redirect URI you registered with the provider, ask for tokens: Java OidcConfiguration cfg = OidcConfiguration.discover("https://accounts.google.com"); OidcClient client = OidcClient.builder() .configuration(cfg) .clientId("123-abc.apps.googleusercontent.com") .redirectUri("com.example.myapp:/oauthredirect") .scopes("openid", "email", "profile") .build(); client.signIn().onResult((tokens, err) -> { if (err != null) { OidcException oe = (OidcException) err; if (oe.getCode() == OidcException.USER_CANCELLED) return; Log.e(oe); return; } String idToken = tokens.getIdToken().raw(); String email = tokens.getIdToken().getClaim("email").asString(); proceed(email, idToken); Discovery JSON parsed and cached. PKCE S256 challenge generated and verified. State and nonce checked on the callback. ID-token claims decoded for you (we deliberately do not verify the signature client-side; the dev guide is explicit about why and points at the "re-validate on your backend" remedy). Refresh and revoke are first-class. The token store is pluggable via TokenStore; the default is Storage-backed, but a Keychain-backed or in-memory variant is a small class. On iOS the system-browser piece routes through ASWebAuthenticationSession. On Android through androidx.browser.customtabs with a plain ACTION_VIEW fallback for the rare device with no Custom Tabs provider. AuthenticationServices.framework and androidx.browser:browser are auto-linked when the classpath scanner sees OidcClient in use. Provider Wrappers: Google, Apple, Microsoft, Facebook, Auth0, Firebase If you would rather not configure OIDC by hand, the existing social classes get a signIn(...) method that drives the same stack with the provider's issuer URL pre-wired: Java GoogleConnect.signIn(googleClientId, "com.example.myapp:/oauthredirect", "openid", "email", "profile") .onResult((tokens, err) -> { /* ... */ }); MicrosoftConnect.signIn(entraClientId, "msauth.com.example.myapp://auth", "User.Read") .onResult((tokens, err) -> { /* ... */ }); Auth0Connect.signIn("tenant.auth0.com", clientId, redirectUri, "openid profile email") .onResult((tokens, err) -> { /* ... */ }); FacebookConnect.signIn(...) follows the same shape against the Facebook OIDC endpoint. FirebaseAuth covers the REST-based Firebase auth surface (email/password, IdP token exchange, refresh) which sits underneath any provider hand-off you might want to drive from app code. Sign In With Apple Sign in with Apple is required on iOS for apps that offer any other social login, and on Android it must fall through to a web flow. com.codename1.social.AppleSignIn handles both transparently: Java AppleSignIn.signIn() .onResult((result, err) -> { if (err != null) return; String idToken = result.getIdToken(); String code = result.getAuthorizationCode(); proceedToBackend(idToken, code); }); On iOS 13 and later this drops directly into the native Apple sheet via ASAuthorizationAppleIDProvider. On non-iOS platforms it falls through to the same OIDC web flow as everything else, so a single line of app code does the right thing on every port. The Maven plugin injects the com.apple.developer.applesignin entitlement on iOS when it sees AppleSignIn in use; Android does not see it because it is not there. Migration From the Legacy Oauth2 com.codename1.io.Oauth2 is now deprecated. Existing code still compiles, but the migration is short and almost always shorter than what it replaces: Java // Before Oauth2 oauth = new Oauth2("https://accounts.google.com/o/oauth2/auth", clientId, redirectUri); oauth.setClientSecret(clientSecret); oauth.setScope("openid email profile"); oauth.setBrowserComponent(myBrowserComponent); // tied to a WKWebView String token = oauth.authenticate(); // blocks, opens the web view Java // After OidcClient.builder() .configuration(OidcConfiguration.discover("https://accounts.google.com")) .clientId(clientId) .redirectUri(redirectUri) .scopes("openid", "email", "profile") .build() .signIn() .onResult((tokens, err) -> proceed(tokens.getIdToken().raw())); You stop owning the browser. The OS owns it. The cookies live in the platform's authentication session. The user gets the same login experience they have everywhere else on their device. WebAuthn/Passkeys PR #5039 layers a portable WebAuthn client on top: Java WebAuthnClient client = WebAuthnClient.getInstance(); if (!client.isAvailable()) { fallbackToPassword(); return; } PublicKeyCredentialCreationOptions opts = PublicKeyCredentialCreationOptions.fromServerJson(serverJson); client.create(opts).onResult((cred, err) -> { if (err == null) postToRelyingParty(cred.toJson()); }); W3C JSON wire format in both directions, so the response can be POSTed verbatim to any standard server-side WebAuthn library. iOS 16+ routes through ASAuthorizationPlatformPublicKeyCredentialProvider; Android API 28+ through androidx.credentials.CredentialManager. Provider helpers: Auth0Connect.signInWithPasskey(...) / .registerPasskey(...) and FirebaseAuth.signInWithPasskey(...) / .registerPasskey(...). One thing worth pulling out before you reach for it: if you sign in via OIDC against Google, Apple, Microsoft, Auth0, or Firebase, you usually already get passkeys for free. The identity provider runs the WebAuthn ceremony inside the system browser; OIDC just hands you the resulting tokens. So you do not need WebAuthnClient for that case. You need it for apps that run their own relying-party backend, and for apps driving the Auth0 or Firebase passkey grants directly. Full chapter: Authentication and Identity. Connectivity: WiFi, Bonjour, USB, network-type listeners PR #5021 lands four packages for apps that need to do more with the network than open an HTTP socket. The shape: Java WiFi wifi = WiFi.getInstance(); String ssid = wifi.getCurrentSSID(); String bssid = wifi.getBSSID(); String gateway = wifi.getGateway(); String ip = wifi.getIp(); wifi.scan(new ScanOptions().setTimeoutMillis(5000)) .onResult((results, err) -> { /* ... */ }); wifi.connect("MyNetwork", "hunter2", Security.WPA2_PSK) .onResult((success, err) -> { /* ... */ }); com.codename1.io.wifi for WiFi info, scan, and connect. com.codename1.io.wifi.WiFiDirect for peer-to-peer (Android only by platform reality). com.codename1.io.bonjour for mDNS / Zeroconf via BonjourBrowser and BonjourPublisher. com.codename1.io.usb for USB host (Android only). And NetworkManager.addNetworkTypeListener(...) plus NETWORK_TYPE_* constants so an app can react to a transition between cellular, WiFi, ethernet, or "none": Java NetworkManager.getInstance().addNetworkTypeListener(evt -> { int type = evt.getNetworkType(); if (type == NetworkManager.NETWORK_TYPE_NONE) showOfflineBanner(); else if (type == NetworkManager.NETWORK_TYPE_CELLULAR) suppressLargeBackgroundDownloads(); else clearOfflineBanner(); }); iOS does not expose programmatic WiFi scanning to third-party apps; scan() throws UnsupportedOperationException on iOS. iOS also does not expose WiFi Direct or general USB host. None of those are Codename One limitations; they are Apple's. The dev guide is explicit about each platform's limits. Three new compile-time defines (CN1_INCLUDE_WIFI_INFO, CN1_INCLUDE_HOTSPOT, CN1_INCLUDE_BONJOUR) wrap the iOS native code, set only when the classpath scanner sees the matching Java API in use. Apps that do not use these APIs do not pay for them at App Store review time. Same pattern as the NFC gating from the previous release. Full reference: Network Connectivity. Share-Sheet Result Callbacks PR #5036 closes a small but persistent gap: Display.share(...) and ShareButton finally tell you what the user did with the share sheet: Java ShareButton btn = new ShareButton(); btn.setTextToShare("Look at this fox"); btn.setImageToShare("/fox.jpg"); btn.setShareResultListener(result -> { switch (result.getStatus()) { case SHARED_TO: track("share_completed", result.getTargetPackage()); break; case DISMISSED: track("share_dismissed"); break; case FAILED: track("share_failed", result.getError()); break; } }); iOS routes through UIActivityViewController.completionWithItemsHandler; Android through Intent.createChooser with an IntentSender callback (API 22+). The framework normalizes the platform values into SHARED_TO(packageName), DISMISSED, or FAILED. Appearing in Other Apps' Share Menus The other half of sharing is the inverse direction: not "let the user share from your app", but "let your app receive content other apps share". If a user is in Safari, Photos, or Mail and taps the share icon, your app should be able to appear as a target there alongside Messages, WhatsApp, and Instagram. On iOS that requires a separate Share Extension target inside the .ipa, with its own bundle, its own Info.plist, an App Group string that links it to the host app, and a ShareViewController that handles the incoming payload. Historically the recommendation was to bootstrap that target by hand in Xcode, copy the resulting files into the Codename One project under ios/app_extensions/, and let the build server's extractor consume them. It worked, but it was a workflow most teams put off because the setup is fiddly. The same PR ships an IOSShareExtensionBuilder Mojo that does all of that for you. A typical setup is one Maven command and a one-time configuration block: XML <plugin> <groupId>com.codenameone</groupId> <artifactId>codenameone-maven-plugin</artifactId> <configuration> <iosShareExtension> <bundleIdentifier>com.example.myapp.share</bundleIdentifier> <displayName>MyApp</displayName> <appGroup>group.com.example.myapp</appGroup> <acceptedContent> <content>PUBLIC_URL</content> <content>PUBLIC_IMAGE</content> <content>PUBLIC_TEXT</content> </acceptedContent> </iosShareExtension> </configuration> </plugin> Run mvn cn1:generate-ios-share-extension and the Mojo writes a complete .ios.appext bundle into ios/app_extensions/: the Info.plist with the right NSExtension activation rules for the content types you declared, the App Group entitlement, a minimal ShareViewController.swift that lands the payload in the App Group's UserDefaults(suiteName:), and the matching buildSettings.properties. The result feeds straight into the existing IPhoneBuilder.extractAppExtensions pipeline, so apps that already have a hand-rolled extension keep working unchanged. On the host-app side, you read the payload on launch: Java // Anywhere after Display.init has run String shared = Storage.getInstance() .readObject("ios.shareExtension.lastPayload"); if (shared != null) { handleSharedPayload(shared); } After the next cloud or local build, your app appears in the iOS share sheet for the content types you declared. No Xcode work, no hand-rolled plist, no App Group string typed in three places. The build-time tooling owns it. Wrapping Up Tomorrow's post covers the architectural change in this release: a build-time bytecode annotation framework, the declarative router that is its first consumer, the SQLite ORM and JSON / XML mappers and component binder built on the same SPI, and the build-time SVG / Lottie transcoder that ships in the same release for related reasons. Back to the weekly index.
Most data architectures don't fail all of a sudden. They clearly show warning signs for months, or sometimes years, before anyone takes action. By that time, the damage is already done. I have spent 20 years building and reviewing data platforms across industries (from CPG to healthcare to consumer tech), and here is what I've learned to identify these signals early. The good news is that you can fix them before they become a disaster. The bad news is that most organizations ignore these signs until an AI initiative gets stuck, executives lose trust in reports/dashboards, or new joinees quit because the system is too complex to understand and maintain. Here are five critical signs that your data architecture needs a redesign, along with what to do about each one. Sign 1: Your AI Initiatives Keep Stalling at the Data Layer You've got the right team. You've picked the best models. You've invested in the necessary infrastructure. Still, your AI projects keep hitting the same blockers: they can't move past experimentation. The problem isn't your models. It's your data. What's Actually Happening is AI systems need three things that most legacy data architectures don't provide: Semantic layer: Clear definitions of what your data meansData lineage: Traceability of where data came from and how it transformedGoverned access: Controlled, policy-driven data access at scale Without these, your AI models are working with incomplete or inconsistent information. They might produce results, but you can't simply trust them. And when business leaders ask, "Why did the model make this decision?" you can't answer. The Architecture Gap This is what an AI-ready architecture looks like. Most architectures skip the middle layers. They have ingested raw data and may have built some curated/gold-layer tables, but nothing in between. That's why AI fails. What to Fix? Add a semantic layer that defines business metrics consistently across teamsImplement active metadata that tracks lineage automaticallyBuild governed access into your architecture, not as a separate policy document AI readiness starts in the architecture. Not the model you picked. Sign 2: Different Teams Get Different Answers From the Same Data The marketing department says revenue is $10M. The finance department says it's $9.2M. The CEO's dashboard shows $10.5M. Everyone's using the same source data. Yet nobody agrees. This isn't a reporting problem; this is a semantic layer problem. When you don't have a centralized definition of what "revenue" means (or any other business metric), every team creates its own version. Marketing might count revenue when a campaign is launched. Finance counts it when payment is recognized. The executive dashboard might include projected revenue. All "correct," but they don't match. The Cost of Inconsistency The Architecture You Need When everyone uses the same semantic definitions, numbers align. Trust returns. Decisions happen faster. What to Fix? Define business metrics once in a centralized semantic layerEnforce those definitions across all reporting toolsDocument the logic in a central place like Confluence so anyone can trace how a number was calculated Sign 3: Your Governance Lives in a Document That Nobody Reads You have a data governance policy. It is a .docx and .pdf file sitting in a SharePoint or Confluence site. No one has opened it for a very long time. Meanwhile, your team is manually handling access requests to the data, and imagine that someone forgot to tag the sensitive data, and the team has no idea which downstream systems are consuming PII data. Governance in 2026 is embedded in the architecture, not sitting in a document somewhere. Real governance is not something about people remembering to follow rules. It's about the systems that automatically enforce them. Old way (broken): Policy documentsTraining sessionsManual access reviewsPeriodic audits Modern way (embedded): Automated lineage trackingActive metadata that tags sensitive dataPolicy enforcement at the query levelContinuous compliance monitoring Embedded Governance Every query is getting checked against policies. Sensitive data is getting tagged automatically. Lineage is being tracked without human input, and governance happens by design, not by reminders. What to Fix? Move governance from documents to code (policy engines, access controls)Implement active metadata that automatically tags and classifies dataBuild lineage tracking into your pipeline toolingEnforce policies at the query layer, not as a post-check Sign 4: Security Was Designed for Humans, Not for AI Agents Your security model works great for analysts querying dashboards, data engineers running pipelines, and Data scientists building models. But here is what it was not built for -> "AI Agents" that query your data autonomously, all the time, at scale, without a HITL (human-in-the-loop). The New Access Pattern Old access: Human queries the dataHuman reviews the output/resultsHumans decide what to do with the results AI agents access: Agents query the data continuouslyAgents processed 1000's of rows automaticallyAgents make decisions without a human reviewing themAgents scale across multiple data sets The Security Gap If your security model assumes humans are always involved, you end up with a growing security gap. Security for AI Agents You need fine-grained, policy-driven security that works for both human and machine users. What to Fix? Implement column-level security (not just object-level)Add rate limits and quotas for AI agentsLog all access in real time with anomaly detectionUse context-aware policies that consider the query intention, not just the user role Sign 5: A New Engineer Needs Months to Understand the Architecture of the System A new data engineer joined your team. They are smart, experienced, and highly motivated. But after 2 months, onboarding is complete, they still can't confidently answer "Where does this metric come from?" or "What happens if I change this part in the pipeline?" Do you think it is a hiring problem? No, certainly not. It's an architecture problem. Great Architecture Is Maintainable If onboarding an engineer takes longer than it should, the design is the issue, not the engineer. Here are the red flags that you should pay attention to. Red flags: No clear data modelling standardsMissing or incomplete metadataPoorly defined ownership (who owns this table?)Fragmented pipeline design (no standard/consistent pattern)Documentation that is missing or outdated Maintainable Architecture Principles When these are part of your architecture, new joinees can navigate the system in weeks instead of months and follow it easily. What to Fix? Standard data modelling across the organizationGenerate metadata automatically (Don't depend on manual documentation)Use a consistent pattern for pipelines (Same design, same tools, same naming standards)Assign clean ownership for every data product/data domainAuto-generate documentation for newly created pipelines/code The fundamentals never change, but the layers around them have. After working for 20 years in this space, I have noticed that the core principles of data architecture remain the same. What never changes: Data modelingSchema designAligning with business outcomes/requiremnts What has matured over time: Governance (embedded rather than document-driven)Metadata (became active from passive)Semantic layer (A centralised one, not scattered across)Security (AI-Aware, not human only)AI readiness (architecture first, not model first) If these modern layers are missing from your architecture, now is the time to add them. Not when AI initiatives stall, not when executive leaders lose trust in the data, not when your best engineers quit because the system became too complex. Which Sign Resonates Most With You? I have worked with companies that have faced all five of these signs. Some are dealing with one. Most are dealing with three or four. The question is not whether you have these problems. The question is: which one is costing you the most right now? Is it AI initiatives that can't move forward?Is it teams that can't agree on basic metrics?Is it governance that exists only in a document?Is it a security gap that you're discovering too late?Is it engineers who can't navigate your architecture? Pick the one that's most urgent and start there. You don't need to solve everything at once. But do start. Before the warning signs become breaking points.
Tuhin Chattopadhyay
AI Decision Intelligence Scholar-Practitioner | Founder, Tuhin AI Advisory | Professor & Area Chair, AI & Analytics,
JAGSoM
Frederic Jacquet
Technology Evangelist,
AI[4]Human-Nexus
Pratik Prakash
Principal Solution Architect,
Capital One