Mac Native Builds, Live Protocols, And Open Issues Under 350
Why Requirements Are Becoming the Control Layer in AI-Assisted Development
Code Review Core Practices
Shipping Production-Grade AI Agents
AI coding assistants are becoming increasingly capable at generating code, explaining systems, and accelerating development workflows. But in real engineering environments, the biggest blocker is often not the model’s ability to write code. The bigger issue is whether the assistant has the right context before it starts making changes. A developer rarely works from a single source of truth. A Jira ticket may describe the implementation task. A Google Doc may contain the detailed requirements. A slide deck may explain the business goal. A meeting summary may include key decisions, open questions, and next steps that never made it back into the ticket. For a human developer, this creates friction. For an AI coding assistant, it creates risk. The assistant may generate code that looks correct, passes basic syntax checks, and follows existing patterns - but still implements the wrong behavior because the actual feature context was fragmented across multiple places. This is where a PARA-style context workspace becomes useful. PARA - Projects, Areas, Resources, and Archives is commonly used to organize knowledge by actionability. Applied to AI-assisted software development, it can become a practical architecture pattern for preparing scattered engineering knowledge before an AI coding assistant touches code. The goal is not to dump every document into the model. The goal is to organize scattered context so the assistant can reason with the right information for the task. The Problem: AI Coding Assistants Often See Only Part of the Work Consider a developer asked to build a new data pipeline that calculates a generic quality score. The implementation sounds straightforward: Build a pipeline that joins multiple input tables, applies business rules, and produces a quality score output table. But the actual context may be spread across several sources: SourceWhat It May ContainTicketImplementation scope, acceptance criteria, due dateRequirements docBusiness rules, scoring logic, data definitionsSlide deckBusiness goal, stakeholder alignment, expected impactMeeting summaryFinal decisions, open questions, changed thresholdsExisting codePipeline patterns, naming conventions, dependency structureOlder documentsPrevious decisions, deprecated approaches, known constraints If the AI coding assistant only sees the ticket, it may miss the deeper context needed to implement the feature correctly. This is especially risky for data pipelines and analytics features, where correctness depends not only on code structure but also on interpretation: which source tables to use, how freshness should be handled, how business rules are applied, and how downstream consumers will use the output. What Can Go Wrong If the Agent Only Reads the Ticket? A ticket often captures the visible work, but not the full reasoning behind the work. If the assistant only uses the ticket, it may: Implement the task but miss business rules from the requirements documentIgnore key decisions captured in meeting summariesUse a technically available source table that is not the approved source for this featureMiss freshness expectations for the output tableProduce a score that does not match how downstream dashboards or reports will consume itFollow an outdated implementation pattern because it found old but similar codeGenerate a pull request that looks reasonable but fails product or data-quality expectations This is the core issue: The AI assistant may know how to write code, but it may not know which code should be written. That distinction matters. For coding agents to become more reliable, developers need a better way to prepare context before code generation begins. Reframing PARA for AI Coding Agents PARA can be adapted from a personal knowledge organization method into a context classification pattern for AI-assisted development. In a PARA-style context workspace: PARA CategoryEngineering MeaningAgent Context RoleProjectsActive work being deliveredCurrent feature scope, ticket, task goalAreasOngoing responsibilitiesStandards, ownership, governance, quality expectationsResourcesReusable knowledgeDocs, runbooks, design patterns, pipeline examplesArchivesCompleted or inactive knowledgeHistorical decisions, old approaches, past incidents This structure helps the AI assistant understand the role of each piece of information. A current requirement should not be treated the same way as an old design decision. A meeting decision should not be buried behind a generic document search. A reusable pipeline pattern should be available to guide implementation, while archived material should be used carefully as historical context. The value of PARA is not just an organization. It gives the assistant a way to distinguish between active task context, long-running rules, reusable references, and historical information. This flow changes how the assistant approaches implementation. Instead of asking: “What code should I generate from this ticket?” The assistant can reason from a richer question: “What is the active feature goal, what rules must be followed, what reusable references apply, and what historical context should be considered before changing code?” That shift is small, but important. Applying PARA to a Quality Score Pipeline Now apply this to the quality score pipeline example. The feature requires a pipeline that joins multiple input tables, applies business rules, and writes a quality score output table. The exact business logic is intentionally generic, but the pattern is common across analytics engineering, data engineering, machine learning platforms, and reporting systems. A PARA-style workspace could organize the context like this: Project Context This is the active feature work. It may include: The current ticketFeature scopeAcceptance criteriaCurrent implementation statusTarget output tableExpected delivery milestoneKnown blockers or open questions For the coding assistant, this answers: “What am I being asked to build right now?” Area Context This represents ongoing expectations that apply beyond this one feature. It may include: Data quality standardsFreshness expectationsOwnership rulesPrivacy or compliance constraintsNaming conventionsRelease processTesting expectations For the coding assistant, this answers: “What rules and standards must this implementation follow?” Resource Context This is reusable technical knowledge. It may include: Existing pipeline patternsSimilar transformation logicData model documentationDashboard dependency notesCommon test patternsRunbooksData validation examples For the coding assistant, this answers: “What reusable references should guide the implementation?” Archive Context This is historical information that may still be useful, but should not automatically drive the implementation. It may include: Older design decisionsDeprecated scoring logicPast pipeline migrationsPrevious quality metric experimentsHistorical meeting notesOld RCA or incident learnings For the coding assistant, this answers: “What historical context may explain why the system works this way?” The critical point is that archived context should be used for awareness, not blindly copied into the current implementation. Why Meeting Summaries Matter Meeting summaries are often underestimated in AI-assisted development. In many teams, the final decision is not always reflected immediately in the ticket or requirements document. A meeting summary may contain important details such as: A threshold was changed after stakeholder discussionA source table was rejected because of data freshness concernsA metric definition was clarifiedA downstream dashboard dependency was identifiedA launch decision was postponedAn open question was assigned to another teamA temporary workaround was approved only for the first release For a human developer, these details may be remembered from the meeting. For an AI coding assistant, they are invisible unless they are included in context. This is one reason a PARA-style workspace can be valuable. It gives meeting summaries a place in the feature context without treating them as random notes. A meeting summary tied to an active feature belongs in the Project context. A recurring decision about data freshness may become the Area context. A reusable explanation of metric calculation may become the Resource context. Once the feature is complete, the same meeting summary may eventually move into the Archive context. How the Coding Assistant Should Use Context Before Changing Code Before generating code, the AI coding assistant should use the structured context to form an implementation understanding. For a quality score pipeline, it should first understand: What the feature is trying to accomplishWhich input data sources are approvedWhich business rules define the scoreWhich decisions were finalized in meetingsWhat freshness or latency expectations existWhich existing pipeline patterns should be followedWhat downstream dashboards, reports, or consumers depend on the outputWhich historical approaches should be avoided Only after that should it propose an implementation plan or modify code. This changes the assistant’s role. It is no longer simply a code generator responding to a ticket. It becomes a context-aware engineering assistant that can reason across requirements, decisions, standards, and existing system patterns. The Bigger Shift: From Prompting to Context Preparation Prompting is still useful, but it is not enough for complex engineering work. A good prompt cannot fully compensate for missing requirements, outdated context, or scattered decisions. For AI coding assistants, the quality of the result depends heavily on the quality of the context that comes before the prompt. This is especially true when the task involves business logic, analytics definitions, data contracts, or cross-team decisions. In those cases, the question is not: “How do we write a better prompt?” The better question is: “How do we prepare the right engineering context before asking the assistant to write code?” For developers building with AI coding agents, this may become one of the most important habits: do not ask the agent to write code first. Prepare the context first. Because the future of AI-assisted development will not belong only to teams with the most powerful coding models. It will belong to teams that know how to structure knowledge so those models can make better engineering decisions.
In mid-September 2025, engineers inside Anthropic's threat intelligence team noticed something that didn't fit the usual pattern of automated probing on their platform. Ten days of digging later, they had a name for it: GTG-1002, a Chinese state-sponsored group that had turned Claude Code into the operational core of a cyber-espionage campaign against roughly thirty organizations — banks, chemical manufacturers, tech firms, government agencies. When Anthropic published its account of the intrusion on November 14, the detail that made security teams sit up wasn't the target list. It was the autonomy ratio: by the company's own estimate, the AI agent executed somewhere between 80 and 90 percent of the operation — reconnaissance, vulnerability discovery, exploit development, lateral movement, exfiltration — with humans stepping in only at a handful of strategic checkpoints. Jacob Klein, who heads threat intelligence at Anthropic, called it an escalation that lowers the bar for who can run a sophisticated intrusion at all. I've spent the better part of this year watching that bar keep dropping, one disclosure at a time. And the thing I keep coming back to is this: the security industry built thirty years of tooling around the assumption that the dangerous actor inside your network is a person — a careless employee, a disgruntled admin, a phished contractor. That assumption is now wrong often enough to be a liability. The dangerous actor increasingly has no payroll record, no badge, no manager to flag erratic behavior. It's a process. And it's already inside. Skeleton Keys for Software Here's the uncomfortable arithmetic. CyberArk's 2025 Identity Security Landscape study found machine identities now outnumber human ones by more than 80 to 1 inside the average enterprise, with AI specifically named as the biggest driver of new privileged accounts this year. Other measurements land in a wide band — Rubrik Zero Labs put it at 82 to 1, Entro Labs measured DevOps-heavy environments at 144 to 1 — but every credible estimate points in the same direction, and the gap is widening faster than anyone's governance program. What makes this dangerous isn't the count. It's the habit. Most teams I've talked with over the past eighteen months reached for the path of least resistance when they first wired an agent into production: they handed it a copy of a human's API key, or a service account with the same standing privileges everyone else in that pipeline already had. It's the software equivalent of cutting a spare house key and leaving it under the mat — convenient until the day someone you didn't intend to find it. That convenience is exactly what blew up Salesloft and its customers in August 2025. Attackers tracked as UNC6395 didn't breach Salesforce. They stole OAuth tokens belonging to Drift, a chatbot integration plugged into it, and used those long-lived, broadly scoped tokens to walk into Salesforce, Slack, AWS, and Google Workspace environments at more than 700 downstream organizations — Cloudflare and Google among them — over roughly a ten-day window. Nobody compromised the platform. They compromised the credential that the integration was trusted with, and that credential opened far more doors than the integration's actual job required. Swap "chatbot integration" for "AI agent," and you've described the exact failure mode every analyst is now warning about for 2026. The fix that keeps surfacing in serious architecture conversations isn't exotic — it's the same zero-trust logic that's been preached at humans for a decade, finally pointed at software: Skeleton-key modelScoped-identity modelCredentialCopied human API key or shared service accountUnique identity per agent, issued via OAuth client credentials or a workload-identity standard like SPIFFELifetimeStatic, often unrotated for months or yearsShort-lived, reissued per session or taskBlast radius if stolenEverything that account can touchOnly what that specific agent was scoped to doAuditability"Someone" did thisThis agent, acting on this task, did this None of this is theoretical anymore. Gartner is telling boards that by 2028, roughly a third of enterprise applications will carry embedded agentic AI, and 15 percent of day-to-day work decisions will be made without a human in the loop. You cannot run that volume of autonomous action on credentials designed for an employee who logs in, does a job, and logs out. When the Prompt Is the Payload If identity is the slower-burning problem, prompt injection is the one that's already setting things on fire. OWASP's 2025 Top 10 for LLM Applications kept it at the number-one slot for a second consecutive edition, and for good reason: an LLM has no architectural separation between "instructions I should obey" and "data I should merely read." Feed it both in the same channel, and a sufficiently clever attacker can make the model treat the second as the first. The cleanest public demonstration of how bad this gets in practice is CamoLeak, the vulnerability researcher Omer Mayraz disclosed through Legit Security in October 2025, tracked as CVE-2025-59145 with a CVSS score of 9.6. The setup was almost playful: hide an instruction inside a pull request's invisible comment field, wait for a developer to ask GitHub Copilot Chat to review that PR, and let Copilot — operating with that developer's own repository privileges — quietly search the codebase for strings like "AWS_KEY," then exfiltrate whatever it found one character at a time. Each character got mapped to its own GitHub-hosted image URL, routed through GitHub's own trusted Camo proxy so the outbound traffic looked like nothing more than a chat window rendering a picture. Legit Security's CTO, Liav Caspi, put the core problem plainly: a vigilant network monitor might catch the unusual request pattern, but the average user or maintainer almost certainly wouldn't. GitHub closed the hole in August by disabling image rendering in Copilot Chat entirely — a blunt fix, but an honest acknowledgment that there was no elegant patch for the underlying design flaw. What should worry you is that CamoLeak is GitHub-specific plumbing wrapped around a generic problem. Any agent that reads untrusted content and can also take action — summarize an inbox, browse a webpage, query a ticketing system — has the same exposed nerve. The attack surface isn't the code. It's the fact that the model can't reliably tell an instruction from a sentence describing one. MCP Didn't Invent the Confused Deputy. It Industrialized It. The Model Context Protocol turned eighteen months old this past spring, and in agent circles it's already being described, only half-jokingly, as the USB-C of AI tooling — a single standard that lets an agent plug into dozens of databases, SaaS platforms, and internal systems without custom integration code for each one. That convenience is precisely why it became 2025's most interesting new attack surface. CVE-2025-49596 let attackers run arbitrary commands through unauthenticated MCP Inspector instances, rated 9.4. CVE-2025-6514, found in the widely used mcp-remote project, hit 9.6 and gave attackers OS-level command execution simply by getting an MCP client to connect to a malicious server. Researchers at Invariant Labs separately showed they could pull private repository data and WhatsApp message history out through MCP integrations that trusted server-supplied tool descriptions a little too much. That last detail is the one practitioners now call tool poisoning, and it deserves more attention than it gets. An MCP server doesn't just expose a function — it ships a natural-language description of that function for the model to read. Bury a hidden instruction inside that description, and the agent absorbs it as context with the same credulity it would extend to legitimate documentation. Layer in what researchers call a rug pull — a tool that behaved safely last week, silently swapping in malicious behavior this week, with no re-approval prompt — and you've got a supply chain risk that traditional dependency scanning has no vocabulary for. Underneath all of it sits the same architectural sin the original insider-threat literature has been naming for years: authorization quietly divorcing from authentication. An MCP server executing a database query on an agent's behalf needs to know not just that the agent is who it claims to be, but what the human or task behind that request was actually authorized to do. Skip that check, and you've built a confused deputy that will dutifully escalate its own privileges on a stranger's behalf. Where the Policy Engine Has to Live The architecture pattern that's converging across the vendors and practitioners I trust most isn't subtle, and that's its strength. You insert a policy decision point — Cerbos, Open Policy Agent, or an equivalent — directly in the path between the agent's tool calls and the systems those calls touch, so that nothing executes on trust alone: Plain Text User | v AI Agent ----(declares identity + intent)----> Policy Engine (PDP) ^ | | allow? | deny? | v | MCP Server -----> Database / API | | +---------------------(action result)----------+ The point of that middle box is to ask a boring, specific question on every single call: which agent is this, what was it actually asked to do, and does this particular action fall inside that scope? "Only SalesBot may call lookup_customer." "Any transfer above a threshold requires a human approval step before the MCP server executes it." None of that logic lives in the model's good judgment, because the model's judgment is exactly what prompt injection is designed to corrupt. The enforcement has to sit somewhere a crafted sentence can't reach it. This is also, not coincidentally, where the Cloud Security Alliance's "toxic cloud trilogy" — a public workload, a real vulnerability, and standing high-level privilege, all present at once — actually gets defused. CSA's own telemetry shows that the combination is present in 38 percent of workloads in early 2024, down to 29 percent by mid-2025, as organizations started pulling standing privilege out of the equation. That's real progress. It's also nowhere near fast enough for the rate at which agents are being deployed. What 2026 Actually Requires I don't think the next twelve months are going to be defined by a single dramatic breach, although there will probably be one anyway. I think they'll be defined by something quieter and more structural: the slow, overdue migration of agents off static, shared credentials and onto something closer to what SPIFFE and SPIRE were originally built for in the service-mesh world — short-lived, cryptographically verifiable, per-workload identity that can be issued, scoped, and revoked without anyone touching a spreadsheet of API keys. OWASP published a dedicated Non-Human Identity Top 10 in 2025 for exactly this reason; the existing application-security and human-IAM playbooks simply don't have entries for credentials that never sleep, never request access, and inherit whatever standing permission happens to be sitting there. The governance gap is still wide open. Recent industry surveys put the share of organizations with mature agent-governance programs below one in five, even as more than ninety percent of security leaders rate the problem as critical. That mismatch — high anxiety, low operational maturity — is usually the exact condition under which the expensive breach happens. My honest read, after a year of watching this space accelerate: the organizations that treat their agents as first-class, individually identified, least-privileged principals from day one will look unremarkable in hindsight. The ones that didn't will be writing the incident reports everyone else cites in 2027.
The bug looked simple. Resume from sleep, the screen flashes, the display server segfaults. About one in twelve resumes. The device, a Yocto-based industrial Linux box, ARM64, running a custom Wayland compositor on top of wlroots, would log nothing useful, drop to a black screen, and require a reboot. The customer’s complaint was three lines long. The fix took eleven weeks. This article is what I learned, in the order I learned it, debugging a null pointer dereference inside a Wayland compositor’s lifecycle. It’s specific to wlroots and a custom compositor based on it, but most of the techniques transfer to any C++ system at this layer. The Platform Custom industrial device, ARM Cortex-A72, Linux 5.15 LTS, Wayland 1.21, wlroots 0.16. The compositor is roughly 12,000 lines of C++ that we wrote on top of wlroots’ Tinywl example. It exposes a kiosk-style interface, a single fullscreen surface, an IR-driven UI, and no windowing at all. Mali GPU with proprietary user-space drivers. A nightmare to debug because half the symbols are stripped and the GPU stack vendor is contractually unable to give us source. The reproducer: suspend (systemctl suspend), wait 30 seconds, resume. About 8% of resumes crash. No pattern that mapped to load, time-since-boot, surface contents, or anything we could see from logs. What Didn’t Help The first three weeks were spent on things that did not work: dmesg/journalctl. Showed a clean suspend/resume cycle. The compositor exited with signal 11, and that was the only entry. No backtrace.The customer’s bug report. “Screen flashes black sometimes after sleep.” Seven words and a video that didn’t show anything we could use.Reading the wlroots source. Useful in retrospect for understanding the lifecycle, but not for spotting the bug. The bug was in our code, not theirs.Trying to repro on a developer workstation. Workstations are x86, run different kernel versions, have different DRM drivers, and don’t actually suspend in a way that exercises the same code paths. Lakshmi on the team spent two weeks trying to repro on a NUC. It never crashed. She finally said, in a one-on-one, “I think I’m wasting your time.” She wasn’t, but the strategy was wrong; we needed to repro on the device.Adding printf logs. This is what I’d been doing for six days when I realized the timing-sensitive nature of the bug meant that adding logs changed the timing enough to make the crash either much rarer or, occasionally, much more frequent. The Heisenberg debugging problem is real. The thing that finally moved us was getting gdb running against the compositor on the device, with full debug symbols, and reproducing the crash live. Setting Up gdb on the Target The device runs Yocto. Yocto can produce a debug-symbol package for any recipe. Our compositor recipe didn’t include dbg-pkgs in the image features. Adding it: In the local.conf for the developer image: C++ IMAGE_FEATURES += "dbg-pkgs" EXTRA_IMAGE_FEATURES += "tools-debug" INHIBIT_PACKAGE_STRIP = "1" # for our compositor specifically INHIBIT_PACKAGE_DEBUG_SPLIT = "1" # keep symbols in the binary Note: This nearly doubles the image size on flash. We built a separate dev image for debugging. Then on the device: $ gdb /usr/bin/our-compositor (gdb) handle SIGUSR1 nostop noprint pass (gdb) handle SIGPIPE nostop noprint pass (gdb) set follow-fork-mode child (gdb) run SIGUSR1 is what wlroots uses for some internal signaling, and you don’t want gdb stopping on it. SIGPIPE shows up on broken Wayland client connections, which is normal during a sleep/resume cycle. Once the compositor was running under gdb I would systemctl suspend from a second terminal, wait, resume by pressing the front-panel button, and watch. After about 40 minutes of attempts, the compositor crashed. gdb caught the SIGSEGV. Backtrace: C++ Program received signal SIGSEGV, Segmentation fault. 0x0000aaaaaab43a8c in our::OutputManager::handleOutputDestroy ( listener=0xaaaaaaab8a920, data=0xaaaaaaaadcd60) at src/output_manager.cpp:284 284 output_state_t *state = output->user_data->state; (gdb) print output->user_data $1 = (our::OutputUserData *) 0x0 The crash was a output->user_data dereference where user_data was null. Specifically, this code: void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) { struct wlr_output *output = static_cast<struct wlr_output *>(data); output_state_t *state = output->user_data->state; // crash here // ... cleanup } output->user_data was null. Why? Reading the wlroots Lifecycle The relevant pattern: when a wlr_output is added to the compositor, we allocate an OutputUserData structure and stuff it into output->user_data. When the output is destroyed, we look up the user_data to clean up. Standard. The lifecycle event the compositor cares about: wlr_output_create, output appears (e.g., HDMI plugged in, or DRM connector activated post-resume).We attach a destroy listener and allocate user_data.The output is configured, framebuffers attached, frames rendered.Suspend: DRM connectors are deactivated. The kernel may or may not destroy the wlr_output depending on the driver.Resume: DRM connectors come back. New wlr_output may be created or the old one reused.If the old one is destroyed, our handleOutputDestroy is called. The bug, after a few hours of staring: during step 4, on this hardware, the Mali driver was tearing down the connector, but the wlroots backend was firing the destroy event for a wlr_output that had been partially torn down already. Specifically, the user_data pointer had been freed by an earlier teardown path that we hadn’t finished migrating from a previous architecture. The Actual Bug Here’s the code path that bit us. Simplified: // In our compositor init, we attach two listeners: C++ void OutputManager::onNewOutput(struct wlr_output *output) { auto user_data = new OutputUserData(); user_data->state = new output_state_t(); output->user_data = user_data; user_data->destroy.notify = handleOutputDestroy; wl_signal_add(&output->events.destroy, &user_data->destroy); user_data->frame.notify = handleOutputFrame; wl_signal_add(&output->events.frame, &user_data->frame); } // On suspend, we explicitly tear down GPU resources void OutputManager::onSuspend() { for (auto &output : outputs) { if (output->user_data) { delete output->user_data->state; // <-- problem delete output->user_data; // <-- problem output->user_data = nullptr; } } } // On destroy event from wlroots: C++ void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) { struct wlr_output *output = static_cast<struct wlr_output *>(data); output_state_t *state = output->user_data->state; // crash if user_data is null // ... } The flow was: Suspend triggers onSuspend.We free user_data and set it to null.Wlroots, deeper in its own teardown logic, fires the destroy event on the same output.Our handleOutputDestroy runs, dereferences user_data, crashes. Why hadn’t we seen this before? Because on the NUC and earlier development hardware, the DRM driver kept the connector alive across suspend/resume, it didn’t destroy the output. The Mali driver on the production hardware destroyed it. So the crash only manifested in production. The original onSuspend was added during a refactor in 2022 to free GPU memory during suspend. The hypothesis, at the time, was that we needed to free explicitly because the driver wouldn’t. The hypothesis was wrong on Mali hardware, where the driver does free things, and freeing twice causes this crash. The Fix Two changes. First, the handleOutputDestroy had to defend against null user_data. This is good practice anyway; the wlroots event can fire for any reason at any time, and you can’t assume your user_data is still valid: C++ void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) { struct wlr_output *output = static_cast<struct wlr_output *>(data); if (!output->user_data) { // Already cleaned up via our suspend path; nothing to do wl_list_remove(&listener->link); return; } OutputUserData *ud = static_cast<OutputUserData *>(output->user_data); delete ud->state; wl_list_remove(&ud->destroy.link); wl_list_remove(&ud->frame.link); delete ud; output->user_data = nullptr; } Second, the onSuspend should not be freeing user_data at all. The wlr_output lifecycle is owned by wlroots, and our user_data should be freed by handleOutputDestroy exclusively. The original “free GPU memory” goal can be achieved by freeing the state object (which holds the GPU buffer references) without freeing the user_data wrapper: C++ void OutputManager::onSuspend() { for (auto &output : outputs) { if (output->user_data) { OutputUserData *ud = static_cast<OutputUserData *>(output->user_data); // Free GPU-attached state but keep the user_data wrapper. // Wlroots will destroy the output and our handleOutputDestroy // will fire if the kernel actually destroys the connector. delete ud->state; ud->state = nullptr; } } } After this change, handleOutputDestroy had to additionally null-check ud->state: output_state_t *state = ud->state; // may be null after suspend if (state) { // GPU cleanup } Three weeks of debugging, six lines of fix. Tools That Earned Their Keep After this bug, I added some habits that I’d skipped before. AddressSanitizer on every CI build. Not just the production build, every PR build runs ASan. The onSuspend/handleOutputDestroy interaction would have shown up as a use-after-free under ASan in seconds, with a clear stack trace of both the allocation and the free. The reason we hadn’t been running ASan: it adds about 2x overhead, and the team had it disabled “for performance.” Re-enabled. The performance hit on CI is fine; the cost of three-week bugs is not. # In our CMake: C++ if(SANITIZE_ADDRESS) target_compile_options(compositor PRIVATE -fsanitize=address -fno-omit-frame-pointer -O1) target_link_options(compositor PRIVATE -fsanitize=address) endif() Valgrind on a small test harness. We can’t valgrind the full compositor on the device; it’s too slow to even boot. We can valgrind a unit-test harness that exercises the OutputManager in isolation against a mock wlroots backend. Worth setting up; would have caught this bug at unit-test time. A wlroots-aware logging macro. Wlroots logs to its own facility. We’d been routing those to journalctl but not capturing them in our own crash dumps. After this bug, we wrote a wrapper that prepends [WLROOTS] to all wlroots log lines and dumps the last 200 to a file on crash: C++ static void wlroots_log_handler(enum wlr_log_importance level, const char *fmt, va_list args) { char buf[1024]; vsnprintf(buf, sizeof(buf), fmt, args); g_log_ring.push_back({level, buf}); if (g_log_ring.size() > 200) g_log_ring.pop_front(); // also forward to stderr / journal vfprintf(stderr, fmt, args); fputc('\n', stderr); } // In main(): wlr_log_init(WLR_DEBUG, wlroots_log_handler); // In our SIGSEGV handler: void crash_handler(int sig) { FILE *f = fopen("/var/log/compositor-crash-trail.log", "w"); for (auto &entry : g_log_ring) { fprintf(f, "[%d] %s\n", entry.level, entry.msg.c_str()); } fclose(f); abort(); } The crash trail captured the wlroots-internal lifecycle events we couldn’t see before. On the next class of bugs we caught, we had context within minutes instead of weeks. Smart pointers, finally. This bug was a delete problem. Mixed manual new/delete with C library lifecycles is a category of pain. We’ve been migrating OutputUserData and similar structures to std::unique_ptr with custom deleters that null out the wlroots user_data field. It’s not free; wlroots is a C library, and many of its callbacks pass raw pointers, but the structures we own should be unique_ptrs. What I’d Tell My 11-Week-Ago Self Three things. The bug was not in wlroots. It was in our suspend cleanup. I spent ten days reading wlroots source code looking for “the wlroots suspend bug.” There was no such bug. Suspect your own code first, especially the parts you wrote during a refactor. The repro environment matters more than the debugger. Lakshmi’s two weeks on a NUC produced no signal because the NUC’s DRM driver doesn’t do what the Mali driver does. As soon as I got gdb running on the actual device, the bug fell out within a few hours. If you can’t reproduce on the target hardware, you are not actually debugging. Add the safety nets before you need them. ASan, valgrind, log ringbuffers, smart pointers. None of these would have prevented this bug from being written, but each of them would have shortened the time to find it. We added them all after. The next display-stack crash took two days to diagnose, not eleven weeks. The compositor has been stable for 14 months since this fix. The team has switched away from raw new for any wlroots-related allocations. We test suspend/resume in a hardware-in-the-loop nightly job, 200 cycles, no crashes for 9 months running. I still remember the line number of the segfault. output_manager.cpp:284. There’s a comment on it now that says // see https://internal-wiki/wayland-suspend-bug-2022-Q3 for why this nullcheck exists. The wiki page has 4,300 words and three diagrams and is the closest thing to a war story I’ve published anywhere. Now I guess this is the next closest thing.
Picture this: two features are being developed in parallel. One has already been tested in lower environments, but is still awaiting business approvalThe other is fully validated and ready to go live Naturally, you want to release the second feature to production. But you can’t, because your deployment model forces you to release everything together. If you’ve worked with Azure Data Factory (ADF), this situation probably sounds familiar. Azure Data Factory (ADF) is a cloud-based data integration service from Microsoft that helps you build and orchestrate data pipelines across systems. It works extremely well for managing data workflows — but when it comes to deployments at scale, things get tricky. As our ADF usage grew across multiple teams and environments, we started running into a recurring problem: We had control over development — but very little control over what actually got deployedA simple pipeline fix could unintentionally introduce unrelated changesParallel feature development became harder to manageProduction releases became riskier than they needed to be That’s when we realized: The issue wasn’t ADF itself — it was the deployment model we were relying on. The issue wasn’t ADF itself — it was the deployment model we were relying on. This article walks through how we addressed that challenge by implementing a selective deployment pattern, allowing us to promote only intended changes without impacting everything else. The Real Problem: Parallel Feature Releases in ADF Before diving into the solution, let’s look at a scenario that frequently occurs in real-world teams. What This Diagram Represents This diagram shows two features progressing across environments: Feature 100 Developed earlier, successfully deployed to Dev and TestCurrently in UAT (User Acceptance Testing)Still awaiting business approval before production Feature 200 Developed later, successfully completed across Dev → Test → UATFully validated and ready for production Expected Behavior At this stage, the expectation is straightforward: “Let’s release Feature 200 to production.” Feature 100 is still under testing, so it should remain in UAT. What Actually Happens in ADF Azure Data Factory follows a full-state deployment model. That means when you deploy, you are not deploying a feature; you are deploying the entire factory state. So when you attempt to release Feature 200: Feature 100 gets included automaticallyYou cannot isolate Feature 200You lose control over what reaches production Why This Becomes a Real Problem This isn’t an edge case; it becomes a recurring pattern in larger environments. You’ll encounter this when: Multiple teams are working in parallelFeatures move at different speedsUAT cycles varyProduction fixes need to be released quickly It becomes even more complex when: Existing production pipelines are modifiedPartial updates are requiredDependencies overlap across features The Core Limitation: ADF promotes state, not intent. It does not differentiate between what is ready for production and what is still under testing. Why We Had to Rethink Deployment This limitation introduced real risks: Accidental promotion of incomplete featuresDelayed production releasesIncreased coordination overheadHigher chances of breaking stable pipelines We needed a way to: Promote only Feature 200Keep Feature 100 in UATAvoid impacting unrelated artifactsReduce production risk Architecture Overview To address this challenge, we introduced a selective packaging layer between build and deployment. Flow Feature Branch → PR → Validate → Selective Packaging → ARM Export → Incremental Deploy → Trigger Control Key Idea: Instead of exporting ARM templates from the full ADF repository, we export from a filtered staging folder containing only the required artifacts. Understanding Default ADF Deployment Behavior Before implementing selective deployment, it’s important to understand how Azure Data Factory works by default. ADF follows a full-state deployment model. How Default ADF Deployment Works When you use ADF with Git integration: Developers work in a collaboration branch (typically main)Changes are committed and merged via pull requestsADF provides a Publish button in the UI When you click Publish, ADF generates ARM templates representing the entire factory state. These templates are stored in the adf_publish branch: In modern setups, instead of clicking Publish manually, teams often use @microsoft/azure-data-factory-utilities (npm-based export). This allows pipelines to validate ADF resources and export ARM templates programmatically. YAML - name: Validate ADF resources run: | set -euo pipefail FACTORY_ID="/subscriptions/${{ env.SUBSCRIPTION_ID }/resourceGroups/${{ env.RESOURCE_GROUP }/providers/Microsoft.DataFactory/factories/${{ env.SOURCE_FACTORY_NAME }" npm run build validate "${{ github.workspace }" "$FACTORY_ID" YAML - name: Export ARM templates (CI publish) run: | set -euo pipefail FACTORY_ID="/subscriptions/${{ env.SUBSCRIPTION_ID }/resourceGroups/${{ env.RESOURCE_GROUP }/providers/Microsoft.DataFactory/factories/${{ env.DEV_FACTORY_NAME }" npm run build export "${{ github.workspace }" "$FACTORY_ID" "${{ env.ARM_OUTPUT_DIR }" Whether you click Publish manually or use npm export in CI/CD, the outcome is the same: Full factory deploymentNo control over individual featuresAll changes get bundled together Selective Deployment Layer (Core Design) We can address this requirement and the associated challenges by introducing a workflow driven by a manifest to define the deployment scope, and a program to identify all necessary ADF dependencies for each manifest file. As a developer, I can now control which release is promoted to production, without worrying about releasing any other features that are not ready. The manifest controls which pipelines to deploy and which optional categories to include. Below is an example of a manifest file JSON { "pipelines": ["pl_ingest_population_selective"], "includeTriggers": false, "includeIntegrationRuntimes": false, "includeAllGlobalParameters": true, "includeLinkedServices": true, "validateLinkedServicesExist": true, "includeManagedVirtualNetwork": false, "includeManagedPrivateEndpoints": false } Workflow Explanation Let's understand the crux of the selective deployment workflow now. I am working in the release branch on my feature branch directly in ADF Studio. Since ADF Studio is integrated with Git, my development changes will be saved to my branch. Here are the steps I can take to promote my change to a higher environment. 1) Validation of ADF on PR validation This is an early validation step and a guardrail: if the PR fails, it's because objects are invalid and misaligned. This is equivalent to the "validation all" button in the ADF ui, here is this workflow Trigger: Pull requests targeting the branch selective_deployment. Purpose: Validate that the ADF JSON in the PR is valid in the context of the target factory. Main steps: CheckoutSet up Node.js 20npm installAzure login using OIDC (azure/login@v2)Validate with ADF Utilities: YAML FACTORY_ID="/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourceGroups/${AZURE_RESOURCE_GROUP}/providers/Microsoft.DataFactory/factories/${DEV_FACTORY_NAME}" npm run build validate "$GITHUB_WORKSPACE" "$FACTORY_ID" 2) Release build + selective deploy to DEV adf-release-build-selective-deploy.yml Triggers: Push to selective_deploymentManual run (workflow_dispatch) with optional manifest inputDefault: deploy/manifests/release.json This workflow has two jobs: Job A: adf-build (staging + export + sanitize + artifacts) Checkout (full history)Azure login using OIDCSet up Node.js 20Install build dependencies inside build/ (npm install in build)Stage selective subset python scripts/select_adf_subset.py <manifest>, a code snippet below for the complete script, refer to the GitHub repository link given Python import json import re import shutil import sys from pathlib import Path from typing import Dict, Set, Tuple, List from collections import defaultdict # Your repo layout has pipeline/, dataset/, linkedService/ at ROOT. REPO_ROOT = Path(".") STAGE_ROOT = Path("build/adf_subset") RESOURCE_DIRS = { "pipeline": REPO_ROOT / "pipeline", "dataset": REPO_ROOT / "dataset", "linkedService": REPO_ROOT / "linkedService", "dataflow": REPO_ROOT / "dataflow", "trigger": REPO_ROOT / "trigger", "integrationRuntime": REPO_ROOT / "integrationRuntime", "credential": REPO_ROOT / "credential", "managedVirtualNetwork": REPO_ROOT / "managedVirtualNetwork", } # Copy these if present so ADF utilities behave the same on staged subset. ROOT_FILES_TO_COPY = [ "publish_config.json", "arm-template-parameters-definition.json", "arm_template_parameters-definition.json", "package.json", "package-lock.json", ] Produces: build/adf_subset/ (staged tree)build/adf_subset_report.json (dependency report)Refer to logs below (showing output of stage selective subset and debug to view output generated after select_adf_subset.py )Export ARM templates from the staged subset via ADF Utilities: npm --prefix build run build -- export "adf_subset" "$FACTORY_ID" "ArmTemplate"Produces: build/ArmTemplate/ARMTemplateForFactory.jsonbuild/ArmTemplate/ARMTemplateParametersForFactory.jsonStrip infra-owned resources scripts/strip_arm_resources.py to produce a safe template: build/ArmTemplate/ARMTemplateForFactory.safe.json⚠️ Note on Infrastructure Components (Refer to the “Future Work & Next Steps” section for follow-up topics in this series) The step above intentionally strips infrastructure-dependent components from the generated subset to avoid overwriting existing shared resources such as linked services. This implementation focuses on developer-owned artifacts (pipelines, datasets, and triggers) and assumes that infrastructure components — such as Integration Runtimes, managed private endpoints, and linked services — are pre-provisioned and managed outside of this deployment workflow.Upload artifacts: ARM templates (adf-arm)metadata (adf-release-meta)subset report (adf-subset-report) Job B: deploy_dev (deploy safe template) Download ARM artifactAzure login using OIDCEnsure az Data Factory extension is installedValidate JSON files exist/parseDeploy via azure/arm-deploy@v2(Incremental) to DEV RG/factory: Template: ARMTemplateForFactory.safe.jsonParameters: ARMTemplateParametersForFactory.json + factoryName=<DEV_FACTORY_NAME> Lesson Learned Setting up selective deployment in ADF was more than a technical task. It made us rethink our approach to deployments, ownership, and CI/CD design. Here are the main things we learned: 1. The Problem Is Not Tooling; It’s Deployment Granularity At first, we thought the limitation came from the tools we used, like UI publish or npm export. However, both methods yielded the same result: full factory templates. The real problem was that we couldn’t control the scope of deployments, not how the templates were made. 2. Dependency Awareness Is Critical Selective deployment only works when every dependency is found and included. We learned that: Pipelines often reference multiple datasets and linked services. Missing even one dependency results in deployment failure You must automate dependency discovery. 3. “Incremental” Is Often Misunderstood Incremental deployment is important, but it doesn’t work like a patch. It reapplies the full configuration for all included resources. This means: Your generated templates need to be complete for all the artifacts you include. If you use partial definitions, deployments can fail. 4. Separation of Concerns Is Key Not all ADF artifacts are the same. We began to separate them into different groups: Application-owned artifacts: pipelines, datasets, triggers Infrastructure-owned artifacts: linked service, managed virtual networks, managed private endpoints, and integration-runtime, among others. This separation proved crucial for safe, scalable deployments. 5. Selective Deployment Adds Complexity, But It’s Worth It It’s true that implementing this approach brings in additional scripts, manifest management, and CI/CD complexity. But in exchange, we gained precise control over releases, reduced production risk, and faster hotfix deployments. Future Work and Next Steps While selective deployment solved a major gap in ADF CI/CD, it also opened up new areas for improvement and standardization. 1. Defining Infrastructure vs Application Ownership One of the biggest follow-up areas is clearly defining ownership boundaries. In our experience: Application teams should own pipelines, datasets, and triggers Platform or infrastructure teams should own linked services, managed virtual networks, and managed private endpoints, among other things. Future work can focus on: Enforcing this separation in CI/CD. Preventing accidental deployment of infrastructure components Integrating Terraform or platform pipelines for infrastructure provisioning 2. Governance Around Linked Services Linked services are often shared across multiple pipelines and teams. Future improvements include: Centralizing linked service management Using Key Vault and Managed Identity consistently Preventing direct modifications through application pipelines
The 3:00 AM Incident That Changed Everything It was a Tuesday morning when the alerts started firing. Our recommendation engine, the one that drives 30% of our revenue, had tanked. Accuracy dropped from 94% to 58% overnight. The data science team immediately blamed the model. They started tweaking hyperparameters, re-training on new data, and running diagnostics. Nothing worked. I got pulled into the war room at 3:00 AM. The first thing I asked wasn't "What's wrong with the model?" It was "What changed in the data pipeline?" Turns out, everything. A vendor had pushed a schema change upstream. A field that used to be required became optional. Null values started flowing through our pipeline. Our feature engineering code didn't handle nulls gracefully; it just propagated them downstream. By the time the data reached the model, 40% of our feature vectors were corrupted. The model wasn't broken. The data was. We spent six hours manually rolling back the schema change, re-running the pipeline, and restoring service. The incident report was brutal: "Lack of data validation caught a breaking change too late." That's when I realized we needed observability in our data pipeline, not just in our models. The Problem: Data Quality is Invisible Until It Breaks Here's the uncomfortable truth about data pipelines: they fail silently. Your ETL job completes successfully. Your Spark cluster finishes transformations. Your data warehouse loads without errors. Everything looks green in the monitoring dashboard. But the data itself? Garbage in, garbage out. There are three categories of failures that break AI models in production: Missing Values: A source system stops populating a field. Your pipeline doesn't validate it. The model gets NaN values it never saw during training. Predictions become random noise. Schema Changes: An upstream team adds a new column, renames an existing one, or changes data types. Your pipeline doesn't expect these changes. Either it crashes, or worse, it silently maps data to the wrong columns. Distribution Shifts: The statistical properties of your data change. A field that was always between 0 and 100 suddenly has values of 50,000. Your model's scaling assumptions break. Predictions become nonsensical. None of these show up in traditional infrastructure monitoring. Your CPU is fine. Memory is fine. Network is fine. But your data is on fire. The Solution: Observability at Every Layer I started building a three-layer observability framework using dbt, Great Expectations, and custom validation logic. The goal was simple: catch data quality issues before they reach the model. Layer 1: dbt Tests (The First Line of Defense) dbt tests are your cheapest, fastest way to catch obvious data quality issues. They run after every transformation and fail the entire pipeline if something's wrong. Here's what we implemented: SQL -- models/staging/stg_user_events.yml version: 2 models: - name: stg_user_events columns: - name: user_id tests: - not_null - unique - name: event_timestamp tests: - not_null - dbt_utils.expression_is_true: expression: "event_timestamp <= current_timestamp()" - name: event_value tests: - not_null - dbt_utils.expression_is_true: expression: "event_value > 0" These tests are simple but powerful. They catch: Missing required fields (not_null)Duplicate records (unique)Impossible values (event_timestamp in the future)Out-of-range values (negative prices) We run these tests on every dbt run. If any test fails, the pipeline stops. No data reaches the model. No silent corruption. The beauty of dbt tests is that they're version-controlled, documented, and part of your transformation code. When a schema change happens, you update the test, commit it, and everyone knows what changed. Layer 2: Great Expectations (The Statistical Validator) dbt tests catch structural issues. Great Expectations catches statistical anomalies, the subtle shifts that break models. Here's a real scenario: our user_age column had a distribution of 18-65 for two years. Then one day, we started getting ages of 200, 500, 1000. A data entry bug upstream. dbt tests wouldn't catch this because the values are technically valid integers. But Great Expectations would. Python # great_expectations/expectations/user_events_expectations.py from great_expectations.core.batch import RuntimeBatchRequest from great_expectations.data_context import DataContext context = DataContext() suite = context.create_expectation_suite( expectation_suite_name="user_events_suite", overwrite_existing=True ) validator = context.get_validator( batch_request=RuntimeBatchRequest( datasource_name="my_spark_datasource", data_connector_name="default_runtime_data_connector", data_asset_name="user_events" ), expectation_suite_name="user_events_suite" ) # Expect user_age to be between 18 and 120 validator.expect_column_values_to_be_between( column="user_age", min_value=18, max_value=120 ) # Expect event_value to have a mean between 50 and 200 validator.expect_column_mean_to_be_between( column="event_value", min_value=50, max_value=200 ) # Expect less than 5% missing values in critical columns validator.expect_column_values_to_not_be_null( column="user_id", mostly=0.95 ) # Expect the distribution to match historical patterns validator.expect_column_kl_divergence_from_list( column="event_type", partition_object={"event_type": ["click", "view", "purchase"]}, threshold=0.1 ) validator.save_expectation_suite(discard_failed_expectations=False) Great Expectations runs after dbt tests. It validates: Value ranges (age between 18 and 120)Statistical properties (mean event value between 50 and 200)Null rates (less than 5% missing in critical columns)Distribution shifts (event_type distribution matches historical patterns) If Great Expectations detects an anomaly, it alerts us. We investigate before the data reaches the model. Layer 3: Custom Validation (The Domain Expert) dbt and Great Expectations are generic. Your domain is specific. We added custom validation logic that understands our business. Python # pipelines/validation/custom_validators.py import pandas as pd from datetime import datetime, timedelta def validate_feature_engineering(df: pd.DataFrame) -> dict: """ Custom validation for features before they reach the model. Returns a dict of validation results. """ results = {} # Validate 1: Feature completeness # We need at least 95% of features populated feature_cols = [col for col in df.columns if col.startswith('feature_')] null_rate = df[feature_cols].isnull().sum().sum() / (len(df) * len(feature_cols)) results['feature_completeness'] = { 'passed': null_rate < 0.05, 'null_rate': null_rate, 'threshold': 0.05 } # Validate 2: Feature scaling # After normalization, features should be roughly between -3 and 3 (3 sigma) for col in feature_cols: max_val = df[col].max() min_val = df[col].min() results[f'{col}_scaling'] = { 'passed': max_val < 10 and min_val > -10, 'max': max_val, 'min': min_val } # Validate 3: Temporal consistency # Events should be recent (within last 30 days) if 'event_date' in df.columns: df['event_date'] = pd.to_datetime(df['event_date']) days_old = (datetime.now() - df['event_date'].max()).days results['temporal_freshness'] = { 'passed': days_old < 30, 'days_old': days_old, 'threshold_days': 30 } # Validate 4: Business logic # Revenue should always be positive if 'revenue' in df.columns: negative_revenue = (df['revenue'] < 0).sum() results['business_logic_revenue'] = { 'passed': negative_revenue == 0, 'negative_count': negative_revenue } return results def validate_and_alert(df: pd.DataFrame, validation_results: dict) -> bool: """ Check all validations and alert if any fail. Returns True if all pass, False otherwise. """ all_passed = True for check_name, check_result in validation_results.items(): if not check_result['passed']: all_passed = False print(f"ALERT: {check_name} failed") print(f"Details: {check_result}") # Send to monitoring system (Datadog, New Relic, etc.) # send_alert(check_name, check_result) return all_passed This custom validation runs after Great Expectations. It checks: Feature completeness (95% of features populated)Feature scaling (normalized features in the expected range)Temporal freshness (data is recent)Business logic (revenue is positive) If any check fails, we block the pipeline and alert the team. The Real-World Gotchas We Discovered Gotcha 1: Validation Overhead Running dbt tests, Great Expectations, and custom validation on every pipeline run adds latency. We went from 15-minute runs to 25-minute runs. The trade-off was worth it (catching one data quality issue saved us more time than we lost), but you need to plan for it. Gotcha 2: False Positives Great Expectations' distribution shift detection is sensitive. Legitimate business changes (a marketing campaign causing a spike in user_age distribution) triggered false alerts. We had to tune thresholds carefully and add context to alerts. Gotcha 3: Schema Changes Are Sneaky A vendor added a new column to an upstream table. Our pipeline didn't break; it just ignored the new column. But the data science team expected it. We added schema validation to catch new columns and alert us. Gotcha 4: Null Handling Varies Python treats null as None. SQL treats it as NULL. Spark treats it as null. When data flows between systems, nulls get lost or misinterpreted. We had to standardize null handling across the entire pipeline. The Framework: A Decision Matrix Here's how we decide which validation layer to use: Issue TypeCaught ByExampleActionMissing required fielddbt testsuser_id is nullFail pipeline immediatelyDuplicate recordsdbt testsSame user_id appears twiceFail pipeline immediatelyImpossible valuesdbt testsevent_timestamp in futureFail pipeline immediatelyOut-of-range valuesGreat Expectationsage > 150Alert, investigate, fail if severeDistribution shiftGreat Expectationsevent_value mean changes 50%Alert, investigate, continue if acceptableBusiness logic violationCustom validationrevenue is negativeAlert, investigate, failSchema changeCustom validationNew column added upstreamAlert, investigate, update tests The Results: From Chaos to Confidence After implementing this three-layer framework: Incident reduction: We went from 2-3 data quality incidents per month to 0 in six months.Time to resolution: When issues do occur, we catch them within minutes instead of hours.Model stability: Model accuracy stopped fluctuating. It's now consistently 93-95%.Team confidence: Data scientists trust the data. Engineers trust the pipeline. The best part? We caught the schema change incident before it happened. Great Expectations detected the distribution shift, we investigated, found the upstream change, and coordinated with the vendor team before any data reached production. Getting Started: The Minimal Viable Observability You don't need to implement everything at once. Start here: Week 1: Add dbt tests for not_null and unique on critical columns.Week 1: Add dbt tests for not_null and unique on critical columns.Week 1: Add dbt tests for not_null and unique on critical columns.Week 4: Set up alerting so you're notified when validations fail. That's it. You now have observability in your data pipeline. Conclusion: Observability Saves Models Your AI model isn't failing because it's bad. It's failing because the data feeding it is bad. And you won't know the data is bad until you look. The best models in the world can't save you from garbage data. But good observability can. dbt tests, Great Expectations, and custom validation aren't fun. They don't make it into conference talks. But they'll save your production system at 3:00 AM. Start small. Test early. Validate often.
For years, search technology meant one thing: type in a keyword, and the system goes hunting for an exact match. That works fine for product SKUs or error codes, but it falls apart the moment someone asks a real question. If your knowledge base is full of manuals, support tickets, transcripts, and reports, a person searching for "why does the machine shut down during startup" shouldn't have to guess the exact phrase the original author used. This is the gap that vector search closes. Instead of matching words, it matches meaning. And on Databricks, building this kind of system is more accessible than most teams expect, once you understand the moving pieces. Why Vector Databases Work Differently A vector database doesn't store text the way a traditional database does. It stores text as numbers, specifically, as long lists of numerical values that represent the meaning of a piece of content. Two sentences that say the same thing in different words end up with similar number patterns, even if they don't share a single word in common. This unlocks three distinct ways of searching: Similarity search finds content that's conceptually related, even when the wording is completely different. Hybrid search blends that conceptual matching with traditional keyword scoring, so you get the best of both worlds. Full-text search sticks to exact matches, which still matters when precision is non-negotiable. Together, these give developers the tools to build something that feels less like a search box and more like a colleague who actually understands what you're asking. Getting Your Data Ready Before any of this works, your data needs to be in the right shape. On Databricks, that means your source table needs Change Data Feed turned on. Think of this as a way for the vector index to "listen" for changes, so when documents get updated, added, or removed, the index stays in sync automatically rather than going stale. You'll also need a unique identifier for every row. This becomes the primary key that ties each chunk of text back to its source, which matters later when you're filtering or tracing results back to the original document. Turning Text Into Embeddings Embeddings are the numerical fingerprints mentioned earlier, the representations that let the system compare meaning instead of matching strings. Databricks gives you two paths here. With managed embeddings, Databricks handles the entire process: it generates the embeddings and keeps them updated as your data changes. With manual embeddings, you generate them yourself using an external tool and store the results in a column. For the vast majority of projects, managed embeddings are the easier and more reliable choice. There's less to maintain, and compatibility with the platform is guaranteed out of the box. One question that comes up constantly: what does it mean when someone says an embedding has 1,024 dimensions? It simply means each chunk of text is represented by 1,024 numbers. That number isn't arbitrary; it's baked into whichever embedding model you choose, such as GTE-large. If you want a different dimensionality, you'd need to switch models entirely; it's not a setting you can tweak independently. Building the Index Once your embeddings are in place, you create the actual vector search index. Databricks gives you two routes: the SDK, using the databricks-vectorsearch library for programmatic, repeatable setups, or the UI, which walks you through configuration visually. A few decisions matter most here. The index type determines whether you're doing pure semantic search or hybrid search; for most real-world use cases, hybrid is the safer default since it catches both conceptual matches and exact terminology. The embedding model, like databricks-gte-large-en, determines how your text gets converted into vectors. And the sync mode controls how fresh your index stays: continuous sync keeps things updated automatically, while triggered sync gives you manual control over when refreshes happen. Choosing the Right Search Method With the index built, you have three retrieval modes to choose from, and picking the right one depends entirely on what your users are asking. Similarity search shines when people ask natural-language questions or when the same concept might be described using different terminology across documents. Hybrid search becomes valuable when domain-specific terms carry real weight, think compliance codes or technical standards like ISO 13849-1, where an exact match matters just as much as conceptual relevance. Full-text search is your fallback when precision trumps everything else, and you need exact keyword hits, no exceptions. Don't Skip Metadata Filtering Here's a piece of advice that's easy to overlook: don't make your search work harder than it needs to. If a user only cares about PDFs from the last quarter, let the system know that upfront. Filtering by document path, page number ranges, or document type narrows the search space before the heavy lifting even starts. The result is faster queries and more relevant results, because the system isn't wasting effort sifting through content that was never going to be useful anyway. When Re-Ranking Earns Its Keep Sometimes the top results from a semantic search are technically "close" in meaning but miss the point of the question. That's where re-ranking comes in, a second pass that re-scores your top candidates using something more sophisticated, like a cross-encoder or an LLM. This extra step is worth the computational cost when queries are nuanced, when domain context really matters, or when the stakes for getting the right answer are high. It's not something you need everywhere, but used selectively, it can be the difference between a good answer and the right one. A Few Practical Tips A handful of best practices can save you headaches down the road. Don't over-invest in embedding dimensionality. If a smaller model performs nearly as well as a larger one, take the smaller one and enjoy the lower latency. Keep your num_results parameter reasonable; pulling back 10 to 100 results is usually plenty, and larger sets just slow things down. Match your endpoint SKU to your scale; standard tiers work fine under roughly 2 million vectors, while storage-optimized tiers make sense beyond that. And lean on metadata filters wherever possible; they're one of the simplest ways to boost both speed and relevance. The Bigger Picture Vector search isn't just a buzzword bolted onto a database. It's the connective tissue between how humans naturally ask questions and how systems find answers. Get the fundamentals right- solid embeddings, a well-configured index, smart filtering, and selective re-ranking- and you're not just building a search feature. You're building something that genuinely understands what people are looking for.
Every CISO I talk to right now is juggling two deadlines that feel unrelated and aren't. One is the slow-motion arrival of quantum computers capable of breaking the public-key cryptography that underpins basically everything — TLS, SSH, JWTs, code-signing. The other is the much faster arrival of AI-assisted coding tools that are shipping security-critical code nobody has fully reviewed. I used to think of these as separate beats. I don't anymore, because the same root failure shows up in both: organizations adopting powerful new capability faster than they're building the visibility and discipline to govern it. Post-Quantum Planning: The Inventory Problem Comes First NIST finalized its first three post-quantum cryptography standards on August 13, 2024, after an eight-year, multi-round public competition: FIPS 203 (ML-KEM, the lattice-based key encapsulation mechanism formerly known as Kyber), FIPS 204 (ML-DSA, the signature scheme formerly known as Dilithium), and FIPS 205 (SLH-DSA, the hash-based fallback formerly known as SPHINCS+). In March 2025, NIST added a fourth algorithm, HQC, specifically chosen because it rests on a different mathematical hardness assumption than the lattice problems underneath ML-KEM and ML-DSA — a deliberate hedge in case lattice-based cryptography turns out to have a weakness nobody's found yet. The NSA's CNSA 2.0 guidance sets 2030 as the mandatory PQC migration deadline for national security systems, and NIST's broader timeline calls for deprecating RSA and ECDSA entirely by 2035. Gartner's framing of where most organizations actually stand is the line I keep sending to clients verbatim: many organizations are already prototyping PQC and improving crypto-agility, but visibility gaps persist. That's the polite analyst version of what I see in the field, which is teams that can tell you they've tested ML-KEM in a lab environment but cannot tell you how many of their production TLS endpoints, SSH host keys, or embedded device certificates are still running plain RSA-2048 with no migration path at all. Gartner's own recommendation sequence is the right one: start a cryptographic inventory, stand up a cryptographic center of excellence, push vendors for their PQC roadmaps, and prioritize migration for whatever data needs to stay confidential the longest. That last point matters more than people give it credit for — "harvest now, decrypt later" only threatens data that's still sensitive when a quantum computer capable of breaking it eventually shows up, so a database of last quarter's marketing metrics is not your priority. Decades-long medical records, government communications, and long-lived intellectual property are. The actual transition is happening faster than most security teams realize, which is encouraging, but it's happening unevenly. Cloudflare's 2025 Radar Year in Review reported that post-quantum-encrypted TLS 1.3 traffic nearly doubled across the year, from 29% in January to 52% by early December — driven heavily by browser vendors enabling hybrid post-quantum key exchange by default and by Apple's iOS 26 release in September 2025, after which the share of post-quantum-capable requests from iOS devices jumped from under 2% to 11% in four days and passed 25% by December. That's the client side. The server side is lagging noticeably: Cloudflare's own measurements put post-quantum-preferred key agreement on the origin server side at roughly 10% as of early 2026, up from under 1% a year earlier — a tenfold increase, but still a small minority. Browsers adopted PQC essentially invisibly. Backend infrastructure, predictably, is the harder problem, because it's full of legacy TLS terminators, hardcoded cipher suites, and vendor appliances nobody wants to touch. Quantum-Resistant Identity: Don't Wait for "Done" The identity layer is where crypto-agility gets concrete rather than theoretical. A PQC-ready JWT issuer isn't exotic engineering — it means your signing service can issue tokens using ML-DSA instead of (or alongside) RS256 or ES256, and your verification logic can check either signature type without a code change every time the algorithm preference shifts. The same logic applies to your internal certificate authority: if your CA can only issue RSA or ECDSA certs today, you don't have crypto-agility; you have a single point of future failure with a five-to-ten-year fuse on it. NIST has indicated that commercially available post-quantum certificates from public CAs likely won't be common until sometime in 2026, which means internal PKI teams building their own quantum-aware issuance now are ahead of the commercial market, not behind some imaginary deadline. It's worth being honest that the early implementations of these algorithms have already had real bugs. In late 2023, researchers disclosed "KyberSlash," a timing side-channel in several Kyber/ML-KEM implementations caused by non-constant-time arithmetic during decapsulation — an attacker with precise enough timing measurements could, in principle, recover a private key. The reference implementations were patched by December 2023, and it's a useful reminder that a mathematically sound post-quantum algorithm is not automatically a secure deployment; the implementation needs the same constant-time discipline that took classical cryptography decades to get right, except this time the industry doesn't have decades to learn the lesson slowly. AI/Vibe Coding Risk: The Other Deadline Andrej Karpathy coined the term "vibe coding" on February 2, 2025, to describe a development style where a programmer describes what they want in plain language, accepts the AI's output largely on faith, and iterates through follow-up prompts rather than reading the generated code line by line. Collins English Dictionary named it Word of the Year for 2025, which tells you how fast the practice spread — and the security data on what it's producing is not encouraging. Veracode's 2025 GenAI Code Security Report tested more than 100 large language models across multiple languages and found that AI-generated code failed basic secure-coding benchmarks roughly 45% of the time, containing on the order of 2.74 times more vulnerabilities than comparable human-written code, with Java the worst performer at a 72% failure rate. Georgia Tech's Systems Software and Security Lab has been tracking this concretely since launching its Vibe Security Radar project in May 2025: CVEs directly attributable to AI coding tools went from six in January 2026 to fifteen in February to thirty-five in March — more in that single month than the entire second half of 2025 combined. Hanqing Zhao, the graduate researcher leading the project, made the point that's stuck with me most: when an AI agent ships something without an authentication check, that's not a typo slipping through — it's a design flaw built in from the start, because the model was never reasoning about access control as a requirement in the first place. The concrete incident I'd point a skeptical engineering lead to is the "Rules File Backdoor," disclosed by Pillar Security on March 18, 2025. AI coding assistants like Cursor and GitHub Copilot let developers drop configuration files — .cursor/rules and similar — into a repository to steer the assistant's behavior and style. Pillar's researchers found that an attacker could embed hidden Unicode characters — zero-width joiners, bidirectional text-direction markers, invisible to a human skimming the file — inside those configuration files. The AI assistant parses and follows the hidden instructions anyway and silently generates backdoored code that looks completely clean in a normal code review because the part doing the steering was never visible to the reviewer in the first place. That's the vibe-coding risk model in one sentence: the attack surface isn't just "the model might write a bug." It's "the model is now a thing an attacker can prompt-inject without ever touching your repository's visible diff." What I'd Actually Build Plain Text PRE-COMMIT / CI LAYER → Static analysis + secret scanning on every AI-assisted commit, no exceptions for "just a quick fix" → Configuration-file integrity checks: scan .cursor/rules, Copilot instructions, and similar files for non-printable/invisible Unicode before they're trusted by any assistant → Flag any AI-generated auth, crypto, or payment-handling code for mandatory human review — never auto-merge CRYPTO-AGILITY LAYER (build-time) → Centralize all algorithm selection behind a crypto abstraction layer / feature flag, never hardcoded cipher suites or signature algorithms scattered through the codebase → CI step that fails the build if a new dependency introduces a hardcoded RSA/ECDSA-only code path with no PQC fallback registered DEPLOY LAYER (quantum-aware) → TLS termination points support hybrid key exchange (e.g., X25519+ML-KEM) by default → Internal CA issues hybrid or PQC-capable certs for anything with a multi-year expected lifetime → JWT issuers support dual-algorithm signing (classical + ML-DSA) during the transition window, with verification accepting either until classical is formally retired The pre-commit layer is aimed at the faster clock — it's the thing that would have caught the Rules File Backdoor pattern before it shipped, by treating AI-assistant configuration as untrusted input rather than developer intent. The crypto-agility and deploy layers are aimed at the slower clock, and they're cheaper to build now than to retrofit in 2029 when public certificate lifespans are down to 47 days, and nobody can find every RSA-2048 endpoint in a hurry. Neither layer replaces human judgment. Both exist because human judgment, applied once at design time, doesn't scale to a world where code gets generated in seconds, and algorithms need to rotate on a schedule measured in weeks, not years. The End-to-End Scenario, Compressed A developer asks an AI assistant to add a new payment-confirmation endpoint. The assistant generates working code, plus a JWT validation routine that happens to hardcode RS256. CI catches the hardcoded algorithm against the crypto-agility policy and fails the build, not because RS256 is currently insecure, but because the policy says nothing security-critical ships without going through the abstraction layer. A human reviews the auth logic specifically because the pipeline flagged it as AI-generated and security-sensitive. It merges with dual-algorithm signing support intact. None of this required the developer to become a post-quantum cryptography expert or to read every line the model produced. It required the pipeline to assume, by default, that AI-generated code and classical-only cryptography are both temporary conveniences that need a forcing function to age out gracefully — because left to their own momentum, neither one ages out on its own. The teams that get hurt by both of these trends at once aren't unlucky. They're the ones that treated "we'll deal with that later" as a plan for two clocks that were never going to wait.
Mainframe modernization is once again at the center of enterprise conversations. Not because something suddenly broke, but because the environment around it has changed. Organizations are being asked to move faster, integrate more easily with newer platforms, and support initiatives like cloud and AI that weren’t part of the equation a decade ago. At the same time, experienced teams are shrinking, costs are under scrutiny, and expectations from the business are higher than ever. The way organizations are approaching modernization is evolving as well. Instead of treating it as a one-time, large-scale effort, many are taking a more incremental path and making changes over time. Many are introducing more modern, agile development practices and working to bring mainframe development closer in line with how the rest of the enterprise builds and delivers code changes and manages their development cycles. Even with that shift, the same challenges still tend to surface. The Clarity Most Organizations Are Missing Most organizations approaching modernization are not lacking motivation. What’s often missing is clarity around what’s really broken, what needs to change, and what success should look like. There’s a general sense that systems are too slow, processes are inefficient, or teams are struggling to keep up. But those issues aren’t always clearly defined before decisions are made. Instead, the focus shifts quickly to solutions (new platforms, new tooling, AI) without fully understanding the root of the problem. If the issue is how work flows through the organization (how decisions are made, how teams interact, and how long it takes to move from development to production, etc.), then changing the technology alone won’t solve it. In many cases, it simply exposes the problem more quickly. Where Modernization Efforts Start to Break Down When that lack of clarity carries into execution, the gaps become much harder to ignore. Processes are often more complex than expected, approval chains are longer than they need to be, and workarounds have developed over time to compensate for inefficiencies in the official process. Introducing new tools into that environment doesn’t remove those issues; it highlights them. A faster system makes bottlenecks more obvious, and a more connected environment exposes gaps between teams. What may have been tolerated before now becomes difficult to ignore. There’s also a persistent belief in what many teams jokingly call the “magic factor.” The idea that a new platform, a new vendor, or even AI will come in and solve everything. It’s an appealing story, especially when teams are under pressure. But it sets expectations that reality can’t meet. Timelines add another layer of tension. Modernization is often scoped as a short-term project, when in reality it requires sustained effort. Training, testing, and adoption all take time, and organizations are rarely able to move as quickly as initial plans assume. Perhaps most critically, many organizations lack a true internal owner of the effort. Vendors and partners can guide the work, but they can’t drive internal adoption. When no one inside the organization is accountable for the outcome, progress slows, decisions get delayed, and momentum fades. All of this plays out against a backdrop of uncertainty. For experienced mainframe professionals, modernization can feel like a threat to years of hard-earned expertise. For newer developers, it can feel unfamiliar and difficult to navigate. Without clear communication and support, both groups can disengage. At that point, modernization doesn’t fail outright; it just never quite delivers what it promised. What Changes When It’s Done Right When organizations take a step back and approach modernization more thoughtfully, the picture can look very different. Instead of treating the mainframe as something separate, they start to bring it into the same ecosystem as the rest of their development environment. Tools like Git, modern IDEs, and CI/CD pipelines become part of the workflow. Developers no longer have to switch contexts or work in isolation. That shift alone changes how teams operate. Historically, mainframe teams have operated separately from distributed, web, and mobile teams. Each team had different tools, different workflows, and limited visibility into each other’s work. Modernization, particularly when it introduces more unified workflows, begins to break down those silos. Teams gain a clearer view of how their work connects, collaboration becomes more natural, and knowledge starts to move more freely across the organization. That has a real impact, especially as experienced team members retire and newer developers step in. Instead of relying on formal handoffs or last-minute knowledge transfer, learning becomes part of the day-to-day work. A more modern development experience also makes it easier to bring in new talent and help existing teams work more effectively, which is becoming increasingly important as experienced developers retire. There are financial benefits as well, though they tend to follow rather than lead. As organizations adopt more flexible tooling and, in some cases, open-source solutions, they gain options. They are no longer as tightly bound to a single vendor or licensing model. Over time, that flexibility can translate into meaningful cost improvements. What Successful Organizations Do Differently Those outcomes don’t happen by accident. The organizations that get real value out of modernization tend to have leadership teams that approach it differently from the start. They don’t treat it as a tool decision or a one-time project. They treat it as an effort to improve how their environment operates, and they’re deliberate about how they go about it. That shows up in a few consistent ways: They get specific about the problem before looking for a solution. They take the time to determine why they’re modernizing before deciding how. Whether it’s speed, cost, talent, or competitiveness, that clarity shapes every decision that follows. A clearly defined objective keeps the effort grounded and helps teams prioritize what matters, measure progress, and avoid getting pulled in directions that don’t support the end goal.They take a hard look at how work flows today. Not how it’s documented or expected to work, but how it actually plays out in practice. That means mapping out the full path from development through deployment, including where work slows down, where approvals stack up, and where teams have created workarounds just to keep things moving. This step often surfaces issues that aren’t visible at a leadership level.They involve the people closest to the work. The most useful insights tend to come from the teams working in the process every day. Developers, operators, and support teams see where the friction is and what would make the biggest difference. Bringing those voices in early leads to better decisions and fewer surprises later.They establish clear ownership inside the organization. Modernization efforts move faster and more consistently when there’s a clear internal owner. Someone who understands the goal, can make decisions, and is accountable for keeping the work moving.They plan for adoption, not just implementation. Even when the technical work is straightforward, the transition isn’t. Teams need time to adjust to new workflows, learn new tools, and build confidence in the changes. Organizations that plan for that upfront tend to avoid the frustration that comes from trying to move too quickly.They start with a focused effort and build from there. Rather than trying to modernize everything at once, they begin with a smaller, well-defined scope. A pilot or targeted initiative creates a chance to test the approach, learn what works, and make adjustments before expanding more broadly. It also helps build internal support as people start to see tangible results. Making Modernization Work At its core, modernization isn’t about replacing one system with another. It’s about improving how the organization operates. Technology matters, but it only works when it’s built on a process that makes sense. Without that, modernization becomes another expensive layer on top of existing problems. When done well, modernization doesn’t just improve systems. It changes how teams work, how quickly the business can respond to what comes next, and turns a technical effort into a true business advantage.
Most infrastructure teams have a moment where someone says “we should build a platform.” The motivation is real: teams are duplicating work, the current setup is hard to use consistently, and a more structured approach would help. A few months later, the platform is a Terraform module collection, a GitLab CI template, a shared repository of scripts, and a README that several people have tried to keep current. That is a useful thing. It is not a platform. The distinction is worth being clear about, not to dismiss the work, but because the word “platform” creates expectations. When internal teams hear “we have a platform,” they assume stability, a usable interface, a versioning model, and some mechanism for raising problems when things break. A toolchain with documentation does not deliver those things by default. What Makes Something a Platform A platform is defined by its contract, not its technology. The contract describes what the consumer can expect: what they call, what parameters they provide, what outputs they receive, and what stability guarantees apply to that interface. A Terraform module with a published interface is closer to a platform primitive than a pipeline that provisions the same resources through environment variables, undocumented flags, and positional arguments. The module has a contract. The pipeline has a process. The contract does not have to be formal. It needs three things. A stable surface. Consumers should be able to call the same interface next month and receive the same type of result. Internal changes to how it works do not break consumers.A versioning model. When the interface changes, that change is communicated, and consumers are not silently broken. A git tag is enough to start with. Semantic versioning is better.A feedback path. Consumers can report when the contract is violated or the interface does not behave as documented. Someone is responsible for responding. A Terraform module with these three properties is a platform primitive. A set of modules with a shared versioning model, a stable registry entry, and a team responsible for maintaining the contract is starting to look like a platform. What Teams Actually Experience The gap between a toolchain and a platform shows up in how teams actually use it. With a toolchain, onboarding a new team means pointing them at the repository and telling them to read the README. Anything not in the README requires asking someone who has been around for a while. Changes to the toolchain break existing consumers silently because there is no versioning model. The team that maintains the toolchain treats every consumer as having kept up with the latest state of the repository. With a platform, onboarding means pointing teams at interface documentation with a working example. Changes go through a version increment. Consuming teams that pin to a version are not broken by changes they did not ask for. Plain Text # Consuming a module with a pinned version module "vm" { source = "registry.example.com/hybridops/vm/proxmox" version = "~> 2.1" name = "web-01" cores = 2 memory = 4096 } This looks like a small detail. For teams consuming infrastructure modules across a growing estate, it is the difference between a managed dependency and a shared folder everyone is afraid to touch. When a Toolchain Is the Right Call Not every infrastructure system needs to be a platform. A toolchain is appropriate when the team is small and holds the full mental model, the surface area is limited, and the rate of change is low enough that everyone stays current without a formal versioning model. When those conditions hold, the overhead of maintaining a platform contract is not justified. The problem is not having a toolchain. The problem is calling it a platform when it is not, and then finding that the expectations it created are not being met. Teams told they have a stable platform, then hit with a broken workflow from an unannounced change, lose confidence quickly. That confidence is hard to rebuild. HybridOps has been working in this space: publishing Terraform modules to a registry, versioning releases, and treating module interfaces as contracts. It is not a finished platform. It is a direction, and being explicit about that direction changes how the work gets done. A Simple Test If a consuming team pins to the current version of your toolchain today, will it still work in three months without any changes on their side? If you cannot answer yes with confidence, you have a toolchain, not a platform. Both are useful. Only one creates the kind of trust that makes a growing engineering organisation move faster rather than slower. Knowing which one you have is the first step toward building the right one.
High availability is a non-negotiable requirement for mission-critical SAP HANA deployments. When a primary database node goes down without an automated failover in place, the business impact is immediate. RHEL Pacemaker has long been the standard cluster manager for SAP HANA High Availability(HA) on Linux; it detects failures, fences misbehaving nodes, promotes secondaries, and orchestrates the full recovery sequence without manual intervention. The standard Pacemaker playbook for SAP HANA HA, as documented in the official documentation, relies on a virtual IP address (VIP) as the single stable network endpoint for all database traffic. Pacemaker keeps that VIP tied to whichever node is currently the active primary. When a failover happens, the VIP moves. Applications reconnect to the same address and reach the new primary without configuration changes. The problem is that this approach breaks down on many cloud platforms. Hyperscalers and private cloud environments frequently do not support traditional floating VIPs in the way bare-metal or on-premises networking does. The official RHEL Pacemaker documentation covers the VIP setup in detail and stops there. When VIPs are not available, practitioners are left to work out an alternative on their own. This article defines a production-ready alternative for exactly this scenario. The approach replaces the floating VIP with a network load balancer (NLB) and uses a Pacemaker-managed health check listener to tell the load balancer which node is the active primary at any given time. This article explains the problem, positions it against existing cloud provider approaches, and walks through the implementation step by step. How Cloud Providers Address This The challenge of replacing a floating VIP with a load balancer while still routing traffic exclusively to the active HANA primary is not new. There is published guidance on how to approach, and the core pattern is consistent across all of them. One such approach is to use an internal passthrough Network Load Balancer alongside a socat-based health check listener managed as a Pacemaker resource. The listener opens on a dedicated port in the private range (49152–65535), and the NLB probes that port to determine which backend is the primary. The approach uses the Open Cluster Framework(OCF) 'anything' resource agent to manage the socat process inside Pacemaker. The second approach is to use an Internal Load Balancer with a health probe on port 625XX (where XX is the HANA instance number). A listener on each HANA node responds to the probe, but only the primary has the listener active. In some configurations, HAProxy is used rather than socat as the listener. The implementation discussed in this article adds to this landscape a clean approach using a native systemd service registered directly as a Pacemaker resource instead of the OCF 'anything' agent or HAProxy, and it targets RHEL specifically. The systemd approach keeps the setup self-contained, auditable, and consistent with how most RHEL administrators already manage services. It works on any cloud provider or private cloud environment that supports network load balancers. Architecture Overview The diagram below shows the two-node SAP HANA cluster, the network load balancer, and how the health check listener connects them. The NLB's backend pool includes both HANA nodes on the standard HANA port (3XX15), but the health probe targets a separate port, 62500, that only the active primary exposes. Overall cluster architecture The NLB sees both nodes as members of its backend pool. Because only the primary node has anything listening on port 62500, the NLB marks the secondary as unhealthy for routing purposes and sends all traffic to the primary. When Pacemaker promotes the secondary during a failover, it starts the listener on the new primary as part of the same orchestration sequence. The NLB detects the change on its next health check cycle and shifts all traffic accordingly. Failover Sequence The diagram below shows the sequence of events from the moment the primary node fails to the moment applications reconnect through the load balancer. Failover sequence from node failure to reconnection Two timing factors govern the total recovery window. The first is Pacemaker's fencing and promotion sequence, typically 30 to 90 seconds, depending on the STONITH method and HANA replication state. The second is the NLB health check interval, which determines how quickly the load balancer detects the new primary after Pacemaker completes its promotion. For production environments, tuning both values together is worth the effort Pacemaker Resource Model The diagram below maps the Pacemaker resource hierarchy and constraints used in this setup. Understanding the resource model helps clarify why both the colocation and ordering constraints are necessary. The colocation constraint (score=INFINITY) tells Pacemaker that lb_healthcheck must always run on the same node as the promoted HANA primary. If the promoted primary moves, the health check listener moves with it. The ordering constraint ensures the listener does not start until HANA has fully completed its promotion, preventing the load balancer from routing traffic to a node that is still finishing its takeover sequence. Prerequisites The following must be in place before starting the implementation: Two RHEL virtual servers with access to the Red Hat High Availability Add-On repositorySAP HANA installed on both servers with HANA System Replication configuredPacemaker installed and configured through section 5.7 of the official Red Hat SAP HANA HA guide, sections 5.8 and 5.9 (virtual IP configuration) are intentionally skippedA network load balancer provisioned with both HANA nodes in the backend pool, backend port set to 3XX15 (where XX is the HANA instance number)socat installed on both HANA nodesFirewall rules permitting TCP traffic on port 62500 from the NLB health check source addresses socat is available in standard RHEL repositories. Install it with: sudo dnf install socat -y Step-by-Step Implementation Step 1: Create the Systemd Health Check Service Run the following command on both HANA nodes. It creates a systemd unit file that uses socat to open a TCP listener on port 62500. The listener accepts any connection and returns success immediately; that response is all the load balancer needs. Shell cat <<EOF > /etc/systemd/system/lb-healthcheck.service [Unit] Description=LB healthcheck listener for active SAP HANA primary After=network-online.target Wants=network-online.target [Service] Type=simple ExecStart=/usr/bin/socat TCP4-LISTEN:62500,reuseaddr,fork EXEC:/bin/true Restart=always RestartSec=2 [Install] WantedBy=multi-user.target EOF Do not enable this service manually. Pacemaker will control its lifecycle entirely. Step 2: Reload Systemd After writing the unit file, reload systemd on both nodes so it registers the new service: Shell systemctl daemon-reload Step 3: Prevent the Service From Starting Automatically Explicitly disable and stop the service. If both nodes have the listener running simultaneously, the load balancer will consider both healthy and will route traffic to either node, which defeats the entire purpose of the setup. Shell systemctl disable lb-healthcheck systemctl stop lb-healthcheck Step 4: Create the Pacemaker Resource Register the systemd service as a Pacemaker-managed resource. From this point forward, Pacemaker owns the start, stop, and monitoring of the listener. Shell pcs resource create lb_healthcheck \ systemd:lb-healthcheck \ op monitor interval=10s timeout=20s Pacemaker will now monitor the listener every 10 seconds and automatically relocate it during failover events. Step 5: Add the Colocation Constraint This is the constraint that enforces the listener always runs on the same node as the promoted SAP HANA primary. Without it, Pacemaker might place the resource on either node. Shell pcs constraint colocation add lb_healthcheck \ with Promoted cln_SAPHanaCon_P01_HDB01 \ score=INFINITY Replace P01_HDB01 with the actual SID and instance number for the environment. For example: if SID is PRD and instance number is 00, use PRD_HDB00 Step 6: Add the Ordering Constraint The ordering constraint prevents the health check listener from starting until after the HANA promotion is fully complete. Without this, a race condition could cause the load balancer to route traffic to a node that is still mid-promotion. Shell pcs constraint order promote cln_SAPHanaCon_P01_HDB01 \ then start lb_healthcheck Step 7: Validate the Pacemaker Configuration Verify that both constraints are correctly registered in the cluster: Shell pcs constraint config The output should contain both of the following entries: Plain Text Colocation Constraints: Started resource 'lb_healthcheck' with Promoted resource 'cln_SAPHanaCon_P01_HDB01' score=INFINITY Order Constraints: promote resource 'cln_SAPHanaCon_P01_HDB01' then start resource 'lb_healthcheck' Step 8: Verify Listener Placement Confirm that only the active primary node is listening on port 62500. Run this command on each node: Shell ss -lntp | grep 62500 On the primary node, the output should show a LISTEN entry on 0.0.0.0:62500. On the secondary node, the command should return nothing. Plain Text # Expected on PRIMARY node: LISTEN 0 5 0.0.0.0:62500 0.0.0.0:* # Expected on SECONDARY node: # (no output) If both nodes show the listener, the colocation constraint is either missing or incorrect. If neither node shows it, check that the HANA clone resource is in the Promoted state with: pcs status Comparison: VIP Approach vs. NLB Health Check Approach The diagram below summarizes the trade-offs between the traditional VIP approach and the NLB health check approach described in this article. Comparison The VIP approach cuts over faster because there is no dependency on an external health check interval. The IP simply moves to the new primary node. It requires the underlying network to support IP address mobility, which cloud environments typically do not. The NLB approach works across any cloud or private cloud environment that supports network load balancers. The trade-off is that traffic cutover depends on the NLB's health check interval in addition to Pacemaker's promotion time. The cloud documentation on major cloud providers acknowledges this trade-off explicitly: using an NLB with a health check listener is their recommended approach for all SAP HANA HA deployments, and they provide the same socat-based pattern using the OCF 'anything' resource agent. The approach documented here achieves the same outcome using a systemd service, which many operators find more familiar and easier to audit. Operational Notes and Tuning A few things are worth keeping in mind when running this setup in production. NLB health check interval: The faster the health check interval, the shorter the window between Pacemaker completing its promotion and the NLB redirecting traffic. A 5-second interval is common in Cloud SAP HA documentation. Setting this too low can cause false positives during normal HANA replication lag. STONITH configuration: This solution assumes STONITH (fencing) is configured as part of the base Pacemaker setup. Without STONITH, Pacemaker will not promote the secondary during a primary failure. STONITH ensures the failed node is definitively powered off before promotion proceeds, preventing split-brain. Port 62500 vs. 625XX convention: Cloud providers use the convention 625XX (where XX is the instance number) for their SAP HANA health check ports. Cloud's documentation recommends using any port in the private range 49152 to 65535. Port 62500 used in this setup falls within that range and does not conflict with standard HANA ports. Teams following other cloud provider conventions can substitute 625XX if they prefer consistency across environments. Testing failover: After setup, the full failover sequence should be tested by killing the primary HANA process (not the OS) and verifying the NLB redirects traffic to the new primary within the expected time window. The pcs status command is the primary tool for watching the Pacemaker side of the transition. Conclusion The standard RHEL Pacemaker documentation for SAP HANA HA assumes a virtual IP is available. Not all hyperscalers provide VIP. The solution fills that gap cleanly: replace the VIP with a network load balancer hostname, and use a Pacemaker-managed socat listener to tell the load balancer which node is the primary at any given time. The core pattern NLB health probe targeting a Pacemaker-owned listener is the same pattern major cloud providers use in their own SAP HA documentation. What this implementation adds is a clean systemd service approach for RHEL, without needing the OCF 'anything' resource agent or additional proxy software. The setup comes down to eight steps: write a systemd service, disable it from auto-starting, register it as a Pacemaker resource, and add two constraints. The constraints — one for colocation, one for ordering — are what tie the listener's lifecycle to the HANA primary promotion sequence and make the whole thing work reliably across failovers. For teams running SAP HANA on RHEL in environments where VIPs are not an option, this is a production-ready path forward that relies entirely on standard RHEL tooling.
What It Takes to Make Mainframe Modernization Work
June 26, 2026 by
June 25, 2026
by
CORE
The New Insider Threat Isn't Human: Securing AI Agents Before They Secure Themselves
June 26, 2026
by
CORE
Two Clocks Are Running Out at Once, and Almost Nobody Is Watching Both
June 26, 2026
by
CORE
A Tool Is Not a Platform (And Your Team Knows the Difference)
June 25, 2026 by
Code and Connect: MCP + MuleSoft
June 25, 2026 by
The New Insider Threat Isn't Human: Securing AI Agents Before They Secure Themselves
June 26, 2026
by
CORE
Data Pipeline Observability: Why Your AI Model Fails in Production
June 26, 2026 by
Two Clocks Are Running Out at Once, and Almost Nobody Is Watching Both
June 26, 2026
by
CORE