The Breach Was Never at the Door

OAuth tokens and AI agents can bypass traditional security. Learn from Microsoft and Salesloft breaches why behavioral monitoring matters.

Igboanugo David Ugochukwu

CORE ·

Jun. 23, 26 · Analysis

Likes (0)

Comment

Save

1.6K Views

I've lost count of how many breach disclosures I've read where the first sentence is some version of "no evidence the perimeter was compromised." It used to strike me as corporate hedging. Now I read it as the whole story, hiding in plain sight. The perimeter wasn't compromised because, increasingly, nobody bothers attacking it. Why would they, when the back door is propped open by a token nobody's looked at since the engineer who set it up left the company?

That's the pattern I want to walk through here — not as a hypothetical, but as something that's now happened, in public, with named victims and dated timelines, twice in the last eighteen months at a scale too big to wave away.

Microsoft, Eating Its Own Dog Food Problem

Start with the one that should have ended the "we're different" conversation in every boardroom that's ever had it: Microsoft's own corporate environment, breached by the Russian state-linked group tracked as Midnight Blizzard — also known, depending on which vendor's report you're reading, as APT29, Cozy Bear, or Nobelium. Microsoft disclosed it on January 19, 2024, after detecting the intrusion a week earlier. The detail that's easy to skip past, but shouldn't be: this is the same group behind the SolarWinds compromise and the original DNC intrusion. Nation-state patience, applied to a target that builds the identity infrastructure half the internet runs on.

The mechanics, as Microsoft laid them out in its own responder guidance, are almost insultingly simple. A legacy, non-production test tenant — the kind of thing every large engineering org has somewhere, unloved and unreviewed — got hit with a low-and-slow password spray. No MFA on the account. That alone is a known failure mode, the kind every pen-test report flags and every roadmap deprioritizes. But the part that turned a stale test account into a corporate-wide incident was what that account could reach: an old OAuth test application with elevated standing access into Microsoft's actual production environment. From there, the attackers minted new OAuth apps, granted themselves the full_access_as_app Exchange Online role — full read access to mailboxes, org-wide — and pulled emails from senior leadership and Microsoft's own security and legal teams for roughly six weeks before anyone noticed.

Six weeks. Inside Microsoft. I keep coming back to that number because it's not a story about Microsoft being careless in some uniquely embarrassing way — it's a story about what happens when the entire monitoring apparatus is pointed at the login event and almost nothing is pointed at what an already-authenticated OAuth app actually does once it's inside. Zscaler's threat research team, reviewing the incident afterward, made the point bluntly: there's an unimaginable sprawl of forgotten applications and permissions in most large tenants, and that sprawl is exactly where the blind spots accumulate.

Drift, and the Ten Days Nobody Was the Wiser

If Midnight Blizzard was a single, sophisticated actor going after one very large target, the Salesloft Drift incident is the opposite shape — opportunistic, automated, and absolutely massive in blast radius. Google's Threat Intelligence Group put out the advisory on August 26, 2025, tracking the activity under the name UNC6395. Between roughly August 8 and August 18 of that year, the actor used stolen OAuth and refresh tokens belonging to Drift — a third-party AI chat and lead-gen tool that plugs into Salesforce — to authenticate directly into more than 700 connected Salesforce environments. Not "vulnerable to." Authenticated into, as the application itself, using tokens that were entirely legitimate from the platform's point of view.

The list of organizations caught in it reads like a cybersecurity vendor directory, which is its own kind of dark comedy: Cloudflare, Zscaler, Palo Alto Networks, PagerDuty, Proofpoint — companies whose entire business is telling other people how to avoid exactly this. Cloudflare wrote up its own exposure publicly on September 2. Palo Alto's Unit 42 published a threat brief days later, flagging something I find more interesting than the breach itself: a chunk of the malicious traffic carried the user-agent string Python/3.11 aiohttp/3.12.15 — a perfectly valid, unremarkable signature for an automated script, sitting in logs next to months of equally unremarkable integration traffic. Nothing about the request itself was wrong. It was the pattern — systematic SOQL queries enumerating Accounts, Contacts, Cases, and Opportunities, then exfiltrating in bulk, then actively searching the stolen data for AWS keys, Snowflake tokens, and passwords that customers had pasted into support tickets — that should have tripped something, and didn't, for ten days.

Salesloft and Salesforce revoked the tokens on August 20. By October, a ransomware-adjacent group calling itself Scattered Lapsus$ Hunters was trying to extort Salesforce directly over the stolen data; Salesforce, to its credit, said no and went public about saying no. FINRA put out a member alert. The FBI followed with indicators of compromise on September 12. None of that changes the part that should keep security leaders up at night: every control that supposedly stands between an attacker and your CRM — MFA, SSO, conditional access policies — was irrelevant here, because none of it operates downstream of a token that's already been issued.

What Both Incidents Are Actually Telling You

Strip away the attribution and the company names, and Microsoft and Salesloft are the same failure, twice. Identity was verified correctly, once, a long time before anything bad happened. After that, nothing was watching.

This is the part I think most security programs still get wrong, not out of negligence but out of inherited architecture. Identity and access management was built for humans sitting at keyboards — a single moment of proof, a session, a logout. Machine identity doesn't work that way. An OAuth token doesn't get tired, doesn't take weekends off, doesn't have a typing cadence or a face on a badge photo. It has a bearer string and a scope, and the entire security model treats whoever holds it as the thing it represents, indefinitely, until somebody remembers to revoke it. Nobody remembered, in either case, until the damage was already counted in weeks.

The Behavioral Layer Nobody Wants to Build Until They Have To

Here's the uncomfortable engineering truth underneath all of this: the signals that would have caught both incidents early aren't exotic. They're statistical. A token that's read four hundred contact records a month for a year and then reads forty thousand records in an afternoon doesn't need a machine-learning breakthrough to flag — it needs someone to have built the baseline and wired up the alert.

That's the system worth sketching out, so let's actually build the skeleton of it rather than just gesturing at the idea.

    Python
   
 

   from dataclasses import dataclass
from datetime import datetime, timedelta
from collections import defaultdict, deque
import statistics


@dataclass
class ApiEvent:
    token_id: str
    timestamp: datetime
    scope: str
    resource: str
    source_ip: str
    asn: str
    record_count: int = 1


class TokenBaseline:
    """What 'normal' looks like for one specific token, built from its own history."""

    def __init__(self, token_id: str, window_size: int = 500):
        self.token_id = token_id
        self.recent_events: deque = deque(maxlen=window_size)
        self.known_scopes: set = set()
        self.known_asns: set = set()
        self.hourly_counts: defaultdict = defaultdict(int)

    def update(self, event: ApiEvent):
        self.recent_events.append(event)
        self.known_scopes.add(event.scope)
        self.known_asns.add(event.asn)
        self.hourly_counts[event.timestamp.hour] += 1

    def per_minute_rate(self, window_minutes: int = 10) -> float:
        if not self.recent_events:
            return 0.0

        cutoff = self.recent_events[-1].timestamp - timedelta(minutes=window_minutes)
        recent = [e for e in self.recent_events if e.timestamp >= cutoff]
        return len(recent) / max(window_minutes, 1)

    def historical_rate_stats(self):
        buckets = defaultdict(int)

        for e in self.recent_events:
            buckets[e.timestamp.replace(second=0, microsecond=0)] += 1

        values = list(buckets.values())

        if len(values) < 5:
            return None

        return statistics.mean(values), statistics.pstdev(values) or 1.0


class BehavioralRiskEngine:
    def __init__(self):
        self.baselines: dict = {}

    def score(self, event: ApiEvent) -> dict:
        baseline = self.baselines.setdefault(
            event.token_id,
            TokenBaseline(event.token_id),
        )

        risk, reasons = 0, []

        if baseline.known_scopes and event.scope not in baseline.known_scopes:
            risk += 25
            reasons.append(f"new_scope:{event.scope}")

        if baseline.known_asns and event.asn not in baseline.known_asns:
            risk += 20
            reasons.append(f"new_network_origin:{event.asn}")

        total_seen = sum(baseline.hourly_counts.values())

        if total_seen > 50:
            hour_share = baseline.hourly_counts.get(event.timestamp.hour, 0) / total_seen

            if hour_share < 0.01:
                risk += 15
                reasons.append(f"off_pattern_hour:{event.timestamp.hour}")

        stats = baseline.historical_rate_stats()

        if stats:
            mean_rate, stdev_rate = stats
            z = (baseline.per_minute_rate() - mean_rate) / stdev_rate

            if z > 4:
                risk += 30
                reasons.append(f"volume_spike_z:{round(z, 1)}")

        if event.record_count > 200:
            risk += 25
            reasons.append(f"bulk_record_pull:{event.record_count}")

        baseline.update(event)

        return {
            "token_id": event.token_id,
            "risk_score": min(risk, 100),
            "reasons": reasons,
        }


def respond(result: dict) -> str:
    score = result["risk_score"]

    if score >= 70:
        return "AUTO_REVOKE_AND_PAGE_ONCALL"

    if score >= 40:
        return "REQUIRE_STEP_UP_VERIFICATION"

    if score >= 20:
        return "LOG_AND_WATCH"

    return "NO_ACTION"
  

Feed this engine a simulated version of the Drift pattern — a token with a year of light, business-hours, single-record Contact reads suddenly pulling hundreds of Cases and Opportunities at 4 a.m. from an ASN it's never used — and it crosses the auto-revoke threshold on the very first anomalous event. Not on day ten. On the first request.

Building It For Real, Not Just in a Gist

A scoring function in a markdown file is the easy ten percent. The other ninety, in the order I'd actually tackle it if this were my problem to ship:

Pull from the systems that already log everything, instead of building a new collector. Okta, Entra ID, your API gateway, and the native audit logs Salesforce and Google Workspace already produce. If a token can act without leaving a trace somewhere in that set, that's the gap to close first — not the model.
Anchor on identity, not on the current secret. Tokens rotate. The behavioral history of "the Drift integration" or "the legacy test OAuth app" shouldn't reset to zero every time its credential does.
Run new integrations in observation-only mode before you ever alert on them. A baseline built from zero history is a coin flip dressed up as a detection rule.
Score as events arrive, not in a nightly job. Ten days of undetected exfiltration is what a daily batch buys you. Streaming evaluation is what turns "we technically had the data" into "we caught it."
Make the response graduated, not binary. Auto-revoking on every anomaly guarantees outages and guarantees someone disables the system in a fit of frustration three weeks in. Step-up checks and scope throttling first; hard revocation reserved for the scores that actually warrant it.
Feed outcomes back in. A legitimately expanded integration should widen its own baseline, not get permanently flagged because the business changed and the model didn't.
Audit the auditor. A system with visibility into every token's behavior across the company is itself a target. It needs least privilege and logging on its own service account, full stop.

The Next Trust Crisis Won't Be OAuth; It Will Be AI Agents

OAuth taught us that authentication is not trust. Enterprise AI agents are about to make that lesson painfully expensive.

Everything above is about machine identities that are, in the end, fairly dumb. An OAuth token is a bearer string with a scope attached. It does exactly what it's told, by whoever holds it, and the danger is entirely about who's holding it and what it's been granted. That's already hard enough to monitor, as Microsoft and Salesloft both found out the expensive way.

AI agents are the same category of problem wearing a much more dangerous shape.

A customer-service agent wired into a CRM, a billing system, and a ticketing platform isn't a static credential anymore. It's a decision-maker. It reads a customer message, reasons about intent, picks a tool, takes an action, and chains that action into the next one — issue a refund, then update the billing record, then close the ticket, then maybe escalate to a human, maybe not. Depending on how it's scoped, that agent can plausibly touch more systems in an afternoon than most new hires touch in their first quarter. And unlike a token, it isn't just executing a permission — it's deciding which permission to reach for, based on a model's read of unstructured, attacker-reachable input.

That second part is what makes this a genuinely different problem, not just OAuth-with-extra-steps. A stolen OAuth token does what the attacker tells it to because the attacker is holding it directly. A compromised agent can do what an attacker wants without the attacker ever touching its credentials at all — by manipulating what the agent reads. A support ticket, a customer email, a scraped web page the agent was asked to summarize: any of those can carry an instruction the model wasn't supposed to take seriously, and sometimes does. The industry calls this prompt injection, and it matters here because it turns "was this request authenticated" into a meaningless question. The agent is authenticated. It has every right to be touching the CRM. The compromise is in what it decided to do with that right, not in how it got it.

So the question that mattered for OAuth tokens — is this credential being used the way it normally is — has to evolve into something closer to: is this agent's sequence of decisions consistent with what this kind of agent normally decides? Not just rate and scope anymore, but the shape of the reasoning chain itself. A refund-handling agent that suddenly starts exporting customer PII before issuing refunds, it's never issued at that size, to accounts it's never touched, is behaving anomalously in a way that has nothing to do with whether its API key is valid.

The good news is that the underlying engineering instinct doesn't need to be reinvented — it needs to be extended one layer up, from individual API calls to entire action sequences:

    Python
   
 

   from dataclasses import dataclass
from datetime import datetime
from collections import defaultdict, deque


@dataclass
class AgentAction:
    agent_id: str
    timestamp: datetime
    action_type: str       # e.g. "issue_refund", "export_customer_data"
    target_system: str     # e.g. "billing", "crm", "ticketing"
    triggering_input_hash: str
    value_at_risk: float = 0.0  # dollar amount, record count, etc.


class AgentBehaviorBaseline:
    """Tracks the *sequences* an agent normally executes, not just individual calls."""

    def __init__(self, agent_id: str, window: int = 1000):
        self.agent_id = agent_id
        self.action_history: deque = deque(maxlen=window)
        self.known_sequences: defaultdict = defaultdict(int)  # bigrams of action_type

    def record(self, action: AgentAction):
        if self.action_history:
            prev = self.action_history[-1].action_type
            self.known_sequences[(prev, action.action_type)] += 1

        self.action_history.append(action)

    def is_novel_transition(self, prev_action: str, next_action: str) -> bool:
        total = sum(self.known_sequences.values())

        return (
            total > 30
            and self.known_sequences.get((prev_action, next_action), 0) == 0
        )
  

The point isn't that this snippet is production-ready — it's that the move from "watch the token" to "watch the action chain" is a direct, almost mechanical extension of the behavioral lesson OAuth abuse already taught us. The vendors building agentic systems for enterprise — the Typewises and the dozen companies chasing the same category — are walking straight into the version of this problem that Microsoft and Salesloft already lived through, except with agents that act faster, touch more systems at once, and can be redirected by language instead of stolen credentials. Bounded autonomy, audit-first execution, human approval gates on irreversible actions — all of it is the same architecture this article has been arguing for, aimed one layer higher up the stack.

Where This Leaves the Industry

I don't think the fix here is a new product category, though plenty of vendors will sell it as one. It's a mindset correction that's overdue: authorization isn't a single decision made at login and then forgotten. It's a standing claim that has to keep justifying itself, request by request, against a record of what that identity has actually done before.

Microsoft had MFA gaps on a test account nobody was watching. Salesloft had OAuth tokens with more reach than anyone had reason to grant a chat widget. Different failures, same root cause — trust granted once and never re-examined. The next version of this failure is already being built, one enterprise AI agent deployment at a time, by teams who are solving for capability first and asking the trust question second — the same order Microsoft and Salesloft's OAuth ecosystems got built in. The only real question is whether the system watching for it exists before the agent ships, or whether it gets built afterward, in the write-up.

AI security

Opinions expressed by DZone contributors are their own.

Related

Trending