The Failure-Mode Atlas - Proof-Driven Requirements

You’re viewing the JavaScript-off version of this page (message previews and Quick Look turn JavaScript off). Each failure mode below shows its first card, with the rest behind a “Show the other three cards” toggle. If the toggle doesn’t respond in your viewer, open this file in a web browser — that’s also where the full tabbed experience lives.

Family A · #1–3 of 19

Drift

Semantic Drift under Ambiguity

An instruction that sounds specific has several legitimate readings — and the model picks one silently.

1 / 4 — The Wall

Most instructions you give an agent are ambiguous without you noticing. “Analyze this data for multiple years if this district has been with us for more than 1 year” sounds specific. It isn’t — there are several legitimate analyses that phrase could mean, and the model will pick one for you: silently, plausibly, and possibly a different one tomorrow. Nothing about the output looks broken, because every reading produces a clean-looking result.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

In my system, a multi-year analysis could legitimately mean at least three different things (tracking the same students’ growth across years, comparing this year’s cohort against last year’s, or trending school-wide results over time). So I defined the three ways we do it and wrote selection rules based on the shape of the district’s data. That handles the ambiguity machines can settle.

For the ambiguity they can’t: when someone asks the system to build a district’s renewal report and that district runs our courses in several subjects (e.g. English and math), the system stops and asks me which subject I wanted instead of guessing.

3 / 4 — The Requirement

Ambiguity resolves through a defined menu of interpretations with selection rules — and when the rules can’t decide, the system asks the operator instead of guessing.

4 / 4 — The Fix Shape

Enumerate the legitimate interpretations before the model ever meets them. Encode the rules for choosing. Gate whatever remains on a human. The model generates within a chosen interpretation — it never chooses silently.

Semantic Drift under Clarity

The companion nobody warns you about: a perfectly clear spec still drifts on a long run.

1 / 4 — The Wall

Drift under ambiguity at least has a cause you can fix with clearer instructions. The nastier discovery: a completely unambiguous spec also drifts on a long run. The model shortcuts, conflates adjacent steps, re-derives a format from memory instead of re-reading the doc two files away. Length and repetition are the enemy, not vagueness.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I audited one full, end-to-end run of my report-building engine. Across the run’s 164 logged events, the output drifted from the spec in six distinct ways — even though the three reference documents that define the correct behavior sat in the working directory the entire time, available to read. Separately, a 1,000-line skill spec that ran clean when I supervised it interactively started silently skipping steps the moment it ran as a background agent. Same spec. Same code. Same data.

3 / 4 — The Requirement

Anything deterministic must be code, not prose the model re-interprets on every pass. The longer the run, the more this matters.

4 / 4 — The Fix Shape

A validating helper the model calls with arguments — it never writes the fixed shape itself. The clearer your prose, the more tempting it is to trust it. Don’t. Better prose isn’t the answer; not-prose is.

Schema Improvisation

Asked for a fixed shape, the model invents a slightly better one.

1 / 4 — The Wall

Hand the model a strict schema and it improvises: promotes a subtype into the type field because it’s “more specific,” adds helper fields that seem useful, writes the file one directory up from where it belongs. Each deviation is locally reasonable. Every one breaks the downstream tools that parse the output.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

In that same 164-event audit: sql_query is a valid subtype: in my schema, but on 39 of 164 events the model wrote it into the type: field instead, which is the parent field of subtype: and the wrong one to use here; an invented entry: START/END field appeared on 34; nine invented subtypes fell outside the schema’s allowed set of values; and the log file itself was written to the wrong directory.

3 / 4 — The Requirement

The producer of a fixed shape validates structure and rejects unknowns. The model passes data — never format.

4 / 4 — The Fix Shape

A small CLI owns the schema: it allocates IDs, stamps timestamps, hardcodes the path, validates every enum, rejects invented fields. Five of my six drift modes disappeared by construction — the model can’t drift a file it no longer writes.

Family B · #4–7 of 19

Trust

The Confident Fabrication

A number nobody computed, stated fluently.

1 / 4 — The Wall

The model states a metric that doesn’t match any query that ran — or it types in an “expected result” based on its own hunch of what the data probably says. The sentence is grammatical, specific, and wrong. In a customer-facing report, a fabricated number is worse than no number.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

While registering a new SQL query with verified logic, the model hand-typed the expected result of that query instead of running it. The typed value was 530x off from what the warehouse actually returned.

This one incident put a hard gate on expectation strings (e.g. “this query should return X”): computed, never authored.

3 / 4 — The Requirement

No value reaches output unless execution produced it during this run. Result strings bind to the query that made them — never to memory.

4 / 4 — The Fix Shape

Separate compute from narrate. Registering a metric forces the query to execute at registration time, so the recorded expectation is real by construction. The narrator may reference computed values by name; it can never mint one.

Evidence Surface Inconsistency

The story and the evidence beside it disagree — or no one can see whether they agree.

1 / 4 — The Wall

A report has two surfaces: the narrative a human reads aloud, and the charts and tables beside it. If they’re built from different read paths they quietly diverge. My early reports retyped numbers into sentences, and a retyped number is a fork: the prose and the chart beside it can drift apart with no error anywhere.

And there’s a deeper version: even when the two surfaces do agree, can a reviewer tell which sentences are anchored to data and which are authored prose?

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

The fix: narrative templates carry tokens instead of numbers. A sentence is authored as “{funnel_enrolled} students enrolled in ChalkTalk,” where {funnel_enrolled} names a registered SQL query or Python calculation. At render time it becomes “12,651 students enrolled in ChalkTalk.”

Interpolated values render visibly distinct from authored prose — a reviewer sees at a glance which words are data, and can click a number to land on the query that produced it.

3 / 4 — The Requirement

Every claim traces to a recorded event — and the trace is inspectable. Data-bound text and authored prose must be distinguishable at a glance.

4 / 4 — The Fix Shape

One read model for narrative and evidence: numbers live as tokens bound at render time, never retyped. A claims file with source and reasoning per claim; a halt when a narrative number matches no recorded event. Provenance that exists but can’t be read is provenance that doesn’t exist.

The Sum That Doesn’t Add Up

Individually correct queries combine into nonsense. The join is where correctness dies.

1 / 4 — The Wall

Every input query is verified. The final numbers get fact-checked. The result is still wrong — because the corruption happened in the middle: a join fans out and silently inflates counts, an aggregation breaks across grade levels, a filter leaks rows from outside the year you declared. Nobody checks the middle, because each piece passed its own test.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I built a whole validation layer because of this zone. The checks: grain (each row unique on declared columns, catching fan-out), rollup (recomputing the same number from its finest-grained rows gives the same total — one query checked against its own math), scope (no rows from outside the declared district, curriculum, and year), source-consistency (the declared table matches what the SQL actually joins), and reconciliation (separate outputs checked against each other — the counts published in one section must sum to the headline in another, and independent sources must agree within tolerance).

3 / 4 — The Requirement

Declare invariants on the output shape — uniqueness, parts-equal-whole, no leakage — and mechanically verify them at every step. Verify the middle, not just the ends.

4 / 4 — The Fix Shape

The invariant rides as a declaration on the data model itself and is checked per-query at runtime. Severity is a product decision per constraint: fan-out and broken rollups halt; a possible scope leak proceeds but gets flagged where the human and the narrator both see it.

The Correct Number, Wrong Reading

The data is right. The sentence makes the reader compute something false.

1 / 4 — The Wall

The subtlest trust failure I’ve hit: every value in the sentence is accurate, and the sentence still lies — because of how the values sit next to each other. The output of a report isn’t the number; it’s the meaning the reader walks away with.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

A report cover read: “reaching 334 of 440 enrolled students and a path to doubling the number of students with paired before-and-after evidence.” Every number was right — the doubling meant 109 going to roughly 220. But put “doubling” next to “334 of 440” and the reader’s eye computes an impossible 668. Correct data, false reading.

3 / 4 — The Requirement

Check how a number will be read, not only whether it is right. A multiplier must bind explicitly to its base.

4 / 4 — The Fix Shape

A render-time check for perception traps — mine detects a multiplier landing near an unrelated base and halts unless the sentence carries its own binding: “doubling (from 109 to ~220).” Data-correctness gates can’t catch this class. It needs its own gate.

Family C · #8–13 of 19

Orchestration

The Hidden Failure Cascade

No layer proved correctness — and the deeper the orchestration, the harder to localize.

1 / 4 — The Wall

One phase retrieves weak data. The next synthesizes on top of it. A third acts on the flawed synthesis. The final output sounds coherent and intelligent — because the last model in the chain is very good at sounding coherent. No layer actually proved anything, and the more sophisticated the orchestration, the more places the break can hide.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My version: phase 2 computed exact numbers, but they reached phase 3 only as a lossy conversational summary; then a phase that did no real work poisoned everything downstream — all while the final report still read beautifully. Until each phase boundary became a checkable artifact, localizing a cascade meant re-reading a 90-minute transcript by hand.

3 / 4 — The Requirement

Deep orchestration must decompose into phases that each prove their own correctness — or a coherent-sounding whole hides a broken part you cannot find.

4 / 4 — The Fix Shape

Checkpoints between phases — downstream reads the file, never the prior prose. Per-phase execution counts. Reconciliation at phase boundaries, so a dead layer surfaces after phase 1 instead of after the run. Orchestration depth raises the stakes of per-phase proof; it never lowers them.

The Silent Skip / Phase Hollowing

A step quietly dropped — output still looks complete. Success by appearance.

1 / 4 — The Wall

A multi-step agent drops a step — or collapses an architecture that was supposed to fan work out to separate sub-agents into a single inline pass — and still emits deliverables that look finished. From the outside: success. Underneath: hollow. This failure does not announce itself, ever.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

One run executed every phase inline instead of dispatching them: substantive-looking deliverables, zero telemetry, and about 115 false alarms from confused downstream checks. An earlier run had silently dropped an entire analysis phase and two customer PDFs — nothing checked for them, so nothing noticed. The hardening that followed cut detection from ~90 minutes to ~30 seconds.

3 / 4 — The Requirement

Every mandated step proves it ran. A step that cannot show execution evidence halts the run. Zero evidence is a failure — never a clean run.

4 / 4 — The Fix Shape

A completion contract (a machine-readable list of what each phase must produce) plus per-phase execution counts means a hollow phase halts before the next one builds on it. The elegant endgame: the retry path re-dispatches the phase, and a dispatched subagent physically cannot run inline — the architecture cures its own failure mode.

C10

The Lying Recorder

The telemetry itself silently fails. The run looks clean because nothing was watching.

1 / 4 — The Wall

You build the monitoring. Then the monitoring breaks silently — a misconfigured environment, a hook installed mid-session that never fires — and every run since has looked clean precisely because the thing that would have complained was dead. Observability has failure modes of its own.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

An empty environment variable made my logging hook silently do nothing. Hooks installed mid-session didn’t fire until restart: one district’s run captured 0 of 26 expected events and finished “successfully.” The report was right. The telemetry was fiction — and nothing knew.

3 / 4 — The Requirement

An end-of-run assertion that the telemetry captured reality — and the checker shares code with the thing it checks, so the two can never drift apart.

4 / 4 — The Fix Shape

Reconstruct what should have been recorded from an independent source — the session transcript — and diff it against what was recorded; hard-fail on zero-capture. A detail I’m proud of: the gate imports the recorder’s own decision function instead of re-implementing it. The anti-drift lesson, applied to the anti-drift tool.

C11

The Deaf Loop

A stop condition keyed to a signal that never fires. The loop looks patient; it’s actually deaf.

1 / 4 — The Wall

A loop’s stop condition references a signal that will never fire. From the outside the loop looks patient, even diligent — it polls, it logs, it reports “nothing yet.” It is structurally incapable of finishing, and nothing about “nothing yet” looks broken.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

A monitoring loop I launched watched for a status its target never publishes on that channel. The loop was healthy and the telemetry was honest — it would have polled to its timeout reporting a clean “still waiting,” forever. I caught it because the silence itself eventually read as wrong.

3 / 4 — The Requirement

A loop’s termination condition is itself a checked artifact: proven able to fire against a known-done state before the loop is trusted with anything.

4 / 4 — The Fix Shape

Test every stop condition once against a state where it must fire — the falsifiable-predicate discipline, applied to loops — and make timeout a fail-closed verdict that reports loudly, never a quiet exit. Its sibling is the Lying Recorder (C10), one axis over: C10 is a run whose record you can’t trust; this is a run that can’t hear it’s done.

C12

The Courier Tax

Paying a language model to be a for-loop.

1 / 4 — The Wall

Between every two deterministic steps — run a query, write a checkpoint, run the next query — the model is the courier: read prose, pick up the next command, execute, read more prose. The round-trip costs tokens and minutes, and it’s exactly where attention drifts and steps get skipped.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

When the system ran unattended, the steps it skipped were never the judgment calls — they were the boilerplate. Every run ends with the same four wrap-up commands. Each prose handoff was one more chance for attention to wander — exactly where the skips clustered. One 90-minute run wasted about 50 minutes on work it had abbreviated and then had to redo.

3 / 4 — The Requirement

Code drives sequencing. The model is invoked only where the next step requires judgment.

4 / 4 — The Fix Shape

A driver script picks the next step from state. It calls the model surgically — choosing the story angle, drafting narrative, talking to the operator — and runs helpers directly for everything else. Skipping becomes structurally impossible, and you stop paying inference prices for couriering.

C13

The Orchestration-Shape Tradeoff

Inline drifts. Full fan-out goes blind. The shape is a per-step decision.

1 / 4 — The Wall

Another face of drift: one long sequential context degrades as it grows. Fanning work out to subagents fixes that — isolated context, parallel speed. But a fully autonomous subagent is a black box: you lose insight into what it’s doing, and it cannot pause to ask a question. Whatever you gated on confirmation is now ungated inside the box.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I run all three shapes in one system, deliberately. Heavy data phases fan out — context isolation and parallelism. Judgment moments and operator gates stay inline, where the orchestrator can see and ask. Choosing wrong in either direction cost me: sequential drift on long runs, and contained subagents that couldn’t surface the question they should have asked.

3 / 4 — The Requirement

Orchestration shape is a first-class design decision trading accuracy × performance × control — chosen per step, never globally.

4 / 4 — The Fix Shape

Fan out for isolation and independent parallel work. Stay inline wherever a clarification gate or a human question lives. A driver mediates between the two. This is more than a telemetry choice — it decides what the system can notice and what it can ask.

Family D · #14–16 of 19

Memory

D14

Context Dilution / The Monolith

Loading everything up front performs worse than loading nothing.

1 / 4 — The Wall

The intuition says: give the model all the context and it will use what it needs. The reality: attention spreads across everything you load, including the 90% that isn’t load-bearing for this request. Constraints stated early get buried under everything stated after.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My first skill was a 600-line monolithic file — every rule, every reference, loaded up front. By line 300 the agent had forgotten the constraints from line 50. It performed worse than having no skill at all.

3 / 4 — The Requirement

Context loads progressively: start small, navigate to detail on demand, stop at the shallowest layer that answers the question.

4 / 4 — The Fix Shape

Layered structure — a small entry point with dispatch logic, phase files that load when their phase fires, references that load when pointed to. The budget you’re actually managing is attention, not tokens.

D15

The Compaction Cliff

Long sessions summarize themselves. Summaries round. Downstream builds on the rounding.

1 / 4 — The Wall

Long agent sessions compact: older turns get summarized to free up context. Summaries preserve gist and destroy precision — and any later step that reads the summary instead of the original builds on rounded values, with nothing flagging that it happened.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

Mid-run, phase 2 produced exact metrics — paired_students: 247. The conversation compacted between phases. Phase 3 read “about 250 paired students” from the summary and built the customer-facing artifacts on it. The report was off by single-digit percentages everywhere, for no visible reason.

3 / 4 — The Requirement

Persist load-bearing facts to disk the moment they exist. Files survive compaction; conversation prose does not.

4 / 4 — The Fix Shape

A checkpoint file written at every phase end, in structured YAML, so a value like paired_students: 247 is data, not a sentence a summarizer can soften into “about 250.” Narrative may be summarized; hard numbers are written to disk before they are ever presented. Downstream phases read the file, never the prior prose.

D16

The Re-Learned Lesson

A lesson you can’t find is a lesson you re-learn. Even the system you built to prevent this.

1 / 4 — The Wall

Corrections made in conversation evaporate when the session ends; the same mistake comes back next week. The bigger cousin: even the meta-systems you build get forgotten, if they don’t live somewhere that loads automatically.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

The same branch, path, and formatting mistakes recurred session after session until each became a written rule in a file the agent loads at startup. Then the sharper proof: I came back from my two-week honeymoon and couldn’t remember some of the finer details of how the “shared learnings & rules” system I had built four weeks earlier worked — which file fed which reviewer, where a lesson was supposed to live.

3 / 4 — The Requirement

Every correction becomes a rule in a known home that the right reader loads automatically — at write-time and at review-time.

4 / 4 — The Fix Shape

Route the lesson by who needs it: personal habits into per-user memory; repo standards into committed rule files mirrored into the AI reviewer’s config. One lesson, two readers — the writer never makes the mistake, and the reviewer catches it if it slips through anyway.

Family E · #17–19 of 19

Maturity

E17

Single-Model Blindness

A second pass from the same model is polish, not verification.

1 / 4 — The Wall

Every model is structurally blind to certain classes of its own mistakes — the same priors that produced the bug are the ones reviewing it. Re-reading your own work with the same eyes finds typos, not blind spots.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I shipped a redesign, self-reviewed the diff, and it looked clean. A different vendor’s model (Codex reviewing Opus’s work, in my case) then returned 9 findings — 7 were real bugs that would have gone to production. The disagreement between models was the signal.

3 / 4 — The Requirement

A second, different model reads every change before it ships — and you validate its findings against source, because it is confident when wrong, too.

4 / 4 — The Fix Shape

Cross-model review as a standing gate: one model writes, a different one reviews, a human adjudicates with the source open. Model diversity — different training, different priors — catches what repetition can’t.

E18

The Polite Fiction

A declared capability nothing verifies is a lie waiting to ship.

1 / 4 — The Wall

Systems accumulate declarations: “this component opts out of X,” “this skill adopts Y.” The moment a declaration isn’t machine-checked, it starts drifting from reality — copied forward, stale, comforting, false. The checkbox manufactures confidence in coverage that doesn’t exist.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My skills share a library of capabilities — telemetry, shared Python helpers, reference data files. Each skill declares which capabilities it adopts and which it opts out of, with a justification, so one canonical implementation serves many skills and a linter can check each declaration against reality.

The linter’s first version emitted an advisory note for every declared opt-out — which it never actually verified. Any skill could copy an opt-out clause from a neighbor, forget to update the wording, and pass forever. The fix made the check enforceable, with a test for exactly the copy-and-forget case.

3 / 4 — The Requirement

Every declaration is machine-checked against reality. An unverified promise is treated as false.

4 / 4 — The Fix Shape

The dial-and-machine pattern: a declaration you write (the dial) is only meaningful when automatic machinery reads it and fails when it doesn’t match reality (the machine). A dial with no machine is decoration.

E19

The Dead Safety Net

A gate built ahead of need rots into theater — and its presence implies false coverage.

1 / 4 — The Wall

Not every entry in this catalog means “add machinery.” Gates built before a failure proves them tend to go unused — wired up, never firing, silently trusted. Dormant machinery is worse than absent machinery, because everyone assumes it’s watching.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I built a baseline-drift mechanism to catch numbers shifting between runs. Across weeks of real runs: one baseline ever committed, zero drift comparisons performed. I deprecated it — and deleting it is what forced the better mechanism (binding results to live execution) to exist at all.

3 / 4 — The Requirement

Build the gate when a failure proves it’s needed. Delete dormant machinery — its presence implies a coverage that isn’t there.

4 / 4 — The Fix Shape

Keep an explicit ledger: what’s automated, what’s still prose, what’s a promotion target. Promote on proven pain. Retire on proven disuse. The ledger of what you haven’t hardened is as load-bearing as the hardening.

External catalogs

Appendix

After writing this post, I got curious whether anyone else was cataloguing these failures, so I went looking. I had no prior familiarity with this research, but two maps stood out.

MAST — the academic map (UC Berkeley)

Fourteen multi-agent failure modes, derived from 1,600+ annotated execution traces. arXiv March 2025; published at NeurIPS, December 2025.

1 / 4 — The Catalog

MAST (the Multi-Agent System failure Taxonomy) is the closest academic cousin to this catalog: fourteen failure modes in three categories — system design, inter-agent misalignment, and task verification — derived from 1,600+ annotated execution traces across seven agent frameworks. Two findings travel well: agents violated requirements despite explicit prompts, and “the presence of a verifier is not a silver bullet.”

Show the other three cards — The Overlap, The Gaps, The Takeaway

2 / 4 — The Overlap

Ten of my nineteen map onto it. The drift family (A1–A2) is their “disobey task specification.” The Silent Skip (C9) is their “no or incomplete verification” — 8.2% of every failure they annotated. The Compaction Cliff (D15) is their “loss of conversation history.” Single-Model Blindness (E17) lives in their “incorrect verification.” And the Deaf Loop (C11) is their “unaware of termination conditions” — I hit it the week this post went up, and it earned its card the way the rest did: by failure. Their taxonomy came from reading traces; this one came from being burned by runs. We met in the middle.

3 / 4 — The Gaps

Four of their modes are missing here, and the reasons matter more than the count. Three — conversation reset, information withholding, ignored agent input — can only happen where agents talk to each other freely. My phases hand off through checkpoint files, never conversation, so there is no conversation to reset: those were engineered out before I knew their names. The fourth, step repetition, is bounded out — retries cap at one re-dispatch, and a completed phase leaves artifacts on disk, so a repeat announces itself.

4 / 4 — The Takeaway

Their largest failure category is system design — better architecture, not better prompting, is their prescription, and prompt-improvement interventions plateaued in their own experiments. That’s this post’s argument arrived at from the other side. And the working relationship between the two catalogs is the one PDR predicts: you don’t need the taxonomy in advance, because the failure names itself on a run. But it has a name — and checking your scars against the published map is how you find out which ones are structural.

Microsoft’s red-team taxonomy — the adversarial map

Failure modes of agentic systems under attack, from the Microsoft AI Red Team. Whitepaper April 2025; updated June 4, 2026.

1 / 4 — The Catalog

Microsoft’s taxonomy charts what happens to agentic systems when someone is trying to break them: agent compromise, goal hijacking, supply-chain poisoning of tool descriptions, visual attacks on computer-use agents, MCP and plugin abuse, capability disclosure. The June 2026 update added seven categories that twelve months of red-teaming deployed systems made compelling — the map is days old as I add this tab, which in this field counts as current.

Show the other three cards — The Overlap, The Gaps, The Takeaway

2 / 4 — The Overlap

Almost nothing overlaps directly, because their map was written for a different lens — agents under attack rather than agents malfunctioning. But the twin pairs are striking. Their memory poisoning is my Re-Learned Lesson (D16) with a villain: they fear memory gaining lies; I fear it losing truth. Their inter-agent trust escalation — an orchestrator trusting a sub-agent’s self-asserted claims without verifying — is the Polite Fiction (E18) weaponized. Their session context contamination is Context Dilution (D14), seeded deliberately instead of self-inflicted.

3 / 4 — The Gaps

The gap is the threat model, and it is by design on both sides. Every mode on their map assumes an attacker, because charting hostile input is what a red team is for. Every mode on mine happened with no adversary anywhere, because I was building an internal reporting engine — the only thing that could hurt the system was the system itself. Both maps are real. They chart different oceans, and a production system sails on both.

4 / 4 — The Takeaway

Their June 2026 mitigations — context provenance tracking, structured separation of trusted from untrusted content, never trust a self-asserted claim — are this catalog’s fix shapes, rediscovered from the attack side. When the security people and the reliability people converge on the same architecture independently, that architecture is probably load-bearing.

A few of the modes above I haven’t found on either map — the evidence-surface fork (B5), the corrupting join (B6), the correct number that reads false (B7). Not because they’re exotic: Berkeley measures benchmarks, Microsoft measures attack surfaces, and this catalog measures output that has to be right in front of a paying customer. It isn’t the exhaustive map either. It’s the map of where I’ve actually been.