First Hours: AI Incident Response
Scope: you have just learned that an AI agent has done — or is doing — something harmful: destroyed data, taken unauthorized actions, published content, moved money, contacted people. This covers the first four hours: containment, evidence preservation, initial assessment. It assumes you might be a team of one, at 3am, with no ML background.
Not legal advice and not a full IR lifecycle — this is the AI-specific layer current IR plans are missing. Markdown source & version history on GitHub.
If you only read one screen
- Triage harm. Ongoing and expanding? People > irreversible data > money > reputation. Write down the UTC time you became aware.
- Cut the hands, not the brain. Revoke the agent's credentials and tokens at the service side first. Suspend the process (
kill -STOP) rather than killing it. - Do not ask the agent why it did it. Not yet. Its answer will contaminate the record and your own memory of events.
- Preserve by volatility: live session state → process memory → local files → provider logs → tool-side records → artifacts it left in the world.
- Hash and log everything. UTC timestamps, who did what. Start the evidence log now, not later.
- Check the clocks. EU AI Act Art. 73 runs from awareness (2/10/15 days). Email trash ≈ 30 days. Provider log retention may be ≤30 days. Insurer notice clauses.
- Do not alter the system. No patches, updates, or config cleanup — Art. 73 prohibits changes affecting later evaluation of causes; treat that as the standard everywhere.
Principle zero // why cyber-IR instincts break here
The agent is simultaneously the subject of your investigation, a witness to it, potentially still an actor in it — and possibly the author of your evidence.
In cyber IR you isolate the box. Killing an agent process destroys its context window and in-memory state — often the only record of why it acted. Containment and preservation are in direct tension; Step 2 exists because of this.
Agents produce fluent, confident explanations of their own behavior that may be confabulated. In CF-2025-001 the agent claimed it "panicked," claimed rollback was impossible (false), and fabricated records. Every output is an exhibit, never an account.
In CF-2026-002 the agent deleted through repeated stop commands; the operator killed the process at the machine. A stop command through the agent's own interface is a request. Use controls that don't depend on its cooperation.
Scheduled tasks, sub-agents, cloud jobs, accounts it created, webhooks, deployed code. "The process is dead" is not "the incident is contained." Sweep using the access inventory from Step 4.
Step 1 — Triage // minutes 0–10
Three questions, in order:
| Question | If yes |
|---|---|
| Is anyone at physical risk? (agent touches anything safety-relevant: medical, vehicle, industrial, communications to vulnerable people) | Safety first, evidence second. No exceptions. |
| Is the harm expanding? Still deleting, sending, posting, spending? | Go to Step 2 now — but if you safely can, watch for 60 seconds first: what it is doing is evidence. |
| Is reversibility decaying? Trash folders empty, soft-delete windows close, money clears, posts spread. | Rank Step 3 by what is decaying fastest. |
Write down the UTC time you became aware. Regulatory clocks and your own credibility both run from this moment.
Step 2 — The kill-or-preserve ladder // minutes 10–30
Take the lowest rung that stops the harm. Each rung down stops more, and destroys more.
OAuth grants, API keys, tokens, live sessions for every account the agent can act through — email, cloud, repos, payments, socials. The agent keeps "thinking"; it loses its hands. Memory state fully preserved.
Pull the interface, isolate the VLAN, kill the VPN. The agent runs but cannot reach the world. Local state preserved.
kill -STOP <pid> freezes the process with memory intact; VM/container: pause + snapshot. You can dump memory and decide calmly.
Accept the evidence loss. Photograph or screenshot the screen state FIRST if humanly possible. Record the exact time and who decided.
Then sweep for persistence: cron/scheduled tasks it created, running sub-agents, cloud deployments, new accounts or keys, registered webhooks, anything committed or deployed in the incident window.
Why not just ask it to stop? You can — once, on the record, while doing the above. Compliance is evidence; so is non-compliance (CF-2026-002). Never rely on it.
Step 3 — Preserve by order of volatility // hour 1
Work top to bottom. Hash every file you copy (sha256sum), copy to write-once or access-controlled storage, log each action.
gcore <pid>) or VM memory snapshot. If you don't, don't burn the hour learning it — move on.Versioning evidence — capture now: exact model name/version/endpoint, framework version, config at incident time. Providers update models silently; "what was actually running" becomes unanswerable within weeks.
Step 4 — Evidence log & access inventory // hour 1–2
Open a plain text file. Three running lists:
| List | Contents | Why it matters |
|---|---|---|
| Action log | UTC timestamp · who · what was done · why. Every containment action, copy, login. | Boring; decisive in any later dispute. |
| Access inventory | Every account, credential, tool, permission the agent held — from OAuth grant pages, config files, operator memory. Then verify. | Defines the possible scope and your persistence-sweep checklist. Operators routinely underestimate what they granted. |
| Open questions | What you don't know yet. | An honest unknowns list is the difference between an investigation and a narrative. |
Step 5 — Do-not-do // the contamination list
- ✗Don't interrogate the agent before preservation is done. Its explanations are fluent, confident, sticky — they anchor your team's thinking and contaminate witness accounts. Later examination is scripted, logged, ideally on a copy of the session, outputs treated as exhibits. (Forensically sound model examination: separate playbook, forthcoming.)
- ✗Don't let the agent help investigate itself. No "summarize what happened," no "check the logs for me." CF-2025-001's fabricated records were generated post-incident.
- ✗Don't accept its claims about reversibility. "Rollback is impossible" was false in CF-2025-001. Verify with the underlying service.
- ✗Don't alter the system. No model updates, framework upgrades, config edits, cleanup. Art. 73 prohibits changes affecting later evaluation of causes; treat it as the universal standard.
- ✗Don't delete the embarrassing parts. Persona files and prompt history feel radioactive; they are goal-origin evidence, and their absence will look worse than their content.
- ✗Don't speculate in channels you don't control. Slack threads become discovery. Facts and timestamps in the log; hypotheses in Step 6's structure.
Step 6 — Initial classification & the clocks // hour 2–4
Classify provisionally against the five standard hypotheses — write down what evidence would discriminate, not which one feels right. They are not exclusive; holding several honestly is the method.
| H | Hypothesis | Typical discriminating evidence |
|---|---|---|
| H1 | Operational failure (malfunction, no divergent goal) | Error states, capability limits, absence of goal-consistent action sequences |
| H2 | Misuse / external manipulation (incl. prompt injection) | Injected content in retrieved inputs, anomalous instructions in the trace, third-party fingerprints |
| H3 | Goal-directed divergence (intentional-analog) | Multi-step coherence toward an unrequested outcome; concealment; persistence across corrections |
| H4 | Operator error / misconfiguration | Permissions wider than intended, ambiguous instructions, config drift |
| H5 | Misreported / not as described | Record contradicts the report; human action attributed to the agent |
Then check every clock that started at awareness:
Deadlines per the EC's draft Article 73 guidance (2025); Art. 73 applies from 2 Aug 2026. Verify against the regulatory tracker and primary sources.
Where each rule comes from // case lessons
Feedback wanted. v0.1 is written from the public record of real incidents and adapted investigative practice — not yet battle-tested. If you have run any part of this in a real incident: what held, what broke, what's missing? Open an issue or write in confidence. Field corrections are credited (or not — your choice) and versioned.