What a Production Runbook for an Agent Fleet Looks Like
An AI agent runbook covers monitoring, permissions, rotation, escalation, and decommissioning. See a real example and learn what breaks without one.
An AI agent runbook is a structured reference document that tells the operator what each agent does, what credentials it holds, what to check when something looks wrong, and how to shut it down without breaking the rest of the fleet. Armada Works writes one for every agent fleet we deploy, and it is the single most referenced document after the handoff.
Most teams that run AI agents in production skip the runbook entirely. They rely on the prompt file to explain what the agent should do, the state file to show what it did, and their own memory to fill the gap. That works for the first month. It stops working the moment a credential expires on a Friday evening and nobody remembers which token the agent uses, where to rotate it, or what breaks downstream.
The Five Sections Every Agent Runbook Covers
A production runbook for an agent fleet is not a single massive document. It is one page per agent, each following the same five-section structure. The consistency matters: when something breaks at 7 AM and you are scanning for the relevant section, you should not have to hunt for it.
-
Monitoring. What the agent produces each run, where it writes its output, and what a healthy run looks like. For a Content agent, a healthy run means one new markdown file committed under
docs/content/blog/and a brief posted to the dashboard. For an SEO agent, a healthy run means a state file update with current ranking data and no credential errors. -
Permissions. Every credential the agent can reach, classified by blast radius. Account-level tokens (ones that can delete the entire project) should not exist in the agent's environment. Project-level tokens get listed with their renewal cadence and the exact steps to rotate them. This section also documents which PreToolUse hooks are configured and what they block.
-
Rotation. Token expiry dates and the steps to renew each credential before it lapses. API keys, OAuth tokens, and service-role credentials all expire on different schedules. The runbook lists each one with its expected lifetime and a renewal procedure short enough to complete in under five minutes.
-
Escalation. What to do when the agent fails in a way you cannot fix by editing a prompt or rotating a credential. This section defines two paths: self-service fixes (restart the scheduled task, re-run with updated context) and escalation to whoever built the fleet. For Armada engagements, the escalation path during the first 90 days is the optional $1,500/month support tier.
-
Decommissioning. How to pause or permanently remove the agent without breaking the rest of the fleet. Agents coordinate through shared state files. Removing one without updating the synthesizer's read list means the CMO agent will keep looking for a brief that never arrives and will flag a "missing report" anomaly every run.
A Sample Runbook Page: The Content Agent
Below is what a runbook page looks like for a content agent running on a Mon/Wed/Fri cadence. This is the format Armada Works delivers at the end of a transfer engagement.
Agent: Content
Cadence: Mon/Wed/Fri 9:00 AM PT
Prompt file: docs/agents/content-agent-prompt.md
State file: docs/agents/state/content-agent-state.md
Brief output: POST /api/agent-reports (agentPrefix: "content")
+ mirror at docs/agents/state/content-brief-YYYY-MM-DD.md
MONITORING
Healthy run: 1 new file under docs/content/blog/ OR state-only update
Brief posted to dashboard before push
Git commit + push to main
Drift signal: Brief says "queue empty" for 3+ consecutive runs
Output word count <800 or >3000 (outside style range)
Commit touches files outside docs/content/ or docs/agents/state/
PERMISSIONS
Credentials used:
AGENT_REPORT_KEY scope: POST /api/agent-reports only
blast radius: row-level (single table insert)
rotation: manual, no expiry unless revoked
Hooks:
PreToolUse blocks: rm -rf, git push --force, supabase db push
config: .claude/settings.json + ~/.claude/settings.json
ROTATION
AGENT_REPORT_KEY no automatic expiry; rotate if compromised
steps: generate new key in Supabase dashboard,
update .env.local, update Vercel env vars
ESCALATION
Self-service: re-run scheduled task manually
edit prompt file if output drifted
Escalate to: founder (or Armada support if active)
Trigger: 3 consecutive failed runs, credential error,
agent writes outside its designated paths
DECOMMISSIONING
1. Remove scheduled task
2. Remove "content" from CMO agent's read list
3. Archive prompt file (do not delete; git history is the record)
4. Final state file entry: "Decommissioned YYYY-MM-DD, reason: [X]"
This page fits on a single screen. That is deliberate. Robert Cowherd, founder of Armada Works, follows a rule for runbooks: if the operator has to scroll to find the section they need, the page is too long. One screen per agent, five sections, no prose.
What Breaks Without a Runbook
The failure modes that hurt most are the ones that accumulate silently. Three patterns recur in every fleet that ships without documentation.
Credential expiration goes unnoticed. A pattern from Armada's own fleet: a service account token required periodic reauthentication, but the agent's recovery instructions misdiagnosed the problem. The agent flagged the same error on fifteen consecutive runs. Each time, the fix appeared to work for one session, then failed again the next day. A runbook entry with the correct token type, its actual expiry behavior, and the right recovery procedure would have reduced fifteen days of stale data to one. The fix was not a code change. It was a documentation correction: the credential's actual authentication policy did not match what was written down.
Decommissioning leaves orphaned references. When a team pauses an agent without updating the synthesizer's read list, the CMO agent flags a "missing brief" every run. Over a few weeks, the operator learns to ignore that flag. Once they start ignoring one anomaly, they start ignoring others. The monitoring surface degrades not because it broke, but because it cried wolf. A decommissioning checklist (five steps, under a minute) prevents this entirely.
Escalation defaults to "ask the person who built it." Without a documented escalation path, every agent failure becomes a support request to whoever set up the fleet. That works for the first month. It stops working when that person is unavailable, changes roles, or when the fleet has grown past what one person can debug from memory. The runbook's escalation section is the difference between "restart the task and re-run" (self-service, two minutes) and "wait for someone to look at this" (blocked, unknown timeline).
These patterns are not hypothetical. They come from operating a seven-agent fleet against a production codebase for months. Every one of them was preventable with a document that took less time to write than the incident took to resolve. For more on what the first weeks of ownership look like, see what happens after the AI consultancy leaves.
Where the Runbook Lives
The runbook is a markdown file in the repo, committed to main, versioned like everything else. Not a Notion page, not a Google Doc, not a wiki behind a separate login. The same repository that holds the agent prompts and state files holds the runbook.
Armada delivers it as docs/agents/runbook.md in every engagement. The file is editable with any text editor. When you add a new agent, you copy the five-section template, fill in the specifics, and commit. When you rotate a credential, you update the rotation section in the same commit. The runbook stays current because updating it is the same workflow as maintaining the fleet.
The permissions audit covers the credential scoping that feeds the runbook's permissions section. The monitoring post covers the daily brief pattern that feeds the monitoring section. Together with this post, those three form the operational baseline for running an agent fleet you actually own.
If you want help building a runbook for your specific stack, book a 30-minute discovery call. Armada Works includes a production runbook in the deliverables of every Pilot and Transfer engagement.
Frequently Asked Questions
What is an AI agent runbook?
An AI agent runbook is a structured reference document that covers monitoring, permissions, credential rotation, escalation paths, and decommissioning steps for each agent in a fleet. The prompt tells the agent what to do. The runbook tells the human operator what to do when something goes wrong or needs maintenance.
How often should you update an agent runbook?
Update the runbook whenever you change something it documents: a new credential, a changed cadence, a modified escalation path, a paused or added agent. The easiest way to keep it current is to update it in the same git commit as the change it describes. Quarterly reviews catch anything that drifted between updates.
What happens when an agent fails without a runbook?
The operator has to reconstruct the agent's credential set, expected behavior, and recovery steps from memory or by reading the prompt file. For simple failures (a missed run, a brief that looks wrong), this costs ten minutes of investigation. For complex failures (expired credentials, cross-agent coordination issues), it can cost hours or days, especially if the person who built the fleet is unavailable.
Can you automate runbook checks?
Partially. Credential expiry dates can be monitored with a scheduled script that checks token validity and alerts before expiration. Brief format consistency can be validated with a linter that checks for required sections. But the highest-value runbook checks (is the agent's output still aligned with the prompt? is the escalation path still correct?) require a human reading the document against the current state of the fleet.
How is a runbook different from an agent prompt?
The prompt is instructions for the agent. The runbook is instructions for the human operator. The prompt tells the Content agent to draft blog posts from a queue, write briefs, and commit to main. The runbook tells the operator what a healthy run looks like, which credential the agent uses, how to rotate that credential, and how to shut it down without breaking the rest of the fleet. Both are necessary. Neither replaces the other.