Blog

How to Monitor Your AI Agent Fleet in Production

AI agent monitoring is not APM. Learn what to watch for, how daily briefs replace dashboards, and the patterns that catch drift before it compounds.

agent-patterns
Maritime navigation screen displaying electronic charts and route data in a dark ship bridge
Photo by Leon Bredella on Unsplash

AI agent monitoring is the practice of tracking what autonomous agents do, whether they are drifting from their instructions, and whether their output still matches what you intended. It is not application performance monitoring. It is not log aggregation. Armada Works runs seven AI agents against its own codebase on a Mon/Wed/Fri cadence, and the monitoring approach that works is not the one most engineering teams reach for first.

Traditional observability answers a binary question: is the service running? AI agent management asks a harder one: is the agent still doing the right thing? An agent can run successfully, produce zero errors, commit clean code, and still be doing the wrong work. Catching that requires a different set of tools entirely.

What "Monitoring" Means for Autonomous Agents

When most teams hear "monitoring," they think Datadog, Grafana, PagerDuty. Uptime checks, latency percentiles, error rate alerts. Those tools answer whether software is running. They do not answer whether an autonomous agent is producing useful output.

An AI agent fleet introduces failure modes that traditional monitoring cannot detect. The Content agent might draft a blog post that contradicts your positioning. The SEO agent might chase a keyword that conflicts with your brand. The Outbound agent might send prospect research that misreads the ICP. None of these trigger alerts. None of them appear in error logs. They are not bugs. They are drift.

Agent fleet observability means watching for drift, not just failures. The tools you need look more like audit logs and synthesis reports than dashboards and alert rules.

Dimension Traditional Monitoring AI Agent Monitoring
Primary question "Is it running?" "Is it doing the right thing?"
Failure mode Crashes, timeouts, 5xx errors Drift, misalignment, stale context
Detection tool APM dashboards, error trackers Git diffs, daily briefs, synthesis reports
Resolution Restart, rollback, hotfix Prompt tuning, queue adjustment, context refresh
Audit trail Logs, metrics, traces Git commits, state files, brief history

The Daily Brief Pattern

The most effective monitoring pattern for an agent fleet is the daily brief. Each agent writes a structured summary of what it did, what it shipped, and what is blocking it. A synthesizer agent (the CMO in Armada's fleet) reads all of them and produces a single document for the human operator.

This is the pattern described in how six AI agents coordinate without a Slack channel. The coordination model doubles as the monitoring model. The same state files and briefs that keep agents aligned also give you the audit trail you need to catch problems early.

What makes the daily brief work as a monitoring tool:

  • Structured format. Every brief follows the same template: what shipped, what is in progress, what is blocked, any integrity flags. Consistent structure means you can scan in under a minute and spot anomalies by shape alone.

  • Git-backed history. Briefs are committed to the repo. You can diff Tuesday's brief against Monday's and see exactly what changed. The commit history is the audit trail, not a separate system.

  • Synthesis layer. The CMO agent reads all sub-agent briefs and writes a single summary. If you read only one thing each morning, read the synthesis. It reduces seven reports to three or four action items.

Robert Cowherd, founder of Armada Works, reviews the fleet each morning in about five minutes. Most days, nothing requires a response. The brief is the monitoring surface, and it is a document you read, not a chart you interpret.

What to Watch For

Not all drift looks the same. Here are the specific signals that matter when you are running an agent fleet in production.

Prompt drift. The agent's output gradually shifts away from its instructions. This happens when context windows accumulate stale information or when the agent makes a style decision it carries forward across sessions. Prompt drift is the most common failure mode and the hardest to detect automatically, because each individual output looks reasonable in isolation. The fix: compare the agent's prompt against its last three outputs every week. If the gap is widening, tune the prompt.

Cost spikes. Agents that call LLM APIs cost money per run. A sudden increase in token usage usually means the agent is looping, processing unnecessarily large context, or retrying failed operations. Track token counts per agent per run. If a session that normally uses 20,000 tokens suddenly uses 200,000, something changed.

Permission escalation. A well-configured agent should never request new permissions mid-run. If your agent suddenly asks for access to a tool it did not previously use, investigate before approving. It may be legitimate, or it may indicate prompt injection or drift into an unintended workflow. The agent security guardrails post covers the hook-based pattern that catches this at the shell level before the command executes.

Stale context. Agents that read from state files or external APIs can silently work with outdated information. A real example from Armada's own fleet: the SEO agent relied on expired Google Search Console credentials for fifteen consecutive runs. The agent kept running, kept producing briefs, but the data was stale. The monitoring layer that caught it was a credential-expiry flag in the state file, surfaced by the CMO synthesis.

Queue exhaustion. When an agent's input queue empties and no new work arrives, the agent should idle cleanly. Monitor for agents that start generating self-assigned tasks or producing output nobody requested. An agent that invents its own work is an agent that has drifted past its mandate.

Here is what to check on a regular cadence:

  • Prompt alignment: weekly manual review (read the prompt, then read the output)
  • Token usage per run: per-agent, per-session, flagged if 5x above baseline
  • Permission requests: should be stable run to run, not growing
  • State file freshness: are inputs current or stale?
  • Queue depth: empty queues should produce idle reports, not invented work
  • Brief consistency: does the format match the template? Are required sections present?
  • Output quality: read one full draft per week, not just the brief summary
  • Commit frequency: sudden changes in volume signal something shifted

Tools and Patterns That Work

You do not need a dedicated observability platform to monitor an agent fleet. The tools most teams already have are enough.

Git log. The simplest and most underrated monitoring tool for agents. Running git log --oneline --since="7 days ago" on the agent state directory gives you a week of activity. If the commit history looks different from last week (more commits, fewer commits, different file patterns), investigate. Git is the single source of truth for what actually happened, not what the agent says happened.

Product analytics. If your agents produce public-facing output (blog posts, landing pages, email sequences), analytics tell you whether that output is performing. Armada uses PostHog to track which blog posts get traffic and which pages convert to booking calls. This is not agent monitoring in the narrow sense, but it is the feedback loop that tells you whether the fleet's work matters.

Admin dashboard. A simple internal page that shows each agent's latest brief, last run time, and status. Armada's lives at /admin/marketing and reads from an agent_reports table. Building one takes an afternoon. It replaces the need to dig through state files manually on most days.

Pre-commit hooks and permission gates. Monitoring does not have to be retrospective. The security guardrails described in a previous post are proactive monitoring: they block dangerous actions before they execute. A PreToolUse hook that rejects destructive commands without explicit approval is real-time monitoring at the shell level.

The pattern that ties these together: git for the audit trail, the daily brief for synthesis, the dashboard for at-a-glance status, and hooks for the hard guardrails. Each layer catches a different class of problem. None of them alone is sufficient. Together, they give you the same confidence in your agent fleet that a good CI/CD pipeline gives you in your code.

Where to Start

If you are running agents with no monitoring in place, start with the daily brief pattern. It costs nothing to implement (each agent writes a structured markdown file to a known path) and gives you the audit trail that every other layer builds on.

From there, add the synthesis agent. One agent that reads all the others and writes you a single summary. This is the highest-leverage addition to any multi-agent system, and it is the piece most teams build last when they should build it first.

For teams that want a fleet deployed with monitoring already configured, that is what the Operate tier delivers. The daily brief, the CMO synthesis, the admin dashboard, and the security hooks all come wired on day one. You review the output. The fleet runs. Book a 30-minute discovery call to walk through the actual system.

Frequently Asked Questions

How do you monitor AI agents?

Monitor AI agents by implementing a daily brief pattern where each agent writes a structured report of what it shipped, what is blocked, and any anomalies. A synthesizer agent reads all reports and produces a single summary for the human operator. Back everything with git commits for a full audit trail.

What metrics matter for an agent fleet?

The most important metrics are prompt alignment (is the agent still following instructions), token usage per run (cost and complexity), state file freshness (is the agent working with current data), and output quality (verified through periodic spot checks). Traditional uptime metrics matter less because agents can be "up" while doing the wrong work.

How do you know if an agent is drifting?

Drift shows up as a gradual divergence between the agent's prompt and its actual output. The most reliable detection method is a weekly manual comparison: read the agent's prompt, then read its last three outputs side by side. If the gap is widening, the prompt needs tuning. Automated detection is possible but harder, because each individual output may look reasonable on its own.

What happens when an agent breaks?

Agent failures fall into two categories: hard failures (crashes, API errors, permission denials) and soft failures (drift, stale context, invented work). Hard failures surface through error logs and non-zero exit codes. Soft failures require the synthesis layer: a CMO agent that reads all briefs and flags inconsistencies. The fix for soft failures is usually prompt adjustment, context refresh, or queue correction, not a code change.

Do you need a dedicated observability platform for AI agents?

Not for fleets under ten agents. The tools most teams already have (git, product analytics, a simple admin dashboard) cover the critical monitoring needs. Dedicated platforms add value at scale, with dozens of agents across multiple teams, but they are unnecessary at the size where most teams start. The daily brief pattern and git history handle the fundamentals.

How often should you review agent output?

Read the daily synthesis every morning (five minutes). Spot-check one full agent output per week (fifteen minutes). Re-read each agent's prompt against its recent output monthly (thirty minutes per agent). This cadence catches drift early without turning monitoring into a full-time job.

Written by
Robert Cowherd
Book a call