Documenting Agent Chains: What Future You Needs to Know

You built a chain three months ago. It runs on a cron. It works. Today it broke, and you have no idea why because you left zero documentation. You open the chain definition, read the agent names, and spend an hour reverse-engineering what each one does.

This is the most predictable failure in agent orchestration. Not bad prompts, not API outages -- undocumented chains that nobody can maintain. Here's how to fix it.

Why agent chains need their own documentation

Traditional software has function signatures, type systems, and tests that serve as implicit documentation. Agent chains have none of that. A prompt is a blob of natural language. An event file is a JSON structure that only makes sense if you know the context. The chain definition tells you the order of agents but not the intent.

Without documentation, every chain becomes a black box within weeks. The person who built it remembers the reasoning. Everyone else sees agent names and guesses.

Agent chain documentation serves three audiences:

You in three months -- when the chain breaks and you've forgotten the edge cases
Your teammates -- when they need to modify a chain they didn't build
On-call engineers -- when the chain fails at 2am and someone needs to decide whether to restart, skip, or escalate

The chain README

Every chain should have a README in its directory. Not a novel. A single file with five sections.

Purpose

One paragraph. What does this chain do, why does it exist, and what business process does it replace or support?

## Purpose
Generates a weekly competitive intelligence brief from 15 competitor
websites. Replaces the manual process where Sarah spent Monday mornings
reading competitor blogs. Delivers to #sales-intel Slack channel by 8am.

Bad example: "This chain processes data." That tells you nothing. Good documentation answers "why does this exist?" not just "what does this do?"

Chain topology

A visual representation of the agent flow. ASCII art is fine. The goal is to show the shape of the chain at a glance.

## Topology
WebMonitor -> ChangeDetector -> Analyst -> BriefWriter -> SlackSender
                                  |
                                  v (if critical)
                              PagerAgent

Include conditional branches, quality gates, and retry loops. If the chain has fan-out/fan-in, draw it. This diagram is the first thing someone reads when they open the chain for the first time.

Agent descriptions

For each agent in the chain, document:

## Agents

### WebMonitor
- **Purpose:** Checks competitor websites for changes
- **Input:** List of URLs from config/competitors.json
- **Output:** Raw HTML diffs for changed pages
- **Model:** gpt-5.4-mini (low cost, high volume)
- **Failure mode:** If a site is down, skips it and notes the skip
- **Cost per run:** ~$0.20

### ChangeDetector
- **Purpose:** Filters noise from raw diffs
- **Input:** HTML diffs from WebMonitor
- **Output:** List of meaningful changes with categories
- **Model:** gpt-5.4 (needs reasoning for relevance filtering)
- **Failure mode:** If uncertain, includes the change (false positive > false negative)
- **Cost per run:** ~$0.50

The failure mode documentation is the most important part. When a chain produces unexpected output, knowing how each agent handles edge cases tells you where to look.

Trigger and schedule

## Schedule
Cron: `0 7 * * 1-5` (weekdays at 7am EST)
Overlap prevention: enabled (skip if previous run still active)
Timeout: 15 minutes per agent, 45 minutes total

Document the timezone. Document what happens if a run overlaps. Document the timeout values and why they're set where they are.

Runbook

The runbook is for on-call engineers at 2am. It should answer: what do I do when this breaks?

## Runbook

### Chain failed: WebMonitor timeout
**Likely cause:** A competitor website is slow or blocking our IP.
**Action:** Check events/web-monitor-error.event for the failing URL.
Remove it from config/competitors.json temporarily. Re-run the chain.

### Chain failed: Analyst output empty
**Likely cause:** ChangeDetector found zero changes (slow news day).
**Action:** This is normal. The chain will produce a "no significant
changes" brief. If this happens 5+ days in a row, check if
ChangeDetector's threshold is too high.

### Chain running longer than expected
**Normal duration:** 8-12 minutes
**Investigate at:** 20 minutes
**Action:** Check which agent is stalled in the run detail view.
Most common cause: Analyst agent hitting rate limits on the LLM API.

Write runbook entries for every failure you've seen. When you fix a new failure, add it to the runbook before you close the ticket.

Documenting agent prompts

The prompt is the agent's brain. It changes over time as you tune it. Document the reasoning behind prompt decisions, not just the prompt itself.

## Prompt notes: Analyst

v1 (2026-01): Initial prompt. Too verbose, produced 2000-word analysis.
v2 (2026-02): Added "limit to 3 key changes" instruction. Reduced to
              500 words but started missing important changes.
v3 (2026-03): Changed to "identify up to 5 key changes, ranked by
              business impact." Good balance of coverage and brevity.

Known sensitivity: If you remove the "ranked by business impact" clause,
the agent ranks by recency instead, which produces less useful briefs.

This history prevents the next person from making the same prompt changes you already tried and reverted.

Documenting event contracts

Events are the interfaces between agents. Document them like API contracts.

## Event: research-complete

Produced by: Researcher
Consumed by: Writer

Schema:
{
  "topic": "string (the article topic)",
  "sources": ["array of source objects"],
  "key_points": ["array of strings, 5-10 items"],
  "statistics": ["array of stat objects with source attribution"],
  "word_count_target": "integer"
}

Breaking changes:
- 2026-02-15: Added "statistics" field. Writer prompt updated same day.
- 2026-03-01: Renamed "bullets" to "key_points". Required coordinated
  update to Writer prompt.

When you change an event's structure, you need to update every downstream agent that reads it. The contract documentation tells you exactly which agents to update.

The documentation checklist

When you ship a new chain or modify an existing one:

[ ] Chain README exists with all five sections
[ ] Each agent has a description with purpose, input, output, model, and failure mode
[ ] Event contracts are documented with schema and breaking change history
[ ] Prompt change history is logged
[ ] Runbook covers all known failure modes
[ ] Design decisions are documented for non-obvious choices
[ ] Schedule includes timezone and overlap behavior

Making documentation stick

Documentation rots when it's disconnected from the code. Keep chain docs in the same directory as the chain definition. When someone modifies a chain, the documentation is right there.

In Mentiko, chain documentation lives alongside the chain JSON:

chains/
  sales-intel/
    chain.json
    README.md
    runbook.md
    prompts/
      web-monitor.md
      change-detector.md
      analyst.md
      brief-writer.md

Review documentation during chain code review. If a PR changes an agent prompt, the reviewer should check that prompt notes are updated. If a PR changes an event schema, the reviewer should check that the event contract is updated.

The cost of documentation is measured in minutes. The cost of undocumented chains is measured in hours of debugging, incorrect fixes, and broken trust when on-call engineers can't resolve incidents.

Start now

Pick your most critical chain. The one that runs in production every day. Spend 30 minutes writing its README. You'll immediately discover things about the chain you'd forgotten. That's the point.

Need a framework for your chain? See the design patterns guide or build your first documented chain.