Logging Best Practices for Multi-Agent Pipelines

A 5-agent chain runs 200 times a day. That's 1,000 agent executions producing logs. Within a week, you have 7,000 log entries. Without structure, this is a wall of text that nobody will ever read. With structure, it's a searchable, filterable, auditable record that makes debugging fast and compliance audits painless.

Most teams treat logging as an afterthought -- something they add when a chain breaks and they realize they have no idea what happened. Here's how to do it right from the start.

The three purposes of agent chain logs

Agent chain logs serve three distinct audiences with different needs:

Debugging. When the chain produces wrong output, you need to find which agent went wrong, what input it received, what it produced, and why. This requires detailed, per-agent logs with timestamps, inputs, outputs, and reasoning.

Auditing. When a stakeholder asks "why did the system make this decision," you need to show the chain of reasoning across all agents. This requires a complete run trace that connects agent outputs to the final result.

Compliance. When a regulator asks "how do you ensure your AI systems operate correctly," you need to demonstrate systematic logging, retention, and review processes. This requires structured logs with consistent schemas, retention policies, and access controls.

Every logging decision should serve at least one of these purposes. If a log entry doesn't help debugging, auditing, or compliance, it's noise.

Per-agent log isolation

The single most important logging decision: every agent gets its own log context. Don't dump all agent logs into one stream.

runs/
  2026-03-19-001/
    chain.log              # Chain-level events (start, complete, timing)
    agents/
      extractor/
        input.json         # What the agent received
        output.json        # What the agent produced
        execution.log      # Agent's internal log (reasoning, decisions)
        metrics.json       # Duration, token count, model, cost
      analyzer/
        input.json
        output.json
        execution.log
        metrics.json
      formatter/
        input.json
        output.json
        execution.log
        metrics.json

This structure mirrors Mentiko's file-based event system. Each agent's work is isolated in its own directory. When the formatter produces bad output, you go straight to runs/2026-03-19-001/agents/formatter/ and see exactly what happened -- the input it received, the output it produced, and its internal log.

Compare this to a single chain.log file with interleaved messages from all agents. Finding the formatter's logs means grepping through thousands of lines from other agents. Per-agent isolation eliminates this noise immediately.

Structured log format

Free-text logs are easy to write and hard to search. Structured logs are marginally harder to write and dramatically easier to analyze.

Every log entry should include:

{
  "timestamp": "2026-03-19T14:23:45.123Z",
  "run_id": "2026-03-19-001",
  "agent": "analyzer",
  "level": "info",
  "event": "analysis_complete",
  "message": "Identified 3 risk factors in contract",
  "data": {
    "risk_factors_count": 3,
    "processing_time_ms": 1245,
    "input_tokens": 3200,
    "output_tokens": 890,
    "model": "claude-sonnet-4-20250514",
    "confidence": 0.87
  }
}

The key fields: run_id ties the entry to a specific chain execution. agent identifies which agent produced it. level enables filtering by severity. event is a machine-readable event type for automated analysis. data contains the structured payload.

This format works with every log aggregation tool: ELK, Datadog, Loki, CloudWatch, Splunk. You can filter by agent, search by run ID, aggregate by event type, and alert on specific conditions. Try doing that with print("analysis done").

What to log at each level

Not everything needs the same log level. Over-logging at high severity creates alert fatigue. Under-logging at critical points creates blind spots.

ERROR -- Something broke.

Agent execution failed (API timeout, rate limit, crash)
Event schema validation failed
Required input missing
Output quality below minimum threshold

WARN -- Something is off but the chain continued.

Agent took longer than expected
Fallback path triggered
Output near quality threshold
Retry needed (and succeeded)
Token count approaching context window limit

INFO -- Normal operations worth recording.

Agent started / completed
Input received / output produced
Key decisions made (which branch taken, classification result)
Checkpoint written

DEBUG -- Detailed internals for deep diagnosis.

Full prompt sent to model
Full model response (before parsing)
Intermediate reasoning steps
Event file read/write operations

In production, run at INFO level. When debugging a specific issue, enable DEBUG for the suspect agent only. Never run DEBUG for all agents in production -- the log volume will be enormous and the storage costs will surprise you.

Logging model interactions

The most valuable debug information in an agent chain is what went to the model and what came back. Log the full model interaction for every agent:

{
  "timestamp": "2026-03-19T14:23:44.001Z",
  "run_id": "2026-03-19-001",
  "agent": "classifier",
  "level": "debug",
  "event": "model_request",
  "data": {
    "model": "llama3.1:8b",
    "provider": "ollama",
    "system_prompt_tokens": 245,
    "user_prompt_tokens": 1890,
    "temperature": 0,
    "max_tokens": 100
  }
}

{
  "timestamp": "2026-03-19T14:23:45.123Z",
  "run_id": "2026-03-19-001",
  "agent": "classifier",
  "level": "debug",
  "event": "model_response",
  "data": {
    "output_tokens": 12,
    "latency_ms": 1122,
    "finish_reason": "stop",
    "response_preview": "CONTRACT_TYPE: NDA"
  }
}

Note the response_preview instead of the full response in the structured log. Store the full response in the agent's output.json file, not in the log stream. Log entries should be compact enough to aggregate efficiently. Full model responses -- which can be thousands of tokens -- belong in the per-agent output files.

Correlation: tying it all together

A single chain run produces logs from 5+ agents. A single business event might trigger multiple chain runs. Without correlation IDs, connecting these logs requires manual timestamp matching.

Use three levels of correlation:

Run ID. Every log entry from a single chain execution shares the same run ID. This is the primary correlation key for debugging a specific run.

Chain ID. If multiple chains are triggered by the same business event (a contract upload triggers a review chain AND a compliance check chain), they share a chain group ID. This connects related chain runs.

Request ID. If the chain was triggered by an external event (a webhook, an API call, a scheduled job), propagate the external request ID into the chain logs. This connects the chain run to the upstream system.

{
  "run_id": "2026-03-19-001",
  "chain_group_id": "contract-upload-abc123",
  "external_request_id": "webhook-xyz789",
  "agent": "extractor",
  "event": "extraction_complete"
}

With all three IDs, you can trace from a customer complaint ("my contract review was wrong") to the external trigger to the specific chain runs to the specific agent that produced the bad output. Without them, you're matching timestamps by hand.

Retention policies

Logs cost money to store. The question is how long to keep them.

DEBUG logs: 7 days. These are only needed for active debugging. Once the issue is resolved, they're noise.

INFO logs: 30-90 days. Long enough to investigate issues that surface on a delay (a client notices a bad output a week later) and to do trend analysis on chain performance.

ERROR and WARN logs: 1 year. These are your incident history. They reveal patterns: does this agent fail every Tuesday at 3 AM? Is this error rate increasing over time?

Compliance-relevant logs (model inputs, outputs, decisions): Whatever your industry requires. Healthcare (HIPAA): 6 years. Financial services (SOX): 7 years. Legal: varies by jurisdiction. If in doubt, ask your compliance team before you deploy, not after.

Full model interactions (DEBUG-level prompt and response): 30 days unless compliance requires longer. These are the largest logs by volume and the most expensive to retain.

Implement tiered storage. Recent logs in hot storage (searchable, fast access). Older logs in cold storage (compressed, archived, slow access). Compliance logs in immutable storage (write-once, tamper-evident).

Practical implementation

Here's a minimal logging setup for a new chain:

Enable per-agent directory structure (Mentiko does this by default with file-based events)
Add structured log output to each agent's prompt: "Log your reasoning and decisions in structured format"
Capture model request/response metadata (tokens, latency, model version)
Set run ID on chain start, propagate to all agents
Configure retention: 7 days for debug, 30 days for info, 1 year for errors

This takes about 30 minutes of configuration work up front. It saves hours of blind debugging the first time something goes wrong. And when a stakeholder asks "why did the system do that," you have the answer in seconds instead of guesses.

Log early, log structured, log consistently. Your future self -- the one debugging a chain failure at 11 PM -- will thank you.

Building observable chains? See monitoring best practices or the debugging guide.