Observability for AI Agent Pipelines: Beyond console.log

You deployed your agent chain. It ran three times and you watched the output. It looked good. Then you went to bed. By morning it had run 47 more times, burned $180 in API calls, and the last 12 runs produced empty output because a rate limit kicked in at 3am.

Nobody was watching. That's the problem.

Agent pipelines need observability -- real observability, not console.log("chain started") scattered through your scripts. Here's how to build it.

Why agent observability is harder than traditional monitoring

Traditional application monitoring tracks request/response cycles. A request comes in, processing happens, a response goes out. You measure latency, error rate, throughput. The happy path is well-defined and the failure modes are understood: timeouts, 500s, connection refused.

Agent pipelines break every assumption that traditional monitoring relies on.

First, duration is unpredictable. A chain run might take 30 seconds or 15 minutes depending on what the agents find. There's no "expected response time" to alert against -- you need adaptive baselines.

Second, success is ambiguous. An agent can return HTTP 200 with confidently wrong output. A writing agent can produce a grammatically perfect article that completely ignores the source material. The run "succeeded" in every technical sense while producing garbage.

Third, costs are variable and opaque. One run costs $0.03. The next costs $2.40 because the research agent found a long document and fed 90k tokens to the analysis step. You don't know this until you check the bill.

Fourth, agents interact with external systems in unpredictable ways. They call APIs, scrape pages, execute shell commands. Each agent is a miniature application with its own dependency graph. A monitoring system that only watches the orchestrator misses everything interesting.

You need a monitoring approach built for this reality. We break it into three pillars.

Pillar 1: Events (what happened)

Every significant moment in a chain run should produce a structured event. Not a log line -- a structured JSON document that's machine-parseable and human-readable.

A chain run in Mentiko produces events like this:

{
  "run_id": "run_a8f3c2e1",
  "chain": "content-pipeline",
  "event": "agent:complete",
  "agent": "researcher",
  "status": "success",
  "started_at": "2026-03-19T02:14:33Z",
  "completed_at": "2026-03-19T02:14:58Z",
  "duration_ms": 25012,
  "input_tokens": 1240,
  "output_tokens": 3891,
  "model": "claude-sonnet-4-20250514",
  "cost_usd": 0.0183,
  "output_path": ".outputs/researcher/findings.md",
  "output_bytes": 4210
}

Every field matters. run_id lets you correlate across agents. duration_ms feeds your performance baselines. input_tokens and output_tokens power your cost tracking. output_bytes catches empty output without reading the file.

These events are files. They live in .events/ alongside the coordination events that trigger downstream agents. You can grep them, jq them, feed them to any log aggregator, or just ls -lt to see what happened in order.

# What happened in the last run?
jq -r '[.agent, .status, .duration_ms, .cost_usd] | @tsv' \
  .events/run_a8f3c2e1/*.event

# researcher   success   25012   0.0183
# writer       success   41209   0.0291
# editor       success   18443   0.0097
# publisher    success    3201   0.0000

File-based events have a property that centralized logging doesn't: they survive infrastructure failures. If your log aggregator goes down, the events are still on disk. If the orchestrator crashes mid-run, the events already written tell you exactly where it stopped.

Pillar 2: Logs (agent output)

Events tell you what happened. Logs tell you what the agent was thinking.

The critical design decision: each agent writes to its own log file. Not a shared log. Not stdout interleaved with other agents. One file per agent per run.

.runs/run_a8f3c2e1/
  researcher.log
  researcher.stdout
  writer.log
  writer.stdout
  editor.log
  editor.stdout

The .log file contains the agent's structured execution trace -- what prompt was sent, what the model returned, what tools were called. The .stdout file captures raw standard output from whatever the agent executed (shell commands, scripts, API calls).

This isolation matters when you're debugging. You don't search through a 10,000-line combined log trying to figure out which output belongs to which agent. You open one file. The context for that agent's entire execution is right there.

It also means you can ship different agents' logs to different places. Your security-sensitive agents' logs go to an audit trail. Your noisy research agents' logs go to cold storage. Your critical publisher agent's logs go to real-time monitoring.

Pillar 3: Metrics (cost and time)

Metrics are aggregated numbers derived from events. The essential ones:

Per-run cost. Sum of all API costs for every agent in the run. This is the number that tells you if something went wrong financially.

{
  "run_id": "run_a8f3c2e1",
  "chain": "content-pipeline",
  "total_cost_usd": 0.0571,
  "total_duration_ms": 87865,
  "total_input_tokens": 14320,
  "total_output_tokens": 12408,
  "agent_costs": {
    "researcher": 0.0183,
    "writer": 0.0291,
    "editor": 0.0097,
    "publisher": 0.0000
  }
}

Rolling averages. The last 20 runs averaged $0.05 and 82 seconds. This becomes your baseline. Anything 3x over baseline triggers investigation.

Token efficiency. Output tokens divided by input tokens. A healthy chain produces more than it consumes. If your ratio suddenly inverts (input tokens spike, output stays flat), an agent is processing bloated input and producing nothing useful.

Cost per agent over time. Track which agents are getting more expensive. Model providers change pricing. Prompts drift. Input data grows. Per-agent cost trends catch these before your monthly bill does.

Alert design: when to page a human

Bad alerts are worse than no alerts. If you page someone at 3am for a non-critical issue, you've trained them to ignore pages. Design your alert tiers deliberately.

Critical (page immediately):

Chain run failed 3+ times consecutively
Single run cost exceeds 10x the rolling average
An agent has been running for more than 5x its P95 duration
Watchdog detected a stalled agent (see below)

Warning (check within hours):

Run cost exceeds 3x rolling average
Output quality gate failed (but retried successfully)
Agent duration trending upward over last 20 runs

Info (review weekly):

Per-agent cost breakdown changed significantly
New error types appeared in events
Token efficiency dropped below threshold

The key principle: alert on trends and thresholds, not individual failures. A single failed run might be a transient API error. Three consecutive failures is a real problem.

The watchdog pattern

Agents get stuck. An LLM call hangs indefinitely. A shell command blocks on input. A network request waits for a server that will never respond. The orchestrator thinks the agent is still running. It's not. It's dead.

The watchdog is a separate process that monitors agent liveness:

{
  "event": "watchdog:stall_detected",
  "run_id": "run_a8f3c2e1",
  "agent": "researcher",
  "started_at": "2026-03-19T02:14:33Z",
  "last_activity": "2026-03-19T02:16:01Z",
  "stall_duration_ms": 180000,
  "threshold_ms": 120000,
  "action": "terminated",
  "reason": "no stdout/event output for 3 minutes"
}

The watchdog checks two signals: event output (has the agent written anything to .events/?) and process activity (is the agent's process producing stdout?). If both are silent beyond the configured threshold, the agent is stalled.

On detection, the watchdog kills the stalled agent, emits a stall event, and the orchestrator's error handling takes over -- retry, skip, or fail the run depending on configuration.

Without a watchdog, stalled agents silently block the pipeline. Scheduled runs queue up behind the stuck one. By the time anyone notices, you have 40 pending runs and a chain that's been dead for six hours.

Dashboard design for agent monitoring

A useful agent monitoring dashboard answers four questions at a glance:

"Is anything broken right now?" Top of the dashboard. Active runs with status indicators. Failed runs in the last hour. Active alerts. This panel should be red, yellow, or green -- nothing else.

"What happened recently?" Run history as a timeline. Each run shows duration, cost, pass/fail, and the number of agents that completed. Click to drill into per-agent events and logs. Failures are visually distinct.

"Are we trending in the right direction?" Cost over time, duration over time, success rate over time. Line charts with 7-day and 30-day windows. Anomalies marked automatically. This is where you catch slow degradation that per-run alerts miss.

"What costs the most?" Cost breakdown by chain, by agent, by model. Sorted descending. The most expensive chain is at the top. The most expensive agent in that chain is highlighted. This is where you find optimization targets.

Skip the vanity metrics. Total runs executed, total tokens processed, uptime percentage -- these tell you nothing actionable. Every panel on your dashboard should answer a question that leads to a decision.

Getting started

You don't need all of this on day one. Build in layers:

Structured events for every agent completion (this is free with Mentiko's file-based architecture)
Per-agent log isolation (configure output directories per agent)
A cost tracking script that sums token usage from events
Watchdog process for long-running chains
Alerts for consecutive failures and cost spikes
Dashboard when you have enough history to make trends visible

The biggest mistake is shipping an agent pipeline with no observability and adding it later. "Later" means "after the first production incident where you have no data to diagnose it." Start with events. Everything else builds on top of them.

Want observability built in? See how Mentiko handles agent monitoring or learn the event system.