Skip to content
← all posts
6 min read

Rollback Strategies for Agent Chains: When Things Go Wrong

Mentiko Team

Your agent chain was working fine on Tuesday. You updated one agent's prompt on Wednesday. By Thursday, the output is wrong and three downstream systems are ingesting garbage data. You need to roll back, and you need to do it without making things worse.

Rollback in agent orchestration is harder than rollback in traditional software. There's no binary to revert. Prompts are text, models have versions, and the chain may have already produced bad output that propagated to other systems. Here's how to handle it.

Why agent chains are harder to roll back

In traditional deployment, a rollback means deploying the previous build. The code is deterministic -- same input, same output. Rolling back restores the exact previous behavior.

Agent chains have three problems that make rollback harder:

Non-deterministic output. Even with the same prompt and input, an LLM can produce different output. Rolling back to a previous prompt doesn't guarantee you'll get the same output you were getting before.

Propagated side effects. If an agent sent emails, updated a database, or posted to Slack, those actions already happened. Rolling back the chain doesn't undo the side effects.

Model version drift. You might not have changed anything. The model provider updated their model, and your chain's behavior changed anyway. Rolling back your chain definition won't fix this.

Version pinning: The foundation

Before you can roll back, you need something to roll back to. Version everything.

Pin your chain definitions

Store chain definitions in version control. Every change to a chain JSON file gets a commit. When you need to roll back, you check out the previous version.

# Current chain definition
chains/sales-intel/chain.json  (commit abc123)

# Roll back to yesterday's version
git log --oneline chains/sales-intel/chain.json
# abc123 Update analyst prompt for brevity
# def456 Add quality gate after writer
# ghi789 Initial chain

git checkout def456 -- chains/sales-intel/chain.json

In Mentiko, chain definitions are JSON files. Treat them like code. Review changes in PRs. Tag releases. Never edit production chains directly.

Pin your prompts

Prompts change more often than chain structure. Version them separately.

chains/sales-intel/
  chain.json
  prompts/
    analyst-v3.md    (active)
    analyst-v2.md    (previous)
    analyst-v1.md    (original)

When you update a prompt, create a new version file and update the chain definition to reference it. The old version stays in the directory. Rolling back means changing one line in chain.json.

Pin your model versions

If your LLM provider offers model version pinning, use it. gpt-5.4-2026-03-15 is safer than gpt-5.4. The generic alias can change under you without notice.

{
  "agent": "analyst",
  "model": "gpt-5.4-2026-03-15",
  "prompt": "prompts/analyst-v3.md"
}

When a new model version is available, test it on your chain before switching. Treat model upgrades like dependency upgrades in software.

Rollback patterns

Pattern 1: Prompt rollback

The most common rollback. You changed a prompt and the output degraded.

Steps:

  1. Identify which agent's output changed. Check the run history and compare recent output to historical output.
  2. Revert the agent's prompt reference in chain.json to the previous version.
  3. Run the chain manually and validate the output.
  4. If output is good, deploy the reverted chain definition.

Gotcha: If you changed the prompt to fix a bug, reverting means the bug comes back. Consider whether the original bug is less harmful than the new regression.

Pattern 2: Chain structure rollback

You added or removed agents, changed the flow, or modified event contracts.

Steps:

  1. Revert chain.json to the previous commit.
  2. Check if any new event files or configs were added. Revert those too.
  3. Verify that all agents referenced in the old chain definition still exist.
  4. Run the chain manually end-to-end.

Gotcha: If you deleted an agent that the old chain definition references, rolling back the definition isn't enough. You need the agent's prompt and config back too. This is why you version everything and never delete old files.

Pattern 3: Gradual rollback

When you're not sure which change caused the regression, roll back incrementally.

Steps:

  1. List all changes since the last known-good state (prompt changes, config changes, model version updates).
  2. Revert the most recent change.
  3. Run the chain and check output.
  4. If still broken, revert the next most recent change.
  5. Repeat until output is good.

This is slow but safe. It also tells you exactly which change caused the problem, which is valuable information for the fix.

Pattern 4: Side-by-side comparison

Run the old and new chain versions in parallel. Compare their outputs to identify divergence.

Input -> Old Chain (v2) -> Output A
Input -> New Chain (v3) -> Output B

Compare A and B. Which is correct?

In Mentiko, you can clone a chain and run both versions on the same input. This is the safest way to validate a rollback before committing to it.

Output validation as rollback prevention

The best rollback is the one you never need. Output validation catches regressions before they hit production.

Schema validation

Every event between agents should have a schema. If an agent produces output that doesn't match the schema, the chain stops before the bad data propagates.

{
  "event": "analysis-complete",
  "schema": {
    "required": ["summary", "key_changes", "risk_level"],
    "properties": {
      "summary": { "type": "string", "minLength": 100 },
      "key_changes": { "type": "array", "minItems": 1 },
      "risk_level": { "type": "string", "enum": ["low", "medium", "high"] }
    }
  }
}

Quality gates

A dedicated agent that scores the output of the previous agent. If the score is below threshold, the chain pauses and notifies a human instead of continuing with bad output.

Writer -> QualityGate -> (score >= 0.8) -> Publisher
                      -> (score < 0.8)  -> Human Review Queue

Quality gates are cheap insurance. A $0.10 quality check prevents a $100 incident cleanup.

Baseline comparison

Compare each run's output statistics to a rolling baseline. If output length, sentiment, structure, or key metrics deviate significantly from the baseline, flag the run for review.

Average output length: 500 words (baseline)
This run's output: 50 words

ALERT: Output length 90% below baseline. Run flagged for review.

The rollback playbook

When something goes wrong, follow this sequence:

1. Stop the bleeding. Pause the chain's schedule. No more bad runs while you investigate.

2. Assess the blast radius. How many runs produced bad output? Which downstream systems consumed it? Who saw it?

3. Quarantine if needed. If bad output reached customers or critical systems, quarantine it immediately.

4. Identify the cause. Check the change log. What changed since the last known-good run? Prompt? Chain structure? Model version? Input data?

5. Roll back. Use the appropriate pattern from above.

6. Validate. Run the rolled-back chain and confirm output is correct.

7. Re-process if needed. If bad output was quarantined, re-run the chain for the affected period and replace the quarantined data.

8. Resume the schedule. Turn the cron back on.

9. Post-mortem. Document what went wrong, what the blast radius was, and what validation would have caught it. Add that validation to the chain.

Building rollback into your workflow

Rollback shouldn't be an emergency procedure. It should be part of your standard deployment process.

Before deploying a chain change:

  • Tag the current version as known-good
  • Document what you're changing and why
  • Define what "broken" looks like (how will you know if this change is bad?)
  • Plan your rollback steps before you deploy

After deploying:

  • Monitor the first 3 runs closely
  • Compare output to the previous version
  • Check downstream systems for anomalies

Automate where possible:

  • Automatic schema validation on every event
  • Quality gates on high-stakes agent output
  • Baseline deviation alerts
  • One-command rollback scripts

The teams that recover fastest from chain failures aren't the ones with the best debugging skills. They're the ones who planned for failure before it happened.


Learn more about production-grade chains: Monitoring agents in production or debugging agent chains.

Get new posts in your inbox

No spam. Unsubscribe anytime.