Idempotency in Agent Chains: Safe Retries Without Side Effects

Your agent chain runs nightly. It processes invoices, generates reports, and emails them to clients. Monday night it fails halfway through -- the email agent crashed after sending 40 of 100 emails. You fix the bug and re-run the chain Tuesday morning.

Now 40 clients got the report twice.

This is the idempotency problem. Every production agent chain will eventually need to be retried -- after crashes, timeouts, API failures, or bad deploys. If your chain isn't designed for safe retries, every recovery creates a new problem.

What idempotency means for agent chains

In traditional software, idempotency means calling the same operation multiple times produces the same result as calling it once. An HTTP PUT is idempotent. An HTTP POST is not. A database upsert is idempotent. An insert is not.

For agent chains, idempotency means: re-running a chain with the same input produces the same side effects as running it once, even if the previous run partially completed.

This is harder than it sounds. Agents produce outputs, send notifications, write files, call APIs, and update databases. Each of these side effects needs to be handled differently on retry.

The idempotency key pattern

The foundation of safe retries is an idempotency key -- a unique identifier for each chain run that all agents reference before performing side effects.

{
  "name": "invoice-processor",
  "config": {
    "idempotency_key": "{RUN_ID}",
    "checkpoint_dir": "./checkpoints/{RUN_ID}"
  },
  "agents": [
    {
      "name": "invoice-parser",
      "prompt": "Parse invoices from the input directory. For each invoice, check if a checkpoint exists at {CHECKPOINT_DIR}/{invoice_id}.parsed. Skip any invoice that already has a checkpoint. Write parsed output and create checkpoint file.",
      "triggers": ["chain:start"],
      "emits": ["parsing:complete"]
    },
    {
      "name": "report-generator",
      "prompt": "Generate a report from parsed invoices. Check if {CHECKPOINT_DIR}/report.generated exists. If it does, skip generation and emit the existing report.",
      "triggers": ["parsing:complete"],
      "emits": ["report:ready"]
    },
    {
      "name": "email-sender",
      "prompt": "Send reports to clients. For each recipient, check if {CHECKPOINT_DIR}/sent-{recipient_id}.log exists. Skip any recipient with an existing send log. Create the log file after each successful send.",
      "triggers": ["report:ready"],
      "emits": ["chain:complete"]
    }
  ]
}

Every agent checks for a checkpoint before doing work and writes one after completing. On retry, already-completed work is skipped. The RUN_ID is reused across retries of the same logical run.

Checkpoint strategies

Not all checkpoints are the same. The strategy depends on what the agent does.

File-based checkpoints

The simplest approach. Each agent writes a marker file when it completes its work.

checkpoints/run-2026-03-19-001/
  invoice-001.parsed
  invoice-002.parsed
  invoice-003.parsed    # chain crashed here
  # invoice-004.parsed  # not yet processed

On retry, the parser sees that invoices 001-003 are already done and starts with 004. File-based checkpoints work well for Mentiko because the platform already uses file-based events. The checkpoint directory is just another directory in the workspace.

Advantages: simple, inspectable, works with any tool. Disadvantages: not atomic -- if the agent crashes between doing the work and writing the checkpoint, you get a gap.

Database-backed checkpoints

For chains that interact with databases, use the database itself as the checkpoint store. Insert a record with the idempotency key before performing the operation.

INSERT INTO processed_items (idempotency_key, item_id, status, processed_at)
VALUES ('run-2026-03-19-001', 'invoice-004', 'processing', NOW())
ON CONFLICT (idempotency_key, item_id) DO NOTHING;

If the insert succeeds, proceed with processing. If it conflicts, the item was already handled. After processing completes, update the status to 'completed'. On retry, query for items where the idempotency key matches but status isn't 'completed'.

Advantages: atomic, queryable, survives workspace resets. Disadvantages: requires database access from the agent workspace.

Event-based checkpoints

Use Mentiko's event system itself as the checkpoint. Each agent emits a granular event per item processed, not just one event at the end.

{
  "name": "invoice-parser",
  "prompt": "Parse each invoice individually. Emit a parsed event per invoice.",
  "triggers": ["chain:start"],
  "emits": ["invoice:parsed"]
}

On retry, the orchestrator checks which invoice:parsed events already exist for this run and tells the parser to skip those items. The event log becomes the checkpoint log.

Advantages: built into the platform, no extra infrastructure. Disadvantages: event directory can grow large for high-volume chains.

Output deduplication

Checkpoints prevent duplicate work, but some side effects need deduplication at the receiver level too. Your chain might be idempotent, but the external system it talks to might not be.

API calls with provider idempotency keys

Many APIs support idempotency keys natively. Stripe, for example, accepts an Idempotency-Key header. If you send the same key twice, the second call returns the first call's result without executing again.

Build this into your agent prompts:

When calling the payment API, include the header:
Idempotency-Key: {RUN_ID}-{invoice_id}

This ensures that retrying the chain never creates duplicate payments.

Email deduplication

Email is the classic non-idempotent side effect. Once sent, you can't unsend. The checkpoint pattern from above handles this -- check the send log before sending. But add a second layer of defense: include the run ID in the email's message ID or a custom header. Your email provider's deduplication may catch duplicates that slip past your checkpoint.

File output deduplication

If your chain writes files to S3, GCS, or a local directory, use deterministic naming with the idempotency key:

output/{run_id}/report-{client_id}.pdf

Same run, same client, same file path. Writing the file again on retry overwrites the identical content. No duplicates.

Handling non-idempotent agents

Some agents are inherently non-idempotent. They call APIs that don't support idempotency keys. They send notifications that can't be unsent. They modify external state in ways that can't be repeated safely.

For these agents, use the "execute once, record, skip on retry" pattern:

Before executing, check the execution log for this run ID + agent combination
If found, skip execution and return the recorded result
If not found, execute, record the result and all side effects, then proceed

{
  "name": "non-idempotent-notifier",
  "prompt": "Before sending any notification, check if notifications.log contains an entry for run {RUN_ID}. If it does, skip sending and emit the previous result. If it doesn't, send the notification, log it with the run ID, then emit the result.",
  "triggers": ["data:processed"],
  "emits": ["notification:sent"]
}

This is manual idempotency -- the agent implements the check-and-record logic itself. It's more fragile than platform-level checkpoints because the agent has to get the logic right. But for genuinely non-idempotent operations, it's the only option.

Testing idempotency

Here's how to verify your chain handles retries correctly:

Run the chain to completion. Record all outputs and side effects.
Run the exact same chain with the same input and same run ID.
Verify: no duplicate side effects, same final output, all checkpoints correctly detected.

Then test the partial failure case:

Run the chain with a kill switch -- crash it at a specific point.
Resume with the same run ID.
Verify: only unprocessed items are handled, no duplicates, correct final state.

Automate these tests. Run them on every chain definition change. Idempotency bugs are silent -- they don't crash your chain, they just produce wrong results. You won't catch them in normal testing.

Common mistakes

Mistake: Using timestamps as idempotency keys. Timestamps aren't unique across retries. Two runs of the same chain started at different times get different timestamps. Use a stable run ID that persists across retries.

Mistake: Checkpointing after side effects. If you send an email and then write the checkpoint, a crash between send and checkpoint means the email goes out again on retry. Checkpoint the intent first, execute the side effect, then mark the checkpoint as completed. Two-phase approach.

Mistake: Ignoring partial agent output. An agent that processes 50 items but crashes while writing the output file loses all 50 items of work. Flush output incrementally, not all at once at the end.

Mistake: Assuming APIs are idempotent. Most are not. POST endpoints almost never are. Check the API documentation and use idempotency keys when available.

The investment pays for itself the first time a production chain fails at 3 AM and you re-run it without worrying about duplicate invoices, double emails, or corrupted state.

Building production chains? Learn the debugging process or see monitoring best practices.