Skip to content
← all posts
4 min read

Testing AI Agent Chains: Strategies That Actually Work

Mentiko Team

Testing AI agent chains is hard. The output is non-deterministic. The "correct" answer isn't always a single expected value. And a test that passes today might fail tomorrow because the model was updated.

But "it's hard" isn't an excuse to skip testing. Untested chains break in production and nobody knows why. Here are the testing strategies that actually work.

Why traditional testing fails

Unit testing works when you can assert: given input X, expect output Y. Language models don't work that way. The same prompt with the same input can produce different output on different runs.

This means:

  • Exact string matching is useless
  • Snapshot testing is fragile
  • Mocking the LLM defeats the purpose

You need testing strategies designed for probabilistic systems.

Strategy 1: Schema validation tests

You can't predict the exact content, but you can predict the structure. If your agent should output JSON with specific fields, test for that:

def test_classifier_output_schema():
    result = run_agent("classifier", input=sample_ticket)
    data = json.loads(result)

    assert "priority" in data
    assert data["priority"] in ["P0", "P1", "P2", "P3", "P4"]
    assert "category" in data
    assert isinstance(data["category"], str)
    assert len(data["category"]) > 0

This test passes regardless of what the model classifies the ticket as. It only fails if the output structure is wrong -- which is the most common and most damaging failure mode.

Strategy 2: Boundary tests

Test edge cases that are likely to break agents:

  • Empty input
  • Very long input (context window limits)
  • Input in unexpected language
  • Input with special characters or code
  • Input that looks like prompt injection
def test_empty_input_handling():
    result = run_agent("classifier", input="")
    data = json.loads(result)
    assert data["category"] == "unprocessable"

def test_very_long_input():
    long_text = "word " * 50000  # way over context limit
    result = run_agent("summarizer", input=long_text)
    assert len(result) > 0  # agent should still produce output
    assert len(result) < len(long_text)  # output should be shorter

Strategy 3: Quality gate tests

Run the chain with known-good input and check if the quality gates pass:

def test_content_pipeline_quality():
    result = run_chain("content-pipeline", vars={"TOPIC": "cloud computing"})

    assert result.status == "completed"
    assert result.quality_score >= 0.7
    assert result.revision_count <= 2
    assert len(result.final_output) > 500  # meaningful content

If your quality gates are well-calibrated, they serve as test assertions. A chain that passes its own quality gates is working correctly by definition.

Strategy 4: Regression tests with golden outputs

Save the output of a "known good" run. On future runs, compare:

  • Output length within 20% of golden (not exact match)
  • Key sections present (introduction, conclusion, etc.)
  • No quality gate failures that the golden run passed
  • Similar structure (same number of sections, similar formatting)

This catches regressions without requiring exact matches.

Strategy 5: Event flow tests

Test the chain's orchestration separately from agent logic:

def test_chain_event_flow():
    # Run the chain
    run = execute_chain("research-pipeline")

    # Check all expected events were emitted
    events = get_events(run.id)
    assert "research:complete" in events
    assert "analysis:complete" in events
    assert "report:complete" in events
    assert "chain:complete" in events

    # Check ordering
    assert events.index("research:complete") < events.index("analysis:complete")

This tests the orchestration layer without caring about agent output quality. If events fire in the right order, the chain's structure is correct.

Strategy 6: Cost tests

Protect against runaway costs:

def test_chain_cost_within_budget():
    run = execute_chain("daily-research")

    assert run.total_cost < 1.00  # max $1 per run
    assert run.agent_count <= 5   # no unexpected agent spawning
    assert run.duration < 300     # max 5 minutes

A chain that suddenly costs 10x its normal amount is broken, even if the output looks fine.

Testing workflow

For a new chain:

  1. Write schema tests first (before the chain works perfectly)
  2. Add boundary tests for known edge cases
  3. Run the chain 5 times and save a golden output
  4. Add regression tests against the golden output
  5. Add event flow tests to verify orchestration
  6. Add cost tests to prevent budget surprises

Run tests on every prompt change. Prompt changes are code changes and should be treated with the same rigor.

CI/CD for agent chains

Integrate chain testing into your deployment pipeline:

# .github/workflows/test-chains.yml
- name: Test chain schemas
  run: python tests/test_schemas.py

- name: Test event flows
  run: python tests/test_events.py

- name: Test cost bounds
  run: python tests/test_costs.py

- name: Run regression suite
  run: python tests/test_regression.py

Block deployments when chain tests fail, just like you'd block on failing unit tests.


Building testable chains? Learn the chain patterns or see the debugging guide.

Get new posts in your inbox

No spam. Unsubscribe anytime.