Performance Optimization for Agent Chains: Faster Pipelines, Lower Cost
Mentiko Team
A 4-agent chain that takes 45 seconds per run is fine when you're testing. When it's running 500 times a day across your team, those 45 seconds become 6+ hours of cumulative wait time and a significant API bill. Performance optimization for agent chains isn't premature -- it's the difference between a tool people tolerate and one they rely on.
Here's how to make your chains faster and cheaper without sacrificing output quality.
Parallel vs sequential execution
The single biggest performance win is running agents in parallel when they don't depend on each other's output.
Most chains are written sequentially by default because that's how people think about workflows: step 1, step 2, step 3. But many chains contain agents that are independently executable. A lead enrichment chain doesn't need to scrape LinkedIn before checking the CRM -- those can run simultaneously.
Sequential (slow):
{
"agents": [
{ "name": "linkedin-scraper", "triggers": ["chain:start"], "emits": ["linkedin:done"] },
{ "name": "crm-lookup", "triggers": ["linkedin:done"], "emits": ["crm:done"] },
{ "name": "news-scanner", "triggers": ["crm:done"], "emits": ["news:done"] },
{ "name": "synthesizer", "triggers": ["news:done"], "emits": ["chain:complete"] }
]
}
Parallel (fast):
{
"agents": [
{ "name": "linkedin-scraper", "triggers": ["chain:start"], "emits": ["linkedin:done"] },
{ "name": "crm-lookup", "triggers": ["chain:start"], "emits": ["crm:done"] },
{ "name": "news-scanner", "triggers": ["chain:start"], "emits": ["news:done"] },
{ "name": "synthesizer", "triggers": ["linkedin:done", "crm:done", "news:done"], "emits": ["chain:complete"] }
]
}
The parallel version runs three agents at once and collects their outputs in a fan-in pattern. If each agent takes ~8 seconds, the sequential version takes 32 seconds. The parallel version takes ~16 seconds: 8 for the fan-out agents plus 8 for the synthesizer. That's a 50% reduction from rearranging triggers.
Rule of thumb: Draw your chain as a dependency graph. If two agents don't read each other's output, they can run in parallel.
Model right-sizing
Not every agent needs GPT-5.4 or Claude Opus. This is the most common source of wasted spend in agent chains.
Think about what each agent actually does. A classification agent that routes tickets into three categories doesn't need frontier-model reasoning. A data extraction agent pulling structured fields from a known format doesn't need creative capability. Match the model to the task:
- Routing, classification, extraction: Use the fastest, cheapest model available. Claude Haiku, GPT-5.4 Mini, Gemini Flash. These tasks have constrained output and low ambiguity.
- Summarization, rewriting, formatting: Mid-tier models handle these well. Claude Sonnet, GPT-5.4. The task requires language facility but not deep reasoning.
- Analysis, judgment, complex writing: This is where you use the expensive models. Claude Opus, GPT-5.4. The output quality depends on reasoning capability.
A chain that uses Haiku for the first two agents and Opus for the final analysis agent can cost 70% less than using Opus everywhere. The output quality is identical because the cheap agents weren't doing work that required expensive reasoning.
In Mentiko, you set the model per agent in the chain definition:
{
"name": "classifier",
"model": "claude-haiku",
"prompt": "Classify this ticket as: bug, feature, question."
}
Prompt caching and deduplication
If your chain processes similar inputs repeatedly, you're paying to send the same context to the same model over and over.
System prompt caching. Most LLM providers now support prompt caching. If your agent's system prompt is 2,000 tokens and it runs 100 times a day, caching that system prompt saves 200,000 input tokens daily. Anthropic's prompt caching gives you a 90% discount on cached tokens. Enable it.
Output deduplication. If an agent processes the same input twice, it should return the cached result instead of making another API call. This is especially valuable for enrichment chains where the same lead might enter the pipeline multiple times.
{
"name": "enricher",
"cache": {
"enabled": true,
"key": "input.email",
"ttl": 86400
}
}
This tells Mentiko to cache the enricher's output keyed by the input email, with a 24-hour TTL. Same email enters the chain within 24 hours? Skip the API call, return the cached result.
Context trimming. Each agent in a chain receives upstream output. By default, that means later agents get all previous agents' full outputs as context. If Agent 4 only needs a summary from Agent 1 and the full output from Agent 3, tell it that explicitly. Don't send 10,000 tokens of intermediate output when 500 tokens of relevant context will do.
Chain architecture: minimize round trips
Every agent invocation is a network round trip: serialize input, send to API, wait for response, deserialize output. The latency floor for a single LLM call is 1-3 seconds even for simple prompts. Chain architecture should minimize the number of sequential calls.
Combine agents that always run together. If two sequential agents always process the same data and their combined prompt would be under the model's sweet spot, merge them into one agent with a multi-step prompt. Two 3-second calls become one 4-second call.
Eliminate pass-through agents. Some chains have agents that exist only to reformat data between two other agents. This is a sign that the upstream agent's output format should match what the downstream agent expects. Fix the prompt, remove the middle agent.
Use streaming for user-facing output. If the final agent in your chain produces output that a human reads, stream it. The user sees the first token in 500ms instead of waiting 8 seconds for the complete response. The total time is the same, but perceived performance improves dramatically.
Workspace colocation
Where your agents execute matters for latency. If your chain triggers a workspace on a remote server, you're adding SSH connection overhead on every agent invocation. Mentiko supports three workspace types: local, SSH, and Docker.
For performance-critical chains, use local or Docker workspaces. SSH workspaces add 200-500ms of connection overhead per invocation. Over a 6-agent chain, that's 1-3 extra seconds of pure network latency.
If you must use SSH workspaces, use persistent connections. Mentiko keeps SSH sessions alive between agent invocations within the same chain run, but the initial connection still costs time. Colocate your workspace with your chain runner when possible.
Lazy agent initialization
Not every agent in a chain will fire on every run. Conditional branches mean some agents only execute when specific conditions are met. Loading model connections, workspace environments, and tool configurations for agents that might not run is waste.
Mentiko initializes agents lazily by default: the workspace spins up and the model connection opens when the agent's trigger fires, not when the chain starts. This means a chain with 10 agents but only 4 active branches doesn't pay initialization cost for the 6 idle agents.
If you're building your own orchestration, implement this pattern. Pre-initializing everything is simpler to code but creates unnecessary overhead in branching chains.
Measuring performance
You can't optimize what you don't measure. Track these metrics per chain and per agent:
- Time per agent: How long each agent takes from trigger to output. Identifies bottlenecks.
- Cost per run: Total API spend for one chain execution. Tracks the impact of optimizations.
- Cache hit rate: What percentage of invocations are served from cache. Low hit rate on a high-volume chain means your cache key is wrong.
- Parallel efficiency: (Sum of individual agent times) / (Actual wall-clock time). A ratio of 2.0 means your parallelization is giving you 2x speedup. If it's close to 1.0, your chain is effectively sequential despite having parallel agents.
- Token utilization: Input tokens vs output tokens per agent. High input-to-output ratios suggest context trimming opportunities.
Mentiko exposes these metrics per run in the dashboard and via the /api/runs/{id}/metrics endpoint. Export them to your monitoring stack and set alerts for cost spikes or latency regressions.
The optimization checklist
Before deploying a chain to production, run through this:
- Draw the dependency graph. Identify agents that can run in parallel. Restructure triggers.
- Audit model assignments. Downgrade agents doing simple tasks to cheaper models.
- Enable prompt caching. Especially for agents with large system prompts.
- Add output caching. For agents that process repeated inputs.
- Trim context passing. Each agent should receive only the upstream output it needs.
- Eliminate unnecessary agents. Merge pass-throughs, combine always-sequential pairs.
- Choose the right workspace. Local > Docker > SSH for latency.
- Set up metrics. Cost per run, time per agent, cache hit rate.
Most chains can be made 40-60% faster and 50-70% cheaper by applying these techniques. The optimizations compound: parallelizing three agents that each use a cheaper model with cached prompts is multiplicative, not additive.
Start with the dependency graph and model right-sizing. Those two changes alone usually account for 80% of the improvement. The rest is incremental -- worth doing for high-volume chains, but not where you should spend your first hour.
Get new posts in your inbox
No spam. Unsubscribe anytime.