LLM Model Selection for Agent Chains: Matching Models to Tasks

Most teams pick one model and use it everywhere. GPT-5.4 for the whole chain, or Claude Sonnet for every agent, or whatever they started prototyping with. This is like hiring a senior engineer to answer phones, sort mail, and also architect your distributed system. It works, technically. It's also a massive waste of money and a bottleneck you don't need.

The better approach: match the model to the task. Every agent in your chain has different requirements for reasoning depth, output quality, latency, and cost. Treating model selection as a per-agent decision instead of a global one can cut your chain costs by 60-80% while improving overall quality.

The model tier framework

Think of available models in three tiers. The names change every few months, but the tiers are stable.

Tier 1: Reasoning engines. Claude Opus, GPT-5.4, Gemini Ultra. These models handle complex multi-step reasoning, nuanced writing, ambiguous instructions, and tasks that require maintaining coherent context across long outputs. They cost 10-30x more per token than tier 3 models. Latency is highest here.

Tier 2: Workhorse models. Claude Sonnet, GPT-5.4 Mini, Gemini Pro. Good balance of capability and cost. They handle most structured tasks well: summarization, code generation from clear specs, document analysis, data extraction from well-formatted inputs.

Tier 3: Speed models. Claude Haiku, GPT-5.4 Mini, Gemini Flash, local models like Llama 3 and Mistral. These are cheap and fast. They handle classification, routing, simple extraction, formatting, validation, and any task where the correct output is highly constrained.

The mistake is assuming "harder model = better results" across the board. A classification agent that routes support tickets to the right queue doesn't need Opus-level reasoning. A Haiku-class model will classify just as accurately, return results 5x faster, and cost 20x less.

Mapping agents to model tiers

Here's how common agent types map to model tiers in practice.

Tier 3 (speed models) for:

Classification and routing agents. The output space is small (3-10 categories). Even small models nail this.
Validation agents. Checking whether output matches a schema, whether required fields are present, whether a value falls in range.
Formatting agents. Converting markdown to HTML, restructuring JSON, normalizing dates and addresses.
Extraction agents with clear schemas. Pulling structured data from well-formatted documents.

Tier 2 (workhorse models) for:

Content generation from detailed specs. Blog posts, documentation, email drafts where the requirements are clear.
Code generation with defined interfaces. Implementing functions when the types and tests are provided.
Summarization agents. Condensing long documents while preserving key points.
Analysis agents with moderate complexity. Analyzing data with clear criteria, generating reports from structured inputs.

Tier 1 (reasoning engines) for:

Decision agents. When the agent needs to weigh tradeoffs, handle ambiguity, or make judgment calls that affect downstream agents.
Planning agents. Breaking complex tasks into subtasks, creating execution plans, determining agent routing for novel inputs.
Complex code review. Finding architectural issues, security vulnerabilities, or subtle logic bugs -- not just linting.
Synthesis agents. Combining outputs from multiple agents into a coherent whole when the inputs are messy or contradictory.

The cost math

A real example. We have a content pipeline chain with five agents: researcher, outliner, writer, editor, and publisher.

All Opus (the lazy approach):

5 agents x ~$15/M input tokens + ~$75/M output tokens
Per run: ~$0.90 at typical token volumes
100 runs/day: $90/day, $2,700/month

Model-matched approach:

Researcher: Sonnet ($3/$15 per M) -- needs good search synthesis
Outliner: Haiku ($0.25/$1.25 per M) -- structured output, constrained task
Writer: Opus ($15/$75 per M) -- creative, nuanced, quality matters most
Editor: Sonnet ($3/$15 per M) -- checking against clear style rules
Publisher: Haiku ($0.25/$1.25 per M) -- formatting and API calls

Per run: ~$0.35. That's a 61% cost reduction. And the output quality is the same because the agents that needed Opus got Opus, and the ones that didn't weren't paying for capabilities they don't use.

Scale this to 100 chains across an organization and you're looking at the difference between a $50,000/month LLM bill and a $18,000/month one.

When to use local models

Local models (Llama 3, Mistral, Phi-3, Qwen) have a specific place in agent chains. They're not a general replacement for cloud models, but they solve three problems that cloud models can't.

Data sensitivity. If your agent processes PII, financial records, medical data, or trade secrets, sending that data to an external API is a compliance risk. A local Llama 3 running on your infrastructure keeps the data in your perimeter. Use local models for extraction and classification agents that handle sensitive inputs, then pass only sanitized summaries to cloud models for reasoning.

Latency requirements. A local model on a decent GPU returns results in 50-200ms. Cloud API calls take 500-3000ms. For agents in user-facing loops, the latency difference matters.

Cost at scale. If you're running an agent thousands of times per hour, cloud API costs add up. A single A100 running Llama 3 70B costs about $1.50/hour and handles roughly 30 requests per second. The cloud equivalent would cost $50-100/hour.

The tradeoff: local models need infrastructure. Don't run them for 10 requests per day. Consider them past 10,000 requests per day or when data can't leave your network.

Benchmarking model performance per agent

Don't guess which model works for each agent. Measure it. Here's the process.

Step 1: Define evaluation criteria per agent. For a classification agent, accuracy on a labeled test set. For a writing agent, human preference scores or automated quality metrics. For a code generation agent, test pass rate.

Step 2: Create a test harness. Run each agent against a fixed test set using three models (one from each tier). Log: output quality score, latency, token count, cost.

{
  "agent": "ticket-classifier",
  "test_set_size": 200,
  "results": {
    "opus": { "accuracy": 0.96, "avg_latency_ms": 1200, "avg_cost": 0.008 },
    "sonnet": { "accuracy": 0.95, "avg_latency_ms": 600, "avg_cost": 0.002 },
    "haiku": { "accuracy": 0.93, "avg_latency_ms": 180, "avg_cost": 0.0003 }
  }
}

In this case, Haiku at 93% accuracy for $0.0003 per classification is almost certainly the right choice. The 3% accuracy gap rarely justifies a 26x cost increase.

Step 3: Set quality thresholds. Decide the minimum acceptable quality for each agent. If Haiku meets it, use Haiku. Don't pay for capabilities you're not using.

Step 4: Re-benchmark on model updates. When providers release new model versions, re-run the benchmarks. Model capabilities shift. Today's tier 2 model might outperform last quarter's tier 1 on your specific tasks.

Per-agent model configuration in Mentiko

Mentiko lets you set the model at the agent level in your chain definition:

{
  "name": "content-pipeline",
  "agents": [
    {
      "name": "researcher",
      "model": "claude-sonnet",
      "prompt": "Research {TOPIC} and compile findings."
    },
    {
      "name": "outliner",
      "model": "claude-haiku",
      "prompt": "Create a structured outline from the research."
    },
    {
      "name": "writer",
      "model": "claude-opus",
      "prompt": "Write the article following the outline."
    },
    {
      "name": "editor",
      "model": "claude-sonnet",
      "prompt": "Edit for clarity, accuracy, and tone."
    }
  ]
}

Each agent runs on its own model. No code changes, no separate deployments. You can swap models by editing one field and re-running the chain.

For local models, point the agent at your own endpoint:

{
  "name": "pii-extractor",
  "model": "local:llama3-70b",
  "endpoint": "http://gpu-cluster:8080/v1",
  "prompt": "Extract and classify PII from the input document."
}

The chain execution engine handles the differences in API formats, token counting, and error handling between providers.

Model selection is a living process

Your initial model assignments are a starting point. Models improve, pricing changes, your data distribution shifts. Review per-agent cost and quality metrics monthly. A/B test model swaps before committing. When a new model drops, benchmark it against your current assignments.

The goal isn't to use the cheapest model everywhere. It's to use the right model everywhere. The chain doesn't care whether it's Opus or a local Llama, as long as each agent gets a model that matches its task.