Skip to content
← all posts
6 min readstrategy

Choosing the Right Model for Your Agents

model-selectionllmcost-optimizationagents

Mentiko Team

The model you choose for each agent in a chain is the single biggest lever you have over cost and quality. Using the same model for every step is like using a sledgehammer for every nail -- it works, but it's expensive and often unnecessary.

Here's a framework for matching models to tasks.

The three dimensions

Every model choice is a trade-off between three things:

Quality -- How well does the model handle the task? Does it follow instructions precisely? Is the output accurate and well-structured?

Speed -- How fast is first-token latency and total generation time? For agent chains, this compounds. A 2-second delay per agent in a 6-agent chain is 12 seconds of just model time.

Cost -- What do you pay per million tokens? For high-volume chains, this is the dominant factor.

No model wins on all three. The art is matching the right trade-off to the right task.

Model tiers and what they're good at

Frontier tier: Claude Opus, GPT-4.5

These are the most capable models available. They excel at complex reasoning, nuanced writing, ambiguous instructions, and tasks where getting it wrong has real consequences.

Use for:

  • Final review and quality assurance steps
  • Complex analysis that requires multi-step reasoning
  • Tasks with ambiguous or underspecified instructions
  • Customer-facing content where quality is non-negotiable
  • Architecture decisions, code review, strategic analysis

Cost: $15-75/1M input tokens, $75-150/1M output tokens

When it's overkill: Classification, extraction, formatting, routing, summarization of straightforward content. These models are wasted on simple tasks.

Workhorse tier: Claude Sonnet, GPT-4o, Llama 4 Maverick

The sweet spot for most production workloads. These models handle complex tasks well without frontier-tier pricing. They're fast enough for interactive use and capable enough for most agent chain steps.

Use for:

  • Content generation (reports, summaries, responses)
  • Code generation and refactoring
  • Data analysis and trend identification
  • Conversational agents and chat flows
  • Most "thinking" steps in a chain

Cost: $2.50-3/1M input, $10-15/1M output (API). Self-hosted Maverick is $0 per token after GPU costs.

The default choice. If you're not sure which tier to use, start here and adjust based on results.

Speed tier: Claude Haiku, GPT-4o-mini, Llama 4 Scout, Qwen 3.5 14B

Fast and cheap. These models handle well-defined tasks reliably at a fraction of the cost. They're the right choice for any step where the task is clear-cut and the model doesn't need to "think hard."

Use for:

  • Classification (sentiment, priority, category, routing)
  • Data extraction from structured or semi-structured text
  • Format conversion (markdown to JSON, CSV parsing, template filling)
  • Simple summarization
  • Input validation and preprocessing

Cost: $0.25-0.80/1M input, $1-4/1M output (API). Self-hosted options run at GPU cost only.

Underused in most chains. Teams default to the workhorse tier for tasks that a speed-tier model handles just as well. This is the most common source of overspending.

The decision framework

For each agent in your chain, ask these questions in order:

1. Is the task well-defined with clear success criteria?

If the output format is known, the input is structured, and you can mechanically verify correctness, use the speed tier. Classification, extraction, and formatting are speed-tier tasks.

2. Does the task require reasoning or judgment?

If the model needs to weigh trade-offs, handle ambiguity, or produce novel analysis, use the workhorse tier. Most "thinking" steps fall here.

3. Are the consequences of a wrong answer significant?

If a mistake means a bad customer experience, a wrong business decision, or a compliance issue, consider the frontier tier. Use it as a guardrail or final validator, not for every step.

4. What's the volume?

High-volume steps should use the cheapest model that meets quality requirements. If an agent runs 1,000 times per day, the difference between Haiku ($0.25/1M) and Opus ($15/1M) is $14.75 per million tokens. At scale, that adds up to thousands of dollars per month.

Practical examples

Content pipeline (4 agents)

| Step | Task | Model | Why | |---|---|---|---| | 1. Research | Pull and summarize source material | Claude Haiku | Well-defined extraction task | | 2. Draft | Write the content piece | Claude Sonnet | Requires quality writing | | 3. Edit | Revise for tone, accuracy, grammar | Claude Sonnet | Judgment required | | 4. SEO optimize | Add metadata, keywords, structure | Claude Haiku | Formulaic task |

Estimated cost per run: ~$0.02-0.04 vs ~$0.15-0.30 using Sonnet for everything.

Support triage (3 agents)

| Step | Task | Model | Why | |---|---|---|---| | 1. Classify | Categorize ticket by type and priority | GPT-4o-mini | Simple classification | | 2. Respond | Draft customer response | GPT-4o | Needs to sound human and helpful | | 3. Route | Assign to team based on classification | GPT-4o-mini | Rule-based routing |

Data processing (5 agents)

| Step | Task | Model | Why | |---|---|---|---| | 1. Parse | Extract fields from raw data | Qwen 3.5 14B (self-hosted) | High-volume extraction | | 2. Validate | Check extracted data against rules | Qwen 3.5 14B (self-hosted) | Deterministic validation | | 3. Enrich | Add context from external sources | Llama 4 Scout (self-hosted) | Moderate reasoning needed | | 4. Analyze | Identify patterns and anomalies | Claude Sonnet | Complex analysis | | 5. Report | Generate human-readable summary | Claude Sonnet | Quality writing required |

Steps 1-3 are self-hosted, costing only GPU time. Steps 4-5 use APIs for their superior reasoning and writing. Total cost is 70-80% less than using API models for everything.

The bring-your-own-keys advantage

Model selection only works if your platform supports it. If you're locked into a single provider or a single model per chain, you can't optimize.

Mentiko uses a bring-your-own-keys model. Each agent in a chain can connect to a different model endpoint. You configure API keys for Anthropic, OpenAI, and any OpenAI-compatible endpoint (including self-hosted models via vLLM or TGI). The chain definition specifies which model each agent uses.

This means you can:

  • Mix proprietary and open-source models in the same chain
  • Point high-volume agents at self-hosted models and low-volume agents at APIs
  • Swap models without changing chain logic -- just update the endpoint
  • A/B test models by running the same chain with different model configurations

The platform doesn't care where the model lives. It just calls the endpoint you configured.

How to validate your model choices

Don't guess. Test.

  1. Build a test set. Collect 50-100 representative inputs for each agent step. Include edge cases.
  2. Run each tier. Process your test set through speed, workhorse, and frontier models.
  3. Score the outputs. Use automated metrics where possible (exact match for classification, ROUGE for summarization) and human review for quality-sensitive steps.
  4. Find the floor. For each step, the cheapest model that meets your quality bar is the right choice.

Most teams discover that 60-70% of their agent steps can use speed-tier models with no quality loss. The savings from those steps fund the frontier-tier models where they actually matter.

Common mistakes

Using the same model everywhere. The most expensive mistake. Differentiate by task.

Optimizing too early. Start with the workhorse tier for everything, get the chain working, then optimize individual steps downward. Premature optimization leads to debugging quality issues in the wrong model.

Ignoring latency. A chain with 6 agents using a slow model feels broken. Speed-tier models aren't just cheaper -- they're faster. Use them for steps where latency matters.

Not re-evaluating. The model landscape changes every few months. A model that was best-in-class six months ago might be outperformed by a model at half the price today. Review your model choices quarterly.

The goal isn't to find the perfect model. It's to find the cheapest model that's good enough for each specific task, and to have the flexibility to change when something better comes along.

Get new posts in your inbox

No spam. Unsubscribe anytime.