Skip to content
← all posts
7 min read

Running Local LLMs in Agent Chains: Ollama, Privacy, and Performance

Mentiko Team

Every conversation about AI agents assumes you're sending data to OpenAI or Anthropic. For a lot of teams, that assumption is wrong. Legal teams processing privileged documents. Healthcare companies handling PHI. Financial firms with strict data residency requirements. Defense contractors. Any company whose compliance officer says "no data leaves our network."

Local LLMs solve this. Run the model on your own hardware, keep every byte on-premises, and still get the benefits of agent orchestration. The tradeoff is performance -- local models are smaller and less capable than the latest cloud offerings. But for many agent chain tasks, they're more than enough.

When local models make sense

Not every agent needs the most capable model available. Agent chains typically contain a mix of tasks with varying complexity:

Good for local models:

  • Classification (route this ticket to the right team)
  • Extraction (pull these fields from this document)
  • Summarization (condense this into a paragraph)
  • Formatting (convert this data into a report template)
  • Validation (does this output match the expected schema)

Better with cloud models:

  • Complex multi-step reasoning
  • Long-context analysis (processing 50-page documents)
  • Creative generation (marketing copy, original content)
  • Tasks requiring the latest training data
  • Anything where quality directly impacts revenue

The key insight: most agent chains have both types of tasks. A 5-agent chain might have 3 agents doing classification, extraction, and formatting (local is fine) and 2 agents doing analysis and synthesis (cloud models add real value). You don't have to choose one or the other.

Setting up Ollama for agent chains

Ollama is the most straightforward way to run local models. It handles model management, provides an OpenAI-compatible API, and runs on Linux, macOS, and Windows.

Install and pull a model:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models suited for different agent tasks
ollama pull llama3.1:8b        # General purpose, fast
ollama pull llama3.1:70b       # More capable, slower
ollama pull mistral:7b         # Good at structured output
ollama pull codellama:13b      # Code-related tasks

Ollama exposes an API on localhost:11434. Mentiko connects to it the same way it connects to cloud APIs -- you just point the model config at your local endpoint:

{
  "name": "contract-classifier",
  "model": {
    "provider": "ollama",
    "model": "llama3.1:8b",
    "endpoint": "http://localhost:11434"
  },
  "prompt": "Classify this contract as: NDA, MSA, SOW, SLA, or OTHER. Respond with only the category name.",
  "triggers": ["chain:start"],
  "emits": ["classification:complete"]
}

The agent works identically whether the model is running on your machine or in a data center. The chain definition is the same. The events are the same. Only the model config changes.

Hardware requirements

Local model performance depends entirely on your hardware. Here's what to expect:

8B parameter models (Llama 3.1 8B, Mistral 7B):

  • Minimum: 8GB RAM, any modern CPU. Runs, but slowly (5-10 tokens/sec).
  • Recommended: 16GB RAM, Apple Silicon or NVIDIA GPU with 8GB+ VRAM. Fast enough for production (30-80 tokens/sec).
  • These models handle classification, extraction, and formatting well.

70B parameter models (Llama 3.1 70B):

  • Minimum: 64GB RAM or GPU with 40GB+ VRAM (A100, A6000).
  • Recommended: Multiple GPUs or a high-RAM system with quantization.
  • These approach cloud model quality for most tasks but need serious hardware.

Quantized models (Q4, Q5, Q8):

  • Reduce memory requirements by 50-75% with modest quality loss.
  • Q5 quantization is the sweet spot for most agent tasks -- minimal quality degradation, significant memory savings.
  • Ollama handles quantization automatically when you pull models.

For teams starting out: an M-series Mac with 32GB RAM runs 8B models fast enough for production agent chains. A single NVIDIA A100 handles 70B models comfortably. You don't need a data center to run local models, but you do need to size your hardware to your model choice.

The hybrid architecture

The real power isn't choosing local or cloud -- it's using both. A hybrid chain routes each agent to the right model based on the task requirements and data sensitivity.

{
  "name": "hybrid-document-processor",
  "agents": [
    {
      "name": "classifier",
      "model": { "provider": "ollama", "model": "llama3.1:8b" },
      "prompt": "Classify this document type: invoice, contract, memo, report, or other.",
      "triggers": ["chain:start"],
      "emits": ["classified"]
    },
    {
      "name": "pii-scanner",
      "model": { "provider": "ollama", "model": "llama3.1:8b" },
      "prompt": "Scan for PII: names, SSNs, addresses, phone numbers, email addresses. Redact all PII by replacing with [REDACTED-TYPE]. Output the redacted document.",
      "triggers": ["classified"],
      "emits": ["redacted"]
    },
    {
      "name": "analyzer",
      "model": { "provider": "anthropic", "model": "claude-sonnet-4-20250514" },
      "prompt": "Analyze the redacted document for key insights, anomalies, and action items.",
      "triggers": ["redacted"],
      "emits": ["analysis:complete"]
    },
    {
      "name": "summary-writer",
      "model": { "provider": "anthropic", "model": "claude-sonnet-4-20250514" },
      "prompt": "Write a concise executive summary from the analysis.",
      "triggers": ["analysis:complete"],
      "emits": ["chain:complete"]
    }
  ]
}

In this chain, the first two agents run locally. The classifier and PII scanner handle the raw document -- the sensitive data never leaves your network. After PII is redacted, the sanitized document goes to cloud models for deeper analysis. The cloud models are better at synthesis and reasoning, but they only see redacted data.

This is the pattern that unlocks AI agent chains for regulated industries. You get the privacy guarantees of local processing for sensitive data handling, combined with the quality of cloud models for the reasoning-heavy tasks.

Performance tuning for local models

Local models in agent chains have different performance characteristics than cloud APIs. Here's what to tune.

Context window management. Local models typically have smaller effective context windows than cloud models. A cloud model might handle 100K tokens effortlessly. Your local 8B model works best under 4K tokens. Design your agents to work with shorter inputs: summarize upstream output before passing it to a local agent, or chunk large documents and process them in batches.

Batch processing. Cloud APIs handle concurrent requests gracefully because they're running on massive GPU clusters. Your local Ollama instance processes requests sequentially by default. For agent chains with fan-out patterns (multiple agents running in parallel), this creates a bottleneck. Solutions: run multiple Ollama instances on different ports, or serialize the fan-out into sequential processing and accept the latency.

Temperature and sampling. Local models are more sensitive to temperature settings than cloud models. For agent tasks that require consistent, deterministic output (classification, extraction, formatting), set temperature to 0. For generation tasks, stay low -- 0.3-0.5. High temperatures on smaller models produce more randomness and less useful variation than the same temperature on larger models.

System prompts. Keep system prompts short and direct for local models. A 500-word system prompt that works perfectly on a frontier model might confuse a 7B model. Shorter, more structured instructions with explicit output format examples work better.

Model selection per agent type

A practical mapping: classifier and formatter agents run fine on 7-8B models (Llama 3.1 8B, Mistral 7B) -- they're picking from fixed options or restructuring data. Extractor agents work on 8-13B for structured documents, but need 70B or cloud for messy unstructured text. Analyzer and generator agents benefit most from larger models -- 70B or cloud. Analysis requires reasoning that smaller models struggle with, and generation quality scales with model size more than almost any other task.

Cost comparison

Cloud API costs scale linearly with usage. Local model costs are mostly fixed (hardware) with marginal operational costs (electricity, maintenance).

A team running 1,000 executions per day at $0.10 each spends $3,000/month on API costs. The same workload on a dedicated GPU server (~$2,000/month cloud or ~$15,000 one-time purchase) has near-zero marginal cost per execution.

The breakeven depends on volume. Below 500 executions/day, cloud is usually cheaper when you factor in ops overhead. Above 2,000 executions/day, local hardware wins decisively. For hybrid architectures, the math is even better: run high-volume simple agents locally and send only complex tasks to cloud models.

Getting started

  1. Install Ollama and pull llama3.1:8b to start.
  2. Pick one agent in an existing chain -- the simplest one, probably a classifier or formatter.
  3. Switch that agent to use the local model. Run the chain. Compare output quality.
  4. If quality holds, switch the next simplest agent. Keep going until you find the boundary where local models aren't good enough.
  5. Everything above that boundary stays on cloud models. Everything below runs locally.

This incremental approach lets you find the right split for your specific workload without committing to an all-or-nothing decision.


Want to run Mentiko fully self-hosted? See why self-hosting matters or build your first chain.

Get new posts in your inbox

No spam. Unsubscribe anytime.