Skip to content
← all posts
8 min read

We Built a Multi-Agent Platform Without Kubernetes

Mentiko Team

We spent the first two months of Mentiko building the "right" way. Kubernetes cluster, containerized agents, service mesh for inter-agent communication, a message broker for events, health checks, liveness probes, the works. We had 14 YAML files just for the orchestration layer.

Then we threw all of it away and replaced it with bash scripts.

This isn't a contrarian flex. We didn't start with bash because we're luddites. We started with Kubernetes because that's what you're supposed to do. We ended with bash because agents aren't web services, and the tooling built for web services makes agents worse.

Here's what happened and why.

Agents need real terminals

The first thing that broke our Kubernetes setup was a simple requirement: agents need to run CLI tools. git clone, npm install, python script.py, claude --chat. These aren't HTTP endpoints. They're interactive programs that expect a terminal -- stdin, stdout, stderr, TTY signals, environment variables, the full POSIX interface.

We tried running agents in containers. It works for basic tasks. But the moment an agent needs to install a dependency, clone a repo, or use a tool that checks isatty(), you're fighting the abstraction. Docker containers give you isolated filesystems and cgroups. What agents need is a real PTY session -- a pseudo-terminal that behaves like a human sitting at a keyboard.

So we built pty-manager, a lightweight daemon that allocates PTY sessions for agents. Each agent gets a real terminal. It can run git, pipe output through jq, use ssh with key agents, run interactive CLIs. No shims, no wrappers, no "we'll intercept the syscall and fake the response." Just a terminal.

# launch-agent.sh spawns an agent in a real PTY session
pty-manager allocate --session "$agent_id"
pty-manager exec --session "$agent_id" -- \
  claude --model sonnet --system-prompt "$prompt" \
  < "$input_file" > "$output_dir/$agent_id.output"

This is the kind of thing that's trivial on a single machine and painful in a container orchestrator. Kubernetes wants to manage your processes. Agents need to manage their own.

The orchestration layer is four bash scripts

The entire orchestration layer that replaced our Kubernetes deployment is four scripts:

  • chain-runner.sh reads a chain.json definition and orchestrates the full pipeline. It resolves dependencies, manages execution order, and handles the lifecycle of a chain run.
  • launch-agent.sh takes an agent config and spawns it in a PTY session via pty-manager. It sets up the environment, input files, and output directory.
  • event-trigger.sh watches for file-based events and triggers dependent agents when their preconditions are met.
  • complete-agent.sh handles agent completion -- capturing output, writing the event file, and signaling downstream agents.

There's also watchdog.sh, which detects stalled runs by monitoring PTY sessions for activity timeouts. If an agent hasn't produced output in N minutes, watchdog escalates -- first a retry, then a kill, then an alert.

Total line count across all five scripts: around 800 lines of bash.

The Kubernetes equivalent was 2,400 lines of YAML, a custom operator written in Go, a Redis instance for pub/sub, and a Celery-like task queue. It had more infrastructure for managing agents than logic for actually running them.

File-based events as the communication layer

Agents communicate through event files. When an agent completes, complete-agent.sh writes a JSON file:

// .events/researcher.event
{
  "agent": "researcher",
  "status": "complete",
  "output_path": "/runs/abc123/researcher/output",
  "tokens_used": 14200,
  "duration_seconds": 34,
  "timestamp": "2026-03-19T09:41:22Z"
}

event-trigger.sh watches the .events/ directory using inotifywait (Linux) or fswatch (macOS). When a new event file appears, it checks the chain definition for agents that depend on that event and launches them.

This is intentionally low-tech. Here's what we get from that:

Debuggability. When something goes wrong, you cat the event file. You don't need a trace ID, a log aggregation service, or a dashboard. The state of the system is a directory of JSON files.

Auditability. Event files are just files. You can git diff them between runs, grep for error patterns across a month of runs, or write a one-liner that shows you the average token usage per agent.

Replayability. Want to re-run the pipeline from a specific point? Delete the event files from that point forward and re-trigger. No need to reset queues, clear caches, or reconstruct state.

# Debug: why did the editor fail?
$ cat .events/editor.event
{"agent": "editor", "status": "error", "error": "context_length_exceeded"}

# Replay: re-run from editor onward
$ rm .events/editor.event .events/publisher.event
$ bash event-trigger.sh --chain runs/abc123

We compared this against our previous setup where events went through Redis pub/sub. Redis is faster. It's also opaque -- messages are transient, you need a separate system to persist them, and debugging requires connecting to the broker and hoping the message hasn't expired.

Why not Kubernetes? Because agents aren't web services

Kubernetes solves specific problems brilliantly: scaling stateless HTTP services, rolling deployments, service discovery, load balancing. Agents have none of these requirements.

Agents are stateful. Each agent has an ongoing conversation context, working files, and environment state. Kubernetes' model of disposable pods that can be killed and rescheduled doesn't fit. When K8s evicts a pod running an agent mid-conversation, that work is lost.

Agents are long-running. A chain might run for 20 minutes or 2 hours. Kubernetes is optimized for request-response workloads measured in milliseconds to seconds. Long-running pods trigger health check timeouts, resource reclamation warnings, and autoscaler confusion.

Agent orchestration is I/O-bound, not compute-bound. An agent spends 95% of its time waiting for LLM API responses. The CPU sits idle. A single 4-core VPS can run hundreds of concurrent agent chains because the bottleneck is API latency, not local compute. Kubernetes' horizontal scaling doesn't help when there's nothing to scale.

The operational overhead is real. Running a Kubernetes cluster -- even a managed one -- means maintaining the control plane, managing node pools, configuring networking, setting up monitoring. For a workload that runs fine on a single machine, that's a tax you pay every day for benefits you never use.

We did the math. A managed Kubernetes cluster on any major cloud provider costs $70-150/month for the control plane alone, before any workloads. A VPS that handles our agent workload costs $20-40/month. And the VPS is simpler to operate by an order of magnitude.

What we gained

Deployment is scp and systemctl. No container registry, no image builds, no rolling deployment strategy. Copy the scripts, restart the services. Time from commit to production: under 30 seconds.

Debugging is ssh and cat. No kubectl, no log aggregation, no distributed tracing. SSH into the machine, read the event files, check the PTY session logs. Time from alert to root cause: under 5 minutes, consistently.

Onboarding is reading bash. New engineers don't need to understand Kubernetes concepts, CRDs, operators, or Helm charts. The entire orchestration layer is readable bash with comments. The learning curve is a few hours, not a few weeks.

Customer isolation is real isolation. Each customer gets their own VPS with their own instance of the platform. Not namespace isolation within a shared cluster -- actual separate machines. This is simpler to reason about, simpler to secure, and simpler to explain to customers who ask where their data lives.

What we gave up

We'd be lying if we said there were no tradeoffs. There are real ones.

Single-machine means single point of failure. If the VPS goes down, that customer's agents stop. We mitigate this with monitoring, automated restarts via systemd, and the fact that agent work is resumable -- chains can pick up from the last completed event. But it's not the same as Kubernetes automatically rescheduling pods across nodes.

Bash is harder to unit test. You can test bash scripts, but the tooling is nowhere near what you get with Go or Python. We lean on integration tests that run real chains end-to-end, and watchdog.sh as a safety net for runtime failures. It works, but it's not as rigorous as a typed codebase with property-based tests.

File-based events don't guarantee ordering. If two agents complete within the same millisecond, filesystem timestamps might not preserve order. We mitigate with monotonic sequence numbers in event filenames and conventions about conflict resolution. In practice this has caused exactly one bug in six months, but the theoretical risk is real.

No horizontal scaling within a single instance. If a customer needs more than one machine can provide, we can't transparently spread their workload across nodes. The answer is vertical scaling (bigger VPS) or running separate instances for different workloads. For our current scale this is fine. For 10x our current scale, we'll need to revisit.

The meta-lesson

The infrastructure industry has a complexity ratchet. You start with a problem ("I need to run agents"), reach for the standard solution ("Kubernetes"), discover it doesn't quite fit ("agents aren't web services"), add abstractions to make it fit ("custom operators, sidecars, init containers"), and end up with a system whose accidental complexity dwarfs its essential complexity.

The alternative is to start from the problem and work outward. Agents need terminals? Give them terminals. Agents communicate through events? Use files. Agents need orchestration? Bash can orchestrate. At each step, ask whether the added complexity is solving a problem you actually have or one you might have someday.

We might need Kubernetes eventually. If we're running 10,000 concurrent instances and need automated multi-region failover, bash scripts won't cut it. But that's a future problem with future context. Today, our four bash scripts and a PTY daemon handle everything our customers need, and we spend our engineering time on making agents better instead of making infrastructure work.

The simplest thing that works isn't a consolation prize. Sometimes it's the best architecture.

If you're building agent workflows and this approach resonates, join the waitlist for early access. You'll get your own instance -- bash scripts and all.

Get new posts in your inbox

No spam. Unsubscribe anytime.