Agent Orchestration: Lessons from Year One

We've been building and running Mentiko for a year. An agent orchestration platform that lets teams define, execute, and monitor multi-agent chains. In that year, we made good decisions and bad decisions. We built things that worked and things that didn't. Here's what we learned.

Lesson 1: File-based events were the right call

The most controversial decision we made was using the filesystem for inter-agent communication. Agent A writes an event file. Agent B watches for it and starts. No message queue. No database. Just files.

Engineers told us it wouldn't scale. They're right -- it won't scale to millions of events per second. But agent chains don't need that. A busy platform produces maybe 10,000 events per day.

What file-based events give you: debuggability (ls the events directory to see exactly what happened), transparency (users inspect every piece of data flowing through their chains), and simplicity (no RabbitMQ, no Kafka, no Redis pub/sub to manage).

We'd make this decision again. Agent chains are I/O-bound on LLM API calls. The chain spends 95% of its time waiting for the model and 0.01% writing event files.

Lesson 2: Bash as the orchestration layer was right

Chains are defined in YAML, but the execution engine is bash. Every server has bash. Every developer knows enough to debug a problem. SSH into a server, read the scripts, understand what's happening without learning a framework.

A custom runtime in Python or Go would give better performance and typing. It would also give users a runtime they can't inspect or debug without our help. We chose transparency over sophistication. The support questions we don't get are about mysterious runtime behavior, because there's no mysterious runtime.

Lesson 3: No Kubernetes was essential

An agent chain needs: a process, internet access for LLM APIs, and a filesystem for events. That's a single server.

Kubernetes gives you container scheduling, service mesh, ingress controllers, PVCs, RBAC, namespaces, operators, CRDs, and a learning curve measured in months. Every team we've talked to that deployed agent orchestration on Kubernetes spent more time managing Kubernetes than managing their agents.

Our architecture: one server per workspace, Docker for isolation, SSH for remote execution. A small VPS can run a steady queue of chains without per-execution billing. If someone tells you agent orchestration needs Kubernetes, ask them: scale to what? You need Kubernetes at thousands of concurrent executions. You probably have dozens.

Lesson 4: Multi-tenancy is harder than you think

When we launched, we assumed multi-tenancy would be straightforward. Each tenant gets a workspace. Workspaces are isolated. Done.

It's not done. Multi-tenancy in agent orchestration has unique challenges:

Secret isolation. One bug in environment variable scoping and you've leaked a production API key between tenants. Strict directory permissions and separate env namespaces are non-negotiable.

Resource contention. Without per-tenant resource limits, one team's 20-agent chain saturates the server and everyone else's chains time out. Noisy neighbors destroy multi-tenant platforms.

Cost attribution. Build cost tracking from day one. Adding it later means instrumenting every execution path retroactively.

We underestimated the effort for proper multi-tenancy by about 3x.

Lesson 5: Decision flow changed everything

We built Mentiko as an automated pipeline system. Agents execute in sequence, events flow between them, output comes out the other end. Fully automated.

Then users asked: "Can I review the output of Agent 2 before Agent 3 runs?"

This is the decision flow pattern. A chain pauses at a decision point, presents options to a human, and continues based on their choice. It turns a fully automated pipeline into a human-in-the-loop workflow.

We initially resisted this. It felt like a step backward -- the whole point was automation. But users were right. Many workflows need human judgment at critical junctures. A research chain should be fully automated. An action chain that sends emails or deploys code should have a human checkpoint.

Decision flow became one of our most-used features. The tinder-style card interface (swipe to approve, reject, or request changes) makes it fast for humans to participate without becoming bottlenecks.

The lesson: don't be dogmatic about full automation. The best orchestration systems let humans participate when it matters and step back when it doesn't.

Lesson 6: Prompts are the new config

In traditional software, configuration files control behavior. In agent orchestration, prompts control behavior. This has implications we didn't fully appreciate at first.

Prompts need version control. When a chain's output quality degrades, you need to diff the prompt against the last known-good version. Without version control, you're debugging blind.

Prompts need testing. A small prompt change can dramatically change output. "Summarize this document" and "Summarize the key points of this document" produce meaningfully different results. Test prompt changes against a set of reference inputs before deploying.

Prompts need review. A prompt change in a production chain should go through the same review process as a code change. Someone other than the author should read it and verify it says what it means.

Prompts are not code. Developers instinctively want to DRY up prompts, parameterize them, build prompt templates with inheritance. This usually makes things worse. A prompt that's been abstracted through three layers of templates is harder to understand and debug than a prompt that says exactly what it means in plain language.

We now treat prompt changes as first-class deployments. They go through version control, review, staging validation, and production monitoring -- the same pipeline as code changes.

Lesson 7: Monitoring agent chains requires different metrics

Traditional monitoring tracks uptime, latency, and error rates. Agent chain monitoring needs those plus:

Output quality. A chain can complete successfully and produce garbage. "Success" in traditional monitoring means the process didn't crash. "Success" in agent monitoring means the output is correct, complete, and useful. You need quality checks, not just health checks.

Cost per run. A chain that gradually gets more expensive (longer prompts, more retries, larger context) is a slow-moving incident. Track cost per run over time and alert on drift.

Model dependency health. Your chain depends on external LLM APIs. If OpenAI's API latency doubles, your chain duration doubles. Monitor the APIs your chains depend on, not just your own infrastructure.

Prompt drift. The same prompt can produce different results over time as model providers update their models. If your chain's output quality drops without any changes on your end, the model changed. Track output consistency over time.

We built a monitoring layer that tracks all four. It caught problems that traditional monitoring would have missed: a chain that ran successfully every day but started producing lower-quality output because the model provider pushed an update.

Lesson 8: The marketplace took longer than the platform

Building a marketplace for sharing chains seemed natural. The marketplace itself was simple. Everything around it was hard: packaging chains for other workspaces (abstracting environment dependencies), trust (what can a marketplace chain access in your workspace?), versioning (does an update break your workflow?), and support (who helps when the publisher moves on?).

If we were starting over, we'd ship the platform for six months before building the marketplace. It's a multiplier -- but it multiplies problems too.

Lesson 9: Self-hosting wins for this category

Agent chains handle sensitive data, call APIs with your credentials, and make decisions affecting your business. Most teams aren't comfortable handing that to a third-party SaaS.

Self-hosting eliminates per-execution pricing that makes managed platforms expensive at scale. You pay for your server and API keys. We charge a flat rate for the software. The downside is more setup friction, but a single Docker command gets you running.

Lesson 10: Start simple, stay simple

The most important lesson after a year: complexity is the enemy. Every abstraction layer, every configuration option, every "advanced feature" adds cognitive load for users and maintenance burden for the team.

The features that users love are the simple ones: define a chain in YAML, run it, see the results. The features that cause the most support tickets are the complex ones: custom triggers, chain composition, dynamic agent selection.

When we're evaluating a new feature, the first question isn't "is this useful?" It's "does this make the simple case simpler?" If a feature makes the power user's life easier but the beginner's life harder, we usually don't build it.

Agent orchestration is complex enough inherently. The platform's job is to reduce that complexity, not add to it.

What we'd change

If we started over today:

Multi-tenancy from day one. Not bolted on after launch.
Cost tracking from day one. Same reason.
Decision flow from day one. It's core to how people actually use agent chains, not an afterthought.
No marketplace for the first six months. Get the core right first.
More integration tests, fewer unit tests. Agent chains are integration-heavy. Unit tests on individual components missed most of the real bugs, which lived in the handoffs between agents.

Everything else -- file-based events, bash orchestration, no Kubernetes, self-hosting -- we'd do exactly the same.

Want to get started? Build your first chain in five minutes or read why we built Mentiko.