Running Your Own Models on RunPod, Lambda Labs, and Beyond
Mentiko Team
There are good reasons to run your own models. Cost control at scale. Data sovereignty. Latency requirements. Freedom from rate limits. The infrastructure to do it has gotten dramatically easier in the past year.
This guide covers the practical steps: picking a GPU provider, deploying an inference server, and wiring it into your agent workflows.
Why self-host at all
The API model is simple. You send tokens, you get tokens back, you pay per million. For prototyping and low-volume use, it's the right call.
The math changes at scale. If you're running agent chains that make hundreds of LLM calls per day, the API bill adds up fast. A single GPT-4-class call might cost $0.03-0.10 in tokens. Multiply that across a 6-agent chain running 50 times daily, and you're looking at $300-1,000/month in API costs alone.
A self-hosted model on a rented GPU can handle the same volume for $200-400/month with no per-token charges. The breakeven point is usually around 500-1,000 calls per day, depending on the model and provider.
Beyond cost, there are other reasons:
- No rate limits. Your GPU, your throughput. No 429 errors during peak usage.
- Data never leaves your network. Prompts and completions stay on your infrastructure.
- Consistent latency. No shared queue. First-token latency is predictable.
- Model pinning. The model doesn't change until you change it. No surprise capability regressions.
Picking a GPU provider
The GPU cloud market has matured. Here's what's available:
RunPod is the most popular for inference workloads. They offer on-demand and spot A100s, H100s, and consumer GPUs at competitive rates. The serverless endpoint feature is particularly useful -- you deploy a model and get an API endpoint that scales to zero when idle. On-demand A100 80GB runs about $1.64/hour. Spot pricing can drop to $0.80/hour.
Lambda Labs focuses on training but works well for inference too. Their H100 instances start at $2.49/hour. The advantage is simplicity -- you get a clean Ubuntu box with CUDA pre-installed and full root access. No container abstractions to deal with.
Vast.ai is the budget option. It's a marketplace of individual GPU owners renting hardware. Prices can be 50-70% lower than RunPod, but reliability varies. Good for experimentation, less ideal for production workloads.
Together AI sits between self-hosted and API. They run open-source models on their infrastructure and charge per token, but at 3-5x lower rates than proprietary APIs. You don't manage GPUs, but you get open-source model pricing.
For most teams, RunPod is the safest starting point. The serverless endpoints handle scaling, the pricing is transparent, and the community has documented most failure modes.
Deploying an inference server
The standard stack for self-hosted inference is vLLM or Text Generation Inference (TGI). Both turn a model checkpoint into an OpenAI-compatible API.
vLLM (recommended)
vLLM is the most widely deployed inference engine. It supports continuous batching, PagedAttention for efficient memory use, and OpenAI-compatible endpoints out of the box.
On RunPod, the fastest path is their vLLM serverless template:
- Create a serverless endpoint in the RunPod dashboard
- Select the vLLM worker template
- Set your model (e.g.,
meta-llama/Llama-4-Scout-17B-16E-Instruct) - Pick your GPU tier (A100 80GB for 70B models, A10G for 7-8B models)
- Deploy
You'll get an HTTPS endpoint that accepts OpenAI-format requests.
For a persistent deployment on a pod:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
This gives you an OpenAI-compatible server at http://your-pod-ip:8000/v1/chat/completions.
TGI (alternative)
Hugging Face's Text Generation Inference is the other major option. It's Docker-native and integrates well if you're already in the HF ecosystem:
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
--max-input-length 4096 \
--max-total-tokens 8192
Both engines handle quantized models (AWQ, GPTQ, GGUF) which let you run larger models on smaller GPUs.
Model sizing and GPU requirements
The rule of thumb: a model needs roughly 2x its parameter count in GB of VRAM when loaded in FP16. A 70B model needs ~140GB, which means two A100 80GBs or a single H100.
Quantization changes the equation:
| Model size | FP16 VRAM | 4-bit quantized VRAM | GPU options | |---|---|---|---| | 7-8B | ~16GB | ~5GB | A10G, RTX 4090 | | 13B | ~26GB | ~8GB | A10G, A100 40GB | | 34B | ~68GB | ~20GB | A100 40GB | | 70B | ~140GB | ~40GB | A100 80GB, H100 |
For agent workloads, 4-bit quantization (AWQ or GPTQ) is usually the right trade-off. Quality loss is minimal for most tasks, and you cut GPU costs by 60-75%.
Connecting self-hosted models to Mentiko
Mentiko uses a bring-your-own-keys architecture. You configure which API endpoint your agents call. Since vLLM and TGI both expose OpenAI-compatible APIs, the setup is straightforward.
In your chain definition, set the model endpoint to your self-hosted server:
{
"agent": "research-analyst",
"model": {
"provider": "openai-compatible",
"base_url": "https://your-runpod-endpoint.runpod.ai/v1",
"api_key": "your-runpod-api-key",
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct"
}
}
Because Mentiko runs on your infrastructure too, the network path is clean. Your Mentiko instance calls your inference server directly. No tokens pass through any third-party service.
You can also mix models within a chain. Use a self-hosted Llama for high-volume classification and extraction steps, and route the final analysis step to Claude or GPT-4 via their APIs. This hybrid approach optimizes cost without sacrificing quality where it matters.
Production considerations
Running inference in production is different from running a demo. A few things to get right early:
Health checks. vLLM exposes a /health endpoint. Poll it. GPU processes crash silently more often than you'd expect.
Request timeouts. Set aggressive timeouts on your agent's HTTP client. A hung inference server should fail fast so the chain can retry or fall back, not block for 5 minutes.
Monitoring GPU utilization. If your GPU is consistently below 40% utilization, you're overpaying. Consider a smaller instance or RunPod's serverless endpoints that scale to zero.
Model updates. Pin your model checkpoint hash. Don't pull latest -- a model update mid-production can change behavior in ways that break downstream agents.
Cold starts. Loading a 70B model takes 2-4 minutes. RunPod serverless handles this with pre-warmed workers, but if you're managing your own pod, keep it running or accept the startup latency.
Cost comparison
Here's a real-world comparison for a 4-agent chain running 200 times per day, with each run averaging 8,000 input tokens and 2,000 output tokens per agent:
| Approach | Monthly cost | Notes | |---|---|---| | GPT-4o API | ~$480 | $2.50/1M input, $10/1M output | | Claude Sonnet API | ~$360 | $3/1M input, $15/1M output | | Self-hosted Llama 70B (RunPod A100) | ~$240 | On-demand, always-on | | Self-hosted Llama 70B (RunPod serverless) | ~$120 | Scales to zero, pay per second | | Self-hosted Llama 8B (RunPod A10G) | ~$80 | Smaller model, cheaper GPU |
The savings compound as volume increases. At 1,000 runs per day, the API approach costs $1,800-2,400/month while self-hosted stays under $400.
When not to self-host
Self-hosting isn't always the right call.
If you're making fewer than 200 LLM calls per day, the API model is cheaper and simpler. If you need frontier-tier reasoning (complex multi-step analysis, nuanced writing), proprietary models still lead. If your team doesn't have someone comfortable with GPU infrastructure, the ops burden may not be worth the savings.
The best approach for most teams is hybrid: self-host for high-volume commodity tasks, use APIs for low-volume high-stakes tasks. Mentiko's chain definitions make this easy -- each agent in a chain can point to a different model endpoint.
Get new posts in your inbox
No spam. Unsubscribe anytime.