Skip to content
← all posts
7 min read

Mentiko for DevOps: Automate Incident Response

Mentiko Team

Your on-call engineer's phone buzzes at 2 AM. They open their laptop, squint at a PagerDuty alert, and start the same ritual they've done a hundred times: check the dashboard, grep the logs, correlate with recent deploys, run the same diagnostic commands, and -- if it's a known issue -- execute the same runbook they ran last week. Forty-five minutes later, the service is back up and they're writing an incident report they'll finish tomorrow (or never).

This is the incident response loop at most companies. It's manual, repetitive, and completely automatable for the majority of alerts. Mentiko doesn't replace your on-call rotation. It handles the 70% of incidents that follow known patterns so your humans can focus on the 30% that actually need a brain.

Alert fatigue is the real incident

Before we talk about agents, let's talk about the actual problem. It's not that incidents happen. It's that most alerts aren't incidents at all.

Industry data consistently shows that 60-70% of production alerts are noise. Transient spikes, flapping health checks, threshold-breached metrics that self-resolve in 90 seconds. Your on-call engineer gets paged, opens a terminal, investigates for 15-20 minutes, and concludes: nothing is wrong.

That's not just wasted time. It's trust erosion. After enough false alarms, engineers start ignoring alerts. They mute channels. They add increasingly aggressive snooze rules. By the time a real P0 fires, the response is slower because the team has been conditioned to assume it's noise.

The fix isn't better thresholds. You've already tuned those three times. The fix is a system that triages automatically, resolves known patterns without waking anyone up, and only pages a human when something genuinely novel is happening.

A 4-agent incident response chain

Here's how a Mentiko chain handles an incoming alert. Four agents, each specialized, running in sequence through Mentiko's event system.

Agent 1: AlertClassifier. This agent receives the raw alert payload -- from PagerDuty, OpsGenie, Datadog, or any webhook-capable monitoring tool. It parses the payload, determines severity (P0 through P4), and categorizes the incident: infrastructure, application, network, or database. The classification isn't just keyword matching. The agent understands context. A 500 error spike on a single endpoint after a deploy is categorized differently than a 500 spike across all endpoints with no recent changes. You train the classifier by giving it your historical incident data in its system prompt -- past incidents, their categories, their root causes. It gets smarter as you feed it more examples.

Agent 2: LogAnalyzer. Once the alert is classified, this agent pulls relevant logs from the affected service. It queries your logging stack -- ELK, Datadog Logs, CloudWatch, Loki, whatever you run. It's looking for error patterns, stack traces, rate changes, and correlations with recent deploys. If a deploy went out 12 minutes before the alert fired, it flags that. If the same error pattern appeared three days ago and self-resolved, it flags that too. The output is a structured analysis: here's what's failing, here's the likely cause, here's the confidence level.

Agent 3: RunbookExecutor. This is where it gets real. Based on the category from Agent 1 and the analysis from Agent 2, this agent executes the matching runbook. Restart the service. Scale up the replica count. Roll back the last deploy. Flush the cache. Rotate the connection pool. These are the same actions your on-call engineer would take -- codified as executable steps the agent runs in a Mentiko workspace with SSH access to your infrastructure. You define the runbooks. You control what the agent can and can't do. It runs in a sandboxed workspace with only the permissions you grant. This isn't an AI with root access to production. It's an AI with the same restricted access your junior on-call has.

Agent 4: IncidentReporter. After the runbook executes, this agent compiles the full incident timeline. Alert received at 02:14. Classified as P1/application at 02:14. Logs analyzed, root cause identified as OOM on checkout-service at 02:15. Runbook executed: service restarted, memory limit increased at 02:16. Service healthy at 02:17. Total resolution time: 3 minutes. The report includes root cause analysis, actions taken, whether human intervention was needed, and recommendations for preventing recurrence. It posts to Slack or Teams, creates a Jira or Linear ticket, and updates your incident tracking system.

The whole chain runs in minutes. Not the 45-minute average you're living with now.

Alert routing by severity

Not every alert gets the same treatment. The chain routes based on the severity classification from Agent 1:

  • P0 (critical): Execute the runbook immediately AND page the on-call human. The AI handles the initial response -- your engineer wakes up to a partially or fully mitigated incident with a complete analysis already written.
  • P1 (high): Execute the runbook and notify the team channel. No page unless the runbook fails.
  • P2 (medium): Execute the runbook and log it for next-business-day review.
  • P3/P4 (low/informational): Log only. Batch these for the weekly ops review.
  • Known auto-resolvable patterns: Resolve without any human notification. The weekly report shows how many were handled silently.

The P0 case is the most important. Your engineer still gets paged for critical incidents. But instead of starting from zero -- "what service is this, what changed recently, what do the logs say" -- they open their laptop to a complete incident brief. The diagnostic work is done. The initial mitigation may already be in progress. They're making decisions, not gathering information.

Integration with your existing stack

Mentiko doesn't require you to rip out your monitoring tools. The chain plugs into what you already run:

Trigger: A webhook from PagerDuty, OpsGenie, Datadog, Grafana, or any system that can POST JSON.

LogAnalyzer queries: Elasticsearch, Datadog Logs, CloudWatch Logs, Loki, Splunk. You configure the connection in the agent's workspace. The agent writes the queries -- you don't need to pre-define every possible log search.

RunbookExecutor: Runs in a Mentiko workspace with SSH access or API credentials you provide. It can kubectl, it can aws-cli, it can hit your internal APIs. The workspace is isolated. You define the boundary.

IncidentReporter output: Posts to Slack or Microsoft Teams via webhook. Creates tickets in Jira, Linear, or any system with an API. Writes to your incident management platform (PagerDuty, Rootly, incident.io).

Setup is a webhook URL and credentials for your logging and infrastructure tools. Most teams get the chain running in an afternoon.

The numbers

Here's what changes when you put an incident response chain in front of your monitoring:

| Metric | Before | After | |---|---|---| | Mean time to resolution (MTTR) | 45 minutes | 8 minutes (known patterns) | | Alert noise reaching humans | 70% of all alerts | ~30% (auto-classified, auto-resolved) | | Alerts auto-resolved without human | 0% | ~40% | | Incident reports completed | ~60% (often skipped) | 100% (auto-generated) | | On-call engineer wakeups (monthly) | 15-20 | 5-8 |

Cost: Mentiko's flat-rate plan at $29/month covers the orchestration. LLM API costs run roughly $0.50-1.00 per incident, depending on log volume and model choice. A team handling 200 incidents/month spends $100-200 on API costs. Compare that to the on-call engineer's hourly rate times the hours you're saving -- and the cost of extended outages from slow response.

The compounding effect

Here's the part that matters most long-term: the chain gets better over time.

Every incident that runs through the chain is data. When a runbook fails and a human has to intervene, you update the runbook. When the classifier miscategorizes an alert, you add the example to its training data. When the LogAnalyzer misses a correlation, you refine its search patterns.

After three months, the chain has seen your infrastructure's failure modes. After six months, it's encoding institutional knowledge that would take a new hire months to absorb. After a year, your alert noise is down significantly because the classifier has learned what's real and what's transient.

This is the real value proposition for SRE teams. It's not "AI replaces your on-call." It's "your on-call knowledge compounds instead of walking out the door when someone changes jobs." New engineers joining the on-call rotation get AI-augmented support from day one. They're not alone at 2 AM trying to remember which service talks to which database. The chain has that context encoded.

Getting started

You can build this chain in Mentiko's visual builder or write the JSON directly. The four agents are straightforward -- the real work is writing the runbooks and configuring the log queries for your specific stack.

We've published a starter template for a basic alert-response chain. Start with P3/P4 alerts (low risk), validate that the classification and log analysis are accurate, then gradually move up to P1 and P0 as you build trust.

If your team is spending more time responding to alerts than preventing them, join the waitlist and we'll get you set up with a dedicated instance.

Your on-call engineers have better things to do than restarting the same service at 2 AM every week.

Get new posts in your inbox

No spam. Unsubscribe anytime.