DevOps / Infrastructure · Stealth Startup
Self-Healing RCA Agent for Production Incidents
SRE teams were overwhelmed by alert fatigue from thousands of non-critical logs, missing actual root causes during production outages.
Self-Healing RCA Deployment Correlation LangChain Claude 3 Haiku
Business Impact
65% reduction in MTTR
The Problem
Alert fatigue is real. An SRE team was drowning in thousands of daily alerts—most non-critical, many duplicates, a few genuinely urgent. During outages, they’d spend precious minutes scrolling through logs trying to find the needle in the haystack. By the time they identified root cause, customers had already churned.
The Architecture
flowchart TB
subgraph sources [Signal Sources]
Datadog[Datadog Alerts]
Logs[Application Logs]
Deploys[Deployment Events]
Metrics[System Metrics]
end
subgraph processing [Processing Layer]
Vectorizer[Log Vectorizer]
Correlator[Deployment Correlator]
Anomaly[Anomaly Detector]
end
subgraph agents [RCA Agent System]
Triage[Triage Agent]
RCA[Root Cause Agent]
Remediation[Remediation Agent]
end
subgraph output [Actionable Output]
Diagnosis[Incident Diagnosis]
Fix[Suggested Fix]
Rollback[Rollback Command]
end
Datadog --> Triage
Logs --> Vectorizer
Vectorizer --> Triage
Deploys --> Correlator
Correlator --> RCA
Metrics --> Anomaly
Anomaly --> RCA
Triage -->|"Critical Only"| RCA
RCA --> Remediation
Remediation --> Diagnosis
Remediation --> Fix
Remediation --> Rollback Self-Healing Root Cause Analysis
The system operates in three phases:
- Triage Agent: Filters the noise. Uses Claude 3 Haiku (fast, cheap) to classify incoming alerts and suppress duplicates. Only genuinely critical signals proceed.
- Root Cause Agent: Correlates anomalies with deployment timestamps. “Error spike started 3 minutes after deploy #4521 to service-auth” is the kind of insight it surfaces automatically.
- Remediation Agent: Goes beyond diagnosis to suggest specific fixes—code snippets, config changes, or ready-to-execute rollback commands.
The “self-healing” aspect: for known failure patterns, the system can execute rollbacks automatically with human approval via Slack.
Tech Stack
- LangChain — Agent orchestration and tool use
- Datadog API — Alert ingestion and metric retrieval
- Vectorized Log Storage — Semantic search across historical incidents
- Anthropic Claude 3 Haiku — Fast classification and correlation
The Impact
| Metric | Before | After |
|---|---|---|
| Mean Time to Recovery | 45 min | 16 min |
| Alert-to-RCA Time | 20 min | 2 min |
| Manual Log Auditing | 80% of incidents | 20% of incidents |
| False Positive Alerts | 60% | 10% |
The SRE team now sleeps better. Critical incidents get immediate attention while noise is automatically suppressed.