Back to Projects
DevOps / Infrastructure · Stealth Startup

Self-Healing RCA Agent for Production Incidents

SRE teams were overwhelmed by alert fatigue from thousands of non-critical logs, missing actual root causes during production outages.

Self-Healing RCA Deployment Correlation LangChain Claude 3 Haiku
Business Impact
65% reduction in MTTR

The Problem

Alert fatigue is real. An SRE team was drowning in thousands of daily alerts—most non-critical, many duplicates, a few genuinely urgent. During outages, they’d spend precious minutes scrolling through logs trying to find the needle in the haystack. By the time they identified root cause, customers had already churned.

The Architecture

flowchart TB
  subgraph sources [Signal Sources]
      Datadog[Datadog Alerts]
      Logs[Application Logs]
      Deploys[Deployment Events]
      Metrics[System Metrics]
  end
  
  subgraph processing [Processing Layer]
      Vectorizer[Log Vectorizer]
      Correlator[Deployment Correlator]
      Anomaly[Anomaly Detector]
  end
  
  subgraph agents [RCA Agent System]
      Triage[Triage Agent]
      RCA[Root Cause Agent]
      Remediation[Remediation Agent]
  end
  
  subgraph output [Actionable Output]
      Diagnosis[Incident Diagnosis]
      Fix[Suggested Fix]
      Rollback[Rollback Command]
  end
  
  Datadog --> Triage
  Logs --> Vectorizer
  Vectorizer --> Triage
  Deploys --> Correlator
  Correlator --> RCA
  Metrics --> Anomaly
  Anomaly --> RCA
  
  Triage -->|"Critical Only"| RCA
  RCA --> Remediation
  Remediation --> Diagnosis
  Remediation --> Fix
  Remediation --> Rollback

Self-Healing Root Cause Analysis

The system operates in three phases:

  1. Triage Agent: Filters the noise. Uses Claude 3 Haiku (fast, cheap) to classify incoming alerts and suppress duplicates. Only genuinely critical signals proceed.
  2. Root Cause Agent: Correlates anomalies with deployment timestamps. “Error spike started 3 minutes after deploy #4521 to service-auth” is the kind of insight it surfaces automatically.
  3. Remediation Agent: Goes beyond diagnosis to suggest specific fixes—code snippets, config changes, or ready-to-execute rollback commands.

The “self-healing” aspect: for known failure patterns, the system can execute rollbacks automatically with human approval via Slack.

Tech Stack

  • LangChain — Agent orchestration and tool use
  • Datadog API — Alert ingestion and metric retrieval
  • Vectorized Log Storage — Semantic search across historical incidents
  • Anthropic Claude 3 Haiku — Fast classification and correlation

The Impact

MetricBeforeAfter
Mean Time to Recovery45 min16 min
Alert-to-RCA Time20 min2 min
Manual Log Auditing80% of incidents20% of incidents
False Positive Alerts60%10%

The SRE team now sleeps better. Critical incidents get immediate attention while noise is automatically suppressed.