AI Incident Intelligence

       
Company
   Personal Project
    Engine
   N/A
Platform
   Node.js / Docker Compose
    Skills
   Node.js, Apache Kafka, Cassandra 5, Google Gemini, GraphQL, Docker
Role
   Software Engineer
    Responsibilities
   Architecture, Implementation

Overview

An event-sourced incident intelligence platform built around Kafka, Cassandra, MinIO, and GraphQL, with a Gemini-backed AI triage layer. The write path produces events to Kafka — the source of truth — and two independent consumer groups project those events into query-shaped Cassandra tables. Apollo Server serves the read side over GraphQL and a small REST surface.

Architecture

Kafka is the source of truth. The projection consumer subscribes to incident-events and incident-enriched and writes into query-optimized Cassandra tables (incident-by-id, by-team, by-severity, service-health, artifacts, embeddings). The enrichment consumer is an independent group on the same source topic — it computes team assignment, normalizes severity, and emits INCIDENT_ENRICHED to a separate topic. Decoupling the two means an enrichment outage cannot block the core read path.

End-to-End Demo

A single npm run demo script posts an incident, polls until both consumer groups have caught up, and exercises all three AI endpoints — the kind of trace I’d hand to a reviewer to prove the platform actually works end-to-end.

════════════════════════════════════════════════════════════════════════
  AI Incident Intelligence — End-to-End Demo
════════════════════════════════════════════════════════════════════════

▶ Step 1/7  GET /health
            ✓ API reachable (12ms)

▶ Step 2/7  POST /incidents
            incidentId:  demo-1730000000000-pay
            service:     payments-api
            severity:    CRITICAL
            ✓ event published (id 7af3c1b2…) (38ms)

▶ Step 3/7  Wait for projection (Query.incidentTimeline)
            → projection consumer writes incident_events_by_id
            ✓ event landed in incident_events_by_id (612ms)

▶ Step 4/7  Wait for enrichment (Query.incidentsBySeverity)
            → enrichment consumer derives teamId, severityBucket, dayBucket
            → partition: (CRITICAL, 2026-04-27)
            ✓ row landed in incidents_by_severity (847ms)

▶ Step 5/7  POST /ai/incident-summary
            → Gemini structured-JSON output
            ✓ summary generated (1842ms)

▶ Step 6/7  POST /ai/similar-incidents (k=3)
            → Cassandra 5 SAI ANN cosine search over incident_embeddings
            ✓ returned 3 matches (489ms)

▶ Step 7/7  POST /ai/recommended-actions (k=3)
            → RAG: similar incidents + runbook chunks → Gemini
            ✓ returned 3 actions (2103ms)

════════════════════════════════════════════════════════════════════════
  Demo complete in 6.2s
════════════════════════════════════════════════════════════════════════

Key Features

Structured AI incident summary

Gemini 2.5 Flash with a structured-JSON response schema — customer impact, likely root cause, a numeric confidence, ranked next actions, and the raw signals it pulled out of the timeline.

{
  "summary": "Payments API timing out after 14:02 deploy; 18% error rate, p95 4.2s.",
  "customer_impact": "Checkout funnel impacted; users unable to complete purchases.",
  "likely_root_cause": "Postgres connection pool exhaustion after deploy.",
  "confidence": 0.87,
  "next_actions": [
    "Roll back payments-api to the previous release",
    "Inspect pg connection-pool sizing on payments-api",
    "Page database on-call to check postgres-primary saturation"
  ],
  "signals": [
    "error_rate=18%",
    "p95=4.2s",
    "postgres-primary connection timeouts"
  ]
}

Similar-incident retrieval

Cosine ANN over 768-dim Gemini embeddings, indexed with Cassandra 5 SAI. The query message is embedded on the fly and the self-match is filtered out.

Top matches:
  [score 0.912]  inc-2024-08-payments-pool-exhaustion   payments-api / CRITICAL
    "Payments API connection pool exhausted after release; checkout failing."
  [score 0.874]  inc-2024-06-checkout-timeouts          payments-api / HIGH
    "Checkout endpoint timing out under load; postgres-primary saturated."
  [score 0.831]  inc-2024-03-pg-pool-leak               orders-api   / HIGH
    "Connection pool leak in orders-api after switchover."

Similar incidents and the nearest runbook chunks are retrieved in parallel, then Gemini is asked for exactly three priority-sorted actions, each with a confidence and a risk rating.

{
  "actions": [
    { "priority": 1, "action": "Roll back payments-api to the previous release",
      "confidence": "HIGH",   "risk": "LOW",
      "reason": "Two retrieved runbooks and the closest similar incident all point at the deploy as the trigger." },
    { "priority": 2, "action": "Raise pg connection pool size on payments-api",
      "confidence": "MEDIUM", "risk": "MEDIUM",
      "reason": "Mitigates the symptom while the rollback is in flight; runbook db-pool-exhaustion.md recommends this as a holding action." },
    { "priority": 3, "action": "Page database on-call to inspect postgres-primary",
      "confidence": "MEDIUM", "risk": "LOW",
      "reason": "Repeated connection timeouts in the signal set warrant a primary health check." }
  ],
  "notes": "Two of three retrieved runbooks point at pool exhaustion as the dominant pattern."
}

GraphQL + REST surface

Apollo Server (with Sandbox in dev) for the read side, Express REST endpoints for ingestion and the AI features. Both surfaces share the same resolvers underneath.

Implementation Highlights

  • CQRS-style projections from a single Kafka source-of-truth topic, with two independent consumer groups
  • Idempotent consumers with linear-backoff retry (3 attempts), LWT-backed dedup in processed_messages, DLQ on exhaustion, and a manual replay tool
  • Compound partition key on incidents_by_severity (severity_bucket, day_bucket) so the CRITICAL bucket cannot grow into an unbounded partition
  • Best-effort embedding writes inside the projection consumer — a Gemini outage logs and skips, never blocking the core projection
  • Per-message correlation IDs (incidentId, eventId, topic, partition, offset) via pino, plus partition-aware kafka_consumer_lag_messages gauge refreshed every 10s on Prometheus

The vector index lives directly in Cassandra 5 — no separate vector store, no extra moving piece in the deployment:

CREATE TABLE incident_embeddings (
  incident_id text PRIMARY KEY,
  org_id text,
  service_name text,
  severity text,
  message text,
  embedding VECTOR<FLOAT, 768>,
  indexed_at timestamp
);

CREATE INDEX incident_embeddings_ann_idx
  ON incident_embeddings (embedding)
  USING 'sai'
  WITH OPTIONS = {'similarity_function': 'cosine'};

Tech Stack

  • Runtime: Node.js 18+, Express, Apollo Server
  • Streaming: Apache Kafka (KRaft, single-node for local dev)
  • Storage: Cassandra 5 with SAI vector index, MinIO for artifact blobs
  • AI: Google Gemini 2.5 Flash + embedding-001
  • Observability: pino structured logging, prom-client metrics on three ports
  • Tests: Jest integration test against the compose harness, plus per-feature smoke scripts
  • Infra: Docker Compose (Kafka, Cassandra, Kafka UI, MinIO)