FigJam Diagram: Alert Responder — AI-Powered SRE Agent (expires 2026-04-13)
AI-powered SRE agent that receives Prometheus Alertmanager webhooks, summarizes alerts via AWS Bedrock, posts to Slack, and performs agentic remediation using Claude 3.5 Sonnet with cluster-wide read + scoped write RBAC.
| Namespace | alert-responder |
| Port | :8007 |
| Image | harbor.k3s.internal.strommen.systems/production/alert-responder:sha-0bbd4a9 |
| Part-of | observability |
http://alert-responder.alert-responder:8007/webhook#k3s-alerts with the summary and a remediation option| Role | Model | Purpose |
|---|---|---|
| Alert summary | us.amazon.nova-micro-v1:0 |
Fast, cheap triage + channel message draft |
| Agentic remediation | us.anthropic.claude-3-5-sonnet-20241022-v2:0 |
Multi-step diagnosis + corrective actions |
The alert-responder-agent ServiceAccount has a cluster-wide ClusterRole:
Read access (diagnosis): pods, pod/log, services, endpoints, configmaps, events, nodes, namespaces, PVCs/PVs, deployments, statefulsets, daemonsets, jobs, ingresses, Longhorn volumes/replicas, Prometheus rules/monitors, metrics
Write access (remediations):
pods/exec — exec into containers for diagnosticspods: delete — delete crashing podsnodes: patch — cordon/uncordondeployments/statefulsets: patch,update — restart, scaleconfigmaps,services,jobs: create,update,patch,delete — config-level fixesSecrets are explicitly excluded from the ClusterRole — the agent cannot read Kubernetes Secrets.
From alert-responder-config ConfigMap:
| Key | Value |
|---|---|
SLACK_ALERTS_CHANNEL |
#k3s-alerts |
AWS_REGION |
us-east-1 |
BEDROCK_MODEL_ID |
us.amazon.nova-micro-v1:0 |
REMEDIATION_MODEL_ID |
us.anthropic.claude-3-5-sonnet-20241022-v2:0 |
CACHE_TTL_SECONDS |
21600 (6 hours) |
CLUSTER_NAME |
k3s-homelab |
REMEDIATION_MAX_STEPS |
20 |
PORT |
8007 |
| Secret | Keys | Purpose |
|---|---|---|
alert-responder-slack |
SLACK_BOT_TOKEN |
Slack bot token for posting to #k3s-alerts |
alert-responder-slack-app |
SLACK_APP_TOKEN |
Socket Mode app token — optional |
alert-responder-aws |
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
Bedrock IAM credentials |
Bootstrap:
kubectl create secret generic alert-responder-slack \
-n alert-responder \
--from-literal=SLACK_BOT_TOKEN="<xoxb-...>"
kubectl create secret generic alert-responder-aws \
-n alert-responder \
--from-literal=AWS_ACCESS_KEY_ID="<iam-access-key-id>" \
--from-literal=AWS_SECRET_ACCESS_KEY="<iam-secret-access-key>"
| Volume | Type | Size | Purpose |
|---|---|---|---|
alert-responder-cache |
Longhorn RWO | 1Gi | SQLite alert dedup database |
Prometheus metrics at :8007/metrics. Pod annotation: prometheus.io/scrape: "true", prometheus.io/port: "8007".
kubernetes/apps/alert-responder/alert-responder.yaml — Namespace, ServiceAccount, ClusterRole,
ClusterRoleBinding, PVC, ConfigMap,
Deployment, Service, Ingress, ServiceMonitor