FigJam Diagram: Cluster Health Monitor — Alert Processing & Remediation (expires 2026-04-13)
OpenClaw-integrated cluster health automation. Monitors AlertManager for firing alerts, attempts auto-remediation for known patterns, and runs scheduled wellness jobs (nightly wiki review, morning digest, weekly skill tests).
| Primary namespace | cluster-health-monitor |
| Secondary namespace | open-webui (webhook handler + scheduled CronJobs) |
| Part of | OpenClaw platform |
| Manifests | kubernetes/apps/cluster-health-monitor/ |
The primary alert processing loop. Runs every hour in the cluster-health-monitor namespace.
Schedule: 0 * * * * (every hour)
Flow:
Auto-remediation patterns:
| Alert Pattern | Auto-fix |
|---|---|
PersistentVolumeSpaceHigh / StorageAlmost* |
Expand PVC by 20% via kubectl patch |
CrashLoop* / *NotReady |
Delete pod (Kubernetes restarts it) |
Certificate*Expir* |
Delete TLS secret (cert-manager re-issues) |
Receives push-mode webhooks from AlertManager/Prometheus. Deployed in open-webui namespace (2 replicas).
:8080 (ClusterIP)claw-auto-remediation scripts for complex fixes| Job | Schedule (UTC) | Local time (ET) | Purpose |
|---|---|---|---|
nightly-wiki-review |
30 4 * * * |
11:30 PM ET | Review repos, check docs, catch contradictions |
morning-mood-boost |
0 12 * * * |
7:00 AM ET | Compile uplifting news via NewsAPI → Slack |
daily-error-report |
0 13 * * * |
8:00 AM ET | Report on critical issues from nightly review |
weekly-skill-tests |
0 11 * * 0 |
Sundays 6:00 AM ET | Test AI skill effectiveness |
CronJob duplication note: Two definitions exist for
nightly-wiki-review:
kubernetes/apps/nightly-wiki-review/cronjob.yaml—30 4 * * *(correct) — this is the authoritative deploymentkubernetes/apps/cluster-health-monitor/nightly-cronjobs.yaml—30 23 * * *— stale, supersededSince both target
open-webuinamespace with the same CronJob name, only one can be active. Thenightly-wiki-review/manifest is canonical. The stale one incluster-health-monitor/should be deleted.
All scheduled CronJobs fetch their scripts at runtime from GitHub (zolty-mat/home_k3s_cluster) — no container rebuild needed for script changes.
| Secret | Namespace | Keys | Purpose |
|---|---|---|---|
cluster-health-secrets |
cluster-health-monitor |
slack-webhook, NEWSAPI_KEY |
CronJob notifications + NewsAPI |
claw-alert-monitor-secrets |
cluster-health-monitor |
slack-webhook-url |
Alert monitor Slack |
kubectl create secret generic claw-alert-monitor-secrets \
-n cluster-health-monitor \
--from-literal=slack-webhook-url="<slack-webhook-url>"
kubectl create secret generic cluster-health-secrets \
-n cluster-health-monitor \
--from-literal=slack-webhook="<slack-webhook-url>" \
--from-literal=NEWSAPI_KEY=<api-key>
The claw-alert-monitor ServiceAccount has a ClusterRole with:
Known RBAC issue: Two ServiceAccounts exist —
claw-alert-monitor(narrow ClusterRole) andcluster-health-monitor(broader ClusterRole). Unclear which is active. Needs reconciliation.
Duplicate webhook handler:
alert-webhook-handler.yaml(replicas: 2,/alertsendpoint) andalert-webhook-handler-corrected.yaml(replicas: 1,/webhookendpoint) both exist. The-corrected.yamlhas an invalidnamespace: open-webuiinside asecretKeyRef. Needs cleanup.
Schedule bug:
morning-mood-boostanddaily-error-reportinnightly-cronjobs.yamluse UTC-naive cron expressions with ET comments — these fire at wrong times. The canonical schedules inkubernetes/apps/morning-mood-boost/cronjob.yamlare correct.
Namespace fragmentation: Several manifests deploy resources into
open-webuinamespace rather thancluster-health-monitor. Consider consolidating in a future cleanup.
kubernetes/apps/cluster-health-monitor/
namespace-and-secrets.yaml — Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding
claw-alert-monitor.yaml — Hourly CronJob + RBAC + Python script ConfigMap
claw-auto-remediation.yaml — Auto-remediation script ConfigMap (open-webui ns)
alert-webhook-handler.yaml — Webhook receiver Deployment + Service (open-webui ns)
nightly-cronjobs.yaml — STALE — superseded by nightly-wiki-review/cronjob.yaml
remediation-jobs.yaml — Remediation job templates + AI diagnosis scripts
weekly-skill-tests.yaml — weekly-skill-tests CronJob + test definitions
alert-tuning.yaml — AlertManager rule tuning
kubernetes/apps/nightly-wiki-review/cronjob.yaml — authoritative nightly-wiki-review deployment