FigJam Diagram: Cluster Dashboard — Service Health Architecture (expires 2026-04-13)
A lightweight Python Flask app serving an HTML dashboard with live health checks for all cluster services. Background threads probe services every 30 seconds and cache results; the frontend polls /api/health every 30 seconds to refresh status indicators.
flowchart TD
Browser["Browser\n(Vanilla JS)"]
Dashboard["cluster-dashboard\nPod (default ns)"]
API["/api/health\nCached results"]
Thread["Background Thread\n(30s interval)"]
Browser -->|"GET /api/health\nevery 30s"| API
API --> Dashboard
Thread -->|"Updates cache\nwith lock"| Dashboard
subgraph Monitoring ["Monitoring & Observability"]
G[Grafana :80]
P[Prometheus :9090]
AM[Alertmanager :9093]
L[Loki :3100]
end
subgraph Infra ["Infrastructure"]
TR[Traefik :443]
LH[Longhorn :80]
AR[Alert Responder :80]
end
subgraph Apps ["Applications"]
CB[Cardboard :80]
TB[Trade Bot :80]
HA[Home Assistant :8123]
WK[Wiki.js :80]
DS[Digital Signage :80]
AJ[Aja Recipes :3000]
AB[Auto Brand :80]
MP[Media Profiler :80]
OC[OpenClaw :80]
GH[GHA Dashboard :80]
end
subgraph Media ["Media Stack"]
JF[Jellyfin :8096]
JS[Jellyseerr :5055]
RA[Radarr :7878]
SO[Sonarr :8989]
PR[Prowlarr :9696]
BZ[Bazarr :6767]
end
subgraph Proxmox ["Proxmox Hosts"]
PV1[pve1 :8006]
PV2[pve2 :8006]
PV3[pve3 :8006]
PV4[pve4 :8006]
end
subgraph External ["External"]
OL["Ollama Host\n192.168.1.214:9100"]
end
Thread -->|HTTP probe| Monitoring
Thread -->|HTTP probe| Infra
Thread -->|HTTP probe| Apps
Thread -->|HTTP probe| Media
Thread -->|"HTTPS probe\n(ssl.CERT_NONE)"| Proxmox
Thread -->|HTTP probe| External
Key behaviors:
- HTTP 401/403/404 responses treated as healthy (service up, auth required)
- Self-signed certs (Proxmox, Traefik) accepted via
ssl.CERT_NONE
- Results protected by
threading.Lock to prevent race conditions
- Prometheus metrics exposed at
/metrics
| Service |
Probe URL |
Expected |
| Grafana |
prometheus-grafana.monitoring:80/api/health |
200 |
| Prometheus |
prometheus-kube-prometheus-prometheus.monitoring:9090/-/ready |
200 |
| Alertmanager |
prometheus-kube-prometheus-alertmanager.monitoring:9093/-/ready |
200 |
| Loki |
loki.monitoring:3100/ready |
200 |
| Service |
Probe URL |
Expected |
| Traefik |
traefik.kube-system:443/ |
Any (HTTPS) |
| Longhorn |
longhorn-frontend.longhorn-system:80/ |
200 |
| Alert Responder |
alert-responder.alert-responder:80/healthz |
200 |
| Service |
Probe URL |
Expected |
| Cardboard |
cardboard.cardboard:80/api/stats |
200 |
| Trade Bot |
trade-bot.trade-bot:80/ |
200 |
| Home Assistant |
home-assistant.home-assistant:8123/ |
200/401 |
| Wiki.js |
wiki.wiki:80/healthz |
200 |
| Digital Signage |
ds-frontend.digital-signage:80/healthz |
200 |
| Aja Recipes |
aja-recipes.aja-recipes:3000/api/health |
200 |
| Auto Brand |
auto-brand-web-ui.auto-brand:80/healthz |
200 |
| Media Profiler |
media-profiler.media-profiler:80/api/health |
200 |
| OpenClaw |
openclaw.open-webui:80/ |
200 |
| GHA Dashboard |
gha-dashboard.gha-dashboard:80/healthz |
200 |
TODO: Verify OpenClaw probe — openclaw.open-webui:80/ may have the wrong service name/namespace. OpenClaw ops agent and Open WebUI are separate services; check against kubernetes/apps/openclaw-ops/ manifests.
| Service |
Probe URL |
Expected |
| Jellyfin |
jellyfin.media:8096/health |
200 |
| Jellyseerr |
jellyseerr.media:5055/api/v1/status |
200 |
| Radarr |
radarr.media:7878/ping |
200 |
| Sonarr |
sonarr.media:8989/ping |
200 |
| Prowlarr |
prowlarr.media:9696/ping |
200 |
| Bazarr |
bazarr.media:6767/api |
200 |
| Host |
IP |
Probe URL |
| pve1 |
192.168.1.105 |
https://192.168.1.105:8006/ |
| pve2 |
192.168.1.106 |
https://192.168.1.106:8006/ |
| pve3 |
192.168.1.107 |
https://192.168.1.107:8006/ |
| pve4 |
192.168.1.108 |
https://192.168.1.108:8006/ |
| Service |
Probe URL |
Expected |
| Ollama Host |
http://192.168.1.214:9100/metrics |
200 |
192.168.1.214 is the dedicated Ollama LAN inference host. The Lima VM was removed 2025-06-25 — this IP is now the Ollama host.
- Harbor —
/api/v2.0/ping
- Gitea —
/api/healthz
- Authentik —
/api/v3/-/health/live/
- HAM — AI habit tracker
- WireGuard — VPN hub status
- Open WebUI — LLM front-end (Authentik-protected)
- Steve Lee Portfolio — ceramics portfolio
TODO: Add the above services to the SERVICES dict in cluster-dashboard/dashboard.yaml ConfigMap and re-apply.
kubernetes/apps/cluster-dashboard/
dashboard.yaml — Namespace, ConfigMap (Flask app + HTML), Deployment, Service, Ingress, ServiceMonitor
service-ingresses.yaml — Ingress resources for Prometheus, Alertmanager, Longhorn (internal subdomain routing)