FigJam Diagram: Proxmox Watchdog — Health Check & Power Cycle Flow (expires 2026-04-13)
A Kubernetes-deployed watchdog that monitors all four Proxmox hosts (pve1–pve4) and power-cycles them via a Kasa HS300 smart power strip when they become unresponsive. Runs with hostNetwork: true so it can reach the Kasa outlet even if the pod overlay network fails.
| Namespace | proxmox-watchdog |
| Image | harbor.k3s.internal.strommen.systems/production/proxmox-watchdog:latest |
| Network | hostNetwork: true — uses host IP directly, bypasses Traefik/MetalLB |
| Metrics port | 8000 |
| Kasa outlet | 192.168.1.205 (HS300 strip, KLAP protocol port 80) |
| PSS level | privileged (hostNetwork requires it) |
| Parameter | Value |
|---|---|
| Check interval | 30 seconds |
| Consecutive failures to trigger | 3 |
| Boot wait after power cycle | 600 seconds (10 minutes) |
| Maximum power cycle attempts | 5 (then give up, fire ProxmoxMaxPowerCycles alert) |
Cycle sequence: 3 failed health checks → power off outlet → wait 10 minutes → power on → restart health check counter. If a host does not recover after 5 cycles, the watchdog stops cycling and fires a critical alert requiring manual intervention.
watchdog-config)| Key | Value | Purpose |
|---|---|---|
KASA_IP |
192.168.1.205 |
Kasa HS300 outlet strip IP |
CHECK_INTERVAL |
30 |
Seconds between health checks |
FAILURE_THRESHOLD |
3 |
Consecutive failures before power cycle |
BOOT_WAIT_TIME |
600 |
Seconds to wait after power-off before power-on |
MAX_POWER_CYCLES |
5 |
Max cycles before giving up |
METRICS_PORT |
8000 |
Prometheus metrics port |
| Secret | Keys | Purpose |
|---|---|---|
proxmox-watchdog-kasa |
KASA_USERNAME, KASA_PASSWORD |
Kasa cloud account credentials for SHIP 2.0 / KLAP authentication |
Bootstrap:
kubectl create secret generic proxmox-watchdog-kasa \
--namespace proxmox-watchdog \
--from-literal=KASA_USERNAME=<kasa-account-email> \
--from-literal=KASA_PASSWORD=<kasa-account-password>
Protocol note: Kasa HS300 uses SHIP 2.0 (KLAP) on port 80. The watchdog connects directly to 192.168.1.205:80, NOT the legacy port 9999 API. This requires a Kasa cloud account login for KLAP handshake.
hostNetwork: trueThe watchdog runs with hostNetwork: true so it can reach the Kasa outlet (192.168.1.205 on VLAN 1) even if the Kubernetes pod overlay network (Flannel/VXLAN) fails. If the overlay is down, pods with normal networking can't reach external IPs — but host-networked pods use the node's own routing table.
This design means the watchdog can recover Proxmox hosts even in partial cluster failure scenarios. Because hostNetwork: true is incompatible with restricted or baseline PSS, the namespace is labeled privileged in both kubernetes/core/pod-security-standards.yaml and the app manifest.
| Alert | Condition | Severity | Notes |
|---|---|---|---|
ProxmoxHostDown |
proxmox_host_status == 0 for 5m |
critical | One host unreachable; watchdog may power-cycle |
ProxmoxMaxPowerCycles |
proxmox_power_cycle_attempts >= 5 |
critical | Host did not recover — manual intervention required |
ProxmoxWatchdogDown |
up{job="proxmox-watchdog"} == 0 for 5m |
warning | Watchdog itself is down; hosts unmonitored |
KasaOutletHighPower |
kasa_outlet_power_watts > 150 for 5m |
warning | ThinkCentre M920q baseline ~65W; >150W indicates hardware stress |
proxmox_host_status — gauge, 0=down/1=up, label hostproxmox_power_cycle_attempts — counter, label hostkasa_outlet_power_watts — gauge, label host
latesttag: The image is deployed with:latestinstead of a SHA-pinned tag. Consider updating CI to deploy withsha-<commit>tags for reproducibility.
kubernetes/apps/proxmox-watchdog/proxmox-watchdog.yaml -- Namespace (privileged PSS),
ConfigMap, Deployment (hostNetwork),
Service, ServiceMonitor, PrometheusRule
kubernetes/core/pod-security-standards.yaml -- PSS label for namespace (privileged)