FigJam Diagram: Proxmox Watchdog — Auto Power Recovery (expires 2026-04-13)
Critical safety net for the Proxmox cluster. Monitors pve hosts over ICMP/HTTP, and when a host is unreachable for 3 consecutive checks, power-cycles it via the Kasa HS300 smart outlet at 192.168.1.205. Runs with hostNetwork: true so it can reach the outlet even if the Flannel/VXLAN pod overlay network is down.
|
|
| Namespace |
proxmox-watchdog |
| PSS Level |
privileged (required for hostNetwork: true) |
| Image |
harbor.k3s.internal.strommen.systems/production/proxmox-watchdog:latest |
| Metrics |
:8000/metrics |
Why hostNetwork: true? The Kasa HS300 (192.168.1.205) is on the host LAN. If the Flannel/VXLAN overlay is down, pods cannot reach external IPs. hostNetwork: true bypasses the pod network entirely — the watchdog uses the node's own network stack so it can still reach the outlet during an overlay outage.
| Outlet |
Host |
IP |
Notes |
| 0 |
pve1 |
192.168.1.105 |
— |
| 1 |
pve2 |
192.168.1.106 |
— |
| 2 |
pve3 |
192.168.1.107 |
— |
| 3 |
(empty) |
— |
Unused |
| 4 |
NAS (DXP4800) |
192.168.30.10 |
Do not power-cycle automatically |
| 5 |
pve4 |
192.168.1.108 |
— |
Kasa HS300 IP: 192.168.1.205
Protocol: KLAP / SHIP 2.0 (port 80, authenticated) — not legacy port 9999
graph LR
WD["proxmox-watchdog\nhostNetwork: true\n:8000 metrics"] -->|"ping + HTTP\nevery 30s"| PVE["Proxmox Hosts\n192.168.1.105-108\npve1/pve2/pve3/pve4"]
WD -->|"KLAP (port 80)\nKasa SHIP 2.0 protocol\non 3rd consecutive failure"| KASA["Kasa HS300\n192.168.1.205\nPower strip outlet per host"]
KASA -->|"power cycle\n(off 10min on)"| PVE
PROM["Prometheus"] -->|"/metrics :8000"| WD
style WD fill:#dc2626,color:#fff
style KASA fill:#e67e22,color:#fff
style PVE fill:#374151,color:#fff
flowchart TD
CHECK["ICMP ping all 4 hosts every 30s"]
CHECK --> SUCCESS{ping OK?}
SUCCESS -->|yes| RESET["Reset failure counter"]
SUCCESS -->|no| INC["Increment failure counter"]
INC --> THRESH{counter >= 3?}
THRESH -->|no| CHECK
THRESH -->|yes| LIMIT{power_cycles >= 5?}
LIMIT -->|yes| STOP["STOP — Alert operator"]
LIMIT -->|no| OFF["Kasa outlet OFF"]
OFF --> WAIT["Wait 600s (10 min)"]
WAIT --> ON["Kasa outlet ON"]
ON --> COUNT["Increment power_cycles_total"]
COUNT --> CHECK
style STOP fill:#dc2626,color:#fff
style OFF fill:#f59e0b,color:#000
style ON fill:#10b981,color:#fff
| Step |
Tool |
Trigger |
Action |
| 1 |
Network Rescue (Ansible/cron on host) |
vmbr0 UP but no IP |
Toggle interface |
| 2 |
Proxmox Watchdog (K8s pod) |
ICMP ping fails x3 |
Power cycle via Kasa |
| Parameter |
Value |
| Health check interval |
30 seconds |
| Failure threshold |
3 consecutive failures |
| Boot wait time |
600 seconds (10 minutes) |
| Max power cycles |
5 attempts, then stop and alert |
From watchdog-config ConfigMap:
| Key |
Value |
KASA_IP |
192.168.1.205 |
CHECK_INTERVAL |
30 |
FAILURE_THRESHOLD |
3 |
BOOT_WAIT_TIME |
600 |
MAX_POWER_CYCLES |
5 |
METRICS_PORT |
8000 |
| Secret |
Keys |
Purpose |
proxmox-watchdog-kasa |
KASA_USERNAME, KASA_PASSWORD |
Kasa cloud account credentials for KLAP auth |
kubectl create secret generic proxmox-watchdog-kasa \
-n proxmox-watchdog \
--from-literal=KASA_USERNAME='<kasa-email>' \
--from-literal=KASA_PASSWORD='<kasa-password>'
KLAP protocol: The watchdog uses SHIP 2.0 (port 80) — not legacy port 9999. Changed when firmware auto-updated.
| Alert |
Condition |
Severity |
ProxmoxHostDown |
host unreachable for 5 minutes |
critical |
ProxmoxMaxPowerCycles |
5 power cycles without recovery |
critical |
ProxmoxWatchdogDown |
watchdog itself is down for 5 min |
warning |
KasaOutletHighPower |
outlet > 150W for 5 min |
warning |
kubernetes/apps/proxmox-watchdog/
proxmox-watchdog.yaml
watchdog.py
Dockerfile