See full diagram gallery for interactive versions
Full observability for the k3s homelab: metrics (Prometheus), visualization (Grafana), log aggregation (Loki), and alerting (AlertManager). Deployed via the kube-prometheus-stack Helm chart in the monitoring namespace.
Deployed via Helm chart: kube-prometheus-stack
Namespace: monitoring
Values file: kubernetes/apps/monitoring/prometheus-helm-values.yaml
Install / upgrade:
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
-n monitoring \
-f kubernetes/apps/monitoring/prometheus-helm-values.yaml
| Property | Value |
|---|---|
| Namespace | monitoring |
| Storage | 30Gi NFS PVC (nfs-monitoring StorageClass) |
| Retention | 14 days |
| Scrape interval | 30s |
| URL (internal) | https://prometheus.k3s.internal.strommen.systems |
| Resource limits | 1 CPU / 4GiB RAM |
| Job | Source | Method |
|---|---|---|
| node-exporter | All 7 k3s nodes (:9100) | DaemonSet ServiceMonitor |
| kube-state-metrics | kube-system (:8080) | ServiceMonitor |
| traefik | kube-system pods | Pod annotation relabel |
| pve-exporter | pve-exporter.monitoring.svc:9221 | Static config |
| proxmox-nodes | 192.168.20.105-108:9100 | Static config |
| arc-controller | arc-runner-system (:8443, HTTPS) | Static config |
| unifi-poller | monitoring ns (:9130) | ServiceMonitor (60s) |
| nas-exporter | media ns (:9355) | ServiceMonitor (60s) |
| seedbox-exporter | media ns (:9354) | ServiceMonitor (60s) |
| anthropic-cost-exporter | monitoring ns (:9091) | ServiceMonitor (300s) |
| openrouter-cost-exporter | monitoring ns (:9092) | ServiceMonitor (300s) |
| aws-cost-exporter | monitoring ns (:9090) | ServiceMonitor (300s) |
| github-exporter | monitoring ns (:9171) | ServiceMonitor (120s) |
| github-org-exporter | monitoring ns (:9172) | ServiceMonitor (300s) |
| application services | All namespaces with ServiceMonitor | ServiceMonitor CRD |
Application services with ServiceMonitors include: cardboard, home-assistant, proxmox-watchdog, github-exporter, pve-exporter, exportarr (Radarr/Sonarr), wireguard-exporter, velero, intel-gpu-exporter, dnd-backend, dnd-discord-bot.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Storage | emptyDir (dashboards loaded from ConfigMaps via sidecar) |
| URL (internal) | https://grafana.k3s.internal.strommen.systems |
| Auth | Authentik OIDC (role mapped from groups: authentik-admins -> Admin, authentik-writers -> Editor) |
| Default home | home-hub.json (custom cluster summary) |
Bootstrap (create before first helm upgrade):
kubectl create secret generic grafana-secrets -n monitoring \
--from-literal=GRAFANA_OIDC_CLIENT_ID=<authentik-client-id> \
--from-literal=GRAFANA_OIDC_CLIENT_SECRET=<authentik-oidc-client-secret> \
--from-literal=CLOUDWATCH_ACCESS_KEY=<iam-access-key-id> \
--from-literal=CLOUDWATCH_SECRET_KEY=<iam-secret-access-key> \
--from-literal=GF_SECURITY_ADMIN_PASSWORD="$(openssl rand -base64 32)"
Note: grafana-oidc-secret is replaced by grafana-secrets. Run helm upgrade after creating the secret. The previous CloudWatch IAM key was committed to git and has since been rotated.
OIDC client secret: Copy from Authentik Admin UI → Applications → Providers → grafana → Client secret. It is auto-generated by Authentik when the OAuth2/OIDC provider is created.
Dashboards are loaded via:
| Folder | Dashboard | Source |
|---|---|---|
| Infrastructure | etcd | grafana.com/3070 |
| Infrastructure | Loki Logs Explorer | grafana.com/13639 |
| Infrastructure | Longhorn | grafana.com/16888 |
| Infrastructure | Proxmox Hardware | grafana.com/11074 |
| Infrastructure | Proxmox PVE | grafana.com/10347 |
| Infrastructure | Traefik | grafana.com/17346 |
| Kubernetes | CoreDNS | grafana.com/15762 |
| Kubernetes | K8s API Server | grafana.com/15761 |
| Kubernetes | K8s Cluster | grafana.com/7249 |
| Kubernetes | K8s Namespace | grafana.com/15758 |
| Kubernetes | K8s Pods | grafana.com/6417 |
| Kubernetes | Node Exporter | grafana.com/1860 |
| AWS | Billing | grafana.com/139 |
| AWS | EC2 | grafana.com/617 |
| AWS | S3 | grafana.com/575 |
| AWS | ECR | grafana.com/16101 |
| AWS | Route53 | grafana.com/11084 |
| GitHub | ARC runners | grafana.com (custom) |
| Custom | Cluster Home Hub | ConfigMap: grafana-home-dashboard |
| Custom | OpenClaw / Claw Ops | ConfigMap: claw-auto-remediation-dashboard |
| Custom | Cost overview | ConfigMap: grafana-cost-dashboard |
| Custom | Living Room Display | ConfigMap: grafana-living-room-* (x4) |
| Custom | Media Stack | ConfigMap (GPU panels, stream count) |
| Custom | WireGuard peers | ConfigMap (peer stats from wg-exporter) |
| Custom | Authentik | ConfigMap (login events, user stats) |
| Custom | UniFi Network | ConfigMap: grafana-dashboard-unifi-network (auto-discovered via sidecar) |
| Custom | NAS Storage | ConfigMap: nas-exporter grafana-dashboard.yaml (UID: nas-storage) |
| Custom | Seedbox | ConfigMap: seedbox-exporter grafana-dashboard.yaml (UID: seedbox-monitoring) |
| Custom | Proxmox PVE | ConfigMap: grafana-dashboard-pve-cluster (UID: pve-cluster-overview) |
| Custom | GitHub Organization | grafana-dashboard-github.yaml (UID: github-org-overview) -- org-level GitHub metrics (repos, runs, rate limits, pending invites) |
| Custom | Proxmox Watchdog & Power | grafana-dashboard-proxmox-watchdog.yaml (UID: proxmox-watchdog) -- pve1-4 host status, Kasa smart outlet power/energy, power cycle attempts |
| Custom | AWS Cost & Resources | grafana-dashboard-aws.yaml (UID: aws-services-overview) -- AWS billing, EC2, S3, ECR, Route53 |
| Custom | Cluster Errors | grafana-dashboard-cluster-errors.yaml (UID: cluster-errors-health) -- cluster-wide error events and CrashLoopBackOff tracking |
| Custom | Cluster Home Hub | grafana-dashboard-home.yaml (UID: home-hub) -- overview home dashboard |
| Custom | K8s Resources | grafana-dashboard-k8s-resources.yaml (UID: k8s-cluster-resources) -- Kubernetes resource utilization |
| Custom | Proxmox Hardware | grafana-dashboard-proxmox-hardware.yaml (UID: proxmox-hardware-temps) -- Proxmox node hardware metrics (also available via grafana.com/11074) |
| Property | Value |
|---|---|
| Namespace | monitoring |
| Storage | 10Gi NFS PVC (nfs-monitoring StorageClass) |
| Retention | 72 hours |
| Endpoint | http://loki.monitoring.svc.cluster.local:3100 |
Loki is also written to by two log aggregation CronJobs:
See Log Aggregation for details.
# All pods in a namespace
{namespace="media"}
# Errors across all namespaces
{namespace=~".+"} |= "error" | logfmt
# CloudTrail security events
{job="cloudtrail", event_type="AccessDenied"}
# Anthropic usage entries
{job="anthropic-usage"}
# k3s-agent-4 GPU node logs
{node="k3s-agent-4"}
| Property | Value |
|---|---|
| Namespace | monitoring |
| Storage | 1Gi NFS PVC |
| URL (internal) | https://alertmanager.k3s.internal.strommen.systems |
| Repeat interval (default) | 4h |
| Repeat interval (critical) | 1h |
| Watchdog interval | 24h (heartbeat alert) |
AlertManager config is managed in kubernetes/apps/monitoring/prometheus-helm-values.yaml.
Email routing: Alerts are delivered via the internal
email-gatewayservice (email-gateway.email-gateway.svc.cluster.local:587) usingsmtp_from: k3s-alerts@strommen.systems— no direct Gmail SMTP configuration needed. The email-gateway handles outbound delivery.
Slack webhook: The webhook URL must be manually set in
prometheus-helm-values.yamlbefore deploying (<SLACK_WEBHOOK_URL>placeholder). Regenerate at: Slack App → Incoming Webhooks → #k3s-alerts.
Bootstrap AlertManager secrets:
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager -n monitoring \
--from-literal=alertmanager.yaml=<base64-encoded-config>
All custom exporters have ServiceMonitors and PrometheusRule alert groups.
Queries the Anthropic Admin API every hour. Exposes Prometheus metrics for monthly cost and 24h token usage by model.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Port | 9091 |
| Scrape interval | 300s (5 min) |
| Poll interval | 3600s (1 hour) |
| Image | python:3.12-slim (inline script from ConfigMap) |
Metrics: anthropic_cost_total_usd, anthropic_cost_daily_usd, anthropic_usage_input_tokens_total, anthropic_usage_output_tokens_total, anthropic_usage_cache_read_tokens_total, anthropic_usage_cache_create_tokens_total
Bootstrap:
kubectl create secret generic anthropic-admin-api-key -n monitoring \
--from-literal=ANTHROPIC_ADMIN_API_KEY=<sk-ant-admin-...>
Alerts: AnthropicMonthlyCostHigh (MTD > $100, warning), AnthropicCostExporterDown (10m)
Queries the OpenRouter Management API every hour. Exposes metrics for credit balance, per-model spend, and per-key usage.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Port | 9092 |
| Scrape interval | 300s (5 min) |
| Poll interval | 3600s (1 hour) |
| Image | python:3.12-slim (inline script from ConfigMap) |
Metrics: openrouter_cost_total_usd, openrouter_credits_remaining_usd, openrouter_cost_daily_usd, openrouter_requests_daily_total, openrouter_key_usage_usd, openrouter_key_limit_remaining_usd
Bootstrap:
kubectl create secret generic openrouter-management-key -n monitoring \
--from-literal=OPENROUTER_MANAGEMENT_KEY=<sk-or-v1-...>
Alerts: OpenRouterCreditsLow (< $10 remaining, warning), OpenRouterDailySpendHigh (daily > $20, warning), OpenRouterCostExporterDown (10m)
Queries the AWS Cost Explorer API (ce:GetCostAndUsage) every 6 hours. Exposes billing data as Prometheus gauges for month-to-date totals, per-service breakdown, and 14-day daily history.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Port | 9090 |
| Scrape interval | 300s (5 min) |
| Poll interval | 6 hours (internal collection cycle) |
| Image | python:3.12-slim (inline script from ConfigMap) |
| API cost | ~$0.04/month at 6h interval ($0.01/request) |
Metrics: aws_cost_total_usd (MTD blended total), aws_cost_by_service_usd{service} (MTD per service), aws_cost_daily_usd{date} (last 14 days), aws_cost_daily_service_usd{date,service} (daily per service), plus aws_cost_last_scrape_timestamp, aws_cost_scrape_duration_seconds, aws_cost_scrape_errors_total
IAM policy required -- ce:GetCostAndUsage only:
{
"Effect": "Allow",
"Action": ["ce:GetCostAndUsage"],
"Resource": "*"
}
Bootstrap:
kubectl create secret generic aws-cost-exporter-credentials -n monitoring \
--from-literal=AWS_ACCESS_KEY_ID=<key-id> \
--from-literal=AWS_SECRET_ACCESS_KEY=<secret>
Alerts: AWSMonthlyCostHigh (MTD > $120, warning), AWSDailyCostSpike (delta > $10 in 24h), AWSCostExporterDown, AWSCostDataStale (> 12h since last scrape)
Exposes Proxmox VE metrics (VM CPU/RAM/disk, node status, storage pools) to Prometheus using the multi-target exporter pattern -- Prometheus passes each PVE host IP as a target query param.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Image | prompve/prometheus-pve-exporter:3.4.5 |
| Port | 9221 |
| Metrics path | /pve (non-standard -- not /metrics) |
Bootstrap -- the secret must contain a pve.yml config file:
kubectl create secret generic pve-exporter-credentials -n monitoring \
--from-file=pve.yml=pve.yml
pve.yml contents:
default:
user: root@pam
password: <YOUR_PVE_PASSWORD>
verify_ssl: false
Note: The PVE password is the same as TF_VAR_proxmox_password / PVE_PASS used in Terraform. The secret key must be named pve.yml -- the exporter reads it with --config.file=/etc/pve-exporter/pve.yml.
Grafana dashboards: ConfigMap grafana-dashboard-pve-cluster (UID: pve-cluster-overview) + grafana.com/10347 via Helm values.
Alerts: PveNodeDown, PveNodeHighCpu (>85% 10m), PveNodeHighSwap (>50% 10m), PveNodeStorageFull (>85% 5m), PveVmDown (non-template VM stopped for 5m), PveExporterDown
Exposes WireGuard peer stats (handshake age, bytes transferred) from the wg1 interface on k3s-server-1.
Runs with hostNetwork on k3s-server-1 (nodeSelector: kubernetes.io/hostname=k3s-server-1).
Port: 9586
See WireGuard Mesh for peer topology.
See also: UniFi Poller -- full deployment guide, config reference, and dashboard details.
Polls the UDM Pro at 192.168.1.1 every 120 seconds for device, client, site, WAN, and DPI metrics.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Image | ghcr.io/unpoller/unpoller:v2.33.0 |
| Port | 9130 |
| Scrape interval | 60s (ServiceMonitor) |
| Poll interval | 120s (reduced from 30s -- prevents 429 auth failures on UDM Pro) |
Key metrics: unpoller_device_uptime_seconds, unpoller_device_cpu_utilization_ratio, unpoller_device_memory_utilization_ratio, unpoller_device_temperature_celsius, unpoller_device_wan_*, unpoller_client_*
Alerts: UnifiDeviceOffline (critical), UnifiDeviceHighCpu (>85%), UnifiDeviceHighMemory (>90%), UnifiDeviceHighTemp (>75C), UnifiWanLatencyHigh (>100ms), UnifiPollerDown, UnifiControllerUnreachable
Bootstrap:
kubectl create secret generic unifi-poller-credentials -n monitoring \
--from-literal=username=<unifi-readonly-user> \
--from-literal=password=<unifi-readonly-password>
Note: Use a read-only local UniFi user for polling. Do not use the primary admin account -- if auth fails repeatedly the UDM Pro rate-limits all login attempts.
Grafana dashboard: auto-loaded via ConfigMap grafana-dashboard-unifi-network (sidecar label grafana_dashboard: "1").
See also: NAS Exporter -- full deployment guide, config reference, and dashboard details.
Custom Python exporter deployed in the media namespace. Monitors the Ugreen DXP4800 NAS (192.168.30.10) via NFS mount and TCP probes -- no NAS-side configuration required.
| Property | Value |
|---|---|
| Namespace | media |
| Image | python:3.12-slim (runs inline script from ConfigMap) |
| Port | 9355 |
| Scrape interval | 60s |
| NFS mount | /volume1/media (ReadOnlyMany, nfsvers=3) |
| UID | 10010 (svc-jellyfin -- read-only access) |
Key metrics: nas_up, nas_nfs_mount_up, nas_volume_total_bytes, nas_volume_used_bytes, nas_volume_free_bytes, nas_volume_usage_ratio, nas_nfs_latency_seconds, nas_media_items_total{media_type}, nas_downloads_pending_items{category}
Alerts:
NASDown (TCP port 2049 unreachable, critical, 5m)NASNFSMountDown (NFS mount inaccessible, critical, 5m)NASVolumeCritical (>95% full, critical, 5m)NASVolumeWarning (>85% full, warning, 30m)NASHighNFSLatency (>2s listdir, warning, 10m)NASDownloadsBacklog (>10 items in downloads staging for 6h, warning)NASExporterDown (warning, 5m)See also: Seedbox Exporter -- full deployment guide, config reference, and dashboard details.
Custom Python exporter deployed in the media namespace. Connects to the RapidSeedbox (45.128.27.65) rTorrent instance via the ruTorrent HTTPRPC XML-RPC plugin.
| Property | Value |
|---|---|
| Namespace | media |
| Image | python:3.12-slim (runs inline script from ConfigMap) |
| Port | 9354 |
| Scrape interval | 60s |
| rTorrent endpoint | https://45.128.27.65/rutorrent/plugins/httprpc/action.php |
| SFTP check | 45.128.27.65:2222 (also checked by this exporter) |
Key metrics: seedbox_up, seedbox_sftp_up, seedbox_torrents_total{state}, seedbox_active_torrents, seedbox_download_speed_bytes, seedbox_upload_speed_bytes, seedbox_free_disk_bytes, seedbox_global_ratio
Torrent states: Downloading, Seeding, Stopped, Hashing, Error
Alerts:
SeedboxDown (rTorrent unreachable, warning, 10m)SeedboxSFTPDown (SFTP port closed -- blocks rclone sync, critical, 10m)SeedboxDiskLow (<5GB free, critical, 5m)SeedboxDiskWarning (<20GB free, warning, 30m)SeedboxTorrentErrors (>0 torrents in Error state, warning, 30m)SeedboxIdle (0 active torrents for 24h, info)SeedboxExporterDown (warning, 5m)Bootstrap:
kubectl create secret generic seedbox-exporter-credentials -n media \
--from-literal=XMLRPC_URL='https://45.128.27.65/rutorrent/plugins/httprpc/action.php' \
--from-literal=HTTP_USER='<user>' \
--from-literal=HTTP_PASSWORD='<seedbox-password>' \
--from-literal=SFTP_HOST='45.128.27.65' \
--from-literal=SFTP_PORT='2222'
Note: The seedbox-sftp secret (used by rclone-sync) is separate from seedbox-exporter-credentials. The exporter uses HTTP basic auth against the ruTorrent reverse proxy.
Two exporters cover GitHub repo and org-level metrics for the zolty-mat org. Both share the same PAT secret (github-exporter-token).
Repo-level metrics for the zolty-mat org: stars, forks, open issues, releases, and GitHub API rate limit remaining.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Port | 9171 |
| Scrape interval | 120s (deliberately slow -- GitHub API rate limits) |
| Image | githubexporter/github-exporter:latest |
| Manifest | kubernetes/core/github-exporter.yaml |
Credentials: Secret github-exporter-token, key github_token -- PAT with repo + read:org scopes.
Bootstrap:
kubectl create secret generic github-exporter-token -n monitoring \
--from-literal=github_token=<ghp_...>
Key metrics: per-repo stars, forks, open issues, releases; github_rate_limit_remaining
Org membership metrics for the zolty-mat org: member counts by role, pending invitations, and team sizes.
| Property | Value |
|---|---|
| Namespace | monitoring |
| Port | 9172 |
| Scrape interval | 300s (org membership data changes infrequently) |
| Image | harbor.k3s.internal.strommen.systems/production/github-org-exporter:latest |
| Manifest | kubernetes/core/github-org-exporter.yaml |
Credentials: Reuses github-exporter-token secret (same PAT -- read:org scope is sufficient).
Key metrics: org member counts by role, pending invitation count, team member counts
Alert rules for GitHub runner availability, exporter health, API rate limits, and org hygiene. Manifest: kubernetes/core/github-alerts.yaml.
| Alert | Severity | Condition |
|---|---|---|
| GitHubRunnerDown | warning | Less than 3 ARC runner pods ready in arc-runner-system for 5m |
| GitHubExporterDown | warning | up{job="github-exporter"} == 0 for 5m |
| GitHubAPIRateLimitLow | warning | github_rate_remaining{resource="core"} < 100 for 2m |
| GitHubPendingOrgInvitation | info | github_org_pending_invitations_total > 0 for 5m -- check zolty-mat org invitations |
Alert rules are spread across PrometheusRule CRDs. Key alert groups:
| Group | Alerts | Manifest |
|---|---|---|
| node-alerts | NodeHighCPU (>85% 10m), NodeHighMemory (>90% 10m), NodeDiskFull (>85% 5m) | |
| pod-alerts | PodCrashLooping (>3 restarts/15m), PodNotReady (10m) | |
| longhorn-alerts | LonghornVolumeHealthy (robustness != 1, 5m) | |
| github-runner-alerts | GitHubRunnerDown ( |
kubernetes/core/github-alerts.yaml |
| backup-alerts | EtcdBackupFailed, EtcdBackupStale (>25h), PostgreSQLBackupFailed, PostgreSQLBackupStale (>25h per-ns), LonghornRecurringJobFailed, LonghornSnapshotCount (<7 snaps, info), LonghornBackupTargetUnreachable (critical), BackupS3BucketQuotaHigh (>80GB), BackupCostHigh (>10k API req/mo, info), BackupJobOOMKilled, BackupJobTimeout (>30m), BackupRestorationTested (no restore in 30d, info), VeleroBackupFailed, VeleroBackupSizeHigh (>400GB) | kubernetes/core/backup-alerts.yaml |
| anthropic-cost | AnthropicMonthlyCostHigh (>$100 MTD), AnthropicCostExporterDown | |
| openrouter-cost | OpenRouterCreditsLow (<$10), OpenRouterDailySpendHigh (>$20/day), OpenRouterCostExporterDown | |
| aws-cost | AWSMonthlyCostHigh (>$120 MTD), AWSDailyCostSpike (>$10/24h), AWSCostExporterDown, AWSCostDataStale (>12h) | kubernetes/apps/aws-cost-exporter/aws-cost-exporter.yaml |
| internal.honeypot | InternalServiceProbed (honeypot hit -- fires immediately, critical) | kubernetes/apps/kube-utils/kube-utils.yaml |
| ham.fitness | HamHRVUnbalancedStreak (>=3d), HamWeightSyncStale (>48h), HamPainLevelElevated (>5), HamWeightGoalOffTrack (>190lbs), HamWeightGoalOnTrack (<=185lbs, info) | kubernetes/apps/monitoring/ham-alert-rules.yaml |
| media-stack-alerts | JellyfinDown (critical), JellyseerrDown, PlexDown (critical), GPUTranscodeOverload (>90% 15m), MediaNFSUnavailable (critical), RcloneSyncFailing, JellyfinHighMemory (>3.5GB), RadarrDown, SonarrDown, ProwlarrDown, BazarrDown, DownloadQueueStuck (6h), MediaControllerDown, AutoManagedNearCapacity (>4.5TB), MediaControllerPruneFailed, MediaControllerDiscoveryDry (72h info), MediaControllerAPIErrors | kubernetes/apps/media/alerts.yaml |
| jellyfin-ha | JellyfinReplicaCountLow (warning, <2 replicas ready 2m), JellyfinAllReplicasDown (critical, 0 replicas 1m), JellyfinPostgresDown (critical, 0 postgres replicas 2m), JellyfinRedisDown (warning, cache unavailable 60s), JellyfinTranscodeSessionOrphaned (warning, fewer pods than desired + Redis up 2m) | kubernetes/apps/media/jellyfin-ha-alerts.yaml |
| unifi-device-alerts | UnifiDeviceOffline, UnifiDeviceHighCpu, UnifiDeviceHighMemory, UnifiDeviceHighTemp | |
| unifi-wan-alerts | UnifiWanLatencyHigh (>100ms) | |
| unifi-poller-health | UnifiPollerDown, UnifiControllerUnreachable | |
| nas-storage | NASDown, NASNFSMountDown, NASVolumeCritical, NASVolumeWarning, NASHighNFSLatency, NASDownloadsBacklog, NASExporterDown | |
| seedbox | SeedboxDown, SeedboxSFTPDown, SeedboxDiskLow, SeedboxDiskWarning, SeedboxTorrentErrors, SeedboxIdle, SeedboxExporterDown | |
| pve-alerts | PveNodeDown, PveNodeHighCpu, PveNodeHighSwap, PveNodeStorageFull, PveVmDown, PveExporterDown | |
| kubernetes.tuned | KubePodNotHealthy (15m, critical), KubePersistentvolumeclaimFillingUp (>85% + growing), plus cert expiry and other kube-prometheus-stack threshold overrides | kubernetes/apps/monitoring/alert-rules-tuned.yaml |
| harbor.rules | HarborDown (5m, critical), HarborComponentDown (5m, critical), HarborScanQueueBacklog (>20 tasks 15m, warning), HarborHighErrorRate (>5% 5xx for 10m, warning) | kubernetes/apps/harbor/prometheusrule.yaml |
The monitoring stack uses NFS-backed storage to allow Prometheus to run on any worker node without Longhorn:
| StorageClass | NFS Path | Used By |
|---|---|---|
| nfs-monitoring | /volume1/monitoring (verify in UGOS Pro -> File Service -> NFS) | Prometheus PVC (30Gi), Loki PVC (10Gi), AlertManager PVC (1Gi) |