Observability Stack

Property	Value
Namespace	monitoring
Storage	30Gi NFS PVC (nfs-monitoring StorageClass)
Retention	14 days
Scrape interval	30s
URL (internal)	https://prometheus.k3s.internal.strommen.systems
Resource limits	1 CPU / 4GiB RAM

Job	Source	Method
node-exporter	All 7 k3s nodes (:9100)	DaemonSet ServiceMonitor
kube-state-metrics	kube-system (:8080)	ServiceMonitor
traefik	kube-system pods	Pod annotation relabel
pve-exporter	pve-exporter.monitoring.svc:9221	Static config
proxmox-nodes	192.168.20.105-108:9100	Static config
arc-controller	arc-runner-system (:8443, HTTPS)	Static config
unifi-poller	monitoring ns (:9130)	ServiceMonitor (60s)
nas-exporter	media ns (:9355)	ServiceMonitor (60s)
seedbox-exporter	media ns (:9354)	ServiceMonitor (60s)
anthropic-cost-exporter	monitoring ns (:9091)	ServiceMonitor (300s)
openrouter-cost-exporter	monitoring ns (:9092)	ServiceMonitor (300s)
aws-cost-exporter	monitoring ns (:9090)	ServiceMonitor (300s)
github-exporter	monitoring ns (:9171)	ServiceMonitor (120s)
github-org-exporter	monitoring ns (:9172)	ServiceMonitor (300s)
application services	All namespaces with ServiceMonitor	ServiceMonitor CRD

Property	Value
Namespace	monitoring
Storage	emptyDir (dashboards loaded from ConfigMaps via sidecar)
URL (internal)	https://grafana.k3s.internal.strommen.systems
Auth	Authentik OIDC (role mapped from groups: authentik-admins -> Admin, authentik-writers -> Editor)
Default home	home-hub.json (custom cluster summary)

Folder	Dashboard	Source
Infrastructure	etcd	grafana.com/3070
Infrastructure	Loki Logs Explorer	grafana.com/13639
Infrastructure	Longhorn	grafana.com/16888
Infrastructure	Proxmox Hardware	grafana.com/11074
Infrastructure	Proxmox PVE	grafana.com/10347
Infrastructure	Traefik	grafana.com/17346
Kubernetes	CoreDNS	grafana.com/15762
Kubernetes	K8s API Server	grafana.com/15761
Kubernetes	K8s Cluster	grafana.com/7249
Kubernetes	K8s Namespace	grafana.com/15758
Kubernetes	K8s Pods	grafana.com/6417
Kubernetes	Node Exporter	grafana.com/1860
AWS	Billing	grafana.com/139
AWS	EC2	grafana.com/617
AWS	S3	grafana.com/575
AWS	ECR	grafana.com/16101
AWS	Route53	grafana.com/11084
GitHub	ARC runners	grafana.com (custom)
Custom	Cluster Home Hub	ConfigMap: grafana-home-dashboard
Custom	OpenClaw / Claw Ops	ConfigMap: claw-auto-remediation-dashboard
Custom	Cost overview	ConfigMap: grafana-cost-dashboard
Custom	Living Room Display	ConfigMap: grafana-living-room-* (x4)
Custom	Media Stack	ConfigMap (GPU panels, stream count)
Custom	WireGuard peers	ConfigMap (peer stats from wg-exporter)
Custom	Authentik	ConfigMap (login events, user stats)
Custom	UniFi Network	ConfigMap: grafana-dashboard-unifi-network (auto-discovered via sidecar)
Custom	NAS Storage	ConfigMap: nas-exporter grafana-dashboard.yaml (UID: nas-storage)
Custom	Seedbox	ConfigMap: seedbox-exporter grafana-dashboard.yaml (UID: seedbox-monitoring)
Custom	Proxmox PVE	ConfigMap: grafana-dashboard-pve-cluster (UID: pve-cluster-overview)
Custom	GitHub Organization	grafana-dashboard-github.yaml (UID: github-org-overview) -- org-level GitHub metrics (repos, runs, rate limits, pending invites)
Custom	Proxmox Watchdog & Power	grafana-dashboard-proxmox-watchdog.yaml (UID: proxmox-watchdog) -- pve1-4 host status, Kasa smart outlet power/energy, power cycle attempts
Custom	AWS Cost & Resources	grafana-dashboard-aws.yaml (UID: aws-services-overview) -- AWS billing, EC2, S3, ECR, Route53
Custom	Cluster Errors	grafana-dashboard-cluster-errors.yaml (UID: cluster-errors-health) -- cluster-wide error events and CrashLoopBackOff tracking
Custom	Cluster Home Hub	grafana-dashboard-home.yaml (UID: home-hub) -- overview home dashboard
Custom	K8s Resources	grafana-dashboard-k8s-resources.yaml (UID: k8s-cluster-resources) -- Kubernetes resource utilization
Custom	Proxmox Hardware	grafana-dashboard-proxmox-hardware.yaml (UID: proxmox-hardware-temps) -- Proxmox node hardware metrics (also available via grafana.com/11074)

Property	Value
Namespace	monitoring
Storage	10Gi NFS PVC (nfs-monitoring StorageClass)
Retention	72 hours
Endpoint	http://loki.monitoring.svc.cluster.local:3100

Property	Value
Namespace	monitoring
Storage	1Gi NFS PVC
URL (internal)	https://alertmanager.k3s.internal.strommen.systems
Repeat interval (default)	4h
Repeat interval (critical)	1h
Watchdog interval	24h (heartbeat alert)

Property	Value
Namespace	monitoring
Port	9091
Scrape interval	300s (5 min)
Poll interval	3600s (1 hour)
Image	python:3.12-slim (inline script from ConfigMap)

Property	Value
Namespace	monitoring
Port	9092
Scrape interval	300s (5 min)
Poll interval	3600s (1 hour)
Image	python:3.12-slim (inline script from ConfigMap)

Property	Value
Namespace	monitoring
Port	9090
Scrape interval	300s (5 min)
Poll interval	6 hours (internal collection cycle)
Image	python:3.12-slim (inline script from ConfigMap)
API cost	~$0.04/month at 6h interval ($0.01/request)

Property	Value
Namespace	monitoring
Image	prompve/prometheus-pve-exporter:3.4.5
Port	9221
Metrics path	/pve (non-standard -- not /metrics)

Property	Value
Namespace	monitoring
Image	ghcr.io/unpoller/unpoller:v2.33.0
Port	9130
Scrape interval	60s (ServiceMonitor)
Poll interval	120s (reduced from 30s -- prevents 429 auth failures on UDM Pro)

Property	Value
Namespace	media
Image	python:3.12-slim (runs inline script from ConfigMap)
Port	9355
Scrape interval	60s
NFS mount	/volume1/media (ReadOnlyMany, nfsvers=3)
UID	10010 (svc-jellyfin -- read-only access)

Property	Value
Namespace	media
Image	python:3.12-slim (runs inline script from ConfigMap)
Port	9354
Scrape interval	60s
rTorrent endpoint	https://45.128.27.65/rutorrent/plugins/httprpc/action.php
SFTP check	45.128.27.65:2222 (also checked by this exporter)

Property	Value
Namespace	monitoring
Port	9171
Scrape interval	120s (deliberately slow -- GitHub API rate limits)
Image	githubexporter/github-exporter:latest
Manifest	kubernetes/core/github-exporter.yaml

Property	Value
Namespace	monitoring
Port	9172
Scrape interval	300s (org membership data changes infrequently)
Image	harbor.k3s.internal.strommen.systems/production/github-org-exporter:latest
Manifest	kubernetes/core/github-org-exporter.yaml

Alert	Severity	Condition
GitHubRunnerDown	warning	Less than 3 ARC runner pods ready in arc-runner-system for 5m
GitHubExporterDown	warning	up{job="github-exporter"} == 0 for 5m
GitHubAPIRateLimitLow	warning	github_rate_remaining{resource="core"} < 100 for 2m
GitHubPendingOrgInvitation	info	github_org_pending_invitations_total > 0 for 5m -- check zolty-mat org invitations

Group	Alerts	Manifest
node-alerts	NodeHighCPU (>85% 10m), NodeHighMemory (>90% 10m), NodeDiskFull (>85% 5m)
pod-alerts	PodCrashLooping (>3 restarts/15m), PodNotReady (10m)
longhorn-alerts	LonghornVolumeHealthy (robustness != 1, 5m)
github-runner-alerts	GitHubRunnerDown ( runners ready, 5m), GitHubExporterDown (5m), GitHubAPIRateLimitLow (<100 requests, 2m), GitHubPendingOrgInvitation (info, 5m)	kubernetes/core/github-alerts.yaml
backup-alerts	EtcdBackupFailed, EtcdBackupStale (>25h), PostgreSQLBackupFailed, PostgreSQLBackupStale (>25h per-ns), LonghornRecurringJobFailed, LonghornSnapshotCount (<7 snaps, info), LonghornBackupTargetUnreachable (critical), BackupS3BucketQuotaHigh (>80GB), BackupCostHigh (>10k API req/mo, info), BackupJobOOMKilled, BackupJobTimeout (>30m), BackupRestorationTested (no restore in 30d, info), VeleroBackupFailed, VeleroBackupSizeHigh (>400GB)	kubernetes/core/backup-alerts.yaml
anthropic-cost	AnthropicMonthlyCostHigh (>$100 MTD), AnthropicCostExporterDown
openrouter-cost	OpenRouterCreditsLow (<$10), OpenRouterDailySpendHigh (>$20/day), OpenRouterCostExporterDown
aws-cost	AWSMonthlyCostHigh (>$120 MTD), AWSDailyCostSpike (>$10/24h), AWSCostExporterDown, AWSCostDataStale (>12h)	kubernetes/apps/aws-cost-exporter/aws-cost-exporter.yaml
internal.honeypot	InternalServiceProbed (honeypot hit -- fires immediately, critical)	kubernetes/apps/kube-utils/kube-utils.yaml
ham.fitness	HamHRVUnbalancedStreak (>=3d), HamWeightSyncStale (>48h), HamPainLevelElevated (>5), HamWeightGoalOffTrack (>190lbs), HamWeightGoalOnTrack (<=185lbs, info)	kubernetes/apps/monitoring/ham-alert-rules.yaml
media-stack-alerts	JellyfinDown (critical), JellyseerrDown, PlexDown (critical), GPUTranscodeOverload (>90% 15m), MediaNFSUnavailable (critical), RcloneSyncFailing, JellyfinHighMemory (>3.5GB), RadarrDown, SonarrDown, ProwlarrDown, BazarrDown, DownloadQueueStuck (6h), MediaControllerDown, AutoManagedNearCapacity (>4.5TB), MediaControllerPruneFailed, MediaControllerDiscoveryDry (72h info), MediaControllerAPIErrors	kubernetes/apps/media/alerts.yaml
jellyfin-ha	JellyfinReplicaCountLow (warning, <2 replicas ready 2m), JellyfinAllReplicasDown (critical, 0 replicas 1m), JellyfinPostgresDown (critical, 0 postgres replicas 2m), JellyfinRedisDown (warning, cache unavailable 60s), JellyfinTranscodeSessionOrphaned (warning, fewer pods than desired + Redis up 2m)	kubernetes/apps/media/jellyfin-ha-alerts.yaml
unifi-device-alerts	UnifiDeviceOffline, UnifiDeviceHighCpu, UnifiDeviceHighMemory, UnifiDeviceHighTemp
unifi-wan-alerts	UnifiWanLatencyHigh (>100ms)
unifi-poller-health	UnifiPollerDown, UnifiControllerUnreachable
nas-storage	NASDown, NASNFSMountDown, NASVolumeCritical, NASVolumeWarning, NASHighNFSLatency, NASDownloadsBacklog, NASExporterDown
seedbox	SeedboxDown, SeedboxSFTPDown, SeedboxDiskLow, SeedboxDiskWarning, SeedboxTorrentErrors, SeedboxIdle, SeedboxExporterDown
pve-alerts	PveNodeDown, PveNodeHighCpu, PveNodeHighSwap, PveNodeStorageFull, PveVmDown, PveExporterDown
kubernetes.tuned	KubePodNotHealthy (15m, critical), KubePersistentvolumeclaimFillingUp (>85% + growing), plus cert expiry and other kube-prometheus-stack threshold overrides	kubernetes/apps/monitoring/alert-rules-tuned.yaml
harbor.rules	HarborDown (5m, critical), HarborComponentDown (5m, critical), HarborScanQueueBacklog (>20 tasks 15m, warning), HarborHighErrorRate (>5% 5xx for 10m, warning)	kubernetes/apps/harbor/prometheusrule.yaml

StorageClass	NFS Path	Used By
nfs-monitoring	/volume1/monitoring (verify in UGOS Pro -> File Service -> NFS)	Prometheus PVC (30Gi), Loki PVC (10Gi), AlertManager PVC (1Gi)

¶ Observability Stack

¶ Architecture

¶ Deployment

¶ Prometheus

¶ Scrape Targets

¶ Grafana

¶ Dashboard Library

¶ Loki

¶ Useful LogQL Queries

¶ AlertManager

¶ Alert Routing

¶ Custom Exporters

¶ Anthropic Cost Exporter

¶ OpenRouter Cost Exporter

¶ AWS Cost Exporter

¶ pve-exporter

¶ wireguard-exporter

¶ UniFi Poller

¶ NAS Exporter

¶ Seedbox Exporter

¶ GitHub Metrics

¶ github-exporter

¶ github-org-exporter

¶ GitHub Alerts

¶ Alert Rule Reference

¶ Storage Classes (NFS)

¶ Observability Stack

¶ Architecture

¶ Deployment

¶ Prometheus

¶ Scrape Targets

¶ Grafana

¶ Dashboard Library

¶ Loki

¶ Useful LogQL Queries

¶ AlertManager

¶ Alert Routing

¶ Custom Exporters

¶ Anthropic Cost Exporter

¶ OpenRouter Cost Exporter

¶ AWS Cost Exporter

¶ pve-exporter

¶ wireguard-exporter

¶ UniFi Poller

¶ NAS Exporter

¶ Seedbox Exporter

¶ GitHub Metrics

¶ github-exporter

¶ github-org-exporter

¶ GitHub Alerts

¶ Alert Rule Reference

¶ Storage Classes (NFS)

¶ Related Pages