FigJam Diagram: 4-Tier Backup Strategy — Cluster Backup Architecture (expires 2026-04-13)
Comprehensive backup strategy and restoration procedures for the homelab cluster.
| Component | Method | Schedule | Retention | Storage |
|---|---|---|---|---|
| etcd (cluster state) | k3s etcd-snapshot + S3 | Daily 2:00 AM UTC | 7 days | S3 + Glacier |
| PostgreSQL databases | pg_dump + S3 | Daily 3:00-4:05 AM UTC | 7 days | S3 + Glacier |
| Longhorn volumes | S3 recurring backup job | Daily 5:00 AM UTC | 7 days | S3 + Glacier |
| Velero (K8s objects) | velero schedule | Daily 2AM / Weekly Sun 3AM / Monthly 1st 4AM UTC | 30 / 360 / 365 days | S3 |
| Git repositories | GitHub | On push | Forever | GitHub |
| Terraform state | S3 backend (versioned) | On apply | Forever | S3 |
Bucket: k3s-homelab-backups-855878721457
Region: us-east-1
IAM User: k3s-backups
Lifecycle Policy:
The etcd backup CronJob runs daily at 2:00 AM UTC in kube-system.
Note: k3s also has a built-in etcd snapshot schedule (via --etcd-snapshot-schedule-cron set to every 6h in Ansible config). The CronJob is the primary backup path that pushes to S3.
The CronJob requires an etcd-backup-aws-credentials secret in kube-system. Create it once from Terraform outputs:
cd terraform/environments/aws
kubectl create secret generic etcd-backup-aws-credentials \
--namespace kube-system \
--from-literal=AWS_ACCESS_KEY_ID=$(terraform output -raw backup_access_key_id) \
--from-literal=AWS_SECRET_ACCESS_KEY=$(terraform output -raw backup_secret_access_key)
Manual snapshot:
sudo k3s etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d-%H%M)
sudo k3s etcd-snapshot list
Default location: /var/lib/rancher/k3s/server/db/snapshots/
PostgreSQL CronJobs run per-namespace. Each app namespace with a PostgreSQL StatefulSet has a postgres-backup CronJob that runs pg_dump and uploads to S3.
| Database | Namespace | Schedule | S3 Path |
|---|---|---|---|
| cardboard | cardboard |
3:00 AM UTC | postgres-backups/cardboard/ |
| openclaw-memory-db | open-webui |
3:00 AM UTC | zolty-homelab-backups/openclaw-memory/ (separate bucket) |
| ham | ham |
3:15 AM UTC | postgres-backups/ham/ |
| trade-bot | trade-bot |
3:15 AM UTC | postgres-backups/trade-bot/ |
| aja-recipes | aja-recipes |
3:20 AM UTC | postgres-backups/aja-recipes/ |
| dnd | dnd |
3:25 AM UTC | postgres-backups/dnd/ |
| wiki | wiki |
3:30 AM UTC | postgres-backups/wiki/ |
| jellyfin | media |
3:30 AM UTC | postgres-backups/jellyfin/ |
| digital-signage | digital-signage |
3:35 AM UTC | postgres-backups/digital-signage/ |
| openclaw-ops | openclaw-ops |
3:40 AM UTC | postgres-backups/openclaw-ops/ |
| openclaw-personal | openclaw-personal |
3:45 AM UTC | postgres-backups/openclaw-personal/ |
| media-profiler | media-profiler |
3:50 AM UTC | postgres-backups/media-profiler/ |
| media-controller | media |
4:00 AM UTC | postgres-backups/media-controller/ |
| authentik | authentik |
4:00 AM UTC | postgres-backups/authentik/ |
| polymarket (TimescaleDB) | polymarket-lab |
4:05 AM UTC | postgres-backups/polymarket-lab/ |
All 15 databases have automated daily S3 backup CronJobs. See Scheduled Jobs for the full schedule.
TimescaleDB note: The polymarket-lab backup uses standard
pg_dumpwhich is compatible with TimescaleDB but restores hypertables as regular tables. For full chunk-level recovery, migrate totimescaledb-backupin a future iteration.
Verify with: kubectl get cronjob -A | grep backup
Velero backs up Kubernetes object state (Deployments, ConfigMaps, Secrets, ServiceAccounts, etc.) to S3. It does NOT back up PVC data. Combine with pg_dump and Longhorn backups for full coverage.
kubectl get schedule -n velero
velero backup get
velero backup describe <backup-name> --details
# Restore a namespace from the latest daily backup
velero restore create --from-schedule k3s-daily-backup --include-namespaces <namespace>
Schedules (manifest: kubernetes/apps/velero/backup-schedules.yaml):
| Schedule Name | Cron | Time (UTC) | TTL | Purpose |
|---|---|---|---|---|
| k3s-daily-backup | 0 2 * * * | Daily 2:00 AM | 720h (30 days) | Fast recent-state recovery |
| k3s-weekly-backup | 0 3 * * 0 | Sundays 3:00 AM | 8640h (~360 days) | Longer-term weekly snapshots |
| k3s-monthly-backup | 0 4 1 * * | 1st of month 4:00 AM | 8760h (365 days) | Long-term compliance/rollback |
Excluded namespaces: velero, kube-system, kube-node-lease, kube-public
Longhorn recurring backup job runs daily at 5:00 AM UTC and retains 7 snapshots. Backup target: s3://k3s-homelab-backups-855878721457@us-east-1/longhorn-backups
The kubernetes/core/longhorn-s3-config.yaml manifest contains placeholder credentials. Before applying, replace with actual values from Terraform:
# Get credentials from Terraform state
cd terraform/environments/aws
aws_key=$(terraform output -raw backup_access_key_id)
aws_secret=$(terraform output -raw backup_secret_access_key)
# Create the secret
kubectl create secret generic longhorn-s3-credentials \
-n longhorn-system \
--from-literal=AWS_ACCESS_KEY_ID="$aws_key" \
--from-literal=AWS_SECRET_ACCESS_KEY="$aws_secret" \
--from-literal=AWS_ENDPOINTS="https://s3.us-east-1.amazonaws.com" \
--dry-run=client -o yaml | kubectl apply -f -
Backup target URL: s3://k3s-homelab-backups-855878721457@us-east-1/longhorn-backups
All infrastructure configuration lives in Git. Git-tracked items:
Not in Git:
| What | When | Where |
|---|---|---|
| etcd snapshots | Daily 2:00 AM UTC | S3 |
| Velero K8s objects (daily) | Daily 2:00 AM UTC | S3 (30-day TTL) |
| PostgreSQL dumps | Daily 3:00-4:05 AM UTC | S3 |
| Longhorn volumes | Daily 5:00 AM UTC | S3 |
| Velero K8s objects (weekly) | Sundays 3:00 AM UTC | S3 (~360-day TTL) |
| Velero K8s objects (monthly) | 1st of month 4:00 AM UTC | S3 (365-day TTL) |
| Terraform state | On apply | S3 (versioned bucket) |
| Git repository | On push | GitHub |
The cluster maintains 4 backup tiers:
All backups stored in S3: s3://k3s-homelab-backups-855878721457/
Quick restore:
./scripts/restore-postgres.sh <namespace> <timestamp>
Manual steps:
aws s3 ls s3://k3s-homelab-backups-855878721457/postgres-backups/cardboard/
NAMESPACE=cardboard
TIMESTAMP=20260211-154344
aws s3 cp "s3://k3s-homelab-backups-855878721457/postgres-backups/${NAMESPACE}/${NAMESPACE}-${TIMESTAMP}.sql.gz" /tmp/
kubectl scale deployment -n cardboard cardboard --replicas=0
kubectl wait --for=delete pod -n cardboard -l app.kubernetes.io/name=cardboard --timeout=60s
kubectl exec -n cardboard postgres-0 -- psql -U postgres -c "DROP DATABASE IF EXISTS cardboard;"
kubectl exec -n cardboard postgres-0 -- psql -U postgres -c "CREATE DATABASE cardboard OWNER cardboard;"
gunzip < /tmp/cardboard-${TIMESTAMP}.sql.gz | kubectl exec -i -n cardboard postgres-0 -- psql -U cardboard -d cardboard
kubectl exec -n cardboard postgres-0 -- psql -U cardboard -d cardboard -c "\dt"
kubectl scale deployment -n cardboard cardboard --replicas=1
Timing: Small DBs (<100MB) ~30s | Medium (100MB-1GB) ~2-5min | Large (>1GB) ~10-30min
kubectl get volumesnapshot -A
Create PVC from snapshot by applying a PVC manifest with dataSource pointing to the VolumeSnapshot.
velero restore create --from-schedule k3s-daily-backup --include-namespaces cardboard --wait
velero restore create --from-backup <backup-name>
velero restore describe <restore-name>
WARNING: etcd restoration resets cluster state to the backup timestamp. All changes after the backup timestamp are lost. Only use for disaster recovery.
aws s3 ls s3://k3s-homelab-backups-855878721457/etcd-snapshots/
aws s3 cp s3://k3s-homelab-backups-855878721457/etcd-snapshots/<snapshot-name> /tmp/
sudo systemctl stop k3s
sudo k3s server --cluster-reset --cluster-reset-restore-path=/tmp/<snapshot-name>
sudo systemctl start k3s
sudo k3s kubectl get nodes
sudo k3s kubectl get pods -A
sudo systemctl stop k3s
sudo rm -rf /var/lib/rancher/k3s/server/db
sudo systemctl start k3s
kubectl get nodes
Timing: etcd restore ~2-5min | Full cluster rejoin ~5-10min | Total downtime ~10-15min
kubectl get cronjob -n kube-system etcd-backup
kubectl get cronjob -n cardboard postgres-backup
kubectl get recurringjob -n longhorn-system
kubectl get schedule -n velero
velero backup get | head -10
aws s3 ls s3://k3s-homelab-backups-855878721457/etcd-snapshots/
Prometheus monitors via PrometheusRule backup-alerts (kubernetes/core/backup-alerts.yaml):
| Alert | Severity | Condition |
|---|---|---|
EtcdBackupFailed |
critical | etcd CronJob failed |
EtcdBackupStale |
warning | No successful etcd backup in >25h |
PostgreSQLBackupFailed |
critical | Any postgres-backup.* job failed |
PostgreSQLBackupStale |
warning | No successful postgres backup in >25h (per namespace) |
LonghornRecurringJobFailed |
warning | Longhorn recurring job status != success |
LonghornSnapshotCount |
info | Volume has <7 snapshots |
LonghornBackupTargetUnreachable |
critical | Longhorn S3 backup target unreachable |
BackupS3BucketQuotaHigh |
warning | Bucket k3s-homelab-backups-855878721457 exceeds 80GB |
BackupCostHigh |
info | Projected S3 API requests >10,000/month |
BackupJobOOMKilled |
warning | Backup container OOMKilled |
BackupJobTimeout |
warning | Backup job active >30 minutes |
BackupRestorationTested |
info | No restore job completed in last 30 days |
VeleroBackupFailed |
critical | Velero backup failed |
VeleroBackupSizeHigh |
warning | Velero backup exceeds 400GB |
Routes to: AlertManager → Slack #k3s-alerts | Email: mstrommen+k3s-alerts@gmail.com
Connect as the database owner, not postgres:
kubectl exec -n cardboard postgres-0 -- psql -U cardboard -d cardboard
kubectl logs -n longhorn-system deployment/longhorn-manager -f
Common causes: insufficient disk space, expired S3 credentials (longhorn-s3-credentials secret), S3 network issues.
Another server has stale etcd data. Reset all servers before restoring:
sudo systemctl stop k3s
sudo rm -rf /var/lib/rancher/k3s/server/db
Get credentials from Terraform:
cd terraform/environments/aws
terraform output -raw backup_user_access_key
terraform output -raw backup_user_secret_key
kubectl delete secret -n kube-system etcd-backup-aws-credentials
kubectl create secret generic etcd-backup-aws-credentials \
--namespace kube-system \
--from-literal=AWS_ACCESS_KEY_ID=<access-key> \
--from-literal=AWS_SECRET_ACCESS_KEY=<secret-key>
| Tier | Retention | Approximate Size | Monthly Cost |
|---|---|---|---|
| etcd snapshots | 7 days | ~350MB total | ~$0.01 |
| PostgreSQL dumps | 7 days | ~140KB per DB | negligible |
| Longhorn volumes | 7 days local + 7 days S3 | varies by PVC size | ~$0.20 |
| Velero K8s objects (daily) | 30 days | ~50MB per backup | ~$0.05 |
| Velero K8s objects (weekly) | ~360 days | ~50MB per backup | ~$0.05 |
| Velero K8s objects (monthly) | 365 days | ~50MB per backup | ~$0.05 |
| Total S3 | ~10-15GB | ~$0.50 |
| Metric | Target |
|---|---|
| Backup Success Rate | >99% |
| RPO (Recovery Point Objective) | 24 hours |
| RTO (Recovery Time Objective) | <30 minutes |
| Backup Retention | 7 days (PVC/DB) / 30-365 days (Velero) |
| Backup Alert Latency | <5 min |