fiddy/docs/public-launch-runbook.md
Nico a0514f0823
Some checks failed
Build & Deploy Fiddy (SSH Compose) / build (push) Failing after 1s
Build & Deploy Fiddy (SSH Compose) / deploy (push) Has been skipped
docs: switch active deployment runbooks from dokploy to ssh compose
2026-02-22 01:51:44 -08:00

113 lines
4.6 KiB
Markdown

# Public Launch Runbook (Self-Hosted + SSH Compose)
## 1) Goals
- Deploy Fiddy publicly without stack rewrite.
- Keep Postgres self-hosted.
- Enable fast rollback and basic operational visibility.
- Keep security baseline enforceable for direct home-IP exposure.
## 2) Deploy Host (SSH Compose)
1. Prepare Linux deploy host with Docker Engine + Compose plugin.
2. Ensure deploy target directory exists (`/opt/fiddy`).
3. Configure web image source: `git.nicosaya.com/nalalangan/fiddy/web`.
4. Configure scheduler image source: `git.nicosaya.com/nalalangan/fiddy/scheduler`.
5. Deploy by immutable tag (`github.sha`) and keep `main` as convenience tag.
6. Configure health check endpoint: `/api/health/ready`.
7. Keep previous image tags for rollback.
### Required secrets/variables
- `DATABASE_URL`
- `DATABASE_SSL`
- `ALLOWED_DB_NAMES`
- `SESSION_COOKIE_NAME`
- `SESSION_TTL_DAYS`
- `DEBUG_API=0`
- `SCHEDULER_POLL_MS` (scheduler app, optional)
- `SCHEDULER_BATCH_SIZE` (scheduler app, optional)
## 3) CI/CD (Gitea Actions)
- Use `.gitea/workflows/deploy-ssh-compose.yml`.
- Required secrets:
- `REGISTRY_USER`
- `REGISTRY_PASS`
- `DEPLOY_KEY`
- `DEPLOY_HOST`
- `DEPLOY_USER`
- `DEPLOY_HEALTHCHECK_URL`
- Health gate:
- workflow calls `scripts/wait-for-health.sh` against `DEPLOY_HEALTHCHECK_URL`
- default retry window: 5 minutes (30 attempts x 10s)
## 4) Reverse Proxy + Network Hardening
- Use your existing Nginx reverse proxy/vhost.
- Apply the required Fiddy directives using `docker/nginx/fiddy.conf` and `docker/nginx/includes/fiddy-proxy.conf` as templates.
- For Nginx Proxy Manager-specific setup, follow `docs/08_NGINX_PROXY_MANAGER_SETUP.md`.
- NPM note: apply `add_header`/`proxy_set_header` in Custom Location `/` (and specific API locations), not only Proxy Host Advanced.
- Install certificate with Let's Encrypt.
- Route 443 -> app container only.
- Keep Postgres private; never expose 5432 publicly.
- Restrict SSH to allowlist/VPN.
- Add host firewall rules:
- Allow inbound `80/443`.
- Deny all other inbound by default.
- Confirm Nginx writes JSON logs:
- `/var/log/nginx/fiddy-access.log`
- `/var/log/nginx/fiddy-error.log`
- If your log paths differ, update:
- `docker/observability/promtail-config.yml`
- `docker/security/fail2ban/jail.d/fiddy-nginx.conf`
- `docker/security/crowdsec/acquis.yaml`
- Apply/verify host baseline using scripts:
- dry-run firewall apply: `SSH_ALLOW_CIDR=<your-cidr> DRY_RUN=1 scripts/harden-host-ufw.sh`
- real firewall apply: `SSH_ALLOW_CIDR=<your-cidr> DRY_RUN=0 sudo scripts/harden-host-ufw.sh`
- host status audit: `scripts/check-host-security.sh`
- Auto-ban templates:
- fail2ban: `docker/security/fail2ban/*`
- crowdsec (optional): `docker/security/crowdsec/acquis.yaml`
## 5) Observability
- Bring up monitoring stack:
- `docker compose -f docker/observability/docker-compose.observability.yml up -d`
- Configure Grafana datasource to Loki (`http://loki:3100`).
- Verify nginx logs are ingested by Promtail (`job="nginx"`).
- Add Uptime Kuma monitors:
- `/api/health/live`
- `/api/health/ready`
- home page (`/`)
## 5.1) Deployment Smoke Check
- Run after every deploy and rollback:
- `scripts/smoke-public-launch.sh https://your-domain`
- The script verifies:
- `/api/health/live` and `/api/health/ready` return `200`
- both responses include `X-Request-Id` header
- both response bodies include `request_id`
## 6) Backup + Restore
- Daily backup command:
- `scripts/backup-postgres.sh`
- Periodic base backup (for faster full recovery):
- `PRIMARY_DATABASE_URL=<replication-url> scripts/basebackup-postgres.sh`
- Retention:
- default 7 days (`RETENTION_DAYS=7`)
- Restore drill:
- `scripts/restore-drill-postgres.sh backups/postgres/<file>.dump <target_database_url>`
- Run restore drill on non-prod DB before public launch.
- Record drill outcome:
- `scripts/log-restore-drill.sh <environment> <backup_file> <restore_target> <status> <rto_minutes> <notes>`
- log file: `docs/restore-drill-log.csv`
## 7) Incident Response Quick Flow
1. Identify failing request and `request_id`.
2. Correlate application logs (Loki) by `request_id`.
3. Check `/api/health/ready` status and DB connectivity.
4. Roll back to previous known-good image tag via SSH Compose if needed.
5. Capture root cause and update this runbook/checklist.
## 8) Rollback Checklist
1. Select previous healthy image tag for both `web` and `scheduler`.
2. Trigger rollback deploy and wait for completion.
3. Run `scripts/smoke-public-launch.sh https://your-domain`.
4. Verify error-rate drop in Grafana/Loki and confirm no DB migration mismatch.
5. Log the rolled back version, timestamp, and reason.