fiddy/docs/public-launch-runbook.md

# Public Launch Runbook (Self-Hosted + SSH Compose)

## 1) Goals
- Deploy Fiddy publicly without stack rewrite.
- Keep Postgres self-hosted.
- Enable fast rollback and basic operational visibility.
- Keep security baseline enforceable for direct home-IP exposure.

## 2) Deploy Host (SSH Compose)
1. Prepare Linux deploy host with Docker Engine + Compose plugin.
2. Ensure deploy target directory exists (`/opt/fiddy`).
3. Configure web image source: `git.nicosaya.com/nalalangan/fiddy/web`.
4. Configure scheduler image source: `git.nicosaya.com/nalalangan/fiddy/scheduler`.
5. Deploy by immutable tag (`github.sha`) and keep `main` as convenience tag.
6. Configure health check endpoint: `/api/health/ready`.
7. Keep previous image tags for rollback.

### Required secrets/variables
- `DATABASE_URL`
- `DATABASE_SSL`
- `ALLOWED_DB_NAMES`
- `SESSION_COOKIE_NAME`
- `SESSION_TTL_DAYS`
- `DEBUG_API=0`
- `SCHEDULER_POLL_MS` (scheduler app, optional)
- `SCHEDULER_BATCH_SIZE` (scheduler app, optional)

## 3) CI/CD (Gitea Actions)
- Use `.gitea/workflows/deploy-ssh-compose.yml`.
- Required secrets:
  - `REGISTRY_USER`
  - `REGISTRY_PASS`
  - `DEPLOY_KEY`
  - `DEPLOY_HOST`
  - `DEPLOY_USER`
  - `DEPLOY_HEALTHCHECK_URL`
- Health gate:
  - workflow calls `scripts/wait-for-health.sh` against `DEPLOY_HEALTHCHECK_URL`
  - default retry window: 5 minutes (30 attempts x 10s)

## 4) Reverse Proxy + Network Hardening
- Use your existing Nginx reverse proxy/vhost.
- Apply the required Fiddy directives using `docker/nginx/fiddy.conf` and `docker/nginx/includes/fiddy-proxy.conf` as templates.
- For Nginx Proxy Manager-specific setup, follow `docs/08_NGINX_PROXY_MANAGER_SETUP.md`.
- NPM note: apply `add_header`/`proxy_set_header` in Custom Location `/` (and specific API locations), not only Proxy Host Advanced.
- Install certificate with Let's Encrypt.
- Route 443 -> app container only.
- Keep Postgres private; never expose 5432 publicly.
- Restrict SSH to allowlist/VPN.
- Add host firewall rules:
  - Allow inbound `80/443`.
  - Deny all other inbound by default.
- Confirm Nginx writes JSON logs:
  - `/var/log/nginx/fiddy-access.log`
  - `/var/log/nginx/fiddy-error.log`
- If your log paths differ, update:
  - `docker/observability/promtail-config.yml`
  - `docker/security/fail2ban/jail.d/fiddy-nginx.conf`
  - `docker/security/crowdsec/acquis.yaml`
- Apply/verify host baseline using scripts:
  - dry-run firewall apply: `SSH_ALLOW_CIDR=<your-cidr> DRY_RUN=1 scripts/harden-host-ufw.sh`
  - real firewall apply: `SSH_ALLOW_CIDR=<your-cidr> DRY_RUN=0 sudo scripts/harden-host-ufw.sh`
  - host status audit: `scripts/check-host-security.sh`
- Auto-ban templates:
  - fail2ban: `docker/security/fail2ban/*`
  - crowdsec (optional): `docker/security/crowdsec/acquis.yaml`

## 5) Observability
- Bring up monitoring stack:
  - `docker compose -f docker/observability/docker-compose.observability.yml up -d`
- Configure Grafana datasource to Loki (`http://loki:3100`).
- Verify nginx logs are ingested by Promtail (`job="nginx"`).
- Add Uptime Kuma monitors:
  - `/api/health/live`
  - `/api/health/ready`
  - home page (`/`)

## 5.1) Deployment Smoke Check
- Run after every deploy and rollback:
  - `scripts/smoke-public-launch.sh https://your-domain`
- The script verifies:
  - `/api/health/live` and `/api/health/ready` return `200`
  - both responses include `X-Request-Id` header
  - both response bodies include `request_id`

## 6) Backup + Restore
- Daily backup command:
  - `scripts/backup-postgres.sh`
- Periodic base backup (for faster full recovery):
  - `PRIMARY_DATABASE_URL=<replication-url> scripts/basebackup-postgres.sh`
- Retention:
  - default 7 days (`RETENTION_DAYS=7`)
- Restore drill:
  - `scripts/restore-drill-postgres.sh backups/postgres/<file>.dump <target_database_url>`
- Run restore drill on non-prod DB before public launch.
- Record drill outcome:
  - `scripts/log-restore-drill.sh <environment> <backup_file> <restore_target> <status> <rto_minutes> <notes>`
  - log file: `docs/restore-drill-log.csv`

## 7) Incident Response Quick Flow
1. Identify failing request and `request_id`.
2. Correlate application logs (Loki) by `request_id`.
3. Check `/api/health/ready` status and DB connectivity.
4. Roll back to previous known-good image tag via SSH Compose if needed.
5. Capture root cause and update this runbook/checklist.

## 8) Rollback Checklist
1. Select previous healthy image tag for both `web` and `scheduler`.
2. Trigger rollback deploy and wait for completion.
3. Run `scripts/smoke-public-launch.sh https://your-domain`.
4. Verify error-rate drop in Grafana/Loki and confirm no DB migration mismatch.
5. Log the rolled back version, timestamp, and reason.