Staging Promotion + Rollback Runbook

Ticket: NQU-865 Phase 3 (W0-f) · defines the Q5 layered-rollback procedure and the P-11 promotion ritual for the new two-tier (staging → prod) topology.

This is the JE-Vectors internal promotion/rollback runbook. It is distinct from:

docs/licensee/ops/rollback-runbook.md — the customer-facing rollback runbook for licensed greenfield installs (NQU-645).
docs/admin/ops/release-validation-runbook.md — the install→upgrade→rollback cycle for cutting a release.

It builds on the 2026-04-03 ADR (docs/decisions/2026-04-03-staging-to-production-transition.md), Decisions 3 (mandatory down-migration + pre-migration snapshot) and 5 (ECS task-def revision revert). Where the ADR used aspirational invapp-prod-* / app.nquir.ai names, this doc uses the real current names.

Naming reality (read this first)

The R-8 rename is deferred (NQU-865, ratified 2026-05-21). Until it lands, the names below are the source of truth:

Tier	Account	Cluster / service / task def	Public host	Notes
Prod	`760007728097`	`invapp-dev-cluster` / `invapp-dev-app-service` / `invapp-dev-app`	`app.nquiry.ai`	Named `invapp-dev-` but serves production*. ALB: `invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com`.
Staging	`961381384763`	`invapp-staging-cluster` / `invapp-staging-app-service` / `invapp-staging-app`	`staging.nquiry.ai`	Behind CloudFront Basic Auth (R-7). Stood up in NQU-865 Wave 2.
Dev	localhost	docker-compose (`npm run local:up`)	—	NQU-867. Never points at prod.

If a command in this doc names invapp-dev-*, that is prod. This is the exact confusion NQU-868's "what runs where" doc exists to prevent — see it for the full topology.

Current state (NQU-881 PR 2 cutover landed 2026-05-23)

Merge to main → staging only (the deploy-staging job in ci.yml). The legacy deploy (prod) job is gated if: false and never runs.
Prod is updated only via .github/workflows/promote-to-prod.yml (workflow_dispatch). Inputs: sha (the validated SHA from staging soak) + reason (cycle id / freeform).
The promote workflow takes the pre-promotion RDS snapshot, reuses the SHA-tagged image if it's already in prod ECR (otherwise builds + pushes), registers the new task def, swaps the service, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure it auto-rolls-back to the previous task-def revision in the same run.
Rollback is rollback-prod.yml (workflow_dispatch). Inputs: optional to-revision (default = immediately previous) + reason. Same prod-promote concurrency group as promote so they cannot run concurrently.
Validated end-to-end before cutover by dry-run #3 on 2026-05-23 (run 26342981773): snapshot → image-reuse → task-def register → wait-for-stable → deep-health green → 🟢 NQU-828 post. Two earlier dry-runs surfaced + closed cascade-fix follow-ups (PR 1.5 ECR immutable-tag guard, PR 1.6 RDS tag-value sanitizer); the third was clean.

What this means in practice

A bad merge can no longer reach prod customers without an explicit workflow_dispatch action — the gate is real.
Prod silently drifts behind staging until someone runs the promote workflow. That is the design intent (NQU-866 P-11: promotion is a per-cycle decision, not a per-PR moment).
The environment: production declaration on both workflows activates required-reviewer rules automatically if the repo upgrades from the current free private plan; on the current plan, workflow_dispatch alone is the gate since only repo admins can dispatch.

Part 1 — Promotion ritual (P-11)

Principle (NQU-866 P-11): promotion is a per-cycle pass/fail signal backed by smoke evidence, not a per-PR approval moment. Joe's "go" is sign-off on the cycle's smoke evidence, not a line-by-line review.

Invariant: every promotion takes a pre-promotion snapshot

Before any promote-to-prod, take a manual prod RDS snapshot. This is the nuclear restore for Part 2 and is non-negotiable — it is what makes the data-rollback RTO achievable.

# Prod RDS snapshot (run with prod creds). Name encodes the cycle/ticket.
aws rds create-db-snapshot \
  --db-instance-identifier invapp-dev-postgres \
  --db-snapshot-identifier invapp-dev-rds-$(date +%Y%m%d%H%M%S)-pre-promote-<cycle> \
  --region us-east-1

Promotion gate checklist (per cycle)

Promotion proceeds only when all of these are green; record the result as a single pass/fail with links (NQU-866 P-12 — summary first, drill-down on request):

Staging deploy is healthy: curl -u <basic-auth> https://staging.nquiry.ai/api/health?deep=true → {"status":"ok"}.
NQU-855 smoke suite green against staging (Cluster A real-provider + Cluster B pipeline invariants).
Any migration in this cycle applied cleanly to staging RDS and its .down.sql was exercised (see migrations.md).
Pending DB migrations are queued to apply to prod RDS during promote. promote-to-prod.yml does NOT run migrations (it only snapshots + swaps the image), so they are applied manually — see Promote step 4 below. Confirm the .down.sql is ready. Pending as of 2026-05-29: NQU-783 20260529003733_nqu783_remove_ai_provider_setting.sql — drops the now-unread ai_provider row; backward-compatible. Clear this note once applied.
Stripe test-mode webhook on staging handled a test event idempotently (if the cycle touched billing).
Pre-promotion prod RDS snapshot taken (above) and reached available.

Promote

Promotion pins prod to the exact image SHA that passed staging soak. With NQU-881 PR 1 landed, the promote-to-prod workflow_dispatch is the standard path; the manual CLI fallback below stays for the offline-runbook case.

Standard path — promote-to-prod.yml (NQU-881):

Identify the validated SHA (the one staging just smoked green): GitHub → Actions → most recent green smoke-suite run → the commit it ran against. Or git log origin/main for the last [cycle-close] commit subject.
GitHub → Actions → Promote to Prod → Run workflow. Inputs:
- sha: full commit SHA from step 1.
- reason: cycle id + ticket reference for the audit trail.
The workflow takes the pre-promotion RDS snapshot, builds + pushes the SHA-tagged image, registers the new task def, swaps the service, waits-for-stable, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure the workflow auto-rolls-back to the previous task-def revision in the same run.
Apply pending DB migrations to prod RDS — the workflow does NOT run migrations. Open the SSM tunnel to prod RDS (invapp-dev-postgres) and run npm run db:migrate:run <file> per migrations.md, then confirm the row landed in _migrations. Sequencing: a backward-compatible migration (e.g. NQU-783's ai_provider row drop — nothing reads it) can be applied right after the deploy; for a non-backward-compatible change, take a manual pre-migration prod snapshot and apply it before dispatching the workflow (the workflow's own snapshot is taken at the start of its run, so applying after dispatch leaves no pre-migration restore point — ADR Decision 3).
The environment: production declaration activates required-reviewer rules automatically if the repo upgrades to a plan that supports them; on the current free private plan, workflow_dispatch alone is the gate (only repo admins can dispatch).

Manual CLI fallback (offline / GitHub down):

# 1. Identify the validated image SHA (the one staging is running).
aws ecs describe-task-definition --task-definition invapp-staging-app \
  --query 'taskDefinition.containerDefinitions[0].image' --output text   # staging creds

# 2. Apply any cycle migrations to PROD (snapshot already taken above), per migrations.md.

# 3. Point prod's task def at that image SHA and force a new deployment.
aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
  --task-definition invapp-dev-app:<revision-with-validated-sha> --force-new-deployment   # prod creds
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# 4. Prod smoke (hit the ALB directly; CloudFront blocks CI/curl UAs).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"

Record the promotion decision (cycle id, validated SHA, snapshot id, who signed off, pass/fail) wherever the cycle is tracked.

Part 2 — Layered rollback (Q5)

Two independent layers. Roll back the application first (fast, almost always sufficient); only touch data if a migration corrupted state.

RTO targets

Layer	Mechanism	Target	When
Application	ECS task-def revision revert	< 15 min (≈3 min typical)	Bad code / config shipped; data intact
Data	Down-migration, else pre-promotion snapshot restore	< 1 hr	A migration corrupted or lost data

Layer 1 — Application rollback (< 15 min)

Fast path, no rebuild — re-point the service at the previous known-good task-def revision (ADR Decision 5, Option A). With NQU-881 PR 1 landed, the rollback-prod workflow_dispatch is the standard path; the manual CLI fallback stays for the offline-runbook case.

Standard path — rollback-prod.yml (NQU-881):

GitHub → Actions → Rollback Prod (App) → Run workflow. Inputs:
- to-revision (optional): invapp-dev-app:<N> to target a specific revision. Leave blank to roll back to the immediately previous revision (the default).
- reason: incident id + freeform.
Workflow asserts AWS, swaps the service back, waits-for-stable, runs the deep-health check, and posts the result to NQU-828. Same prod-promote concurrency group as promote-to-prod.yml so they cannot run simultaneously.
Auto-fires once per failed promote (the inline auto-rollback step at the end of promote-to-prod.yml); use the manual rollback-prod workflow when the bad deploy was a successful promote that turned out bad after the fact.

Manual CLI fallback (offline / GitHub down):

# List recent revisions, pick the last known-good (the revision before the bad deploy).
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5   # prod creds

aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
  --task-definition invapp-dev-app:<previous-good-revision> --force-new-deployment
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# Smoke (ALB direct).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"

For a clean auditable revert instead of a hot fix, git revert <bad-sha> on main and let CI redeploy (~8–10 min, ADR Decision 5 Option B). Use Layer 1 task-def revert when "the site is down, fix it now"; use git revert when "this feature is buggy, roll it back cleanly."

Staging: the identical procedure works against invapp-staging-cluster / invapp-staging-app-service / invapp-staging-app. Catching a regression in staging soak (where Layer-1 revert is cheap and user-invisible) is the point of the promotion gate.

Layer 2 — Data rollback (< 1 hr)

Only when a migration corrupted/lost data. Order: try the down-migration first; restore from snapshot only if down fails or data is already corrupt.

Down-migration (preferred, surgical). Each forward migration has a paired .down.sql (migrations.md). Apply it manually, then clear its row from _migrations so it can be re-applied later. The runner does not auto-run downs.
Snapshot restore (nuclear). If the down fails or data is corrupt, restore the pre-promotion snapshot taken in Part 1. Restoring RDS creates a new instance — repoint the app (Secrets Manager DB host / Terraform) at the restored endpoint; it is not an in-place rollback, which is why the target is <1 hr, not minutes.

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier invapp-dev-postgres-restore-$(date +%Y%m%d%H%M%S) \
  --db-snapshot-identifier <pre-promote-snapshot-id> \
  --region us-east-1
# Then repoint the app at the restored endpoint and redeploy.

If the bad deploy included a migration, sequence it: roll back the application (Layer 1) to a revision compatible with the current schema before deciding whether the schema also needs to roll back (ADR Decision 3).

Dry-run (NQU-865 Wave 3 acceptance, step 8)

Before relying on this in anger, exercise it once against staging and record the measured times:

Deploy a deliberately-broken image to staging → confirm the staging smoke gate fails (catches it pre-promotion).
Layer-1 revert staging to the previous revision → measure wall-clock (target < 15 min).
Apply a reversible test migration to staging, run its .down.sql, confirm schema parity (pg_dump --schema-only diff).
Take a staging snapshot and do one restore-to-new-instance to validate the data path end-to-end.

Acceptance = both RTOs met in the dry-run.

docs/decisions/2026-04-03-staging-to-production-transition.md — Decisions 3 (down-migration + snapshot) and 5 (ECS revert); this doc operationalizes them for two tiers.
docs/reference/process/migrations.md — down-migration mechanics + pre-migration snapshot naming.
docs/reference/process/deployment-flow.md — the CI/CD deploy pipeline this promotes/reverts within.
docs/reference/process/environment-strategy.md — Gate 1/2 environment plan.
docs/admin/ops/what-runs-where.md — canonical topology + identity map (NQU-868).
NQU-865 — staging stand-up; NQU-866 — RCA (P-11 promotion gate, P-12 summary-first reporting).

Naming reality (read this first)​

Current state (NQU-881 PR 2 cutover landed 2026-05-23)​

What this means in practice​

Part 1 — Promotion ritual (P-11)​

Invariant: every promotion takes a pre-promotion snapshot​

Promotion gate checklist (per cycle)​

Promote​

Part 2 — Layered rollback (Q5)​

RTO targets​

Layer 1 — Application rollback (< 15 min)​

Layer 2 — Data rollback (< 1 hr)​

Dry-run (NQU-865 Wave 3 acceptance, step 8)​

Related​