Staging Promotion + Rollback Runbook
Ticket: NQU-865 Phase 3 (W0-f) · defines the Q5 layered-rollback procedure and the P-11 promotion ritual for the new two-tier (staging → prod) topology.
This is the JE-Vectors internal promotion/rollback runbook. It is distinct from:
docs/licensee/ops/rollback-runbook.md— the customer-facing rollback runbook for licensed greenfield installs (NQU-645).docs/admin/ops/release-validation-runbook.md— the install→upgrade→rollback cycle for cutting a release.
It builds on the 2026-04-03 ADR (docs/decisions/2026-04-03-staging-to-production-transition.md), Decisions 3 (mandatory down-migration + pre-migration snapshot) and 5 (ECS task-def revision revert). Where the ADR used aspirational invapp-prod-* / app.nquir.ai names, this doc uses the real current names.
Naming reality (read this first)
The R-8 rename is deferred (NQU-865, ratified 2026-05-21). Until it lands, the names below are the source of truth:
| Tier | Account | Cluster / service / task def | Public host | Notes |
|---|---|---|---|---|
| Prod | 760007728097 | invapp-dev-cluster / invapp-dev-app-service / invapp-dev-app | app.nquiry.ai | Named invapp-dev-* but serves production. ALB: invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com. |
| Staging | 961381384763 | invapp-staging-cluster / invapp-staging-app-service / invapp-staging-app | staging.nquiry.ai | Behind CloudFront Basic Auth (R-7). Stood up in NQU-865 Wave 2. |
| Dev | localhost | docker-compose (npm run local:up) | — | NQU-867. Never points at prod. |
If a command in this doc names
invapp-dev-*, that is prod. This is the exact confusion NQU-868's "what runs where" doc exists to prevent — see it for the full topology.
Current state (NQU-881 PR 2 cutover landed 2026-05-23)
- Merge to
main→ staging only (thedeploy-stagingjob inci.yml). The legacydeploy(prod) job is gatedif: falseand never runs. - Prod is updated only via
.github/workflows/promote-to-prod.yml(workflow_dispatch). Inputs:sha(the validated SHA from staging soak) +reason(cycle id / freeform). - The promote workflow takes the pre-promotion RDS snapshot, reuses the SHA-tagged image if it's already in prod ECR (otherwise builds + pushes), registers the new task def, swaps the service, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure it auto-rolls-back to the previous task-def revision in the same run.
- Rollback is
rollback-prod.yml(workflow_dispatch). Inputs: optionalto-revision(default = immediately previous) +reason. Sameprod-promoteconcurrency group as promote so they cannot run concurrently. - Validated end-to-end before cutover by dry-run #3 on 2026-05-23 (
run 26342981773): snapshot → image-reuse → task-def register → wait-for-stable → deep-health green → 🟢 NQU-828 post. Two earlier dry-runs surfaced + closed cascade-fix follow-ups (PR 1.5 ECR immutable-tag guard, PR 1.6 RDS tag-value sanitizer); the third was clean.
What this means in practice
- A bad merge can no longer reach prod customers without an explicit
workflow_dispatchaction — the gate is real. - Prod silently drifts behind staging until someone runs the promote workflow. That is the design intent (NQU-866 P-11: promotion is a per-cycle decision, not a per-PR moment).
- The
environment: productiondeclaration on both workflows activates required-reviewer rules automatically if the repo upgrades from the current free private plan; on the current plan,workflow_dispatchalone is the gate since only repo admins can dispatch.
Part 1 — Promotion ritual (P-11)
Principle (NQU-866 P-11): promotion is a per-cycle pass/fail signal backed by smoke evidence, not a per-PR approval moment. Joe's "go" is sign-off on the cycle's smoke evidence, not a line-by-line review.
Invariant: every promotion takes a pre-promotion snapshot
Before any promote-to-prod, take a manual prod RDS snapshot. This is the nuclear restore for Part 2 and is non-negotiable — it is what makes the data-rollback RTO achievable.
# Prod RDS snapshot (run with prod creds). Name encodes the cycle/ticket.
aws rds create-db-snapshot \
--db-instance-identifier invapp-dev-postgres \
--db-snapshot-identifier invapp-dev-rds-$(date +%Y%m%d%H%M%S)-pre-promote-<cycle> \
--region us-east-1
Promotion gate checklist (per cycle)
Promotion proceeds only when all of these are green; record the result as a single pass/fail with links (NQU-866 P-12 — summary first, drill-down on request):
- Staging deploy is healthy:
curl -u <basic-auth> https://staging.nquiry.ai/api/health?deep=true→{"status":"ok"}. - NQU-855 smoke suite green against staging (Cluster A real-provider + Cluster B pipeline invariants).
- Any migration in this cycle applied cleanly to staging RDS and its
.down.sqlwas exercised (seemigrations.md). - Pending DB migrations are queued to apply to prod RDS during promote.
promote-to-prod.ymldoes NOT run migrations (it only snapshots + swaps the image), so they are applied manually — see Promote step 4 below. Confirm the.down.sqlis ready. Pending as of 2026-05-29: NQU-78320260529003733_nqu783_remove_ai_provider_setting.sql— drops the now-unreadai_providerrow; backward-compatible. Clear this note once applied. - Stripe test-mode webhook on staging handled a test event idempotently (if the cycle touched billing).
- Pre-promotion prod RDS snapshot taken (above) and reached
available.
Promote
Promotion pins prod to the exact image SHA that passed staging soak. With NQU-881 PR 1 landed, the promote-to-prod workflow_dispatch is the standard path; the manual CLI fallback below stays for the offline-runbook case.
Standard path — promote-to-prod.yml (NQU-881):
- Identify the validated SHA (the one staging just smoked green): GitHub → Actions → most recent green
smoke-suiterun → the commit it ran against. Orgit log origin/mainfor the last[cycle-close]commit subject. - GitHub → Actions → Promote to Prod → Run workflow. Inputs:
sha: full commit SHA from step 1.reason: cycle id + ticket reference for the audit trail.
- The workflow takes the pre-promotion RDS snapshot, builds + pushes the SHA-tagged image, registers the new task def, swaps the service, waits-for-stable, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure the workflow auto-rolls-back to the previous task-def revision in the same run.
- Apply pending DB migrations to prod RDS — the workflow does NOT run migrations. Open the SSM tunnel to prod RDS (
invapp-dev-postgres) and runnpm run db:migrate:run <file>permigrations.md, then confirm the row landed in_migrations. Sequencing: a backward-compatible migration (e.g. NQU-783'sai_providerrow drop — nothing reads it) can be applied right after the deploy; for a non-backward-compatible change, take a manual pre-migration prod snapshot and apply it before dispatching the workflow (the workflow's own snapshot is taken at the start of its run, so applying after dispatch leaves no pre-migration restore point — ADR Decision 3). - The
environment: productiondeclaration activates required-reviewer rules automatically if the repo upgrades to a plan that supports them; on the current free private plan,workflow_dispatchalone is the gate (only repo admins can dispatch).
Manual CLI fallback (offline / GitHub down):
# 1. Identify the validated image SHA (the one staging is running).
aws ecs describe-task-definition --task-definition invapp-staging-app \
--query 'taskDefinition.containerDefinitions[0].image' --output text # staging creds
# 2. Apply any cycle migrations to PROD (snapshot already taken above), per migrations.md.
# 3. Point prod's task def at that image SHA and force a new deployment.
aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
--task-definition invapp-dev-app:<revision-with-validated-sha> --force-new-deployment # prod creds
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service
# 4. Prod smoke (hit the ALB directly; CloudFront blocks CI/curl UAs).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"
Record the promotion decision (cycle id, validated SHA, snapshot id, who signed off, pass/fail) wherever the cycle is tracked.
Part 2 — Layered rollback (Q5)
Two independent layers. Roll back the application first (fast, almost always sufficient); only touch data if a migration corrupted state.
RTO targets
| Layer | Mechanism | Target | When |
|---|---|---|---|
| Application | ECS task-def revision revert | < 15 min (≈3 min typical) | Bad code / config shipped; data intact |
| Data | Down-migration, else pre-promotion snapshot restore | < 1 hr | A migration corrupted or lost data |
Layer 1 — Application rollback (< 15 min)
Fast path, no rebuild — re-point the service at the previous known-good task-def revision (ADR Decision 5, Option A). With NQU-881 PR 1 landed, the rollback-prod workflow_dispatch is the standard path; the manual CLI fallback stays for the offline-runbook case.
Standard path — rollback-prod.yml (NQU-881):
- GitHub → Actions → Rollback Prod (App) → Run workflow. Inputs:
to-revision(optional):invapp-dev-app:<N>to target a specific revision. Leave blank to roll back to the immediately previous revision (the default).reason: incident id + freeform.
- Workflow asserts AWS, swaps the service back, waits-for-stable, runs the deep-health check, and posts the result to NQU-828. Same
prod-promoteconcurrency group aspromote-to-prod.ymlso they cannot run simultaneously. - Auto-fires once per failed promote (the inline auto-rollback step at the end of
promote-to-prod.yml); use the manualrollback-prodworkflow when the bad deploy was a successful promote that turned out bad after the fact.
Manual CLI fallback (offline / GitHub down):
# List recent revisions, pick the last known-good (the revision before the bad deploy).
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5 # prod creds
aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
--task-definition invapp-dev-app:<previous-good-revision> --force-new-deployment
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service
# Smoke (ALB direct).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"
For a clean auditable revert instead of a hot fix, git revert <bad-sha> on main and let CI redeploy (~8–10 min, ADR Decision 5 Option B). Use Layer 1 task-def revert when "the site is down, fix it now"; use git revert when "this feature is buggy, roll it back cleanly."
Staging: the identical procedure works against
invapp-staging-cluster/invapp-staging-app-service/invapp-staging-app. Catching a regression in staging soak (where Layer-1 revert is cheap and user-invisible) is the point of the promotion gate.
Layer 2 — Data rollback (< 1 hr)
Only when a migration corrupted/lost data. Order: try the down-migration first; restore from snapshot only if down fails or data is already corrupt.
- Down-migration (preferred, surgical). Each forward migration has a paired
.down.sql(migrations.md). Apply it manually, then clear its row from_migrationsso it can be re-applied later. The runner does not auto-run downs. - Snapshot restore (nuclear). If the down fails or data is corrupt, restore the pre-promotion snapshot taken in Part 1. Restoring RDS creates a new instance — repoint the app (Secrets Manager DB host / Terraform) at the restored endpoint; it is not an in-place rollback, which is why the target is <1 hr, not minutes.
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier invapp-dev-postgres-restore-$(date +%Y%m%d%H%M%S) \
--db-snapshot-identifier <pre-promote-snapshot-id> \
--region us-east-1
# Then repoint the app at the restored endpoint and redeploy.
If the bad deploy included a migration, sequence it: roll back the application (Layer 1) to a revision compatible with the current schema before deciding whether the schema also needs to roll back (ADR Decision 3).
Dry-run (NQU-865 Wave 3 acceptance, step 8)
Before relying on this in anger, exercise it once against staging and record the measured times:
- Deploy a deliberately-broken image to staging → confirm the staging smoke gate fails (catches it pre-promotion).
- Layer-1 revert staging to the previous revision → measure wall-clock (target < 15 min).
- Apply a reversible test migration to staging, run its
.down.sql, confirm schema parity (pg_dump --schema-onlydiff). - Take a staging snapshot and do one restore-to-new-instance to validate the data path end-to-end.
Acceptance = both RTOs met in the dry-run.
Related
docs/decisions/2026-04-03-staging-to-production-transition.md— Decisions 3 (down-migration + snapshot) and 5 (ECS revert); this doc operationalizes them for two tiers.docs/reference/process/migrations.md— down-migration mechanics + pre-migration snapshot naming.docs/reference/process/deployment-flow.md— the CI/CD deploy pipeline this promotes/reverts within.docs/reference/process/environment-strategy.md— Gate 1/2 environment plan.docs/admin/ops/what-runs-where.md— canonical topology + identity map (NQU-868).- NQU-865 — staging stand-up; NQU-866 — RCA (P-11 promotion gate, P-12 summary-first reporting).