Skip to main content

Staging Promotion + Rollback Runbook

Ticket: NQU-865 Phase 3 (W0-f) · defines the Q5 layered-rollback procedure and the P-11 promotion ritual for the new two-tier (staging → prod) topology.

This is the JE-Vectors internal promotion/rollback runbook. It is distinct from:

  • docs/licensee/ops/rollback-runbook.md — the customer-facing rollback runbook for licensed greenfield installs (NQU-645).
  • docs/admin/ops/release-validation-runbook.md — the install→upgrade→rollback cycle for cutting a release.

It builds on the 2026-04-03 ADR (docs/decisions/2026-04-03-staging-to-production-transition.md), Decisions 3 (mandatory down-migration + pre-migration snapshot) and 5 (ECS task-def revision revert). Where the ADR used aspirational invapp-prod-* / app.nquir.ai names, this doc uses the real current names.


Naming reality (read this first)

The R-8 rename is deferred (NQU-865, ratified 2026-05-21). Until it lands, the names below are the source of truth:

TierAccountCluster / service / task defPublic hostNotes
Prod760007728097invapp-dev-cluster / invapp-dev-app-service / invapp-dev-appapp.nquiry.aiNamed invapp-dev-* but serves production. ALB: invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com.
Staging961381384763invapp-staging-cluster / invapp-staging-app-service / invapp-staging-appstaging.nquiry.aiBehind CloudFront Basic Auth (R-7). Stood up in NQU-865 Wave 2.
Devlocalhostdocker-compose (npm run local:up)NQU-867. Never points at prod.

If a command in this doc names invapp-dev-*, that is prod. This is the exact confusion NQU-868's "what runs where" doc exists to prevent — see it for the full topology.


Current state (NQU-881 PR 2 cutover landed 2026-05-23)

  • Merge to mainstaging only (the deploy-staging job in ci.yml). The legacy deploy (prod) job is gated if: false and never runs.
  • Prod is updated only via .github/workflows/promote-to-prod.yml (workflow_dispatch). Inputs: sha (the validated SHA from staging soak) + reason (cycle id / freeform).
  • The promote workflow takes the pre-promotion RDS snapshot, reuses the SHA-tagged image if it's already in prod ECR (otherwise builds + pushes), registers the new task def, swaps the service, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure it auto-rolls-back to the previous task-def revision in the same run.
  • Rollback is rollback-prod.yml (workflow_dispatch). Inputs: optional to-revision (default = immediately previous) + reason. Same prod-promote concurrency group as promote so they cannot run concurrently.
  • Validated end-to-end before cutover by dry-run #3 on 2026-05-23 (run 26342981773): snapshot → image-reuse → task-def register → wait-for-stable → deep-health green → 🟢 NQU-828 post. Two earlier dry-runs surfaced + closed cascade-fix follow-ups (PR 1.5 ECR immutable-tag guard, PR 1.6 RDS tag-value sanitizer); the third was clean.

What this means in practice

  • A bad merge can no longer reach prod customers without an explicit workflow_dispatch action — the gate is real.
  • Prod silently drifts behind staging until someone runs the promote workflow. That is the design intent (NQU-866 P-11: promotion is a per-cycle decision, not a per-PR moment).
  • The environment: production declaration on both workflows activates required-reviewer rules automatically if the repo upgrades from the current free private plan; on the current plan, workflow_dispatch alone is the gate since only repo admins can dispatch.

Part 1 — Promotion ritual (P-11)

Principle (NQU-866 P-11): promotion is a per-cycle pass/fail signal backed by smoke evidence, not a per-PR approval moment. Joe's "go" is sign-off on the cycle's smoke evidence, not a line-by-line review.

Invariant: every promotion takes a pre-promotion snapshot

Before any promote-to-prod, take a manual prod RDS snapshot. This is the nuclear restore for Part 2 and is non-negotiable — it is what makes the data-rollback RTO achievable.

# Prod RDS snapshot (run with prod creds). Name encodes the cycle/ticket.
aws rds create-db-snapshot \
--db-instance-identifier invapp-dev-postgres \
--db-snapshot-identifier invapp-dev-rds-$(date +%Y%m%d%H%M%S)-pre-promote-<cycle> \
--region us-east-1

Promotion gate checklist (per cycle)

Promotion proceeds only when all of these are green; record the result as a single pass/fail with links (NQU-866 P-12 — summary first, drill-down on request):

  • Staging deploy is healthy: curl -u <basic-auth> https://staging.nquiry.ai/api/health?deep=true{"status":"ok"}.
  • NQU-855 smoke suite green against staging (Cluster A real-provider + Cluster B pipeline invariants).
  • Any migration in this cycle applied cleanly to staging RDS and its .down.sql was exercised (see migrations.md).
  • Pending DB migrations are queued to apply to prod RDS during promote. promote-to-prod.yml does NOT run migrations (it only snapshots + swaps the image), so they are applied manually — see Promote step 4 below. Confirm the .down.sql is ready. Pending as of 2026-05-29: NQU-783 20260529003733_nqu783_remove_ai_provider_setting.sql — drops the now-unread ai_provider row; backward-compatible. Clear this note once applied.
  • Stripe test-mode webhook on staging handled a test event idempotently (if the cycle touched billing).
  • Pre-promotion prod RDS snapshot taken (above) and reached available.

Promote

Promotion pins prod to the exact image SHA that passed staging soak. With NQU-881 PR 1 landed, the promote-to-prod workflow_dispatch is the standard path; the manual CLI fallback below stays for the offline-runbook case.

Standard path — promote-to-prod.yml (NQU-881):

  1. Identify the validated SHA (the one staging just smoked green): GitHub → Actions → most recent green smoke-suite run → the commit it ran against. Or git log origin/main for the last [cycle-close] commit subject.
  2. GitHub → Actions → Promote to ProdRun workflow. Inputs:
    • sha: full commit SHA from step 1.
    • reason: cycle id + ticket reference for the audit trail.
  3. The workflow takes the pre-promotion RDS snapshot, builds + pushes the SHA-tagged image, registers the new task def, swaps the service, waits-for-stable, runs the deep-health check, and posts the result to NQU-828 (CC-attributed). On health failure the workflow auto-rolls-back to the previous task-def revision in the same run.
  4. Apply pending DB migrations to prod RDS — the workflow does NOT run migrations. Open the SSM tunnel to prod RDS (invapp-dev-postgres) and run npm run db:migrate:run <file> per migrations.md, then confirm the row landed in _migrations. Sequencing: a backward-compatible migration (e.g. NQU-783's ai_provider row drop — nothing reads it) can be applied right after the deploy; for a non-backward-compatible change, take a manual pre-migration prod snapshot and apply it before dispatching the workflow (the workflow's own snapshot is taken at the start of its run, so applying after dispatch leaves no pre-migration restore point — ADR Decision 3).
  5. The environment: production declaration activates required-reviewer rules automatically if the repo upgrades to a plan that supports them; on the current free private plan, workflow_dispatch alone is the gate (only repo admins can dispatch).

Manual CLI fallback (offline / GitHub down):

# 1. Identify the validated image SHA (the one staging is running).
aws ecs describe-task-definition --task-definition invapp-staging-app \
--query 'taskDefinition.containerDefinitions[0].image' --output text # staging creds

# 2. Apply any cycle migrations to PROD (snapshot already taken above), per migrations.md.

# 3. Point prod's task def at that image SHA and force a new deployment.
aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
--task-definition invapp-dev-app:<revision-with-validated-sha> --force-new-deployment # prod creds
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# 4. Prod smoke (hit the ALB directly; CloudFront blocks CI/curl UAs).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"

Record the promotion decision (cycle id, validated SHA, snapshot id, who signed off, pass/fail) wherever the cycle is tracked.


Part 2 — Layered rollback (Q5)

Two independent layers. Roll back the application first (fast, almost always sufficient); only touch data if a migration corrupted state.

RTO targets

LayerMechanismTargetWhen
ApplicationECS task-def revision revert< 15 min (≈3 min typical)Bad code / config shipped; data intact
DataDown-migration, else pre-promotion snapshot restore< 1 hrA migration corrupted or lost data

Layer 1 — Application rollback (< 15 min)

Fast path, no rebuild — re-point the service at the previous known-good task-def revision (ADR Decision 5, Option A). With NQU-881 PR 1 landed, the rollback-prod workflow_dispatch is the standard path; the manual CLI fallback stays for the offline-runbook case.

Standard path — rollback-prod.yml (NQU-881):

  1. GitHub → Actions → Rollback Prod (App)Run workflow. Inputs:
    • to-revision (optional): invapp-dev-app:<N> to target a specific revision. Leave blank to roll back to the immediately previous revision (the default).
    • reason: incident id + freeform.
  2. Workflow asserts AWS, swaps the service back, waits-for-stable, runs the deep-health check, and posts the result to NQU-828. Same prod-promote concurrency group as promote-to-prod.yml so they cannot run simultaneously.
  3. Auto-fires once per failed promote (the inline auto-rollback step at the end of promote-to-prod.yml); use the manual rollback-prod workflow when the bad deploy was a successful promote that turned out bad after the fact.

Manual CLI fallback (offline / GitHub down):

# List recent revisions, pick the last known-good (the revision before the bad deploy).
aws ecs list-task-definitions --family-prefix invapp-dev-app --sort DESC --max-items 5 # prod creds

aws ecs update-service --cluster invapp-dev-cluster --service invapp-dev-app-service \
--task-definition invapp-dev-app:<previous-good-revision> --force-new-deployment
aws ecs wait services-stable --cluster invapp-dev-cluster --services invapp-dev-app-service

# Smoke (ALB direct).
curl -sk "https://invapp-dev-alb-1204346020.us-east-1.elb.amazonaws.com/api/health?deep=true"

For a clean auditable revert instead of a hot fix, git revert <bad-sha> on main and let CI redeploy (~8–10 min, ADR Decision 5 Option B). Use Layer 1 task-def revert when "the site is down, fix it now"; use git revert when "this feature is buggy, roll it back cleanly."

Staging: the identical procedure works against invapp-staging-cluster / invapp-staging-app-service / invapp-staging-app. Catching a regression in staging soak (where Layer-1 revert is cheap and user-invisible) is the point of the promotion gate.

Layer 2 — Data rollback (< 1 hr)

Only when a migration corrupted/lost data. Order: try the down-migration first; restore from snapshot only if down fails or data is already corrupt.

  1. Down-migration (preferred, surgical). Each forward migration has a paired .down.sql (migrations.md). Apply it manually, then clear its row from _migrations so it can be re-applied later. The runner does not auto-run downs.
  2. Snapshot restore (nuclear). If the down fails or data is corrupt, restore the pre-promotion snapshot taken in Part 1. Restoring RDS creates a new instance — repoint the app (Secrets Manager DB host / Terraform) at the restored endpoint; it is not an in-place rollback, which is why the target is <1 hr, not minutes.
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier invapp-dev-postgres-restore-$(date +%Y%m%d%H%M%S) \
--db-snapshot-identifier <pre-promote-snapshot-id> \
--region us-east-1
# Then repoint the app at the restored endpoint and redeploy.

If the bad deploy included a migration, sequence it: roll back the application (Layer 1) to a revision compatible with the current schema before deciding whether the schema also needs to roll back (ADR Decision 3).


Dry-run (NQU-865 Wave 3 acceptance, step 8)

Before relying on this in anger, exercise it once against staging and record the measured times:

  1. Deploy a deliberately-broken image to staging → confirm the staging smoke gate fails (catches it pre-promotion).
  2. Layer-1 revert staging to the previous revision → measure wall-clock (target < 15 min).
  3. Apply a reversible test migration to staging, run its .down.sql, confirm schema parity (pg_dump --schema-only diff).
  4. Take a staging snapshot and do one restore-to-new-instance to validate the data path end-to-end.

Acceptance = both RTOs met in the dry-run.


  • docs/decisions/2026-04-03-staging-to-production-transition.md — Decisions 3 (down-migration + snapshot) and 5 (ECS revert); this doc operationalizes them for two tiers.
  • docs/reference/process/migrations.md — down-migration mechanics + pre-migration snapshot naming.
  • docs/reference/process/deployment-flow.md — the CI/CD deploy pipeline this promotes/reverts within.
  • docs/reference/process/environment-strategy.md — Gate 1/2 environment plan.
  • docs/admin/ops/what-runs-where.md — canonical topology + identity map (NQU-868).
  • NQU-865 — staging stand-up; NQU-866 — RCA (P-11 promotion gate, P-12 summary-first reporting).