Environment Strategy
Last updated: 2026-05-25 (rewrote to current two-tier topology post-NQU-865 Wave 2 + NQU-881 cutover) Decision owners: Joe, CC, CD
TL;DR. Two-tier topology since 2026-05-22: prod in account
760007728097(app.nquiry.ai), staging in account961381384763(staging.nquiry.ai). Local docker stack is dev (NQU-867). Every merge tomainauto-deploys to staging; prod is updated only viapromote-to-prod.ymlworkflow_dispatchper cycle. The pre-2026-05-22 "single env" era is preserved below as Historical State.
Current State (2026-05-22+) — Two-Tier Topology
| Tier | Account | Resource prefix | Public host | Provisioned via | Update path |
|---|---|---|---|---|---|
| Prod | 760007728097 | invapp-dev-* ¹ | app.nquiry.ai | environments/dev/ TF | promote-to-prod.yml workflow_dispatch (per cycle, NQU-881 PR 2 cutover 2026-05-23) |
| Staging | 961381384763 | invapp-staging-* | staging.nquiry.ai | environments/staging/ TF | deploy-staging job on push to main (.github/workflows/ci.yml) |
| Dev | localhost | docker-compose | — | npm run local:up (NQU-867) | Local only — never points at staging or prod |
¹ Prod resource prefix is invapp-dev-* (cluster invapp-dev-cluster, service invapp-dev-app-service, RDS invapp-dev-postgres). The cosmetic rename to invapp-prod-* is deferred per NQU-878 — name_prefix drives Terraform resource names, so a rename forces destroy+recreate of stateful resources (data migration + downtime, not in-place). Isolation (the actual goal) is achieved via the account split, not the rename. See what-runs-where.md for the full identity + resource map.
Stand-up history
- 2026-05-22 — Staging account
961381384763(formerly customer-greenfield validation environment per NQU-644/645) stood up as staging via NQU-865 Wave 2 (Route53/ACM, ECS cluster + service, RDS, Cognito, Bedrock quota, OIDC).staging.nquiry.ailive behind CloudFront Basic Auth (nquiry/preview2026) per R-7. - 2026-05-22 — NQU-855 smoke suite (4/4 green) verified on staging.
- 2026-05-23 — NQU-881 PR 2 cutover landed: legacy
deployjob inci.ymlgatedif: false; prod promotion runs only viapromote-to-prod.ymlworkflow_dispatch. Validated end-to-end by dry-run #3 (run 26342981773). - Open — Wave 4 sunset of
cc-staging-bootstrapstatic IAM key (NQU-871, hard deadline 2026-06-15).
What this changes day-to-day
- Every merge to
mainships to staging, not prod. The chat-go protocol that protected prod is no longer the per-PR gate — the per-cyclepromote-to-proddispatch is. SeeCLAUDE.md"PR Merge Discipline" for the post-NQU-881 autonomy rules. - Bad code on
maincannot reach prod customers without an explicitworkflow_dispatch. The promotion gate is real. - Prod silently drifts behind staging until the promote workflow is run. This is the design intent (NQU-866 P-11: promotion is a per-cycle decision, not a per-PR moment).
- The cycle-end smoke suite (NQU-855) hits staging, and its result drives the cycle's promote/no-promote signal posted to NQU-828.
- Local dev never touches prod. The docker-compose stack (NQU-867) replaced the pre-2026-04 pattern of pointing
npm run devat the production RDS tunnel.
Cross-references
docs/admin/ops/what-runs-where.md— canonical "what's deployed where" identity + resource map (NQU-868).docs/admin/ops/staging-promotion-and-rollback.md— promotion ritual (Part 1) + layered rollback (Part 2) runbook.docs/reference/process/deployment-flow.md— the CI/CD pipeline within which promote/rollback fire.docs/reference/process/development-lifecycle.md§2 Stages 9 / 9a / 9b — spine doc with the lifecycle stages this topology embodies.
Gates revisited
The original Gate 1 / Gate 2 plan in this doc (rename dev→prod first, then add staging) was superseded by the NQU-865 work. The actual execution path was account split, not naming split. The gates are preserved below for historical context, with current status notes.
Gate 1: Rename invapp-dev-* → invapp-prod-* — DEFERRED
Status: Deferred indefinitely per NQU-878 (ratified 2026-05-22).
Why deferred. name_prefix drives Terraform resource names. Renaming invapp-dev-postgres → invapp-prod-postgres (and equivalent RDS/S3 buckets) forces Terraform to destroy + recreate stateful resources — a data migration + downtime, not a rename in place. The original framing of "rename + harden in place; no migration" was internally inconsistent on that point.
What replaced it. Phase 3's actual goal — isolation + hardening (R-3 IAM split, R-4 static-key elimination, R-5 Bedrock quota separation, R-6 per-env CI credentials, R-7 staging environment between localhost and prod) — landed via the staging stand-up (NQU-865 Wave 2). The structural risks the RCA surfaced are addressed by isolation, not by the rename. The cosmetic name (invapp-dev-* continuing to serve prod) is mitigated by:
what-runs-where.mdas load-bearing reference at spec time[[cd_spec_topology_citations]]memory discipline (CD specs cite account IDs + resource names, not generic "production")- Cycle-close naming reality check (NQU-866 P-9 scheduled task)
Re-surface triggers (see NQU-878 for full list): account-level migration for any other reason · 3 consecutive cycle-close naming-confusion drift findings · external audit flags the dev-named prod resources.
Gate 2: Add Staging — DONE (differently than originally planned)
Status: Done 2026-05-22 via NQU-865 Wave 2. Account split, not naming split.
Original plan (this doc, pre-2026-05-22): same-account resource-prefix split (invapp-staging-* alongside invapp-dev-* in account 760007728097). Cost estimate ~$50–80/mo.
Actual execution: separate AWS account (961381384763, JE-Vectors-test, formerly customer-greenfield validation env per NQU-644/645). Reasons the account split was chosen over same-account prefix split:
- Cleaner Bedrock-quota isolation (the [[project_nqu865_phase3_inflight]] context that triggered NQU-865 was CC's A/B scripts hammering the shared account-level Bedrock quota and putting live-user traffic at throttling risk).
- Cleaner IAM isolation (per-account OIDC trust + per-account roles, not per-prefix policy carve-outs).
- Reuses already-provisioned account (NQU-644 sub-account, SSO already wired, TF state backend already ready).
- Forces the conceptual separation: every CC operation in the staging account is account-scoped, not policy-scoped — fewer places for blast-radius creep.
Staging domain: staging.nquiry.ai (per R-7, behind CloudFront Basic Auth nquiry / preview2026).
Cognito: separate user pool per account (no cross-account sharing).
Cost: lower than original estimate — the staging account already exists, and staging compute uses smaller instance sizes (cf. NQU-865 Phase 3 implementation plan).
Historical State (pre-2026-05-22) — Single Environment ("dev")
Preserved for context. Until 2026-05-22, Inqura ran a single AWS environment (account
760007728097) that served production traffic atapp.nquiry.ai. Resources were namedinvapp-dev-*throughout (a historical artifact — provisioning started with "dev" expecting to add staging/prod later). The naming actively misled CC + CD into treating changes as low-stakes when they were prod-facing; that conflation is the NQU-866 RCA's central finding. NQU-865 closed the gap by adding staging as a real account, not by renaming.
The pre-2026-05-22 footprint:
- Terraform environment:
environments/dev/ - Resource prefix:
invapp-dev(cluster, service, ECR repo, ALB, RDS, etc.) - State backend:
s3://invapp-terraform-state-760007728097keydev/terraform.tfstate - Domain:
app.nquiry.aivia Route53 + CloudFront + ALB - ECS:
invapp-dev-cluster/invapp-dev-app-service(1 task, 512 CPU / 1024 MB) - RDS:
invapp-dev-postgres, single-AZ,db.t3.micro, deletion protection ON - Redis:
cache.t3.micro, single node - WAF: Bot control in count (log-only) mode
- CloudTrail: Data events disabled (cost savings)
- CloudFront Basic Auth: Removed at F+F (2026-04-20, NQU-359)
Local development ran via npm run dev against a local PostgreSQL instance or a tunneled RDS connection (this last pattern is what NQU-867 replaced — the local docker stack now means local dev never touches prod).
Historical decision rationale (pre-2026-05-22)
| Factor | Decision (pre-2026-05-22) |
|---|---|
| Team size | 1 developer + AI agents. No need for environment isolation between team members. |
| User base | Pre-launch (until F+F 2026-04-20). Limited blast radius justified the single env. |
| Cost | Every duplicated environment doubles infrastructure spend for no current benefit. |
| Complexity | Managing state across environments adds operational burden with no payoff at this scale. |
| Speed | Single environment means faster iteration. No promotion gates to slow down shipping. |
What changed in May 2026: F+F was live, then NQU-728 (Sonnet 4.6 truncation) took the AI offline for real users — the bug could not have been caught pre-deploy because there was no pre-deploy environment. CC's NQU-623 A/B scripts hammered the shared account-level Bedrock quota and put real-user requests at throttling risk. Hard target first paying customer 2026-06-01 made the single-env blast radius untenable. NQU-865 (filed 2026-05-21) committed to the two-tier topology; Wave 2 went live 2026-05-22.