Skip to main content

Environment Strategy

Last updated: 2026-05-25 (rewrote to current two-tier topology post-NQU-865 Wave 2 + NQU-881 cutover) Decision owners: Joe, CC, CD

TL;DR. Two-tier topology since 2026-05-22: prod in account 760007728097 (app.nquiry.ai), staging in account 961381384763 (staging.nquiry.ai). Local docker stack is dev (NQU-867). Every merge to main auto-deploys to staging; prod is updated only via promote-to-prod.yml workflow_dispatch per cycle. The pre-2026-05-22 "single env" era is preserved below as Historical State.


Current State (2026-05-22+) — Two-Tier Topology

TierAccountResource prefixPublic hostProvisioned viaUpdate path
Prod760007728097invapp-dev-* ¹app.nquiry.aienvironments/dev/ TFpromote-to-prod.yml workflow_dispatch (per cycle, NQU-881 PR 2 cutover 2026-05-23)
Staging961381384763invapp-staging-*staging.nquiry.aienvironments/staging/ TFdeploy-staging job on push to main (.github/workflows/ci.yml)
Devlocalhostdocker-composenpm run local:up (NQU-867)Local only — never points at staging or prod

¹ Prod resource prefix is invapp-dev-* (cluster invapp-dev-cluster, service invapp-dev-app-service, RDS invapp-dev-postgres). The cosmetic rename to invapp-prod-* is deferred per NQU-878name_prefix drives Terraform resource names, so a rename forces destroy+recreate of stateful resources (data migration + downtime, not in-place). Isolation (the actual goal) is achieved via the account split, not the rename. See what-runs-where.md for the full identity + resource map.

Stand-up history

  • 2026-05-22 — Staging account 961381384763 (formerly customer-greenfield validation environment per NQU-644/645) stood up as staging via NQU-865 Wave 2 (Route53/ACM, ECS cluster + service, RDS, Cognito, Bedrock quota, OIDC). staging.nquiry.ai live behind CloudFront Basic Auth (nquiry / preview2026) per R-7.
  • 2026-05-22 — NQU-855 smoke suite (4/4 green) verified on staging.
  • 2026-05-23 — NQU-881 PR 2 cutover landed: legacy deploy job in ci.yml gated if: false; prod promotion runs only via promote-to-prod.yml workflow_dispatch. Validated end-to-end by dry-run #3 (run 26342981773).
  • Open — Wave 4 sunset of cc-staging-bootstrap static IAM key (NQU-871, hard deadline 2026-06-15).

What this changes day-to-day

  • Every merge to main ships to staging, not prod. The chat-go protocol that protected prod is no longer the per-PR gate — the per-cycle promote-to-prod dispatch is. See CLAUDE.md "PR Merge Discipline" for the post-NQU-881 autonomy rules.
  • Bad code on main cannot reach prod customers without an explicit workflow_dispatch. The promotion gate is real.
  • Prod silently drifts behind staging until the promote workflow is run. This is the design intent (NQU-866 P-11: promotion is a per-cycle decision, not a per-PR moment).
  • The cycle-end smoke suite (NQU-855) hits staging, and its result drives the cycle's promote/no-promote signal posted to NQU-828.
  • Local dev never touches prod. The docker-compose stack (NQU-867) replaced the pre-2026-04 pattern of pointing npm run dev at the production RDS tunnel.

Cross-references


Gates revisited

The original Gate 1 / Gate 2 plan in this doc (rename dev→prod first, then add staging) was superseded by the NQU-865 work. The actual execution path was account split, not naming split. The gates are preserved below for historical context, with current status notes.

Gate 1: Rename invapp-dev-*invapp-prod-*DEFERRED

Status: Deferred indefinitely per NQU-878 (ratified 2026-05-22).

Why deferred. name_prefix drives Terraform resource names. Renaming invapp-dev-postgresinvapp-prod-postgres (and equivalent RDS/S3 buckets) forces Terraform to destroy + recreate stateful resources — a data migration + downtime, not a rename in place. The original framing of "rename + harden in place; no migration" was internally inconsistent on that point.

What replaced it. Phase 3's actual goal — isolation + hardening (R-3 IAM split, R-4 static-key elimination, R-5 Bedrock quota separation, R-6 per-env CI credentials, R-7 staging environment between localhost and prod) — landed via the staging stand-up (NQU-865 Wave 2). The structural risks the RCA surfaced are addressed by isolation, not by the rename. The cosmetic name (invapp-dev-* continuing to serve prod) is mitigated by:

  • what-runs-where.md as load-bearing reference at spec time
  • [[cd_spec_topology_citations]] memory discipline (CD specs cite account IDs + resource names, not generic "production")
  • Cycle-close naming reality check (NQU-866 P-9 scheduled task)

Re-surface triggers (see NQU-878 for full list): account-level migration for any other reason · 3 consecutive cycle-close naming-confusion drift findings · external audit flags the dev-named prod resources.

Gate 2: Add Staging — DONE (differently than originally planned)

Status: Done 2026-05-22 via NQU-865 Wave 2. Account split, not naming split.

Original plan (this doc, pre-2026-05-22): same-account resource-prefix split (invapp-staging-* alongside invapp-dev-* in account 760007728097). Cost estimate ~$50–80/mo.

Actual execution: separate AWS account (961381384763, JE-Vectors-test, formerly customer-greenfield validation env per NQU-644/645). Reasons the account split was chosen over same-account prefix split:

  • Cleaner Bedrock-quota isolation (the [[project_nqu865_phase3_inflight]] context that triggered NQU-865 was CC's A/B scripts hammering the shared account-level Bedrock quota and putting live-user traffic at throttling risk).
  • Cleaner IAM isolation (per-account OIDC trust + per-account roles, not per-prefix policy carve-outs).
  • Reuses already-provisioned account (NQU-644 sub-account, SSO already wired, TF state backend already ready).
  • Forces the conceptual separation: every CC operation in the staging account is account-scoped, not policy-scoped — fewer places for blast-radius creep.

Staging domain: staging.nquiry.ai (per R-7, behind CloudFront Basic Auth nquiry / preview2026).

Cognito: separate user pool per account (no cross-account sharing).

Cost: lower than original estimate — the staging account already exists, and staging compute uses smaller instance sizes (cf. NQU-865 Phase 3 implementation plan).


Historical State (pre-2026-05-22) — Single Environment ("dev")

Preserved for context. Until 2026-05-22, Inqura ran a single AWS environment (account 760007728097) that served production traffic at app.nquiry.ai. Resources were named invapp-dev-* throughout (a historical artifact — provisioning started with "dev" expecting to add staging/prod later). The naming actively misled CC + CD into treating changes as low-stakes when they were prod-facing; that conflation is the NQU-866 RCA's central finding. NQU-865 closed the gap by adding staging as a real account, not by renaming.

The pre-2026-05-22 footprint:

  • Terraform environment: environments/dev/
  • Resource prefix: invapp-dev (cluster, service, ECR repo, ALB, RDS, etc.)
  • State backend: s3://invapp-terraform-state-760007728097 key dev/terraform.tfstate
  • Domain: app.nquiry.ai via Route53 + CloudFront + ALB
  • ECS: invapp-dev-cluster / invapp-dev-app-service (1 task, 512 CPU / 1024 MB)
  • RDS: invapp-dev-postgres, single-AZ, db.t3.micro, deletion protection ON
  • Redis: cache.t3.micro, single node
  • WAF: Bot control in count (log-only) mode
  • CloudTrail: Data events disabled (cost savings)
  • CloudFront Basic Auth: Removed at F+F (2026-04-20, NQU-359)

Local development ran via npm run dev against a local PostgreSQL instance or a tunneled RDS connection (this last pattern is what NQU-867 replaced — the local docker stack now means local dev never touches prod).

Historical decision rationale (pre-2026-05-22)

FactorDecision (pre-2026-05-22)
Team size1 developer + AI agents. No need for environment isolation between team members.
User basePre-launch (until F+F 2026-04-20). Limited blast radius justified the single env.
CostEvery duplicated environment doubles infrastructure spend for no current benefit.
ComplexityManaging state across environments adds operational burden with no payoff at this scale.
SpeedSingle environment means faster iteration. No promotion gates to slow down shipping.

What changed in May 2026: F+F was live, then NQU-728 (Sonnet 4.6 truncation) took the AI offline for real users — the bug could not have been caught pre-deploy because there was no pre-deploy environment. CC's NQU-623 A/B scripts hammered the shared account-level Bedrock quota and put real-user requests at throttling risk. Hard target first paying customer 2026-06-01 made the single-env blast radius untenable. NQU-865 (filed 2026-05-21) committed to the two-tier topology; Wave 2 went live 2026-05-22.