Skip to main content

Deployment Flow

Last updated: 2026-05-25 (rewrote overview + per-job annotations for the two-tier topology post-NQU-865 Wave 2 + NQU-881 cutover)


Overview: two-tier deployment

Since 2026-05-22, code flows through two environments before reaching prod users. The on-push pipeline targets staging; prod is updated only by an explicit workflow_dispatch on promote-to-prod.yml.

git push main
└─ deploy-staging ──► invapp-staging-* (961381384763, staging.nquiry.ai)


cycle-end smoke gate
(smoke-suite job; NQU-855; result → NQU-828)


promote-to-prod.yml (workflow_dispatch)


invapp-dev-* (760007728097, app.nquiry.ai)


rollback-prod.yml (workflow_dispatch; after-the-fact revert)

Two environments now (see docs/reference/process/environment-strategy.md for the current state and docs/admin/ops/what-runs-where.md for the identity + resource map). Every merge to main auto-deploys to staging. Prod deploys via the promotion gate (promote-to-prod.yml), not per-PR — see docs/admin/ops/staging-promotion-and-rollback.md for the per-cycle ritual + layered rollback.

The pre-2026-05-22 single-env "every merge is a deploy" model is preserved in environment-strategy.md as Historical State.


CI/CD Pipeline (.github/workflows/ci.yml)

Trigger

  • Push to main: runs full pipeline including deploy
  • Pull request to main: runs lint, build, security scan, and tests (no deploy)

Jobs

1. lint-and-build (all pushes and PRs)

Runs in parallel with security-scan. Concurrency group cancels superseded runs.

  1. Checkout code
  2. Setup Node.js (version from .nvmrc)
  3. npm ci (falls back to npm install)
  4. npm run lint
  5. npm run type-check
  6. npm run build

2. security-scan (all pushes and PRs)

Runs Semgrep with rulesets: p/typescript, p/security-audit, p/secrets, p/eslint-plugin-security. Fails on findings (--error).

3. test (all pushes and PRs)

Depends on lint-and-build. Runs npm test -- --coverage. Coverage threshold check (70%) is currently warn-only.

4. e2e-tests (main branch push only)

Depends on lint-and-build + test. Spins up a pgvector/pgvector:pg16 service container, bootstraps the CI database (scripts/ci-bootstrap-db.sql), runs migrations (npm run db:migrate), builds the app, and runs Playwright E2E tests against the standalone build. Uses real Cognito credentials from GitHub secrets.

5. deployPROD path, gated off post-NQU-881 PR 2 (2026-05-23)

Status: if: false in ci.yml since NQU-881 PR 2 cutover landed 2026-05-23. The job is preserved in-place (not deleted) to keep git-blame for the build logic until the follow-up PR replaces it with the reusable promote-to-prod path.

Current prod-deploy mechanism: .github/workflows/promote-to-prod.yml workflow_dispatch (inputs: validated sha + reason). See docs/admin/ops/staging-promotion-and-rollback.md Part 1 for the per-cycle ritual and the standard-vs-CLI-fallback paths.

The promote workflow does the equivalent of steps 1–8 below against prod (account 760007728097, cluster invapp-dev-cluster, service invapp-dev-app-service), wrapped with: pre-promotion RDS snapshot first, image-reuse when the SHA-tagged image is already in prod ECR (otherwise build), deep-health check after wait-for-stable, auto-rollback to the previous task-def revision on health failure, and NQU-828 result post.

Historical steps (preserved for context — these are now executed by promote-to-prod.yml, not by this deploy job):

  1. Change detection: Compares HEAD~1..HEAD. If only docs/tests/config changed, skips the deploy entirely.

  2. AWS auth: OIDC federation via aws-actions/configure-aws-credentials@v4. Role ARN stored in AWS_ROLE_ARN GitHub secret. No long-lived IAM keys.

  3. ECR login: aws-actions/amazon-ecr-login@v2

  4. Docker build + push: Multi-stage build (see Dockerfile):

    • Stage 1 (deps): npm ci in Alpine Node 24.12
    • Stage 2 (builder): Copy deps, copy source, npm run build. NEXT_PUBLIC_* vars passed as build args (baked into client JS). Server-side secrets are NOT build args.
    • Stage 3 (runner): Alpine Node 24.12, copies standalone output only, runs as non-root nextjs user.
    • Image tagged with git SHA: <ecr-registry>/invapp-dev-app:<sha>
  5. Register task definition: Fetches current ECS task definition, updates container image to new SHA tag, upserts environment variables (quality model config), registers new revision.

  6. Update ECS service: Points service at new task definition revision, forces new deployment. If desired count is 0, sets it to 1.

  7. Wait for stabilization: aws ecs wait services-stable blocks until the new task is healthy.

  8. Smoke test: Curls https://app.nquiry.ai/api/health and checks for HTTP 200.

6. deploy-stagingSTAGING path, the on-push deploy

Depends on lint-and-build + test + security-scan + e2e-tests. Runs only on push to main (if: github.ref == 'refs/heads/main' && github.event_name == 'push'). Has its own concurrency group (deploy-staging-${{ github.ref }}) with cancel-in-progress: false.

Environment targets (account 961381384763):

  • ECR_REPOSITORY: invapp-staging-app
  • ECS_CLUSTER: invapp-staging-cluster
  • ECS_SERVICE: invapp-staging-app-service

Steps — same shape as the historical deploy job (steps 1–8 above) but targeted at the staging account. Smoke test hits https://staging.nquiry.ai/api/health (basic-auth required: nquiry / preview2026 per R-7).

7. smoke-suiteSTAGING, cycle-end gate (NQU-855)

Fires on workflow_dispatch or when a [cycle-close] prefix leads the commit subject on main. Runs the NQU-855 Cluster A real-provider smoke + Cluster B pipeline-invariants suite against staging. Posts the 4-row pass/fail result to NQU-828 (CC-attributed).

This is the per-cycle "is staging actually green?" check that gates promote-to-prod.

Key CI Environment Variables

These are the env values for the historical deploy job (prod) and the current deploy-staging job. Prod is now driven by promote-to-prod.yml, which reads the prod env from its own workflow file.

Variabledeploy (prod, gated off)deploy-staging (staging, live)Purpose
AWS_ROLE_ARNGitHub secret (prod OIDC role)GitHub secret (staging OIDC role)OIDC role for per-account AWS access
ECR_REPOSITORYinvapp-dev-app (760007728097)invapp-staging-app (961381384763)ECR repo name
ECS_CLUSTERinvapp-dev-cluster (760007728097)invapp-staging-cluster (961381384763)ECS cluster name
ECS_SERVICEinvapp-dev-app-service (760007728097)invapp-staging-app-service (961381384763)ECS service name
COGNITO_USER_POOL_IDGitHub secretGitHub secret (separate pool per account)Auth config (baked into client build)
COGNITO_CLIENT_IDGitHub secretGitHub secret (separate pool per account)Auth config (baked into client build)

Other Workflows

  • promote-to-prod.yml — Stage 9b. workflow_dispatch (inputs: sha, reason). Standard prod-deploy path since 2026-05-23 (NQU-881 PR 2). Concurrency group prod-promote (shared with rollback). See docs/admin/ops/staging-promotion-and-rollback.md Part 1.
  • rollback-prod.yml — Layer-1 application rollback (< 15 min RTO). workflow_dispatch (inputs: optional to-revision, reason). Same prod-promote concurrency group. See docs/admin/ops/staging-promotion-and-rollback.md Part 2.
  • eval-check.yml: Evaluation/quality checks (separate from deploy)
  • retention-cron.yml: Scheduled retention/cleanup tasks

Infrastructure Architecture

Route53 (app.nquir.ai)
-> CloudFront (CDN + WAF + Basic Auth gate)
-> ALB (HTTPS termination, health checks)
-> ECS Fargate (private subnet, 1 task)
-> RDS PostgreSQL 15 (private subnet, encrypted, pgvector)
-> ElastiCache Redis (private subnet, TLS + auth token)
-> S3 (evidence files, signed URLs)
-> Bedrock (Claude Sonnet 4, Haiku 4.5, Titan embeddings, Cohere rerank)
-> Cognito (auth, MFA enabled)

All compute and data resources are in private subnets. Outbound traffic (Bedrock, external APIs) goes through NAT Gateway.


Database Access

Via SSM Port Forwarding (Bastion)

The bastion is a t3.micro EC2 instance in a private subnet with SSM agent. No SSH keys, no inbound security group rules. Access is via AWS Systems Manager only.

Prerequisites:

  • AWS CLI v2
  • Session Manager plugin installed (brew install --cask session-manager-plugin on macOS)
  • IAM permissions for ssm:StartSession

Get the bastion instance ID:

# From terraform output
cd infrastructure/terraform/environments/dev
terraform output bastion_instance_id

# Or find it in AWS console: EC2 -> Instances -> invapp-dev-bastion

Start a port forwarding session to RDS:

aws ssm start-session \
--target <bastion-instance-id> \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{
"host": ["<rds-endpoint>"],
"portNumber": ["5432"],
"localPortNumber": ["5433"]
}'

This forwards localhost:5433 to the RDS instance on port 5432. Keep this terminal open.

Connect with psql:

psql -h localhost -p 5433 -U app_admin -d investigation_app

Get RDS endpoint:

cd infrastructure/terraform/environments/dev
terraform output database_endpoint

Running Migrations

Migrations use the custom runner at scripts/run-migration.ts, NOT the Supabase CLI. The runner connects via pg and tracks applied migrations in a _migrations table.

Locally (against local DB):

npm run db:migrate # Run all pending
npm run db:migrate:run <file> # Run specific file

Against production RDS (via bastion tunnel):

  1. Start the SSM port forwarding session (see above)
  2. Set environment variables pointing to the tunnel:
DB_HOST=localhost DB_PORT=5433 DB_NAME=investigation_app \
DB_USER=app_admin DB_PASSWORD=<password> DB_SSL=true \
npm run db:migrate

Or for a specific migration:

DB_HOST=localhost DB_PORT=5433 DB_NAME=investigation_app \
DB_USER=app_admin DB_PASSWORD=<password> DB_SSL=true \
npm run db:migrate:run db/migrations/20260327000000_example.sql

Creating a new migration:

touch db/migrations/$(date +%Y%m%d%H%M%S)_migration_name.sql
# Edit the file, then run it

ECS Exec (Container Shell)

For debugging the running container:

aws ecs execute-command \
--cluster invapp-dev-cluster \
--task <task-id> \
--container invapp-dev-app \
--interactive \
--command "/bin/sh"

enable_execute_command = true is set in the ECS module.


Secrets Management

Server-side secrets are stored in AWS Secrets Manager and injected into ECS tasks at runtime via the task definition's secrets block. They are NOT baked into the Docker image.

Secrets managed:

  • DB_PASSWORD
  • REDIS_AUTH_TOKEN
  • STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET
  • RESEND_API_KEY
  • NEXT_PUBLIC_SENTRY_DSN
  • CRON_SECRET

To update a secret, modify it in AWS Secrets Manager and force a new ECS deployment (the task definition references the secret by name, so the new value is pulled on next task start).


Rollback

There is no automated rollback. To roll back:

  1. Identify the last known-good git SHA
  2. Update the ECS service to use the task definition revision that used that SHA's image
  3. Or: revert the commit on main and let CI redeploy

ECR retains all pushed images (tagged by git SHA), so any previous version can be deployed without rebuilding.