Migrating to AWS with Automated Disaster Recovery

June 30, 2026 (Today) · 0 views

This is the implementation half of a two-part disaster recovery series.

If you are deciding what you should build first, start with Designing Disaster Recovery on AWS. That guide covers requirements, RTO/RPO, cost ranges, pros and cons, and cheaper alternatives such as RDS PostgreSQL with Lambda or Step Functions automation.

This post covers what I actually built, why those choices fit this workload, and where I accepted cost or complexity to get a lower recovery time.


Legend

All names below are generic placeholders for an example workload: a hypothetical event ticketing platform I will call the platform. None of these are real service, repository, environment, account, or hostname names.

NameExample role
the platformA hypothetical event ticketing product used to illustrate the architecture
webappUser-facing web application
partner-dashboardDashboard for partners and organizers
adminInternal administration tooling
marketingPublic marketing site
api-webPublic API and WebSocket request handlers
api-workerBackground job consumers
api-cronScheduled task runner
notificationsEmail and SMS delivery service
short-linksURL redirect service
PrimaryThe primary AWS region under normal operation
DRThe disaster recovery AWS region running warm standby

Region names, account identifiers, hostnames, CIDR ranges, and concrete operational thresholds are intentionally omitted.


Why this migration happened

The platform originally ran almost entirely on a single PaaS: Postgres, Redis, compute, and self-hosted observability. That was the right tradeoff early on. It kept the team moving quickly and avoided infrastructure work before the product needed it.

The problem became clear during an outage where every service was healthy internally, but customers could not reach the app. The provider control plane was the single point of failure. The app was up; the path to the app was not.

That incident changed the requirement from "host the app somewhere reliable" to "control the failure domains and prove the recovery path."

The target requirements were:

  • Multi-AZ in the primary region
  • Warm standby compute in a second region
  • Low single-digit-minute RTO for a full regional outage
  • Near-zero manual action during failover
  • Terraform and CI only after cutover
  • Cost discipline, not active-active enterprise architecture

The result is a Terraform-managed AWS stack with ECS Fargate, Aurora PostgreSQL Global Database, independent Redis per region, CloudFront, Global Accelerator, and a Step Functions plus Lambda DR orchestrator.


Final architecture

Traffic routing is split by workload type.

The important design choice is that the DNS provider is not in the failover path. DNS points to AWS ingress. CloudFront and Global Accelerator handle regional routing inside AWS.


Why CloudFront and Global Accelerator both exist

The web frontends and the API have different failure behavior.

WorkloadIngressReasoning
SSR frontendsCloudFront origin groupPer-request origin failover and cached static assets at the edge
REST APIGlobal AcceleratorFast regional reroute without DNS propagation delay
WebSocketsGlobal AcceleratorBetter behavior for long-lived connections and DNS-caching clients
DNS providerDNS onlyAvoids depending on another proxy control plane during failover

An earlier version used a provider-managed load balancer for multi-origin steering. I removed it because it reintroduced the same class of dependency the migration was supposed to eliminate.

The cost tradeoff is that Global Accelerator adds another fixed monthly charge. I accepted that because API and WebSocket recovery was more important than minimizing ingress cost.


Application topology

The platform runs several ECS Fargate services using the same Docker images that previously deployed to the PaaS.

ServiceIngressDR posture
webappCloudFront to ALBWarm in DR
partner-dashboardCloudFront to ALBWarm in DR
adminCloudFront to ALBMinimal DR capacity
marketingCloudFront to ALBWarm or edge-cached
api-webGlobal Accelerator to ALBWarm in DR
api-workerInternal queueScaled up during failover
api-cronInternal singletonRuns only in active region
notificationsPrivate service discoveryScaled up during failover
short-linksALB pathWarm in DR

The API is one Docker image deployed as three ECS services:

ROLE=web     # HTTP handlers and WebSocket, behind ALB
ROLE=worker  # queue consumers
ROLE=cron    # singleton scheduler

This split matters during DR. The user-facing path stays warm, but background workers and cron do not need to burn full standby capacity every day. During failover, the orchestrator scales the right services in the DR region and prevents cron from running in both regions at once.


Infrastructure as Code

Everything lives in a dedicated infrastructure repository managed with Terraform.

Repository layout

infrastructure/
terraform/Terraform modules
bootstrap/One-time: state bucket + lock table
modules/reusable modules
environments/
stage/staging
prod-primary/production (primary)
prod-dr/production (DR)
secrets/
manifest.jsonMachine-readable secret key schema
templates/Placeholder JSON per environment
scripts/operational shell scripts
.github/workflows/deploy, rollback, failback-verify

The Terraform state backend uses S3 with DynamoDB locking. State is replicated cross-region so infrastructure can still be managed during a primary-region incident.

The main modules are:

ModulePurpose
vpcVPC, subnets, NAT, endpoints, routing
ecs-fargate-serviceTask definitions, services, autoscaling policies
aurora-postgresAurora Global Database primary and secondary clusters
elasticache-redisIndependent Redis nodes in each region
cloudfront-alb-appCloudFront distribution and origin group failover
global-accelerator-albGlobal Accelerator endpoint groups
dr-orchestratorStep Functions, Lambda, locks, and failover workflow
github-oidcCI roles without long-lived AWS keys
secrets-dataSecret ARN lookups without storing values in Terraform state

The key rule is that Terraform references secret ARNs, never secret values. Real values live in Secrets Manager and are injected during deploy.


Database choice: why Aurora Global

The most expensive decision was the database.

Production uses Aurora PostgreSQL Global Database with Serverless v2 autoscaling:

  • primary region has the writer cluster and an in-region reader
  • DR region has a read-only secondary cluster under normal operation
  • DR promotion is handled through Aurora global failover

I chose this even though it can easily cost a few hundred dollars per month because the operational behavior matched the recovery requirement.

NeedWhy Aurora Global helped
Low RTOThe DR database is already replicated and ready to promote
Low RPOReplication lag is usually very small for this workload
Cleaner failbackAurora owns more of the demote and reattach workflow
Terraform stabilityFailover does not turn the replica into a permanently detached one-off resource

The cheaper alternative would have been RDS PostgreSQL with a cross-region read replica or snapshot restore, plus Lambda or Step Functions to promote, update app config, scale DR compute, and rebuild replication later.

That alternative is valid. I did not choose it because every failover drill would have produced more manual database recovery and Terraform reconciliation work. For this platform, lower recovery friction was worth the extra monthly database cost.

Stage does not use Aurora Global. Stage uses a single small RDS PostgreSQL instance because staging only needs to validate modules and drills, not carry production-grade regional availability.


Redis choice: independent per region

Each region has its own ElastiCache Redis node. There is no cross-region Redis replication.

Redis useDR decision
SessionsUsers may need to log in again after regional failover
CacheRepopulates on demand
WebSocket adapter stateClients reconnect
Rate limitsCan reset during a regional disaster

This was a cost decision. ElastiCache Global Datastore would have improved Redis RPO, but the product did not need sub-second session replication. Durable state is in Postgres. Losing Redis state during a rare regional event is acceptable.


Networking choices

Each environment gets an isolated VPC with non-overlapping CIDR ranges.

Within each VPC:

  • public subnets hold ALBs and NAT gateways
  • private subnets hold ECS tasks
  • data subnets hold Aurora and ElastiCache

VPC endpoints exist for services like ECR, Secrets Manager, SSM, S3, queues, KMS, and CloudWatch Logs. The reason is not only security. Endpoints also reduce dependency on public internet paths from private workloads.

Each environment uses a single NAT gateway. More NAT gateways would improve AZ-level resilience for outbound internet access, but the added cost was not worth it for this workload.


Secrets and deploy flow

CI authenticates to AWS through OIDC federation. There are no long-lived AWS access keys in CI.

The production deploy flow is:

Secrets are organized by environment and consumer:

ScopeUsed by
platformShared infrastructure configuration
backendAPI and internal services
webappUser-facing frontend
partner-dashboardPartner dashboard
adminInternal admin tooling
observabilityTelemetry pipeline

At deploy time, a script converts Secrets Manager JSON into environment files consumed by ECS task definitions. During failover, the same export path rewrites region-specific values such as datastore endpoints, queue URLs, and active-region flags.

That indirection is important. The app does not need a separate secrets schema for DR, but the active runtime config can still point at DR resources when needed.


Disaster recovery orchestrator

Regional failover is automated with Step Functions and Lambda.

The orchestrator focuses on the steps that are dangerous to do manually under pressure:

  • deciding that the primary region is actually unhealthy
  • preventing split brain with a global lock
  • checking replication lag before promotion
  • promoting the DR database
  • switching active-region configuration
  • scaling DR services up
  • making sure primary services stop writing after promotion

Failback is also automated, but it requires operator confirmation. Failover is about speed. Failback is about caution.


Observability

Self-hosted observability on the old PaaS was retired. Production now uses a managed telemetry stack.

SignalPath
LogsApplication logs to managed log store, with CloudWatch fallback
MetricsManaged CloudWatch integration
AlertsSNS topic to chat and on-call
Request correlationWide events with identifiers across services

One important cutover rule: production DNS does not move until observability is verified. A successful migration that removes your ability to see production is not successful.


Staging vs production

Staging deliberately costs less than production.

DimensionStageProduction
RegionsPrimary onlyPrimary plus DR
DatabaseSingle RDS PostgreSQLAurora Global Database
RedisSingle nodeIndependent nodes per region
API topologySimpler service layoutSplit web, worker, and cron roles
IngressPublic ALBCloudFront and Global Accelerator
ECS scaleMinimal and toggleablePrimary plus warm DR serving path
DR orchestratorDrill and validation pathFull failover and failback automation

Stage exists to validate Terraform modules, run drills, and test deploy behavior. It does not need to mirror the production bill.


What I would repeat

The choices I would make again:

  1. Start with the failover workflow. VPCs and ECS services are straightforward compared with active-region switching, database promotion, and split-brain prevention.
  2. Keep DR compute partially warm. The serving path needs to recover quickly. Workers and cron can scale later.
  3. Use Aurora Global only where it pays for itself. It is expensive, but it removed a lot of failover and failback risk for the production database.
  4. Do not replicate Redis by default. Session loss during regional DR was an acceptable product tradeoff.
  5. Make Terraform the source of truth. Console fixes are tempting during migration, but drift is exactly what hurts during disaster recovery.
  6. Run drills before DNS cutover. Stage drills caught secret localization and ECS desired-count issues that would have been painful during a real incident.

Current status

The AWS infrastructure is applied and production is healthy in the primary region. Services respond on non-production test hostnames. The remaining step is live DNS cutover from the old platform to production hostnames after final verification.

Steady-state operations after cutover are automated: deploys, backups, scaling, and regional failover all run through CI or the orchestrator.

For the reusable planning framework behind these choices, read Designing Disaster Recovery on AWS.