Migrating to AWS with Automated Disaster Recovery

This is the implementation half of a two-part disaster recovery series.

If you are deciding what you should build first, start with Designing Disaster Recovery on AWS. That guide covers requirements, RTO/RPO, cost ranges, pros and cons, and cheaper alternatives such as RDS PostgreSQL with Lambda or Step Functions automation.

This post covers what I actually built, why those choices fit this workload, and where I accepted cost or complexity to get a lower recovery time.

Legend

All names below are generic placeholders for an example workload: a hypothetical event ticketing platform I will call the platform. None of these are real service, repository, environment, account, or hostname names.

Name	Example role
the platform	A hypothetical event ticketing product used to illustrate the architecture
webapp	User-facing web application
partner-dashboard	Dashboard for partners and organizers
admin	Internal administration tooling
marketing	Public marketing site
api-web	Public API and WebSocket request handlers
api-worker	Background job consumers
api-cron	Scheduled task runner
notifications	Email and SMS delivery service
short-links	URL redirect service
Primary	The primary AWS region under normal operation
DR	The disaster recovery AWS region running warm standby

Region names, account identifiers, hostnames, CIDR ranges, and concrete operational thresholds are intentionally omitted.

Why this migration happened

The platform originally ran almost entirely on a single PaaS: Postgres, Redis, compute, and self-hosted observability. That was the right tradeoff early on. It kept the team moving quickly and avoided infrastructure work before the product needed it.

The problem became clear during an outage where every service was healthy internally, but customers could not reach the app. The provider control plane was the single point of failure. The app was up; the path to the app was not.

That incident changed the requirement from "host the app somewhere reliable" to "control the failure domains and prove the recovery path."

The target requirements were:

Multi-AZ in the primary region
Warm standby compute in a second region
Low single-digit-minute RTO for a full regional outage
Near-zero manual action during failover
Terraform and CI only after cutover
Cost discipline, not active-active enterprise architecture

The result is a Terraform-managed AWS stack with ECS Fargate, Aurora PostgreSQL Global Database, independent Redis per region, CloudFront, Global Accelerator, and a Step Functions plus Lambda DR orchestrator.

Final architecture

Traffic routing is split by workload type.

The important design choice is that the DNS provider is not in the failover path. DNS points to AWS ingress. CloudFront and Global Accelerator handle regional routing inside AWS.

Why CloudFront and Global Accelerator both exist

The web frontends and the API have different failure behavior.

Workload	Ingress	Reasoning
SSR frontends	CloudFront origin group	Per-request origin failover and cached static assets at the edge
REST API	Global Accelerator	Fast regional reroute without DNS propagation delay
WebSockets	Global Accelerator	Better behavior for long-lived connections and DNS-caching clients
DNS provider	DNS only	Avoids depending on another proxy control plane during failover

An earlier version used a provider-managed load balancer for multi-origin steering. I removed it because it reintroduced the same class of dependency the migration was supposed to eliminate.

The cost tradeoff is that Global Accelerator adds another fixed monthly charge. I accepted that because API and WebSocket recovery was more important than minimizing ingress cost.

Application topology

The platform runs several ECS Fargate services using the same Docker images that previously deployed to the PaaS.

Service	Ingress	DR posture
webapp	CloudFront to ALB	Warm in DR
partner-dashboard	CloudFront to ALB	Warm in DR
admin	CloudFront to ALB	Minimal DR capacity
marketing	CloudFront to ALB	Warm or edge-cached
api-web	Global Accelerator to ALB	Warm in DR
api-worker	Internal queue	Scaled up during failover
api-cron	Internal singleton	Runs only in active region
notifications	Private service discovery	Scaled up during failover
short-links	ALB path	Warm in DR

The API is one Docker image deployed as three ECS services:

ROLE=web     # HTTP handlers and WebSocket, behind ALB
ROLE=worker  # queue consumers
ROLE=cron    # singleton scheduler

This split matters during DR. The user-facing path stays warm, but background workers and cron do not need to burn full standby capacity every day. During failover, the orchestrator scales the right services in the DR region and prevents cron from running in both regions at once.

Infrastructure as Code

Everything lives in a dedicated infrastructure repository managed with Terraform.

Repository layout

infrastructure/

terraform/Terraform modules

bootstrap/One-time: state bucket + lock table

modules/reusable modules

environments/

stage/staging

prod-primary/production (primary)

prod-dr/production (DR)

secrets/

manifest.jsonMachine-readable secret key schema

templates/Placeholder JSON per environment

scripts/operational shell scripts

.github/workflows/deploy, rollback, failback-verify

The Terraform state backend uses S3 with DynamoDB locking. State is replicated cross-region so infrastructure can still be managed during a primary-region incident.

The main modules are:

Module	Purpose
vpc	VPC, subnets, NAT, endpoints, routing
ecs-fargate-service	Task definitions, services, autoscaling policies
aurora-postgres	Aurora Global Database primary and secondary clusters
elasticache-redis	Independent Redis nodes in each region
cloudfront-alb-app	CloudFront distribution and origin group failover
global-accelerator-alb	Global Accelerator endpoint groups
dr-orchestrator	Step Functions, Lambda, locks, and failover workflow
github-oidc	CI roles without long-lived AWS keys
secrets-data	Secret ARN lookups without storing values in Terraform state

The key rule is that Terraform references secret ARNs, never secret values. Real values live in Secrets Manager and are injected during deploy.

Database choice: why Aurora Global

The most expensive decision was the database.

Production uses Aurora PostgreSQL Global Database with Serverless v2 autoscaling:

primary region has the writer cluster and an in-region reader
DR region has a read-only secondary cluster under normal operation
DR promotion is handled through Aurora global failover

I chose this even though it can easily cost a few hundred dollars per month because the operational behavior matched the recovery requirement.

Need	Why Aurora Global helped
Low RTO	The DR database is already replicated and ready to promote
Low RPO	Replication lag is usually very small for this workload
Cleaner failback	Aurora owns more of the demote and reattach workflow
Terraform stability	Failover does not turn the replica into a permanently detached one-off resource

The cheaper alternative would have been RDS PostgreSQL with a cross-region read replica or snapshot restore, plus Lambda or Step Functions to promote, update app config, scale DR compute, and rebuild replication later.

That alternative is valid. I did not choose it because every failover drill would have produced more manual database recovery and Terraform reconciliation work. For this platform, lower recovery friction was worth the extra monthly database cost.

Stage does not use Aurora Global. Stage uses a single small RDS PostgreSQL instance because staging only needs to validate modules and drills, not carry production-grade regional availability.

Redis choice: independent per region

Each region has its own ElastiCache Redis node. There is no cross-region Redis replication.

Redis use	DR decision
Sessions	Users may need to log in again after regional failover
Cache	Repopulates on demand
WebSocket adapter state	Clients reconnect
Rate limits	Can reset during a regional disaster

This was a cost decision. ElastiCache Global Datastore would have improved Redis RPO, but the product did not need sub-second session replication. Durable state is in Postgres. Losing Redis state during a rare regional event is acceptable.

Networking choices

Each environment gets an isolated VPC with non-overlapping CIDR ranges.

Within each VPC:

public subnets hold ALBs and NAT gateways
private subnets hold ECS tasks
data subnets hold Aurora and ElastiCache

VPC endpoints exist for services like ECR, Secrets Manager, SSM, S3, queues, KMS, and CloudWatch Logs. The reason is not only security. Endpoints also reduce dependency on public internet paths from private workloads.

Each environment uses a single NAT gateway. More NAT gateways would improve AZ-level resilience for outbound internet access, but the added cost was not worth it for this workload.

Secrets and deploy flow

CI authenticates to AWS through OIDC federation. There are no long-lived AWS access keys in CI.

The production deploy flow is:

Secrets are organized by environment and consumer:

Scope	Used by
platform	Shared infrastructure configuration
backend	API and internal services
webapp	User-facing frontend
partner-dashboard	Partner dashboard
admin	Internal admin tooling
observability	Telemetry pipeline

At deploy time, a script converts Secrets Manager JSON into environment files consumed by ECS task definitions. During failover, the same export path rewrites region-specific values such as datastore endpoints, queue URLs, and active-region flags.

That indirection is important. The app does not need a separate secrets schema for DR, but the active runtime config can still point at DR resources when needed.

Disaster recovery orchestrator

Regional failover is automated with Step Functions and Lambda.

The orchestrator focuses on the steps that are dangerous to do manually under pressure:

deciding that the primary region is actually unhealthy
preventing split brain with a global lock
checking replication lag before promotion
promoting the DR database
switching active-region configuration
scaling DR services up
making sure primary services stop writing after promotion

Failback is also automated, but it requires operator confirmation. Failover is about speed. Failback is about caution.

Observability

Self-hosted observability on the old PaaS was retired. Production now uses a managed telemetry stack.

Signal	Path
Logs	Application logs to managed log store, with CloudWatch fallback
Metrics	Managed CloudWatch integration
Alerts	SNS topic to chat and on-call
Request correlation	Wide events with identifiers across services

One important cutover rule: production DNS does not move until observability is verified. A successful migration that removes your ability to see production is not successful.

Staging vs production

Staging deliberately costs less than production.

Dimension	Stage	Production
Regions	Primary only	Primary plus DR
Database	Single RDS PostgreSQL	Aurora Global Database
Redis	Single node	Independent nodes per region
API topology	Simpler service layout	Split web, worker, and cron roles
Ingress	Public ALB	CloudFront and Global Accelerator
ECS scale	Minimal and toggleable	Primary plus warm DR serving path
DR orchestrator	Drill and validation path	Full failover and failback automation

Stage exists to validate Terraform modules, run drills, and test deploy behavior. It does not need to mirror the production bill.

What I would repeat

The choices I would make again:

Start with the failover workflow. VPCs and ECS services are straightforward compared with active-region switching, database promotion, and split-brain prevention.
Keep DR compute partially warm. The serving path needs to recover quickly. Workers and cron can scale later.
Use Aurora Global only where it pays for itself. It is expensive, but it removed a lot of failover and failback risk for the production database.
Do not replicate Redis by default. Session loss during regional DR was an acceptable product tradeoff.
Make Terraform the source of truth. Console fixes are tempting during migration, but drift is exactly what hurts during disaster recovery.
Run drills before DNS cutover. Stage drills caught secret localization and ECS desired-count issues that would have been painful during a real incident.

Current status

The AWS infrastructure is applied and production is healthy in the primary region. Services respond on non-production test hostnames. The remaining step is live DNS cutover from the old platform to production hostnames after final verification.

Steady-state operations after cutover are automated: deploys, backups, scaling, and regional failover all run through CI or the orchestrator.

For the reusable planning framework behind these choices, read Designing Disaster Recovery on AWS.