This is the implementation half of a two-part disaster recovery series.
If you are deciding what you should build first, start with Designing Disaster Recovery on AWS. That guide covers requirements, RTO/RPO, cost ranges, pros and cons, and cheaper alternatives such as RDS PostgreSQL with Lambda or Step Functions automation.
This post covers what I actually built, why those choices fit this workload, and where I accepted cost or complexity to get a lower recovery time.
Legend
All names below are generic placeholders for an example workload: a hypothetical event ticketing platform I will call the platform. None of these are real service, repository, environment, account, or hostname names.
| Name | Example role |
|---|---|
| the platform | A hypothetical event ticketing product used to illustrate the architecture |
| webapp | User-facing web application |
| partner-dashboard | Dashboard for partners and organizers |
| admin | Internal administration tooling |
| marketing | Public marketing site |
| api-web | Public API and WebSocket request handlers |
| api-worker | Background job consumers |
| api-cron | Scheduled task runner |
| notifications | Email and SMS delivery service |
| short-links | URL redirect service |
| Primary | The primary AWS region under normal operation |
| DR | The disaster recovery AWS region running warm standby |
Region names, account identifiers, hostnames, CIDR ranges, and concrete operational thresholds are intentionally omitted.
Why this migration happened
The platform originally ran almost entirely on a single PaaS: Postgres, Redis, compute, and self-hosted observability. That was the right tradeoff early on. It kept the team moving quickly and avoided infrastructure work before the product needed it.
The problem became clear during an outage where every service was healthy internally, but customers could not reach the app. The provider control plane was the single point of failure. The app was up; the path to the app was not.
That incident changed the requirement from "host the app somewhere reliable" to "control the failure domains and prove the recovery path."
The target requirements were:
- Multi-AZ in the primary region
- Warm standby compute in a second region
- Low single-digit-minute RTO for a full regional outage
- Near-zero manual action during failover
- Terraform and CI only after cutover
- Cost discipline, not active-active enterprise architecture
The result is a Terraform-managed AWS stack with ECS Fargate, Aurora PostgreSQL Global Database, independent Redis per region, CloudFront, Global Accelerator, and a Step Functions plus Lambda DR orchestrator.
Final architecture
Traffic routing is split by workload type.
The important design choice is that the DNS provider is not in the failover path. DNS points to AWS ingress. CloudFront and Global Accelerator handle regional routing inside AWS.
Why CloudFront and Global Accelerator both exist
The web frontends and the API have different failure behavior.
| Workload | Ingress | Reasoning |
|---|---|---|
| SSR frontends | CloudFront origin group | Per-request origin failover and cached static assets at the edge |
| REST API | Global Accelerator | Fast regional reroute without DNS propagation delay |
| WebSockets | Global Accelerator | Better behavior for long-lived connections and DNS-caching clients |
| DNS provider | DNS only | Avoids depending on another proxy control plane during failover |
An earlier version used a provider-managed load balancer for multi-origin steering. I removed it because it reintroduced the same class of dependency the migration was supposed to eliminate.
The cost tradeoff is that Global Accelerator adds another fixed monthly charge. I accepted that because API and WebSocket recovery was more important than minimizing ingress cost.
Application topology
The platform runs several ECS Fargate services using the same Docker images that previously deployed to the PaaS.
| Service | Ingress | DR posture |
|---|---|---|
| webapp | CloudFront to ALB | Warm in DR |
| partner-dashboard | CloudFront to ALB | Warm in DR |
| admin | CloudFront to ALB | Minimal DR capacity |
| marketing | CloudFront to ALB | Warm or edge-cached |
| api-web | Global Accelerator to ALB | Warm in DR |
| api-worker | Internal queue | Scaled up during failover |
| api-cron | Internal singleton | Runs only in active region |
| notifications | Private service discovery | Scaled up during failover |
| short-links | ALB path | Warm in DR |
The API is one Docker image deployed as three ECS services:
ROLE=web # HTTP handlers and WebSocket, behind ALB
ROLE=worker # queue consumers
ROLE=cron # singleton scheduler
This split matters during DR. The user-facing path stays warm, but background workers and cron do not need to burn full standby capacity every day. During failover, the orchestrator scales the right services in the DR region and prevents cron from running in both regions at once.
Infrastructure as Code
Everything lives in a dedicated infrastructure repository managed with Terraform.
Repository layout
The Terraform state backend uses S3 with DynamoDB locking. State is replicated cross-region so infrastructure can still be managed during a primary-region incident.
The main modules are:
| Module | Purpose |
|---|---|
| vpc | VPC, subnets, NAT, endpoints, routing |
| ecs-fargate-service | Task definitions, services, autoscaling policies |
| aurora-postgres | Aurora Global Database primary and secondary clusters |
| elasticache-redis | Independent Redis nodes in each region |
| cloudfront-alb-app | CloudFront distribution and origin group failover |
| global-accelerator-alb | Global Accelerator endpoint groups |
| dr-orchestrator | Step Functions, Lambda, locks, and failover workflow |
| github-oidc | CI roles without long-lived AWS keys |
| secrets-data | Secret ARN lookups without storing values in Terraform state |
The key rule is that Terraform references secret ARNs, never secret values. Real values live in Secrets Manager and are injected during deploy.
Database choice: why Aurora Global
The most expensive decision was the database.
Production uses Aurora PostgreSQL Global Database with Serverless v2 autoscaling:
- primary region has the writer cluster and an in-region reader
- DR region has a read-only secondary cluster under normal operation
- DR promotion is handled through Aurora global failover
I chose this even though it can easily cost a few hundred dollars per month because the operational behavior matched the recovery requirement.
| Need | Why Aurora Global helped |
|---|---|
| Low RTO | The DR database is already replicated and ready to promote |
| Low RPO | Replication lag is usually very small for this workload |
| Cleaner failback | Aurora owns more of the demote and reattach workflow |
| Terraform stability | Failover does not turn the replica into a permanently detached one-off resource |
The cheaper alternative would have been RDS PostgreSQL with a cross-region read replica or snapshot restore, plus Lambda or Step Functions to promote, update app config, scale DR compute, and rebuild replication later.
That alternative is valid. I did not choose it because every failover drill would have produced more manual database recovery and Terraform reconciliation work. For this platform, lower recovery friction was worth the extra monthly database cost.
Stage does not use Aurora Global. Stage uses a single small RDS PostgreSQL instance because staging only needs to validate modules and drills, not carry production-grade regional availability.
Redis choice: independent per region
Each region has its own ElastiCache Redis node. There is no cross-region Redis replication.
| Redis use | DR decision |
|---|---|
| Sessions | Users may need to log in again after regional failover |
| Cache | Repopulates on demand |
| WebSocket adapter state | Clients reconnect |
| Rate limits | Can reset during a regional disaster |
This was a cost decision. ElastiCache Global Datastore would have improved Redis RPO, but the product did not need sub-second session replication. Durable state is in Postgres. Losing Redis state during a rare regional event is acceptable.
Networking choices
Each environment gets an isolated VPC with non-overlapping CIDR ranges.
Within each VPC:
- public subnets hold ALBs and NAT gateways
- private subnets hold ECS tasks
- data subnets hold Aurora and ElastiCache
VPC endpoints exist for services like ECR, Secrets Manager, SSM, S3, queues, KMS, and CloudWatch Logs. The reason is not only security. Endpoints also reduce dependency on public internet paths from private workloads.
Each environment uses a single NAT gateway. More NAT gateways would improve AZ-level resilience for outbound internet access, but the added cost was not worth it for this workload.
Secrets and deploy flow
CI authenticates to AWS through OIDC federation. There are no long-lived AWS access keys in CI.
The production deploy flow is:
Secrets are organized by environment and consumer:
| Scope | Used by |
|---|---|
| platform | Shared infrastructure configuration |
| backend | API and internal services |
| webapp | User-facing frontend |
| partner-dashboard | Partner dashboard |
| admin | Internal admin tooling |
| observability | Telemetry pipeline |
At deploy time, a script converts Secrets Manager JSON into environment files consumed by ECS task definitions. During failover, the same export path rewrites region-specific values such as datastore endpoints, queue URLs, and active-region flags.
That indirection is important. The app does not need a separate secrets schema for DR, but the active runtime config can still point at DR resources when needed.
Disaster recovery orchestrator
Regional failover is automated with Step Functions and Lambda.
The orchestrator focuses on the steps that are dangerous to do manually under pressure:
- deciding that the primary region is actually unhealthy
- preventing split brain with a global lock
- checking replication lag before promotion
- promoting the DR database
- switching active-region configuration
- scaling DR services up
- making sure primary services stop writing after promotion
Failback is also automated, but it requires operator confirmation. Failover is about speed. Failback is about caution.
Observability
Self-hosted observability on the old PaaS was retired. Production now uses a managed telemetry stack.
| Signal | Path |
|---|---|
| Logs | Application logs to managed log store, with CloudWatch fallback |
| Metrics | Managed CloudWatch integration |
| Alerts | SNS topic to chat and on-call |
| Request correlation | Wide events with identifiers across services |
One important cutover rule: production DNS does not move until observability is verified. A successful migration that removes your ability to see production is not successful.
Staging vs production
Staging deliberately costs less than production.
| Dimension | Stage | Production |
|---|---|---|
| Regions | Primary only | Primary plus DR |
| Database | Single RDS PostgreSQL | Aurora Global Database |
| Redis | Single node | Independent nodes per region |
| API topology | Simpler service layout | Split web, worker, and cron roles |
| Ingress | Public ALB | CloudFront and Global Accelerator |
| ECS scale | Minimal and toggleable | Primary plus warm DR serving path |
| DR orchestrator | Drill and validation path | Full failover and failback automation |
Stage exists to validate Terraform modules, run drills, and test deploy behavior. It does not need to mirror the production bill.
What I would repeat
The choices I would make again:
- Start with the failover workflow. VPCs and ECS services are straightforward compared with active-region switching, database promotion, and split-brain prevention.
- Keep DR compute partially warm. The serving path needs to recover quickly. Workers and cron can scale later.
- Use Aurora Global only where it pays for itself. It is expensive, but it removed a lot of failover and failback risk for the production database.
- Do not replicate Redis by default. Session loss during regional DR was an acceptable product tradeoff.
- Make Terraform the source of truth. Console fixes are tempting during migration, but drift is exactly what hurts during disaster recovery.
- Run drills before DNS cutover. Stage drills caught secret localization and ECS desired-count issues that would have been painful during a real incident.
Current status
The AWS infrastructure is applied and production is healthy in the primary region. Services respond on non-production test hostnames. The remaining step is live DNS cutover from the old platform to production hostnames after final verification.
Steady-state operations after cutover are automated: deploys, backups, scaling, and regional failover all run through CI or the orchestrator.
For the reusable planning framework behind these choices, read Designing Disaster Recovery on AWS.