Designing Disaster Recovery on AWS

June 29, 2026 (1d ago) · 0 views

This is the planning guide I wish I had before rebuilding a production platform for disaster recovery on AWS.

The companion case study is Migrating to AWS with Automated Disaster Recovery. That post covers the actual architecture I built. This one is about how to decide what you should build, what to prioritize, what each layer costs, and what cheaper alternatives are reasonable.


Start with the recovery requirement

Most DR designs go wrong because the team starts with services instead of failure modes.

Before picking Aurora, ECS, Global Accelerator, or any other AWS product, write down the business requirements first:

QuestionWhat it decides
Which business functions must survive?The systems that deserve DR spend before everything else
What does downtime cost per hour?Whether warm standby is justified or backup-and-restore is enough
What failure are you surviving?Instance failure, AZ failure, database failure, full regional outage, or provider outage
What RTO do you need?How long the app can be unavailable before the incident becomes unacceptable
What RPO do you need?How much committed data can be lost or replayed
What can degrade?Sessions, cache, search, analytics, background jobs, admin tooling, or partner workflows
Which obligations apply?Compliance, audit evidence, encryption, retention, and customer commitments

This is the business impact analysis, even if you do not call it that. It keeps the architecture honest. A checkout flow, payment ledger, and customer database may deserve low RTO. A marketing CMS, internal report, or analytics pipeline may not.

For many products, the right answer is not full multi-region automation on day one. A good DR plan usually grows in layers:

  1. Backups that restore. Not backups that merely exist.
  2. Infrastructure as Code. You should be able to recreate the environment without guessing.
  3. Multi-AZ inside one region. This handles the failures you are most likely to see.
  4. A documented regional restore path. Even if it is manual at first.
  5. Warm standby in a second region. Only when the business needs low RTO.
  6. Automated failover and failback. Only when manual recovery is too slow or too error-prone.

Pick a DR tier

AWS has a few common DR patterns. The names are less important than the operating model.

PatternTypical RTOTypical RPOMonthly cost postureWhen it fits
Backup and restoreHours to daysLast backupLowestInternal tools, early-stage products, low traffic apps
Pilot lightTens of minutes to hoursMinutes to hoursLow to mediumCritical data replicated, compute mostly off
Warm standbyMinutesSeconds to minutesMedium to highCustomer-facing apps that need fast recovery
Active-activeSecondsNear-zeroHighestGlobal products that can handle multi-writer complexity

The platform in the case study uses a warm standby model:

  • primary region serves all normal traffic
  • DR region keeps the serving path warm at minimal capacity
  • database replication is continuous
  • workers and cron scale up only during failover
  • failback is automated but still requires operator confirmation

That is a good fit when full downtime is expensive, but active-active writes are overkill.


Reference architecture

A practical warm-standby architecture usually has these layers:

The shape is simple. The hard part is deciding how much money and automation each box deserves.


Cost model

The numbers below are intentionally approximate. AWS cost varies by region, traffic, storage, data transfer, log volume, retention, instance size, Aurora ACU usage, NAT traffic, and how much standby capacity you keep warm.

Use this as a planning model, not a quote.

LayerCommon choiceRough monthly rangeMain cost driver
IngressCloudFront, ALB, Global Accelerator$20-$100+ALBs, accelerator hours, traffic, request volume
ComputeECS Fargate warm standby$50-$300+Always-on tasks, CPU/memory size, number of services
Database high endAurora Global Database$200-$400+ before serious trafficPrimary and secondary clusters, ACUs or instances, storage, IO
Database cheaperRDS PostgreSQL with replica or restore automation$30-$200+Instance class, Multi-AZ, storage, backup retention
RedisElastiCache per region$30-$150+Node class, number of regions, replication choice
NetworkingNAT gateways and VPC endpoints$50-$250+NAT hourly charges, endpoint count, data processing
AutomationLambda and Step Functions$0-$20 for most small systemsState transitions, Lambda duration, schedule frequency
ObservabilityManaged logs and metrics$20-$300+Log ingestion, retention, metrics cardinality

Two parts surprise teams most often:

Aurora Global Database is excellent, but it is not cheap. Even before traffic, a primary cluster plus secondary region can land around the $200-$400+/month range depending on configuration.

Networking can quietly become real money. One NAT gateway per region may feel small, but hourly NAT charges plus data processing plus interface endpoints add up quickly.


Database decision guide

The database is the most important DR decision because it defines both data safety and failback complexity.

Option 1: Aurora Global Database

Use Aurora Global Database when you need low RTO/RPO and want AWS to own most of the replication and failback mechanics.

Good fitBad fit
You need regional failover in minutesYou are trying to keep the full DR stack under a very small budget
You want managed promotion and cleaner failbackYou can tolerate manual database recovery
You have enough production revenue to justify a few hundred dollars monthly for the data layerYour workload is still early and regional outage recovery can be slower

The main reason to choose it is not raw replication alone. The reason is operational: a regional failover should not leave you with a detached database that must be manually rebuilt before the next drill.

Option 2: RDS PostgreSQL plus Lambda or Step Functions

If you do not want to spend $200-$400+ on Aurora Global, a cheaper path is RDS PostgreSQL with automation around backups, replicas, DNS or config updates, and app scaling.

A common version looks like this:

This saves money, but it moves complexity into your own automation and runbooks:

  • promotion may detach the replica
  • failback may require rebuilding replication
  • Terraform state may need careful handling after promotion
  • app configuration must switch to the promoted endpoint
  • you need drills to prove the process works

That tradeoff is completely reasonable for smaller systems. Just be honest that you are buying lower monthly cost with higher operational work during recovery.

Option 3: Single-region RDS with tested restore

For staging, internal tools, or products where regional outage recovery can take hours, a single Multi-AZ RDS instance with tested snapshot restore may be enough.

The key word is tested. A backup plan without restore drills is not a DR plan.


Replicate the right data

Postgres is usually the center of the DR plan, but it is rarely the only data store.

Inventory data by type before choosing replication tools:

Data typeCommon AWS optionWhat to watch
Relational dataAurora Global, RDS replica, AWS Backup, snapshotsPromotion behavior, consistency, failback, Terraform state
Object storageS3 versioning and cross-region replicationDelete replication, lifecycle rules, KMS keys, replication lag
Block volumesEBS snapshots and lifecycle policiesRestore time and whether the volume is actually still needed
File dataEFS replication, DataSync, Storage GatewayThroughput, file count, and consistency during active writes
Logs and audit trailsCloudWatch export, S3 archive, CloudTrail organization trailsRetention, immutability, and searchability during incidents

Replication is not only about having a copy. You also need to know whether the copy is consistent, encrypted with usable keys in the recovery region, monitored for failure, and restorable within your RTO.


Redis and cache strategy

Do not replicate Redis across regions unless the data inside Redis deserves that cost.

Ask what Redis is actually storing:

Redis dataDR posture
Cache entriesLet them repopulate
SessionsOften acceptable to force re-login during regional DR
Rate limitsUsually acceptable to reset
WebSocket adapter stateReconnect clients after failover
Durable business dataIt probably should not live only in Redis

ElastiCache Global Datastore can make sense, but it is often unnecessary for product workloads where persistent state is in Postgres. Independent Redis nodes per region are a strong default for cost-conscious warm standby.


Ingress choices

Ingress is where DR becomes visible to users. DNS-only failover is simple, but it depends on TTLs, resolver behavior, and client caching. That may be fine for many apps. It is not ideal for mobile clients, APIs, or WebSockets.

ChoiceUse it whenTradeoff
DNS failoverYou can tolerate minutes of propagation uncertaintyCheapest and simplest, but client behavior varies
CloudFront origin groupsYou serve web frontends or cacheable assetsGreat per-request origin failover, less ideal as the only API ingress
Global AcceleratorYou need static anycast IPs, fast regional reroute, or WebSocket-friendly ingressAdds monthly cost and another AWS layer
Provider-managed proxy/load balancerYou already trust that provider as part of the control planeCan reintroduce the dependency you are trying to escape

For the case study, I used CloudFront for SSR frontends and Global Accelerator for API/WebSocket traffic because they optimize for different failure behavior.


Compute strategy

Warm standby does not mean every service runs at full capacity in both regions.

Split services by recovery priority:

Service typeDR capacity
User-facing web/API pathKeep warm at minimum capacity
WorkersScale from zero or low capacity during failover
Cron/schedulersRun only in the active region
Admin toolsKeep off or minimal unless needed during incidents
Marketing sitePrefer static or edge-cached if possible

This is one of the easiest ways to control cost. The serving path recovers quickly, while less urgent compute only starts when the failover is real.


Automation priorities

You do not need to automate everything immediately. Automate the steps that are dangerous, slow, or easy to do incorrectly during stress.

The first automation targets should be:

  1. health checks that distinguish partial failure from regional failure
  2. a failover lock to prevent split brain
  3. database promotion
  4. active-region configuration updates
  5. ECS desired-count changes
  6. queue endpoint and secret localization
  7. failback verification

Manual confirmation is still useful for failback. Failover often needs speed. Failback needs caution.


Monitoring, audit, and security

DR automation should not run on vibes. It needs signals you trust and permissions you can defend.

At minimum, plan for:

ConcernAWS building blockWhy it matters
Health and metricsCloudWatch metrics and alarmsDetect sustained failure without relying on a human refreshing dashboards
NotificationsSNS or an on-call integrationMake failover visible even when it is automated
API audit trailCloudTrailKnow who or what changed infrastructure during an incident
Configuration driftAWS Config or IaC drift checksCatch resources that no longer match the recovery plan
Access controlIAM roles with least privilegeLimit what automation and operators can change
EncryptionKMS keys in every required regionA replicated backup is useless if the recovery region cannot decrypt it

The alerting rule of thumb is simple: page on symptoms that affect recovery, not on every noisy metric. Too many DR alerts are as dangerous as too few because the team stops trusting them.


Minimum viable DR path

If the full architecture is too expensive right now, build this first:

  1. Put infrastructure in Terraform.
  2. Store secrets outside the repo and outside Terraform state.
  3. Run the app statelessly so compute can be recreated.
  4. Use Multi-AZ RDS in the primary region.
  5. Enable automated backups and snapshot retention.
  6. Restore a backup into a clean environment at least once.
  7. Document exactly how traffic, secrets, queues, and database endpoints change during recovery.
  8. Add a DR region only for the components that are hardest to recreate quickly.
  9. Add automation after the manual drill is repeatable.

This path is not glamorous, but it is how most teams should start. A simple, drilled recovery plan beats an expensive multi-region diagram nobody has tested.


Testing and maintenance cadence

DR is not a document you finish. It is an operating habit.

Use a cadence like this:

WhenWhat to test
Every deployBasic health checks, migrations, rollback path, secrets export
MonthlyBackup restore into an isolated environment
QuarterlyFailover drill for the most critical path
After major architecture changesRTO/RPO assumptions, runbooks, automation permissions, dashboards
Before contract or compliance reviewsEvidence: test results, diagrams, contacts, retention policy, access review

Each drill should record:

  1. whether the target RTO and RPO were met
  2. which manual steps still exist
  3. which alarms fired or failed to fire
  4. what data needed repair or replay
  5. what documentation was wrong

The point is not to stage a perfect demo. The point is to find the painful parts while there is no real outage.


What to focus on during your build

The priority order I would use:

  1. Data safety. Decide RPO first. Everything else is easier to change later.
  2. Ingress behavior. Know how users reach the healthy region.
  3. Configuration switching. Secrets, endpoints, queues, and active-region flags must move together.
  4. Split-brain prevention. Only one writer, one scheduler, and one active region.
  5. Cost boundaries. Choose which services stay warm and which wake up later.
  6. Drills. Recovery that has not been practiced is just a theory.
  7. Maintenance. Revisit the plan when the product, compliance posture, vendors, or AWS services change.

The case study shows one way to apply these principles with ECS Fargate, Aurora Global Database, independent Redis, CloudFront, Global Accelerator, Terraform, and a Lambda/Step Functions orchestrator: Migrating to AWS with Automated Disaster Recovery.

Further reading: Christopher Adamson's Creating a Disaster Recovery Plan Using AWS Services is a useful broad checklist for AWS DR planning. I used it as a cross-check for coverage; the structure, examples, cost framing, and recommendations here are written for this guide.