Designing Disaster Recovery on AWS

This is the planning guide I wish I had before rebuilding a production platform for disaster recovery on AWS.

The companion case study is Migrating to AWS with Automated Disaster Recovery. That post covers the actual architecture I built. This one is about how to decide what you should build, what to prioritize, what each layer costs, and what cheaper alternatives are reasonable.

Start with the recovery requirement

Most DR designs go wrong because the team starts with services instead of failure modes.

Before picking Aurora, ECS, Global Accelerator, or any other AWS product, write down the business requirements first:

Question	What it decides
Which business functions must survive?	The systems that deserve DR spend before everything else
What does downtime cost per hour?	Whether warm standby is justified or backup-and-restore is enough
What failure are you surviving?	Instance failure, AZ failure, database failure, full regional outage, or provider outage
What RTO do you need?	How long the app can be unavailable before the incident becomes unacceptable
What RPO do you need?	How much committed data can be lost or replayed
What can degrade?	Sessions, cache, search, analytics, background jobs, admin tooling, or partner workflows
Which obligations apply?	Compliance, audit evidence, encryption, retention, and customer commitments

This is the business impact analysis, even if you do not call it that. It keeps the architecture honest. A checkout flow, payment ledger, and customer database may deserve low RTO. A marketing CMS, internal report, or analytics pipeline may not.

For many products, the right answer is not full multi-region automation on day one. A good DR plan usually grows in layers:

Backups that restore. Not backups that merely exist.
Infrastructure as Code. You should be able to recreate the environment without guessing.
Multi-AZ inside one region. This handles the failures you are most likely to see.
A documented regional restore path. Even if it is manual at first.
Warm standby in a second region. Only when the business needs low RTO.
Automated failover and failback. Only when manual recovery is too slow or too error-prone.

Pick a DR tier

AWS has a few common DR patterns. The names are less important than the operating model.

Pattern	Typical RTO	Typical RPO	Monthly cost posture	When it fits
Backup and restore	Hours to days	Last backup	Lowest	Internal tools, early-stage products, low traffic apps
Pilot light	Tens of minutes to hours	Minutes to hours	Low to medium	Critical data replicated, compute mostly off
Warm standby	Minutes	Seconds to minutes	Medium to high	Customer-facing apps that need fast recovery
Active-active	Seconds	Near-zero	Highest	Global products that can handle multi-writer complexity

The platform in the case study uses a warm standby model:

primary region serves all normal traffic
DR region keeps the serving path warm at minimal capacity
database replication is continuous
workers and cron scale up only during failover
failback is automated but still requires operator confirmation

That is a good fit when full downtime is expensive, but active-active writes are overkill.

Reference architecture

A practical warm-standby architecture usually has these layers:

The shape is simple. The hard part is deciding how much money and automation each box deserves.

Cost model

The numbers below are intentionally approximate. AWS cost varies by region, traffic, storage, data transfer, log volume, retention, instance size, Aurora ACU usage, NAT traffic, and how much standby capacity you keep warm.

Use this as a planning model, not a quote.

Layer	Common choice	Rough monthly range	Main cost driver
Ingress	CloudFront, ALB, Global Accelerator	$20-$100+	ALBs, accelerator hours, traffic, request volume
Compute	ECS Fargate warm standby	$50-$300+	Always-on tasks, CPU/memory size, number of services
Database high end	Aurora Global Database	$200-$400+ before serious traffic	Primary and secondary clusters, ACUs or instances, storage, IO
Database cheaper	RDS PostgreSQL with replica or restore automation	$30-$200+	Instance class, Multi-AZ, storage, backup retention
Redis	ElastiCache per region	$30-$150+	Node class, number of regions, replication choice
Networking	NAT gateways and VPC endpoints	$50-$250+	NAT hourly charges, endpoint count, data processing
Automation	Lambda and Step Functions	$0-$20 for most small systems	State transitions, Lambda duration, schedule frequency
Observability	Managed logs and metrics	$20-$300+	Log ingestion, retention, metrics cardinality

Two parts surprise teams most often:

Aurora Global Database is excellent, but it is not cheap. Even before traffic, a primary cluster plus secondary region can land around the $200-$400+/month range depending on configuration.

Networking can quietly become real money. One NAT gateway per region may feel small, but hourly NAT charges plus data processing plus interface endpoints add up quickly.

Database decision guide

The database is the most important DR decision because it defines both data safety and failback complexity.

Option 1: Aurora Global Database

Use Aurora Global Database when you need low RTO/RPO and want AWS to own most of the replication and failback mechanics.

Good fit	Bad fit
You need regional failover in minutes	You are trying to keep the full DR stack under a very small budget
You want managed promotion and cleaner failback	You can tolerate manual database recovery
You have enough production revenue to justify a few hundred dollars monthly for the data layer	Your workload is still early and regional outage recovery can be slower

The main reason to choose it is not raw replication alone. The reason is operational: a regional failover should not leave you with a detached database that must be manually rebuilt before the next drill.

Option 2: RDS PostgreSQL plus Lambda or Step Functions

If you do not want to spend $200-$400+ on Aurora Global, a cheaper path is RDS PostgreSQL with automation around backups, replicas, DNS or config updates, and app scaling.

A common version looks like this:

This saves money, but it moves complexity into your own automation and runbooks:

promotion may detach the replica
failback may require rebuilding replication
Terraform state may need careful handling after promotion
app configuration must switch to the promoted endpoint
you need drills to prove the process works

That tradeoff is completely reasonable for smaller systems. Just be honest that you are buying lower monthly cost with higher operational work during recovery.

Option 3: Single-region RDS with tested restore

For staging, internal tools, or products where regional outage recovery can take hours, a single Multi-AZ RDS instance with tested snapshot restore may be enough.

The key word is tested. A backup plan without restore drills is not a DR plan.

Replicate the right data

Postgres is usually the center of the DR plan, but it is rarely the only data store.

Inventory data by type before choosing replication tools:

Data type	Common AWS option	What to watch
Relational data	Aurora Global, RDS replica, AWS Backup, snapshots	Promotion behavior, consistency, failback, Terraform state
Object storage	S3 versioning and cross-region replication	Delete replication, lifecycle rules, KMS keys, replication lag
Block volumes	EBS snapshots and lifecycle policies	Restore time and whether the volume is actually still needed
File data	EFS replication, DataSync, Storage Gateway	Throughput, file count, and consistency during active writes
Logs and audit trails	CloudWatch export, S3 archive, CloudTrail organization trails	Retention, immutability, and searchability during incidents

Replication is not only about having a copy. You also need to know whether the copy is consistent, encrypted with usable keys in the recovery region, monitored for failure, and restorable within your RTO.

Redis and cache strategy

Do not replicate Redis across regions unless the data inside Redis deserves that cost.

Ask what Redis is actually storing:

Redis data	DR posture
Cache entries	Let them repopulate
Sessions	Often acceptable to force re-login during regional DR
Rate limits	Usually acceptable to reset
WebSocket adapter state	Reconnect clients after failover
Durable business data	It probably should not live only in Redis

ElastiCache Global Datastore can make sense, but it is often unnecessary for product workloads where persistent state is in Postgres. Independent Redis nodes per region are a strong default for cost-conscious warm standby.

Ingress choices

Ingress is where DR becomes visible to users. DNS-only failover is simple, but it depends on TTLs, resolver behavior, and client caching. That may be fine for many apps. It is not ideal for mobile clients, APIs, or WebSockets.

Choice	Use it when	Tradeoff
DNS failover	You can tolerate minutes of propagation uncertainty	Cheapest and simplest, but client behavior varies
CloudFront origin groups	You serve web frontends or cacheable assets	Great per-request origin failover, less ideal as the only API ingress
Global Accelerator	You need static anycast IPs, fast regional reroute, or WebSocket-friendly ingress	Adds monthly cost and another AWS layer
Provider-managed proxy/load balancer	You already trust that provider as part of the control plane	Can reintroduce the dependency you are trying to escape

For the case study, I used CloudFront for SSR frontends and Global Accelerator for API/WebSocket traffic because they optimize for different failure behavior.

Compute strategy

Warm standby does not mean every service runs at full capacity in both regions.

Split services by recovery priority:

Service type	DR capacity
User-facing web/API path	Keep warm at minimum capacity
Workers	Scale from zero or low capacity during failover
Cron/schedulers	Run only in the active region
Admin tools	Keep off or minimal unless needed during incidents
Marketing site	Prefer static or edge-cached if possible

This is one of the easiest ways to control cost. The serving path recovers quickly, while less urgent compute only starts when the failover is real.

Automation priorities

You do not need to automate everything immediately. Automate the steps that are dangerous, slow, or easy to do incorrectly during stress.

The first automation targets should be:

health checks that distinguish partial failure from regional failure
a failover lock to prevent split brain
database promotion
active-region configuration updates
ECS desired-count changes
queue endpoint and secret localization
failback verification

Manual confirmation is still useful for failback. Failover often needs speed. Failback needs caution.

Monitoring, audit, and security

DR automation should not run on vibes. It needs signals you trust and permissions you can defend.

At minimum, plan for:

Concern	AWS building block	Why it matters
Health and metrics	CloudWatch metrics and alarms	Detect sustained failure without relying on a human refreshing dashboards
Notifications	SNS or an on-call integration	Make failover visible even when it is automated
API audit trail	CloudTrail	Know who or what changed infrastructure during an incident
Configuration drift	AWS Config or IaC drift checks	Catch resources that no longer match the recovery plan
Access control	IAM roles with least privilege	Limit what automation and operators can change
Encryption	KMS keys in every required region	A replicated backup is useless if the recovery region cannot decrypt it

The alerting rule of thumb is simple: page on symptoms that affect recovery, not on every noisy metric. Too many DR alerts are as dangerous as too few because the team stops trusting them.

Minimum viable DR path

If the full architecture is too expensive right now, build this first:

Put infrastructure in Terraform.
Store secrets outside the repo and outside Terraform state.
Run the app statelessly so compute can be recreated.
Use Multi-AZ RDS in the primary region.
Enable automated backups and snapshot retention.
Restore a backup into a clean environment at least once.
Document exactly how traffic, secrets, queues, and database endpoints change during recovery.
Add a DR region only for the components that are hardest to recreate quickly.
Add automation after the manual drill is repeatable.

This path is not glamorous, but it is how most teams should start. A simple, drilled recovery plan beats an expensive multi-region diagram nobody has tested.

Testing and maintenance cadence

DR is not a document you finish. It is an operating habit.

Use a cadence like this:

When	What to test
Every deploy	Basic health checks, migrations, rollback path, secrets export
Monthly	Backup restore into an isolated environment
Quarterly	Failover drill for the most critical path
After major architecture changes	RTO/RPO assumptions, runbooks, automation permissions, dashboards
Before contract or compliance reviews	Evidence: test results, diagrams, contacts, retention policy, access review

Each drill should record:

whether the target RTO and RPO were met
which manual steps still exist
which alarms fired or failed to fire
what data needed repair or replay
what documentation was wrong

The point is not to stage a perfect demo. The point is to find the painful parts while there is no real outage.

What to focus on during your build

The priority order I would use:

Data safety. Decide RPO first. Everything else is easier to change later.
Ingress behavior. Know how users reach the healthy region.
Configuration switching. Secrets, endpoints, queues, and active-region flags must move together.
Split-brain prevention. Only one writer, one scheduler, and one active region.
Cost boundaries. Choose which services stay warm and which wake up later.
Drills. Recovery that has not been practiced is just a theory.
Maintenance. Revisit the plan when the product, compliance posture, vendors, or AWS services change.

The case study shows one way to apply these principles with ECS Fargate, Aurora Global Database, independent Redis, CloudFront, Global Accelerator, Terraform, and a Lambda/Step Functions orchestrator: Migrating to AWS with Automated Disaster Recovery.

Further reading: Christopher Adamson's Creating a Disaster Recovery Plan Using AWS Services is a useful broad checklist for AWS DR planning. I used it as a cross-check for coverage; the structure, examples, cost framing, and recommendations here are written for this guide.