This is the planning guide I wish I had before rebuilding a production platform for disaster recovery on AWS.
The companion case study is Migrating to AWS with Automated Disaster Recovery. That post covers the actual architecture I built. This one is about how to decide what you should build, what to prioritize, what each layer costs, and what cheaper alternatives are reasonable.
Start with the recovery requirement
Most DR designs go wrong because the team starts with services instead of failure modes.
Before picking Aurora, ECS, Global Accelerator, or any other AWS product, write down the business requirements first:
| Question | What it decides |
|---|---|
| Which business functions must survive? | The systems that deserve DR spend before everything else |
| What does downtime cost per hour? | Whether warm standby is justified or backup-and-restore is enough |
| What failure are you surviving? | Instance failure, AZ failure, database failure, full regional outage, or provider outage |
| What RTO do you need? | How long the app can be unavailable before the incident becomes unacceptable |
| What RPO do you need? | How much committed data can be lost or replayed |
| What can degrade? | Sessions, cache, search, analytics, background jobs, admin tooling, or partner workflows |
| Which obligations apply? | Compliance, audit evidence, encryption, retention, and customer commitments |
This is the business impact analysis, even if you do not call it that. It keeps the architecture honest. A checkout flow, payment ledger, and customer database may deserve low RTO. A marketing CMS, internal report, or analytics pipeline may not.
For many products, the right answer is not full multi-region automation on day one. A good DR plan usually grows in layers:
- Backups that restore. Not backups that merely exist.
- Infrastructure as Code. You should be able to recreate the environment without guessing.
- Multi-AZ inside one region. This handles the failures you are most likely to see.
- A documented regional restore path. Even if it is manual at first.
- Warm standby in a second region. Only when the business needs low RTO.
- Automated failover and failback. Only when manual recovery is too slow or too error-prone.
Pick a DR tier
AWS has a few common DR patterns. The names are less important than the operating model.
| Pattern | Typical RTO | Typical RPO | Monthly cost posture | When it fits |
|---|---|---|---|---|
| Backup and restore | Hours to days | Last backup | Lowest | Internal tools, early-stage products, low traffic apps |
| Pilot light | Tens of minutes to hours | Minutes to hours | Low to medium | Critical data replicated, compute mostly off |
| Warm standby | Minutes | Seconds to minutes | Medium to high | Customer-facing apps that need fast recovery |
| Active-active | Seconds | Near-zero | Highest | Global products that can handle multi-writer complexity |
The platform in the case study uses a warm standby model:
- primary region serves all normal traffic
- DR region keeps the serving path warm at minimal capacity
- database replication is continuous
- workers and cron scale up only during failover
- failback is automated but still requires operator confirmation
That is a good fit when full downtime is expensive, but active-active writes are overkill.
Reference architecture
A practical warm-standby architecture usually has these layers:
The shape is simple. The hard part is deciding how much money and automation each box deserves.
Cost model
The numbers below are intentionally approximate. AWS cost varies by region, traffic, storage, data transfer, log volume, retention, instance size, Aurora ACU usage, NAT traffic, and how much standby capacity you keep warm.
Use this as a planning model, not a quote.
| Layer | Common choice | Rough monthly range | Main cost driver |
|---|---|---|---|
| Ingress | CloudFront, ALB, Global Accelerator | $20-$100+ | ALBs, accelerator hours, traffic, request volume |
| Compute | ECS Fargate warm standby | $50-$300+ | Always-on tasks, CPU/memory size, number of services |
| Database high end | Aurora Global Database | $200-$400+ before serious traffic | Primary and secondary clusters, ACUs or instances, storage, IO |
| Database cheaper | RDS PostgreSQL with replica or restore automation | $30-$200+ | Instance class, Multi-AZ, storage, backup retention |
| Redis | ElastiCache per region | $30-$150+ | Node class, number of regions, replication choice |
| Networking | NAT gateways and VPC endpoints | $50-$250+ | NAT hourly charges, endpoint count, data processing |
| Automation | Lambda and Step Functions | $0-$20 for most small systems | State transitions, Lambda duration, schedule frequency |
| Observability | Managed logs and metrics | $20-$300+ | Log ingestion, retention, metrics cardinality |
Two parts surprise teams most often:
Aurora Global Database is excellent, but it is not cheap. Even before traffic, a primary cluster plus secondary region can land around the $200-$400+/month range depending on configuration.
Networking can quietly become real money. One NAT gateway per region may feel small, but hourly NAT charges plus data processing plus interface endpoints add up quickly.
Database decision guide
The database is the most important DR decision because it defines both data safety and failback complexity.
Option 1: Aurora Global Database
Use Aurora Global Database when you need low RTO/RPO and want AWS to own most of the replication and failback mechanics.
| Good fit | Bad fit |
|---|---|
| You need regional failover in minutes | You are trying to keep the full DR stack under a very small budget |
| You want managed promotion and cleaner failback | You can tolerate manual database recovery |
| You have enough production revenue to justify a few hundred dollars monthly for the data layer | Your workload is still early and regional outage recovery can be slower |
The main reason to choose it is not raw replication alone. The reason is operational: a regional failover should not leave you with a detached database that must be manually rebuilt before the next drill.
Option 2: RDS PostgreSQL plus Lambda or Step Functions
If you do not want to spend $200-$400+ on Aurora Global, a cheaper path is RDS PostgreSQL with automation around backups, replicas, DNS or config updates, and app scaling.
A common version looks like this:
This saves money, but it moves complexity into your own automation and runbooks:
- promotion may detach the replica
- failback may require rebuilding replication
- Terraform state may need careful handling after promotion
- app configuration must switch to the promoted endpoint
- you need drills to prove the process works
That tradeoff is completely reasonable for smaller systems. Just be honest that you are buying lower monthly cost with higher operational work during recovery.
Option 3: Single-region RDS with tested restore
For staging, internal tools, or products where regional outage recovery can take hours, a single Multi-AZ RDS instance with tested snapshot restore may be enough.
The key word is tested. A backup plan without restore drills is not a DR plan.
Replicate the right data
Postgres is usually the center of the DR plan, but it is rarely the only data store.
Inventory data by type before choosing replication tools:
| Data type | Common AWS option | What to watch |
|---|---|---|
| Relational data | Aurora Global, RDS replica, AWS Backup, snapshots | Promotion behavior, consistency, failback, Terraform state |
| Object storage | S3 versioning and cross-region replication | Delete replication, lifecycle rules, KMS keys, replication lag |
| Block volumes | EBS snapshots and lifecycle policies | Restore time and whether the volume is actually still needed |
| File data | EFS replication, DataSync, Storage Gateway | Throughput, file count, and consistency during active writes |
| Logs and audit trails | CloudWatch export, S3 archive, CloudTrail organization trails | Retention, immutability, and searchability during incidents |
Replication is not only about having a copy. You also need to know whether the copy is consistent, encrypted with usable keys in the recovery region, monitored for failure, and restorable within your RTO.
Redis and cache strategy
Do not replicate Redis across regions unless the data inside Redis deserves that cost.
Ask what Redis is actually storing:
| Redis data | DR posture |
|---|---|
| Cache entries | Let them repopulate |
| Sessions | Often acceptable to force re-login during regional DR |
| Rate limits | Usually acceptable to reset |
| WebSocket adapter state | Reconnect clients after failover |
| Durable business data | It probably should not live only in Redis |
ElastiCache Global Datastore can make sense, but it is often unnecessary for product workloads where persistent state is in Postgres. Independent Redis nodes per region are a strong default for cost-conscious warm standby.
Ingress choices
Ingress is where DR becomes visible to users. DNS-only failover is simple, but it depends on TTLs, resolver behavior, and client caching. That may be fine for many apps. It is not ideal for mobile clients, APIs, or WebSockets.
| Choice | Use it when | Tradeoff |
|---|---|---|
| DNS failover | You can tolerate minutes of propagation uncertainty | Cheapest and simplest, but client behavior varies |
| CloudFront origin groups | You serve web frontends or cacheable assets | Great per-request origin failover, less ideal as the only API ingress |
| Global Accelerator | You need static anycast IPs, fast regional reroute, or WebSocket-friendly ingress | Adds monthly cost and another AWS layer |
| Provider-managed proxy/load balancer | You already trust that provider as part of the control plane | Can reintroduce the dependency you are trying to escape |
For the case study, I used CloudFront for SSR frontends and Global Accelerator for API/WebSocket traffic because they optimize for different failure behavior.
Compute strategy
Warm standby does not mean every service runs at full capacity in both regions.
Split services by recovery priority:
| Service type | DR capacity |
|---|---|
| User-facing web/API path | Keep warm at minimum capacity |
| Workers | Scale from zero or low capacity during failover |
| Cron/schedulers | Run only in the active region |
| Admin tools | Keep off or minimal unless needed during incidents |
| Marketing site | Prefer static or edge-cached if possible |
This is one of the easiest ways to control cost. The serving path recovers quickly, while less urgent compute only starts when the failover is real.
Automation priorities
You do not need to automate everything immediately. Automate the steps that are dangerous, slow, or easy to do incorrectly during stress.
The first automation targets should be:
- health checks that distinguish partial failure from regional failure
- a failover lock to prevent split brain
- database promotion
- active-region configuration updates
- ECS desired-count changes
- queue endpoint and secret localization
- failback verification
Manual confirmation is still useful for failback. Failover often needs speed. Failback needs caution.
Monitoring, audit, and security
DR automation should not run on vibes. It needs signals you trust and permissions you can defend.
At minimum, plan for:
| Concern | AWS building block | Why it matters |
|---|---|---|
| Health and metrics | CloudWatch metrics and alarms | Detect sustained failure without relying on a human refreshing dashboards |
| Notifications | SNS or an on-call integration | Make failover visible even when it is automated |
| API audit trail | CloudTrail | Know who or what changed infrastructure during an incident |
| Configuration drift | AWS Config or IaC drift checks | Catch resources that no longer match the recovery plan |
| Access control | IAM roles with least privilege | Limit what automation and operators can change |
| Encryption | KMS keys in every required region | A replicated backup is useless if the recovery region cannot decrypt it |
The alerting rule of thumb is simple: page on symptoms that affect recovery, not on every noisy metric. Too many DR alerts are as dangerous as too few because the team stops trusting them.
Minimum viable DR path
If the full architecture is too expensive right now, build this first:
- Put infrastructure in Terraform.
- Store secrets outside the repo and outside Terraform state.
- Run the app statelessly so compute can be recreated.
- Use Multi-AZ RDS in the primary region.
- Enable automated backups and snapshot retention.
- Restore a backup into a clean environment at least once.
- Document exactly how traffic, secrets, queues, and database endpoints change during recovery.
- Add a DR region only for the components that are hardest to recreate quickly.
- Add automation after the manual drill is repeatable.
This path is not glamorous, but it is how most teams should start. A simple, drilled recovery plan beats an expensive multi-region diagram nobody has tested.
Testing and maintenance cadence
DR is not a document you finish. It is an operating habit.
Use a cadence like this:
| When | What to test |
|---|---|
| Every deploy | Basic health checks, migrations, rollback path, secrets export |
| Monthly | Backup restore into an isolated environment |
| Quarterly | Failover drill for the most critical path |
| After major architecture changes | RTO/RPO assumptions, runbooks, automation permissions, dashboards |
| Before contract or compliance reviews | Evidence: test results, diagrams, contacts, retention policy, access review |
Each drill should record:
- whether the target RTO and RPO were met
- which manual steps still exist
- which alarms fired or failed to fire
- what data needed repair or replay
- what documentation was wrong
The point is not to stage a perfect demo. The point is to find the painful parts while there is no real outage.
What to focus on during your build
The priority order I would use:
- Data safety. Decide RPO first. Everything else is easier to change later.
- Ingress behavior. Know how users reach the healthy region.
- Configuration switching. Secrets, endpoints, queues, and active-region flags must move together.
- Split-brain prevention. Only one writer, one scheduler, and one active region.
- Cost boundaries. Choose which services stay warm and which wake up later.
- Drills. Recovery that has not been practiced is just a theory.
- Maintenance. Revisit the plan when the product, compliance posture, vendors, or AWS services change.
The case study shows one way to apply these principles with ECS Fargate, Aurora Global Database, independent Redis, CloudFront, Global Accelerator, Terraform, and a Lambda/Step Functions orchestrator: Migrating to AWS with Automated Disaster Recovery.
Further reading: Christopher Adamson's Creating a Disaster Recovery Plan Using AWS Services is a useful broad checklist for AWS DR planning. I used it as a cross-check for coverage; the structure, examples, cost framing, and recommendations here are written for this guide.