cms - thecloudguy: May 2026

Here's an uncomfortable truth most cloud architects quietly acknowledge: a DR plan that lives in a Word document is not a DR plan. It's a wish list. When a region goes dark at 2 AM, nobody is calmly following a 47-step runbook — they're improvising, making mistakes, and hoping. Oracle's Full Stack Disaster Recovery (FSDR) is built on the premise that hope is not an operations strategy.

What Exactly is OCI FSDR?

Think of FSDR as a conductor for your disaster recovery orchestra. Your databases, compute instances, load balancers, storage volumes, and middleware are all musicians — each knowing their individual part. But without someone coordinating them, you get noise, not music. FSDR is that conductor: it knows the sequence, enforces the timing, and makes sure the database switches roles before the application servers come back up, not after.

More precisely, FSDR is OCI's native, fully managed DR orchestration service. It moves your entire application stack — not just the database, not just the compute — from a primary OCI region to a standby region, in a single orchestrated workflow you can trigger with one click. Just a pretested, automated plan that executes the same way every single time.

It went generally available at Oracle CloudWorld 2022 and has been expanding steadily — Singapore West, Riyadh, Chile West, Paris, Milan, Newport — with more regions added through 2024 and 2025.

The DR Lie Most Teams Are Living

Ask any team if they have a DR plan. They'll say yes. Ask them when they last tested it end-to-end. Watch the room go quiet. The dirty secret of enterprise DR is that most plans are theoretical — written for the architecture that existed two years ago, by someone who has since left the company, covering an application that has since grown three new dependencies nobody documented.

The problem isn't intent. It's that traditional DR is genuinely hard to get right, for a few structural reasons:

Tools don't talk to each other. Your database failover tool knows nothing about your compute layer. Your compute snapshots know nothing about your middleware config. You end up with a relay race where nobody's sure who's holding the baton.
Runbooks rot the moment you write them. Your application changes every sprint. Your DR runbook gets updated never. By the time you need it, it's archaeology, not operations.
Recovery at scale is a different problem entirely. Bringing one app back up is stressful but doable. Bringing ten back simultaneously, in the right order, with the right dependencies? That's where teams discover their limits.
No test, no trust. DR drills are expensive, risky-feeling, and easy to deprioritize. So they don't happen. And the first real test is a real disaster.

Learn the Language Before You Touch the Console

FSDR has a vocabulary. Some terms look deceptively familiar — RTO, failover, standby — but carry specific weight inside FSDR that matters when you're configuring real protection groups, not just reading about them. Get these wrong in your head and you'll wire things up wrong in the console.

Here's what they actually mean:

RTO => How long your business can survive the app being down. Not a technical target — a business commitment. Everything else is shaped around this number.

RPO => How much data your business can afford to lose, expressed in time. "One hour RPO" means you're okay losing up to 60 minutes of transactions if the worst happens.

Primary => Your production region. Where real users are hitting real workloads right now.

Standby => Your waiting region. Quiet until it isn't. Everything is pre-positioned here so it can take over without scrambling.

Protection Group => The core FSDR concept. Think of it as a named boundary around everything that belongs to one application — compute, DB, load balancer, storage. If it needs to move together, it lives in the same Protection Group.

DR Plan => The actual playbook — an ordered sequence of plan groups and steps that FSDR executes during a transition. This is what runs when things go wrong (or when you're drilling).

Switchover => The planned version. You choose to move. Primary shuts down gracefully, standby comes up clean. Zero data loss. Use this for maintenance windows and drills.

Failover => The unplanned version. Something broke and you can't wait. Standby starts immediately without waiting for primary to acknowledge. Some data loss may occur depending on your replication lag.

Warm Standby => Resources are already provisioned and running in the standby region. Costs more, recovers faster. Right choice for anything customer-facing.

Cold Standby => Standby region has minimal pre-deployment — resources get provisioned during the DR transition itself. Cheaper to run day-to-day, but your RTO takes the hit. Fine for non-critical internal systems.

Prechecks => FSDR's built-in sanity check. Runs before any DR operation to confirm the standby is actually ready. Your early-warning system for configuration drift. Run these regularly, not just before a disaster.

How FSDR Thinks About Your Infrastructure

Everything revolves around one idea: your application has a primary home and a standby home. FSDR's entire job is to move it from one to the other — cleanly, completely, and in the right order — whenever you tell it to.

The two regions are represented by DR Protection Groups. Think of a Protection Group as a container that says "these resources belong to the same application and need to move together." Every compute instance, every database, every load balancer that's part of your app gets added as a member. Once that's done, FSDR understands your topology and can reason about it automatically.

Primary Region:- Active production workloads. Compute, databases, load balancers, storage all running live traffic.

Standby Region:- Reserved or warm-standby infrastructure. DR plans execute here during failover or switchover.
Protection Groups:- Paired consistency groups (one per region) representing the full application system.
DR Plans:- Automated workflows — Switchover or Failover — executed from the standby protection group.

FSDR can work cross-region (different OCI regions entirely) or intra-region (across Availability Domains within one region). For anything mission-critical, cross-region is the right choice — the whole point is surviving an event big enough to take out a data center, and for that you need geographic distance.

What FSDR Actually Does Well

Auto-generates your DR plan — from scratch

Add your resources to a Protection Group and FSDR introspects the topology, figures out the dependencies, and hands you a working DR plan. You don't write the plan; you review and extend it. For large application stacks, this alone saves days of manual sequencing work.

Prechecks — so you're not surprised when it actually counts
Before running any DR operation, FSDR runs a full set of validation checks: is the standby database replicating? Are all the right IAM policies in place? Is the standby compute configured correctly? If anything would cause the recovery to fail, you find out now — not at 3 AM when the primary region is on fire.

You can bolt in your own logic anywhere

FSDR handles the infrastructure transitions, but your app almost certainly needs additional steps that no generic tool can anticipate — flushing a Redis cache, updating a DNS record, hitting a webhook, sending a Slack alert. You can inject shell scripts or OCI Functions as custom plan steps at any point in the sequence. .

You can watch it happen, step by step

The OCI Console shows you a live, step-by-step view of every DR execution as it runs. Each plan step shows its status, duration, and any errors — in real time. No more SSH-ing into servers trying to figure out where the script died. You know exactly where you are in the recovery at every moment.

Setting It Up — What You Actually Do

The setup process is more logical than it looks on first glance. Here's the honest sequence:

Map your stack honestly

Before touching the console, sit down and list every resource your application actually needs to function — not what the architecture diagram says, what it actually uses.

Compute instances, databases, load balancers, file storage, any middleware. Miss something here and your DR plan will have a gap you'll only discover at the worst possible moment.

Create Protection Groups in both regions

One Protection Group in your primary region, one in the standby. Assign their roles, point them at an Object Storage bucket for logs, then link them as peers.

This paired relationship is the foundation everything else builds on.

Add members to each group

This is where you tell FSDR which resources belong to this application. Add your compute instances, databases, OKE clusters — whatever the stack needs.

FSDR introspects what you add and uses that to auto-generate the recovery plan. The quality of what you put in directly determines the quality of the plan that comes out.

Generate plans — then actually look at them

Create your Switchover and Failover plans. FSDR will build them for you, but don't just accept them blindly.

Read through every plan group, understand the sequence, and add any custom steps your application actually needs. This is the moment to think, not during a real outage.

Run prechecks. Then run them again next month.

Prechecks aren't a one-time gate — they're a health signal. Run them regularly. They catch configuration drift: IAM policies that were quietly changed, standby databases that stopped replicating, storage buckets that were deleted. The cost of catching these early is near zero. The cost of catching them during a failover is enormous.

When disaster strikes — one click from the standby side

This is intentional design: failover always executes from the standby region, never the primary. That means even if your primary region is completely unreachable,

you can still trigger recovery. Go to the standby Protection Group, execute the Failover plan, and watch FSDR do in minutes what a manual recovery team would spend hours on.

What Can FSDR Actually Protect?

This is the question that catches people off guard. FSDR isn't a narrow database tool wearing a DR hat — it has genuine breadth across the OCI service catalogue, and Oracle has been expanding it steadily. Here's what's in scope today:

Compute: VM instances, Dedicated VM Hosts, Boot Volumes, Block Volumes — the bread and butter of most workloads, fully covered.

Database: Autonomous Database Serverless, Base Database Service, Exadata Database Service, and MySQL HeatWave. If you're running Oracle databases, there's a FSDR integration for it.

Containers: Oracle Kubernetes Engine (OKE) clusters — added in 2024 and a genuine game-changer for teams running cloud-native workloads alongside traditional infrastructure.

Storage: File Storage, Block Volume groups, Object Storage via replication policies. Your data moves with your application, not separately.

Networking: Load Balancers and Network Load Balancers — so traffic routes correctly the moment your standby comes live, without manual DNS surgery.

Integration: Oracle Integration Cloud (OIC) and Oracle GoldenGate — for teams running Oracle middleware and real-time data replication as part of their stack.

How to Not Waste the Tool You Just Set Up

FSDR gives you real power. These are the habits that determine whether you use it well or spend months convincing yourself you're protected when you're not.

Schedule prechecks like you schedule backups
Drill with Switchover, not just tabletop exercises
One app per Protection Group — resist the urge to bundle
Your custom steps are part of the plan — treat them that way
Cross-region over intra-region — unless you have a specific reason

The Bottom Line

DR has always been one of those things organizations know they need and consistently under-invest in — because the cost of doing it well is visible today, and the cost of doing it badly only shows up when everything is already on fire. FSDR doesn't eliminate that tension, but it does shift the calculus. Setting up protection groups, running prechecks, and drilling with real switchovers is no longer a six-month infrastructure project. It's a few days of focused work and an ongoing operational habit.

What FSDR gets right that most DR tools miss is scope. It doesn't protect your database and leave your compute to fend for itself. It doesn't protect your compute and forget about your load balancers. It thinks about your application as a stack, moves it as a stack, and recovers it as a stack.

If you're running production workloads on OCI and your DR strategy is still a runbook in a shared drive — this is the moment to fix that. Not because something bad is about to happen. Because the whole point is that you genuinely don't know when it will.

Start here

Open the OCI Console, go to Migration & Disaster Recovery -> Dr Protection Groups
and create your first Protection Group.
The official docs live at docs.oracle.com/en-us/iaas/disaster-recovery — they're well-maintained and worth reading alongside this post.

#OracleCloud #OCI #DisasterRecovery #FSDR #FullStackDisasterRecovery #CloudInfrastructure #BusinessContinuity #CloudArchitecture #OracleCloudInfrastructure #SRE #DevOps #CloudEngineering #InfrastructureAsCode #Terraform #DataProtection #CloudResilience #OracleDatabase #Kubernetes #OKE #ITResilience

Link List

OCI Full Stack Disaster Recovery (FSDR): A Practitioner's Complete Guide