@ryancarson: https://x.com/ryancarson/status/2064751272834593135
Summary
A detailed guide on setting up automated agent-driven disaster recovery using Devin AI, covering two backup strategies (PITR and off-site dumps), a playbook for execution, and live destructive testing.
View Cached Full Text
Cached at: 06/10/26, 07:53 PM
How to automate disaster recovery with agents
You’re an early adopter and your agent is already shipping 100% of your features, bug fixes, and refactors. Pat yourself on the back.
It’s time to level up with automated agent-driven disaster recovery.
(Cheat code: Just point your agent at this post and say and say “Implement this”.)
A database restore is the scariest button in your whole stack. It’s destructive, it’s rare, and the one time you need it, you’ll be panicking at 2 a.m. trying to remember which backup is the good one.
That’s exactly the kind of high-stakes, low-frequency procedure that should be written down as a playbook your agent can execute — and, crucially, one you’ve actually tested under real conditions before you need it.
This post walks through how we set that up: two independent backup strategies (a point-in-time restore and an off-site dump), a single playbook our agent follows, how we trigger it, how we verify it, and how we ran a live, destructive test against production without losing data.
The agent we use is @DevinAI, and the specific mechanism is a Devin playbook — a reusable procedure Devin loads and executes. We’ll be concrete about how we created that playbook, how Devin runs it, and (since people always ask) how a playbook differs from a skill. The concepts generalize to any capable coding agent, but the wiring below is exactly what we run.
Throughout, two pieces of standard vocabulary are worth keeping in mind, because they’re how experienced operators reason about this:
-
RPO (Recovery Point Objective): how much data you can afford to lose, measured in time. A daily backup implies an RPO of up to ~24h; continuous PITR gets you to seconds.
-
RTO (Recovery Time Objective): how long recovery is allowed to take.
The other touchstone is the classic 3-2-1 rule: at least 3 copies of your data, on 2 different media/systems, with 1 off-site. Everything below is really just a concrete, agent-operated implementation of 3-2-1 with explicit RPO/RTO targets.
Why two backups, not one
The golden rule of backups:
A backup you haven’t restored is just a hope.
And a single backup strategy is a single point of failure. We use two, because they fail in different ways and cover different disasters:
1. Point-in-time restore (PITR) — your fast “undo”
Most modern managed Postgres providers (we use @neondatabase - it integrates really easy with @vercel; Supabase, RDS, Cloud SQL, and others have equivalents) keep a continuous change history and let you roll the database back to any moment within a retention window (ours is 7 days).
-
Best for: “We just ran a bad migration / a bad delete / a buggy deploy 20 minutes ago.” You roll back to the timestamp just before the damage. (Excellent RPO — seconds.)
-
Granularity: to the second.
-
Speed (RTO): provider-dependent — don’t assume “instant.” On Neon, a restore is a copy-on-write branch operation and is near-instant. On RDS / Cloud SQL, PITR provisions a brand-new instance from base snapshot + WAL replay and can take tens of minutes to hours, after which you cut over. Know your provider’s restore mechanics and time it during a drill so your RTO is a measured number, not a guess.
-
The killer feature: it’s reversible. When you restore, the provider preserves the pre-restore state as a separate branch/snapshot. If your restore was a mistake, you can undo the undo.
-
The catch: it lives inside the same provider account as your live database. If that account is compromised, deleted, or the provider has a catastrophic failure, your PITR history can go with it.
- Off-site dump — your “the building burned down” backup
This is a backup written to object storage in a different vendor (we use AWS S3; GCS, Cloudflare R2, Backblaze B2 all work). A cron job runs it on a schedule (daily, in our case).
We use a plain pg_dump, and it’s worth being honest about when that’s the right tool:
-
Logical dumps (pg_dump) are great for small-to-mid databases — say up to tens of GB. They’re simple, portable across Postgres versions, and trivial to inspect. But they don’t scale well: dumps and (single-threaded) restores get painfully slow as the DB grows, and a nightly dump gives you a coarse RPO (up to ~24h).
-
For larger or higher-RPO systems, graduate to physical backups + continuous WAL archiving to object storage — tools like pgBackRest, WAL-G, or Barman. These give you off-site point-in-time recovery (not just a nightly snapshot), parallel/faster restores, and far better RPO. If your DB is big or your RPO target is tight, treat nightly pg_dump as a starter and plan the upgrade.
Regardless of mechanism:
-
Best for: the provider account itself is gone, corrupted, or locked. Or you need a backup older than the PITR window. Or compliance wants an immutable, exportable copy.
-
Granularity (RPO): whatever your cadence/archiving is (a daily dump = up to ~24h of potential loss; WAL archiving = seconds).
-
Speed (RTO): slower than PITR — you download and replay. For a logical dump, restore time grows with DB size.
-
The killer feature: it’s off-site and vendor-independent. Totally separate blast radius from your primary DB.
-
The catch: a logical dump is coarse-grained and only as fresh as your last run.
The point of having both: PITR is your everyday, fine-grained, fast undo. The off-site dump is your worst-case insurance. In a real incident you’ll often use them together — and that’s exactly the scenario we practiced (more below).
Step 1: Set up the two backups
You need these to exist before you write the playbook. An agent can help you build all of this.
Point-in-time restore:
-
Confirm your provider has PITR and check the retention window (e.g., 7 days). Extend it if your budget allows — a longer window means more disasters you can recover from.
-
Verify that a restore preserves the prior state (Neon does this as an automatic branch). Reversibility is what makes a live test safe.
Off-site dump:
-
A scheduled job (GitHub Actions cron, a Vercel cron, a Lambda — whatever fits) that runs pg_dump, gzips it, and uploads to a bucket in a different vendor. (At scale, swap this for a WAL-archiving tool like pgBackRest/WAL-G writing to the same bucket.)
-
A bucket with versioning and a sane lifecycle/retention policy. Consider object-lock / immutability if you want ransomware-resistant, tamper-proof copies.
-
A read-only, least-privilege credential scoped to only that backup bucket, that the agent can use to list and download dumps. Don’t hand your agent your root keys.
-
Bonus: enable a manual trigger (e.g. workflow_dispatch) so you can produce an on-demand dump in minutes instead of waiting for the nightly run.
💡 Tip: know your actual dump timing, not the cron expression. Ours is scheduled for 03:00 UTC but, thanks to CI queue time, actually lands around 04:30 UTC. That detail matters when you’re reasoning about “how much data would we lose.”
Step 2: Write the playbook
In Devin, a playbook is a first-class, reusable procedure you author once and then attach to any session. You create it in the Devin web app (Settings → Playbooks), give it a name and a trigger macro (ours is !database_restore), and write the body as a plain-language, step-by-step runbook. From then on, anyone on the team can start a Devin session, attach that playbook (or type the macro), and Devin loads those instructions and executes them itself — calling the database/provider APIs, running psql, toggling maintenance mode, and reporting back. You’re not writing code that Devin calls; you’re writing the checklist Devin follows.
(If you’re on a different agent that doesn’t have a playbook concept, the same content as a well-structured RESTORE.md in your repo, referenced in the prompt, gets you most of the way.)
The key insight: the playbook is the runbook. You’re writing the checklist a careful human would follow, precisely enough that the agent can execute it without improvising on the dangerous parts.
Ours has two modes:
-
VALIDATION mode (default, non-destructive): restore into a throwaway branch, check the data looks right, throw it away. This is what you run on a schedule to keep yourself honest. It touches nothing real.
-
DISASTER mode (destructive, requires explicit authorization): the real thing, against the live database.
A good restore playbook spells out, in order:
-
Triage first. Confirm it’s actually a data problem and establish the exact restore timestamp (“restore to just before 09:15 UTC”).
-
Put the app in maintenance mode before touching the database, so application writes and cron jobs stop and you don’t get torn data mid-restore. (See Step 4 — make this instant, and note its real limits: middleware stops front-door writes, not every possible writer.)
-
**Choose the path: **Damage within the PITR window and provider is healthy → PITR. Provider account compromised, or you need an older/off-site copy → S3 dump.
-
Snapshot the current state first, even though it’s broken — name it something obvious like main-before-restore-
. This is your “undo the undo” safety net. -
Execute the restore (the specific provider API calls or psql commands).
-
Verify (Step 5 below) — while still in maintenance mode.
-
Only if verification passes, lift maintenance mode.
-
Report: what was restored, to when, the safety branch name, total downtime, and before/after row counts.
Things to bake into the playbook so the agent can’t foot-gun itself:
-
Hard gates: “If verification fails, leave maintenance mode ON and stop. Do not lift maintenance on a bad restore.”
-
Abort-with-no-outage pre-flight: check credentials and that the dump is downloadable/valid before enabling maintenance. If S3 is unreachable, you find out before you’ve taken the site down.
-
Never delete the safety branch as part of the run. Cleanup is a separate, human-approved decision.
-
Require explicit authorization for DISASTER mode.
Playbook vs. skill — what’s the difference?
The rule of thumb: if you want the agent to decide on its own when to apply some knowledge, make it a skill. If you want a human to deliberately pull a lever, make it a playbook.
A destructive database restore is the textbook case for a playbook, not a skill. You never want an agent to auto-decide it’s time to overwrite production — that’s a lever a human pulls on purpose, with authorization, which is exactly what a manually-attached playbook gives you. (Skills are perfect for the non-destructive habits around it — e.g. a repo skill that says “here’s how to run a scheduled validation restore into a throwaway branch.”)
Step 3: Trigger the agent
There are two ways we trigger it, for two different situations:
A real incident (you, manually): open a Devin session, attach the playbook or type its macro (!database_restore), and tell it what happened: “We had a bad delete around 09:15 UTC, restore production to 09:10.” Devin loads the playbook and walks the steps, pausing where the playbook says to pause.
A supervised drill (Devin spawns a child Devin): for our live test, we had a main Devin session spin up a separate child Devin session dedicated to running the playbook, and watched it work in real time. Devin can launch and monitor child sessions, which makes this clean:
-
The child runs the procedure end-to-end on its own machine.
-
The parent monitors progress without interfering, and relays milestones to you (“maintenance ON,” “restore done,” “verified,” “maintenance OFF”).
-
You get a clean, auditable transcript of exactly what was done.
Step 4: Make maintenance mode instant
This is the unsung hero of a safe restore. You cannot do a clean restore while writers are hitting the database. You need a switch that, in seconds:
-
routes all traffic to a maintenance page,
-
stops application-driven writes,
-
and pauses cron/background jobs.
Be precise about what “maintenance mode” actually freezes. App-layer middleware only stops writes coming through your app’s front door. It does **not **automatically stop: background workers and queue consumers, inbound webhooks that hit other entrypoints, scheduled jobs already mid-run, or anything connecting straight to the database.
Your maintenance switch has to also gate those paths (we freeze cron and reject writes in the API/server-action layer), and you should accept that a small number of in-flight writes can still land in the instant the flag flips. The only true write freeze is at the database itself — e.g. revoking write privileges, flipping the DB to read-only, or terminating all other connections. For a short restore window, app-layer gating plus paused cron is usually enough; just don’t tell yourself it’s a hard guarantee.
The mistake to avoid: gating maintenance behind an environment variable that requires a redeploy to flip. During an incident, waiting 3–5 minutes for a deploy to toggle maintenance is agony, and it widens your data-loss window.
We made it instant and deploy-free using a low-latency edge config store (we use Vercel Edge Config; a Redis key or any fast KV store works) read on every request in middleware:
-
A maintenanceMode flag the middleware checks on every request, redirecting everything to /maintenance.
-
The flag flips in ~1–3 seconds via an API call — no redeploy.
-
Fail open (a deliberate availability trade-off): if the config read errors, default to serving traffic rather than showing maintenance, so a config-store blip can’t black out your whole site. The trade-off is that a config outage during an incident won’t automatically gate writes — if you’d rather guarantee gating, fail closed instead. Pick the failure mode on purpose.
-
Bonus: store the maintenance page’s headline/message/ETA in the same config so you can update the copy live (“back by 10:30 ET”) without shipping code.
We gave the agent a tiny CLI for this (maintenance-mode on|off|status) so the playbook step is just one command.
Step 5: Check that it actually works (without touching production)
Build verification into a routine you run constantly, not just during incidents:
-
Scheduled VALIDATION restores. Have the agent restore the latest off-site dump into a throwaway branch on a schedule, run sanity checks, and report. If the dump is corrupt or the restore mechanics broke, you learn it on a Tuesday afternoon — not during a fire.
-
Sanity checks that mean something. Row counts for your key tables (users, your core domain tables), confirm all schemas are present, and check the most recent timestamp in a high-write table to confirm freshness.
-
Credential checks. Confirm the agent’s backup credentials authenticate as the expected least-privilege identity and can list/read the bucket. (More on why below.)
⚠️ The kind of bug drilling catches. Drills routinely surface problems that look fine on paper: a stale or mis-scoped credential, a backup identity that’s lost read access to the bucket, an expired key, an IAM policy that quietly drifted. These are invisible until someone actually exercises the path — and they tend to bite hardest in a fresh, cold-start emergency session that doesn’t have your laptop’s cached state. Running the drill flushes them out with zero production impact, so you fix the credential/policy centrally and re-run before it matters. The backups you never test are the ones that betray you.
Step 6: Do a real, live, destructive test
First, the standard practice, stated plainly: the normal, safe way to test restores is non-destructively, into a separate branch / clone / staging instance (Step 5). You should be doing that on a schedule, and for most teams that’s sufficient — it proves the dump is good and the restore mechanics work without ever risking production. If you’re not comfortable touching prod, don’t; a restored-clone drill is a perfectly respectable answer.
That said, a clone drill doesn’t exercise the production-specific glue: your maintenance switch, your real DNS/edge routing, your actual credentials in a cold session, and the muscle memory of doing it for real. So — as an advanced, optional, heavily-gated exercise — we also ran the full thing against production once. This section is about how to do that without it being reckless. It is not a substitute for routine clone-based validation; it’s a deliberate, one-time confidence check on top of it.
The trick that makes it safe: freeze writes first, so both restore targets converge on the same moment. If you enable maintenance mode at time T0, (almost) nothing is written after T0. So restoring “to T0” loses essentially nothing — the only data at risk is whatever was in flight the instant the flag flipped (see the in-flight-writes caveat in Step 4). And because PITR is reversible (preserved branch) and the off-site copy is untouched, every step has an undo.
We went further and practiced a realistic two-path chain in a single maintenance window, because in a real incident you might genuinely need both:
-
Enable maintenance mode. Record T0. Capture baseline row counts.
-
Snapshot current production to a safety branch.
-
Restore from the off-site S3 dump (this rolls production back to the dump’s timestamp — the off-site fallback path).
-
Verify the S3 restore landed: schemas intact, counts sane.
-
Recover the lost time via PITR — roll production forward to T0, bringing back everything between the dump and the freeze. (S3 state is preserved in its own branch first.)
-
Verify against the baseline. We checked row counts on key tables, schema presence, and latest timestamps. A caveat worth stating: matching row counts is necessary, not sufficient — equal counts don’t prove equal content. For real confidence, also compare something content-sensitive: checksums/hashes of key tables (e.g. md5(array_agg(…)) over a deterministic ordering), a few spot-checked rows, or pg_dump –schema-only diffs. Counts are a fast first gate; checksums are the proof.
-
Lift maintenance mode. Smoke-test: homepage 200, login redirect works.
-
Report a three-state table: BASELINE → POST-S3 → POST-RECOVERY.
What this proves: the off-site dump restores correctly and you can recover from PITR afterward — the exact sequence you’d run if you had to fall back to the off-site copy and then claw back recent data.
Our actual result: the full chain ran in a ~8-minute maintenance window and recovered to baseline counts exactly, fully reversible the whole way via preserved branches. (We verified counts and timestamps; for a production-grade sign-off we’d add table checksums per the note above.)
Live-test safety checklist:
-
You have explicit authorization to run destructively against production.
-
Maintenance mode is instant and verified working before you start.
-
Pre-flight checks (creds + dump integrity) run before any outage, with a clean abort path.
-
Every destructive step preserves the prior state in a named branch.
-
Hard gate: if any verification fails, maintenance stays ON and the agent stops.
-
You run during a low-traffic window.
-
You captured baseline counts to compare against.
-
Cleanup of safety branches is a separate, deliberate, human-approvedstep afterward.
A note on credentials and least privilege
Give your agent a credential that can do exactly what the playbook needs and nothing more:
-
Backup reads: a read-only key scoped to only the backup bucket. It should be able to list and download dumps — not delete them, not touch other buckets.
-
Restore operations: the provider API key needs restore/branch permissions, but you can still keep it off destructive account-level actions.
-
Store these as shared secrets at the org/team level so a fresh agent session inherits the correct ones automatically — then verify a clean, cold session actually picks them up. Stale or mis-scoped secrets are one of the most common things a drill exposes, so treat “a brand-new session can authenticate as the expected least-privilege identity” as an explicit test, not an assumption.
The takeaway
Disaster recovery is the perfect thing to delegate to an agent, if you do the prep:
-
Two backups, different blast radii — point-in-time (fast, fine-grained, reversible, same provider) and off-site dump (coarse, slow, vendor-independent).
-
A playbook that encodes the careful-human procedure: triage → maintenance mode → snapshot → restore → verify → (only then) lift maintenance → report — with hard safety gates.
-
Instant, deploy-free maintenance mode so freezing writes takes seconds, not a redeploy.
-
Constant non-destructive validation, plus at least one real live drill to prove the end-to-end path — and to flush out the broken-credential-shaped surprises before they matter.
The first time you run a production restore should not be the first time you’ve ever run a production restore.
Write the playbook, hand it to your agent, and drill it — so when something breaks for real, recovery is a calm, eight-minute, well-rehearsed procedure instead of a panic.
Similar Articles
Just had to rewrite my entire agent infrastructure for reliability, anyone else doing the same?
The author describes rewriting their AI agent infrastructure for reliability using DBOS durable execution after facing cascading failures, and asks the community about similar experiences, tool choices, and build-vs-buy decisions.
@walden_yan: If you're building your own cloud agent like Devin or Ramp Inspect, there's lots of great details here on setting up VM…
A deep dive into building cloud agents with Walden Yan (Cognition) and Cole Murray (OpenInspect), covering VM setup, computer use, memory, and the rise of async agents in the AI engineering landscape.
NirDiamant/agents-towards-production
A comprehensive open-source playbook with tutorials for turning AI agents into production-ready products, covering deployment, memory, security, and more.
@avyvar: We made deploying agents to prod so easy your coding agent can do it in under 5 minutes. Get a production-grade API tha…
The article announces a new tool that simplifies deploying AI agents to production, allowing for setup in under 5 minutes with features like session scaling, long pauses, and crash recovery.
How are you handling recovery when AI agents fail mid-task in production? and How often this happens for you?
A discussion query asking developers how they handle recovery when AI agents crash mid-task in production, exploring approaches like restarting, persisting state, using checkpoints, or manual inspection.