Disaster recovery
What this layer does
Phase 9 makes the substrate recoverable from any single failure mode. DO Spaces bucket versioning protects against accidental delete or overwrite. Manual snapshots provide named checkpoints. Restore moves versioned bytes to a holding dir without touching live. Rollback is the explicit, label-confirmed full reversion. A weekly cross-region backup gives a cold copy outside the primary bucket's failure domain.
Files involved
state/bin/sync/snapshot.sh— manual checkpoint to
s3://robert-storage/snappy-os-snapshots/<ts>-<label>/.
state/bin/sync/restore.sh— fetches versioned bytes for a single
skill at an ISO timestamp into a holding dir.
state/bin/sync/rollback.sh— full restore from a snapshot id;
refuses without --apply and label confirmation.
state/bin/sync/cross-region-backup.sh— weekly tor1 → nyc3 cold
backup, baked into v1.
state/log/snapshots.ndjson— append-only snapshot index
(in SYNC_ALLOW).
Bucket versioning
Enabled one-time on robert-storage via DO panel or s3cmd. Retention 30 days. Every overwrite preserves the prior version addressable by version-id; every delete is a tombstone with the prior version still retrievable.
Procedures
# Manual checkpoint before a risky change
state/bin/sync/snapshot.sh "pre-quorum-rewrite-2026-04-16"
# Restore a single skill to a point in time (non-destructive)
state/bin/sync/restore.sh snappy-image 2026-04-15T18:00:00Z
# → writes to ~/projects/snappy-os/_restore/snappy-image-<ts>/
# Full rollback (requires explicit apply + matching label)
state/bin/sync/rollback.sh 2026-04-16T12:00:00Z-pre-quorum-rewrite \
--apply --confirm-label="pre-quorum-rewrite-2026-04-16"
Worker DR
- Wrangler config in
~/projects/snappy-skills/. Loss redone with
wrangler deploy from Robert's machine.
- KV is a 60s cache; loss re-populates from DO on next request.
- DO Spaces is the only true SPOF. The weekly cross-region backup
to nyc3 gives a cold copy in a separate failure domain.
Operational gotchas
- Restore NEVER overwrites live. The script writes to a holding dir
and prints the diff. Manual review + manual mv is mandatory. This prevents the restore tool from becoming a foot-gun.
- Rollback refuses without
--applyand a--confirm-labelthat
matches the snapshot label exactly. This is intentional friction.
- Bucket versioning has 30-day retention. Snapshots are NOT bucket
versions — they are independent prefixes that survive the version retention window.
- The cross-region backup runs as a Worker scheduled job (weekly).
If it stalls, state/lint/sync-freshness.ts flags the cross-region-backup row missing in state/log/snapshots.ndjson.
_restore/is gitignored AND inSYNC_DENY— partial restores
must not auto-push. Holding bytes leak otherwise.
How to verify it's working
state/bin/sync/snapshot.sh "verify-test"produces a row in
state/log/snapshots.ndjson and the new prefix is visible at s3://robert-storage/snappy-os-snapshots/.
state/bin/sync/restore.sh <skill> <ts>writes to
~/projects/snappy-os/_restore/<skill>-<ts>/ and exits 0 without touching live state.
state/bin/sync/rollback.sh <bad-id>without--applyexits 1
with the refusal message.
- The weekly cross-region row appears in
snapshots.ndjsonwithin
7 days of bootstrap.
- DO panel shows versioning enabled on
robert-storagewith 30-day
retention.