Automation Without a Recovery Plan Is Dangerous
Automation feels powerful, but without a recovery plan, it's just fast failure. Why manual fallbacks, edge-case documentation, and failure-mode testing are critical for mature DevOps teams.

Automation feels powerful.
It saves time. Reduces human error. Cuts operational cost. Improves speed.
But here's the uncomfortable truth:
Automation without a recovery plan is just fast failure.
And I learned that the hard way.
The Illusion of Safety
In many DevOps teams, automation becomes the goal in itself.
- Auto-scaling
- Auto-deployments
- Auto-backups
- Auto start/stop environments
- Auto-healing workloads
Everything runs without manual intervention. Until it doesn't.
The real risk isn't automation itself. It's assuming automation eliminates failure. It doesn't. It only changes how failure happens — and often, it changes it in ways that are harder to detect and slower to recover from.
When a human runs a deployment and it fails, the human notices immediately. When an automated pipeline fails at 3 AM, the failure sits in a log until someone checks. If nobody checks because "it's automated," the failure compounds.
The Hidden Assumption
Most automation is written with a success path in mind.
For example:
- "Stop RDS at night to save cost."
- "Start it in the morning."
- "Scale up when CPU > 70%."
- "Restart container if it crashes."
- "Run backups every 6 hours."
These are clean, logical instructions. They are also assumptions disguised as guarantees.
But rarely do teams ask:
- What if start fails because of capacity constraints?
- What if scaling hits quota or service limits?
- What if restart loops indefinitely?
- What if backup completes but the data is corrupt?
- What if the automation itself introduces the failure?
Automation handles expected scenarios. Recovery planning handles unexpected ones. And in production, the unexpected is what wakes you up at 2 AM.
When Automation Made Things Worse: Three Real Scenarios
These are patterns I have seen firsthand across multiple teams. They all share a common thread: the automation worked exactly as designed, and that was the problem.
1. The Infinite Restart Loop
An "auto-healing" script was designed to restart a backend service whenever it failed a health check. It worked perfectly for months. Then, a bad configuration was merged — one that caused the service to crash immediately on startup, before it could even register a health check.
The automation did exactly what it was told: restart the service. Every 10 seconds. Endlessly.
Because the service was "restarting," the monitoring system suppressed the "Service Down" alert. The system thought it was healing itself. By the time someone noticed the PagerDuty silence was suspicious, we had:
- 10,000+ restart logs flooding CloudWatch
- AWS API rate limiting triggered across the account
- Other unrelated services unable to make API calls
The automation didn't just fail to fix the problem — it created two new ones. And the rate-limiting cascaded into a partial outage across different services that had nothing to do with the original misconfiguration.
The fix wasn't just adding a restart limit. It was rethinking the entire pattern:
1# Before: naive auto-restart
2restartPolicy: Always
3
4# After: bounded restart with circuit breaker
5restartPolicy: OnFailure
6# Combined with:
7# - Max restart count (backoffLimit in Jobs/CronJobs)
8# - Exponential backoff (built into kubelet for Pods)
9# - Alert on CrashLoopBackOff state, not just "down"We also added a dead-letter mechanism: if a service fails more than 5 times in 10 minutes, stop trying and alert a human. The automation surrenders control. That was the key insight — good automation knows when to stop.
2. The Phantom Backup
A team had automated daily backups of their PostgreSQL database using pg_dump in a CronJob. The CronJob ran on schedule. It reported success. The monitoring dashboard was green.
For eight months, nobody tested a restore.
When they finally needed to restore after an accidental data deletion, they discovered the backup files were empty. The pg_dump process had been silently failing due to a password rotation that invalidated the stored credentials. The CronJob's exit code was 0 because the script completed — it just didn't capture the pg_dump error code.
1# The script that "worked" for 8 months
2#!/bin/bash
3pg_dump -U $DB_USER -h $DB_HOST $DB_NAME > /backup/daily.sql
4aws s3 cp /backup/daily.sql s3://backups/$(date +%Y-%m-%d).sql
5
6# What the script should have been
7#!/bin/bash
8set -euo pipefail
9
10pg_dump -U "$DB_USER" -h "$DB_HOST" "$DB_NAME" > /backup/daily.sql
11
12# Validate the backup has actual content
13FILESIZE=$(stat -f%z /backup/daily.sql 2>/dev/null || stat -c%s /backup/daily.sql)
14if [ "$FILESIZE" -lt 1024 ]; then
15 echo "ERROR: Backup file is suspiciously small (${FILESIZE} bytes)" >&2
16 exit 1
17fi
18
19aws s3 cp /backup/daily.sql "s3://backups/$(date +%Y-%m-%d).sql"
20echo "Backup completed: ${FILESIZE} bytes uploaded"The team had automated the act of backing up. They had not automated validation of the backup. And they had never tested the restore path. Green dashboards are not the same as working backups.
3. The Cost-Saving Cascade
A team automated their non-production environments to stop every night at 11 PM and start at 7 AM. EKS node groups, RDS instances, ElastiCache clusters — everything scaled to zero overnight.
One morning, the RDS instance refused to start. InsufficientDBInstanceCapacity in their specific Availability Zone. The instance was stopped, and AWS does not allow you to modify the instance class or AZ of a stopped instance. You cannot start it without capacity. You cannot change it without starting it. I wrote about this exact catch-22 in detail.
But the cascade didn't stop there. The application was configured with a hard-coded RDS endpoint. When the team restored from a snapshot to a new instance (the only option), the new instance had a different endpoint. Updating the endpoint required a config change, which required a deployment, which required the CI/CD pipeline, which required the EKS cluster, which was still starting up because the node group scaling had hit a spot instance capacity issue in the same AZ.
Three separate automations. Three separate failure assumptions. One cascading outage that took 90 minutes to resolve manually — for a non-production environment.
The Pattern Behind Every Automation Failure
Here's the pattern I've seen dozens of times:
- A team automates an operational task.
- It works for months. Sometimes years.
- Confidence increases. The automation becomes invisible.
- Manual runbooks are deprecated. Tribal knowledge fades.
- An edge case appears — one the automation wasn't designed for.
- No one remembers the manual recovery steps.
Now you don't just have failure. You have failure plus confusion. The automation that reduced operational effort now increases recovery time, because the team has to relearn the system under pressure.
This is what I call automation amnesia: the gradual loss of operational knowledge that happens when automation works well for too long.
Cloud Makes This Worse
Cloud platforms are elastic. But they are not infinite.
| Assumption | Reality |
|---|---|
| Resources are always available | Capacity is AZ-specific and can be exhausted |
| Scaling is instant | Auto-scaling has cooldown periods and provisioning latency |
| APIs are always responsive | Every cloud API has rate limits and throttling |
| Managed services handle everything | Managed services have their own failure modes |
| Costs are predictable | Runaway automation can generate unexpected bills |
Automation written for cloud infrastructure carries an implicit assumption: the cloud will behave as expected. But cloud regions have capacity constraints. APIs enforce rate limits. Managed services have maintenance windows that can collide with your automation schedule.
If your automation depends on something external — a cloud API, a third-party service, a DNS provider — you must plan for when that external dependency fails. Because it will. Not often. But at the worst possible time.
What Recovery Planning Actually Means
Recovery planning isn't over-engineering. It isn't "writing a 40-page disaster recovery document that nobody reads." It's a set of practical habits that make the difference between a 5-minute fix and a 2-hour scramble.
1. Always Keep a Manual Path
If automation fails, can a human fix it quickly? This means:
- Runbooks exist and are current. Not from 2023.
- Access is possible. Manual recovery requires credentials, permissions, and console access. If your team can only deploy through CI/CD and the pipeline is broken, can they SSH into a server or use the cloud console?
- The steps are tested. A runbook that hasn't been followed in 6 months is a guess, not a plan.
2. Document the Failure Modes, Not Just the Happy Path
Every automation should have a corresponding "What If" document:
| Automation | What If It Fails? | Recovery Action |
|---|---|---|
| RDS stop/start schedule | Instance won't start (capacity) | Restore from snapshot to different AZ |
| Auto-scaling policy | Scaling hits quota limits | Pre-request limit increases; alert on near-limit |
| Automated backups | Backup is corrupt or empty | Weekly restore test; validate file size in script |
| Auto-restart on crash | Service enters restart loop | Circuit breaker after N restarts; alert on CrashLoopBackOff |
| Certificate auto-renewal | Renewal fails silently | Monitor cert expiry; alert at 14 days remaining |
3. Test Restore, Not Just Backup
Backups are not a recovery strategy. Tested restores are a recovery strategy. The difference is critical:
- Backup: "We have a file on S3."
- Restore test: "We restored that file to a clean environment last Tuesday, validated data integrity, and it took 12 minutes."
Schedule restore tests monthly. Automate them if possible. Measure the time. Know your actual RTO, not your theoretical one.
4. Avoid Single Points of Assumption
Your automation should never depend on a single:
- Availability Zone — What if that AZ is degraded?
- Instance type — What if that instance family is at capacity?
- Region — What if there's a regional service disruption?
- Credential — What if that IAM role or API key is rotated?
- DNS provider — What if your DNS is unreachable?
5. Design for Automation Failure, Not Just System Failure
Most teams design for "What if the server goes down?" Few design for "What if the automation that manages the server goes down?"
Add monitoring to your automation:
- Alert if a scheduled Lambda doesn't execute.
- Alert if a CronJob hasn't run in its expected window.
- Alert if an EventBridge rule is disabled.
- Track automation execution duration, not just success/failure.
1# Example: CloudWatch alarm for Lambda automation that should run daily
2Resources:
3 AutomationHealthAlarm:
4 Type: AWS::CloudWatch::Alarm
5 Properties:
6 AlarmName: "rds-start-lambda-not-invoked"
7 MetricName: Invocations
8 Namespace: AWS/Lambda
9 Statistic: Sum
10 Period: 86400 # 24 hours
11 EvaluationPeriods: 1
12 Threshold: 1
13 ComparisonOperator: LessThanThreshold
14 Dimensions:
15 - Name: FunctionName
16 Value: !Ref RDSStartFunction
17 AlarmActions:
18 - !Ref OpsNotificationTopicThe Maturity Shift
Junior DevOps mindset: "Let's automate this."
Mature DevOps mindset: "What happens when this automation fails?"
That second question changes architecture decisions. It influences:
- Instance selection: Do you hardcode a single instance type, or specify a priority list of fallbacks?
- Backup strategy: Do you run
pg_dumpand hope, or do you validate, test restores, and monitor? - Multi-AZ usage: Do you pin to a single AZ for simplicity, or accept the cost of cross-AZ redundancy?
- Rollback plans: Does your deployment pipeline have a one-click rollback, or does it require rebuilding artifacts?
- Cost-saving decisions: Do you stop instances to save money, or do you have a recovery plan for when they won't start?
- Deployment pipelines: Does your CI/CD have a manual override path for when the pipeline itself breaks?
The maturity isn't in the tools you use. It's in the questions you ask before you hit deploy.
A Recovery Readiness Checklist
Use this to evaluate your current automation stack:
- Every automated process has a documented manual fallback
- Backup scripts validate output (file size, row count, checksums)
- Restore has been tested in the last 30 days
- Automation has monitoring on itself (not just on the systems it manages)
- Restart/healing automation has circuit breakers (max retries, backoff)
- Cost-saving automation (stop/start) has a capacity failure runbook
- Scaling automation accounts for quota limits and pre-requests increases
- On-call team can perform essential operations without CI/CD
- Failure modes are documented alongside the automation, not in a separate wiki
If you can check every box honestly, your automation is resilient. If you can't, you know where to start.
The Real Goal
Automation is not about removing humans from operations. It's about removing repetitive work so humans can focus on the work that actually requires judgment: architecture decisions, incident response, capacity planning, and recovery design.
If your system works only when automation works — it isn't resilient. It's fragile and fast. And fragile systems don't fail gracefully. They fail at 2 AM on a Friday, when the person who wrote the automation is on vacation.
Final Thought
Automation reduces operational load. Recovery planning reduces operational panic. You need both.
Because automation without a recovery plan isn't efficiency. It's deferred risk. And deferred risk doesn't disappear — it compounds, quietly, until the one morning it doesn't.
Want to build automation that's truly resilient? Contact us at Coding Protocols. We help DevOps teams design systems that operate smoothly and recover gracefully when things go sideways.


