Three Habits for Resilient Software and Teams
AWS had a bad day. A failure in US-EAST-1 tied to DNS resolution for DynamoDB endpoints disrupted major apps and sites. Services began recovering within hours, though some customers are still dealing with lingering effects while systems stabilize.
You cannot prevent every incident, but you can shrink the impact. And, with the right design, keep the core working so there is little to recover. Let's talk about how we can apply this in both our code and in our professional lives.
1. Know Your Dependencies
Start with the system. List the external chokepoints your core flows touch. For each, define a trigger and the first move.
For Example:
- DNS slow → serve cached reads.
- Auth slow → allow read-only.
- Region unhealthy → shift traffic and pause nonessential jobs.
Put the trigger and the step in the runbook. Point the alert to that step. For starting points on thresholds, refer to the SLOs chapter in Google’s SRE Workbook, which helps tie triggers to user impact.
Now map the same idea to your week. Identify work dependencies that stall progress, including product sign-off, design assets, vendor data, legal review, and calendar access. Set a timebox and a default move. “If I do not have X by 10 a.m., I do Y.” Share it so no one waits in silence. If you need a lens for what to tackle first, check out Tech Debt Prioritization to aim this work at outcomes.
2. Bend, Don’t Break
Learn how to keep the core running when parts fail. Define a “bare-bones” mode for key flows. Serve cached reads when sources are slow. Queue noncritical writes and reconcile later. Turn off extras that create load. Use a simple status message so users know what still works. For other small, repeatable gains, see How to Win in the Margins.
In your work, do the same. When a plan hits friction, reduce scope. Ship the part that helps a customer today. Park the rest with a clear note on what comes next. Share status updates early so people don't have to guess.
One thing you can try this week is to pick one customer flow. Write down what systems are critical and how you would keep them running. Decide what to cache, what to queue, and what to turn off. Draft the two-line message you would use during a slowdown.
|
|
Where can AI save you time?
My friends at Big Creek Growth put together a quick survey to spot the repetitive work you can hand off to automation.
|
|
3. Practice Recovery
Treat recovery as a skill. Keep a short runbook for common failures. Add simple switches (or feature flags) that your team can flip quickly. Define clear thresholds for when to move to lite mode and when to restore.
After an incident, reconcile queues, verify data, and note one fix to ship. Treat the drill like a short, well-run meeting. Use the patterns from How to Run Better Meetings to keep the meeting focused and valuable.
In your work, do the same. Run short tabletop drills. Assign roles. Use a two-line status update: “What we know. What to do.” Afterward, debrief without blame and capture the next step.
Resilience is built upon small, deliberate choices made before you need them.
Know your dependencies and the first move when one slips. Keep a usable core by defining a "barebones mode" for key flows. Practice recovery so switching modes and restoring service is routine, not a scramble.
The same approach works in your week: set clear defaults, reduce scope when needed, and close the loop after misses. That’s how you keep value moving when the ground shifts.