Lesson 5/5OPERATIONS7 min read

Resilience: planning for what goes wrong

A business that requires everything to go perfectly is fragile.

Planning for failure — building backups, redundancy, and recovery plans — means your company survives disruptions that destroy competitors.

Deep dive theory

Why this matters?

Two restaurants face the same problem: their main supplier cannot deliver this week.

Restaurant A has no backup. They scramble to find alternatives at the last minute, pay premium prices, and still run out of ingredients. Customers leave disappointed.

Restaurant B has relationships with three suppliers. They shift the order to their secondary source within an hour. Customers notice nothing. Business continues.

Same crisis. Completely different outcomes.

The pattern: Resilience is the ability to absorb shocks and continue operating. It comes from anticipating problems before they happen and building systems that can handle them.

The mindset: Instead of hoping nothing goes wrong, assume things will go wrong. Then design accordingly.


1. Single points of failure

A single point of failure (SPOF) is any element whose failure would stop the entire operation.

Common SPOFs in business

  • The only person who knows a critical process
  • The only supplier for a key input
  • The only copy of important data
  • The only access credentials for a system
  • The only client that generates most revenue

Finding SPOFs

For every critical process, ask: "If this one thing fails, what happens?"

If the answer is "everything stops," you have identified a single point of failure.

Eliminating SPOFs

Every SPOF needs a backup:

  • Document knowledge so it is not trapped in one head
  • Maintain relationships with multiple suppliers
  • Back up data to multiple locations
  • Share credentials securely
  • Diversify revenue sources

2. Redundancy

Redundancy means having backups for critical elements. It costs more upfront but prevents catastrophic losses.

Where redundancy matters

Focus redundancy on what would hurt most if it failed:

  • Revenue-generating systems
  • Customer-facing operations
  • Data and intellectual property
  • Key relationships

Not everything needs backup. Low-impact, easily replaceable elements do not justify the cost.

Levels of redundancy

  • Hot standby: Backup is running and ready to take over instantly
  • Warm standby: Backup exists but needs some setup time
  • Cold standby: Components available but require significant time to deploy

Match the level of redundancy to the criticality and acceptable downtime.


3. Business continuity planning

A business continuity plan (BCP) defines how operations continue during and after a disruption.

Elements of a BCP

Risk identification

What could go wrong?

  • Technology failures
  • Supply chain disruptions
  • Key person unavailability
  • Natural disasters
  • Cyberattacks
  • Economic shocks

Impact assessment

For each risk, evaluate:

  • How likely is it?
  • How severe would the impact be?
  • How long could operations survive?

Focus planning on high-likelihood and high-impact risks.

Response procedures

For priority risks, document:

  • Who is responsible for the response?
  • What steps are taken immediately?
  • How is communication handled?
  • What resources are needed?

Recovery steps

After the immediate response:

  • How do you return to normal operations?
  • What needs to be repaired or replaced?
  • How do you learn from the incident?

4. The premortem

A premortem is a planning exercise that imagines failure before it happens.

How it works

Instead of asking "How will this succeed?" ask "It is six months from now and this failed completely. What went wrong?"

This reframes planning from optimism to realism. People are more willing to identify risks when framed as explaining past failure rather than predicting future problems.

Running a premortem

  1. Describe the scenario: "The project has failed. We lost time and money."
  2. Ask each person: "What happened? Why did it fail?"
  3. Collect all the reasons without judgment
  4. Prioritize the most likely and most damaging
  5. Create preventive actions for priority risks

When to use

Run premortems before major initiatives, new projects, or significant changes. It surfaces risks that optimism blinds you to.


5. Graceful degradation

When failure occurs, systems should degrade gracefully rather than collapse completely.

What graceful degradation looks like

  • A website that shows a "we're experiencing high traffic" message instead of crashing completely
  • A service that continues with reduced features rather than total outage
  • A supply chain that shifts to secondary sources with slightly higher costs rather than stopping

The goal is not to prevent all impact but to limit the severity.

Designing for degradation

For critical systems, define:

  • What is the minimum viable function?
  • What can be temporarily disabled?
  • What manual workarounds exist?

Build these fallbacks into the design rather than improvising during a crisis.


6. Recovery and learning

After a disruption, the goal is not just to return to normal but to become stronger.

Post-incident review

After any significant disruption, analyze:

  • What happened exactly?
  • What worked in the response?
  • What did not work?
  • What would we do differently?

Document the findings. This creates institutional memory that improves future responses.

Turning failure into improvement

The best organizations use failures as forcing functions for improvement. Each incident reveals weaknesses that might never have been discovered otherwise.

"Never waste a crisis" means using disruption as an opportunity to fix underlying problems.

Building a resilient culture

Resilience is not just systems — it is mindset. Teams that expect challenges and prepare for them respond better than teams that assume smooth sailing.

Normalize discussion of risks. Reward preparation over optimism. Celebrate recovery as much as success.


Think

What would you do in these scenarios?

Simulator

Sim_v4.0.exe

The Coffee Shop Expansion

You are the manager of a successful local coffee shop. A large international chain is opening a store just across the street. How do you respond to maintain your market position?


Practice

Test yourself and review key terms

Knowledge check

Q1/1

What is the primary indicator of a successful Market Expansion Strategy?

Concepts

Question

Why does one restaurant handle a supplier failure in an hour while the other scrambles all week?

Click to reveal

Answer

One has relationships with three suppliers — the other had no backup and paid premium prices at the last minute.

1 / 14

Do

Your action steps for today

Action plan: what to do today

  • Identify one single point of failure in your business. What would happen if that person, supplier, system, or client disappeared tomorrow? Start building a backup this week.
  • Run a mini-premortem. Think of an upcoming project or initiative. Imagine it failed spectacularly. Write down three reasons why it might have failed. Address the biggest one.
  • Check your data backups. Do you know where your critical files are backed up? Could you recover if your laptop was destroyed today? If not, set up a backup system now.
Note.txt

Some examples and details may be simplified to better convey the core idea. Every business is different — adapt these ideas to your specific context and situation.