Best Practices Intermediate By Samson Tanimawo, PhD Published Aug 9, 2026 6 min read

Graceful Degradation: How a Site Stays Half-Up

When a dependency goes down, the choice is not always availability or outage. With the right patterns, you can run with reduced functionality long enough for the dependency to recover. Users barely notice.

The availability axis is wrong

"Up or down" is the wrong frame. Most production failures are partial: one dependency is sick, another is fine. A binary "we are down" response loses customers when "we are running with checkout disabled" would have kept them. Treat availability as a spectrum.

The cost of binary thinking. A service with one degraded dependency goes fully down because the team's mental model is "either the system works or it doesn't." Customers who would have happily browsed without checkout instead see a hard error. They give up; they go to a competitor; they don't come back.

The shift to spectrum thinking. The system is "fully working" / "checkout disabled" / "personalisation disabled" / "read-only" / "down." Each degradation level is a deliberate engineering choice, a designed state where some functionality continues even when other functionality fails. Building for the spectrum requires more engineering than building for binary; the resilience benefit is large.

Four patterns

Each one trades off complexity for resilience. Add them in order; do not skip ahead. Trying to add all four simultaneously produces a complex system that nobody understands; adding them one at a time builds the muscle.

The progression. Read-only mode is easiest (a feature flag plus UI message). Cached fallback is medium (you maintain caches). Degraded UI is harder (frontend coordination required). Drop-the-feature is a last resort (admit defeat on a specific feature). Each level catches more failure modes; each costs more to maintain.

Read-only mode

The write database is unhappy; you switch to read-only. Browsing works; checkout does not. The cheapest pattern: a feature flag plus a UI message. Most stateful services should have this on day one.

The implementation. A "read-only mode" flag at the application level. When set, write endpoints return a friendly 503 with a message ("we're temporarily read-only; please try again in a few minutes"). Read endpoints continue normally. Switch is one config update.

The customer experience. A read-only e-commerce site lets customers browse, save items, share links, they're "engaged" even though they can't checkout. Many will return when full functionality returns; few would have returned if they got a hard error during their session.

Cached fallback

The recommendation service is down; you serve last week's recommendations. The personalisation service is degraded; you serve the popular-item list. Users get something instead of nothing. Cost: you maintain the cache. Reward: you keep most of the user experience.

The pattern. For each personalisation/recommendation/dynamic feature, maintain a fallback "good enough" version. The fallback is computed periodically (hourly, daily) and stored. When the real service fails, the fallback serves; the user gets a degraded but functional experience.

The freshness trade-off. Fallback recommendations are stale (last week's data). For some features that's fine (popular items don't change much). For others it's catastrophic (yesterday's stock prices). Match fallback freshness to feature requirements.

Degraded UI

The thumbnail service is slow; you render the page with placeholder images and load the real ones lazily. The autocomplete service is down; the search box still works without suggestions. Patterns at the front-end level cost more in coordination but cover the failures back-end patterns cannot.

The coordination cost. Frontend and backend teams must agree on degradation behaviour. Frontend must handle "thumbnail service returned 503" gracefully (show placeholder). Backend must return 503 quickly rather than timeout (so frontend renders without long delays).

The failures degraded UI catches. Slow dependencies (frontend renders on what it has, fetches the rest async). Partial outages (autocomplete down doesn't break the search box). Specific feature regressions (one widget fails, the page still loads). Each is invisible to the user without degraded UI.

Drop the feature

Some features are luxuries. Real-time chat. Activity feeds. When the underlying service is down, hide the feature entirely and surface a small status note. Less elegant than the others; sometimes the only honest option.

The "hide" implementation. A feature flag controlled by the dependency's health check. When the dependency is unhealthy, the flag is off; the feature isn't rendered. The page works; the feature is missing; a small "temporarily unavailable" note acknowledges this.

What "drop the feature" is appropriate for. Non-critical add-ons. Things customers can live without for hours. The discipline is being honest about what's critical and what isn't; "the activity feed is critical" usually isn't true after honest analysis.

Order to add them

Read-only first. Cached fallback for the two or three highest-traffic features second. UI degradation for the visible ones third. Drop-the-feature last; it is a confession, not a strategy. Make the others work first.

The reasoning. Read-only is highest leverage because it covers the broadest case (any write outage). Cached fallback covers the dependency-specific cases. UI degradation covers the timing/partial-outage cases. Drop-the-feature is what you do when none of the others apply.

The team-maturity matching. Read-only is achievable by any team. Cached fallback requires data infrastructure. UI degradation requires frontend-backend coordination. Drop-the-feature is just a feature flag. Match adoption to team capacity; don't try to skip levels.

Common antipatterns

Degradation that's never tested. Read-only mode exists in the code but hasn't been activated in 18 months. The first time it fires, it's broken. Test degradation modes quarterly; they atrophy faster than primary code paths.

Silent degradation. System falls back to cached recommendations; nobody knows. A week later, recommendations are stale; nobody notices because there's no signal. Always alert when a fallback is active; the alert is the signal that "primary is broken" and needs fixing.

The cached-fallback that's also broken. The cache job runs daily but has been failing for two weeks. When the primary fails, the cache is two weeks stale. Monitor the cache freshness; alert when it's older than expected.

Degradation that confuses the user. "Some features may be unavailable." User sees missing UI elements with no explanation. Always tell the user what's degraded and why; opacity makes the degraded experience worse than the outage.

What to do this week

Three moves. (1) For your service, list what would happen if the write database failed. If the answer is "the service is down," you don't have read-only mode yet; that's the first investment. (2) Identify your top 2 personalisation features that have a "popular items" or "default" version. Implement cached fallback; verify it works in staging by simulating the primary's failure. (3) Add a "fallback active" alert. Whenever any degradation pattern is engaged, the team gets a low-priority notification, visibility into degraded states is what prevents them from becoming permanent.