Dashboards Stakeholders Actually Open
Most engineering dashboards are bookmarked once and never opened again. The principles that turn a chart wall into something a VP checks every morning.
Audience first
An engineer's dashboard and a VP's dashboard are different artifacts. The engineer wants every signal; the VP wants the verdict. Pick the audience first; let the layout follow.
The mistake teams make. They build one dashboard for both audiences. The result is too dense for the VP (who skips it) and too sparse for the engineer (who supplements with their own queries). Both audiences are underserved.
The discipline. Two dashboards per service: stakeholder view (4 tiles, one-glance) and engineering view (everything else). Link them so the stakeholder can drill in if they want; default to the stakeholder view above the fold.
The four-tile pattern that travels
For any service, four tiles top to bottom: health (single number, traffic light), traffic (qps trend), error rate (with the SLO threshold drawn as a line), latency (p95 with the SLO threshold). Anything else is for engineering view.
Why these four. They answer the four implicit questions stakeholders have: are we healthy (health), how busy are we (traffic), how often are we breaking (error rate), how fast are we (latency). Different services emphasise different aspects, but all four are universal.
The composition. Health on top because it's the answer to the implicit question "should I care?" If health is green, the stakeholder moves on. If red or yellow, they read the other three for context. The hierarchy of attention drives the order.
Three rules
- One number above the fold. The first thing the reader sees should answer the question they came with.
- Threshold lines on every chart. A chart without context is a chart that nobody can interpret.
- Default time range = last 24 hours. Not 1 hour (too noisy), not 7 days (too smoothed).
The "one number" rule. Stakeholders read in 5 seconds; if the first thing they see isn't the answer, they leave. The big number (e.g., "99.7% available this month") in a green/yellow/red colour is what they need.
The threshold lines. A latency chart showing 280ms is meaningless without context. Add a horizontal line at the SLO threshold (say 500ms). The chart now tells a story: "we're under threshold; we're healthy." Without the line, the reader can't interpret the number.
The 24-hour default. 1 hour is dominated by noise (a single bad minute spikes the chart). 7 days is too smoothed (a 30-minute outage barely registers). 24 hours hits the sweet spot, meaningful events show up; transient noise averages out.
What to leave out
Latency percentiles. Per-endpoint breakdowns. The full burn-rate curve. Any chart with more than two lines. None of these belong on the stakeholder dashboard. They belong on the engineering view, one click away.
The discipline of leaving things out. Engineers want to add their pet metric. Each addition reduces the dashboard's communicative power. Resist additions; create separate engineering dashboards for the breakdown.
The "but my service is special" defence. Every team feels their service has unique stakeholder metrics that need to be on the dashboard. Most don't. The four-tile pattern is universal; specific metrics belong on the engineering view.
How dashboards rot
Service deprecated, dashboard not. Threshold tightened, line not updated. Owner left, dashboard inherited by nobody. Quarterly: prune dashboards nobody opened. The unowned ones go first.
The owner discipline. Every dashboard has a named owner. The owner's job: keep it accurate. When they leave, ownership transfers explicitly; without explicit transfer, the dashboard is unowned and rots.
The pruning ritual. Quarterly, list all team dashboards; for each, check views in the last 90 days. If 0 views, propose deletion. The discussion sometimes reveals the dashboard IS used, just rarely; usually confirms it should go.
Worked example
Stakeholder dashboard for the API service. Top: "API health: 99.94% (last 30 days)" in green. Second tile: requests/second over last 24 hours. Third tile: error rate (with red line at SLO threshold of 0.1%). Fourth tile: p95 latency (with red line at 500ms SLO).
What it doesn't include. p50/p99/p999 latencies (engineering view). Per-endpoint breakdown (engineering view). Per-region breakdown (engineering view). Each is valuable to engineers; none answers the stakeholder's question.
The link to engineering view. Top-right corner: "Engineering view →". One click; the engineer sees everything; the stakeholder isn't burdened.
Common antipatterns
The "comprehensive" dashboard. 30 charts on one page. Stakeholders skim and miss the answer. Resist; pare to 4 tiles.
Dashboards without thresholds. Charts that look interesting but don't tell you whether they're good or bad. Always add the SLO line.
The dashboard nobody opens. Created for a meeting; never updated; eventually deleted. The 90-day-views check catches these.
Multiple dashboards for the same service. Old version, new version, "v2", "real one." Confuses everyone. Maintain one canonical dashboard per audience per service.
What to do this week
Three moves. (1) Audit your team's dashboards. List by views in last 90 days. Bottom 50% are candidates for deletion. (2) For your most important service, build the four-tile stakeholder dashboard if you don't have one. (3) Add SLO threshold lines to every chart that doesn't have them. The visual context is what makes dashboards interpretable.