Reconstructing the Incident Timeline From Telemetry
Most incident timelines are vague and contradictory. The pipeline that produces a precise timeline from logs, metrics, and traces.
Sources
Alerts: when did the page fire?
Metrics: when did the symptom start?
Logs: when did the error first appear?
Traces: which request showed the slowdown first?
Merge
Sort by timestamp. Annotate with source. The merged stream is the timeline.
Tools: simple Python script; or the incident management vendor's timeline feature.
Use in postmortem
The reconstructed timeline goes into the postmortem. Removes the 'when did this happen' debate.
Decisions and actions taken get added by humans. Telemetry provides the spine.