First Step Function
Workflow.
Initial setup
The first Step Functions state machine starts with the execution-type choice: Standard for long-running orchestrations (per state-transition billing, multi-day jobs, approval flows) or Express for high-throughput event-driven workloads (per-duration billing, sub-five-minute executions). Pick the type at creation; switching later requires recreating the machine.
- Console or CLI definition. ASL (Amazon States Language) in JSON or YAML. CLI for everything reproducible.
- Two execution types. Standard for long-running, per state-transition billed; Express for high-throughput, per-duration billed.
- When Standard. Long-running approvals, multi-day jobs, human-paced orchestration.
- When Express. High-frequency event-driven paths, sub-five-minute executions.
Simple workflow
Task, Choice, Map, Parallel cover most operational use cases. Task invokes a Lambda or service; Choice branches on previous output; Map fans out across an array; Parallel runs concurrent branches and waits for all to complete.
- Task states. Lambda invocation per step. Output flows to the next state via input/output paths.
- Choice state. Conditional branch per decision. Logic based on previous state's output.
- Map state. Fan-out per array. Each element processed in parallel up to a configured concurrency cap.
- Parallel state. Concurrent branches per step. Waits for all branches; aggregates outputs.
Error handling
Error handling in Step Functions is declarative. Catch routes per error type, Retry policies with exponential backoff, Fallback states for graceful degradation, explicit timeouts to catch stuck executions. Build error handling into the state machine rather than the Lambda code.
- Catch blocks per error type. Retry, fall back, or terminate based on the error class. Routes are declarative.
- Retry policies. Exponential backoff with max attempts per task. Built-in, declarative, free.
- Fallback states. Alternate path per task. Graceful degradation when the primary path fails.
- Explicit timeouts per state. Catches stuck executions before they wait forever.
Observability
Three observability surfaces: CloudWatch state-transition logs, X-Ray distributed traces, the per-execution visual graph in the console. Failed-execution alarms per state machine close the loop on operations.
- CloudWatch logs. State-transition log per execution. Trace through state transitions for debugging.
- X-Ray integration. Distributed trace per execution. Spans across Lambda invocations within a state machine.
- Per-execution visual graph. Console UI shows the run as a clickable graph. Inspect inputs and outputs per state.
- Failed-execution alarm. Per-state-machine alarm on failure rate. Operations close the loop.
Operating Step Functions
IaC for definitions, version-controlled changes via PR review, TestState API for state-level testing before promote, named owner per machine. Click-built production state machines become legacy debt the day after the original author leaves.
- IaC for state-machine definitions. Terraform, CDK, or SAM source per machine. Click-build is technical debt.
- Version control plus PR review. Every change reviewed before merge. Standard engineering hygiene.
- TestState API for pre-promote validation. State-level testing without running the whole machine. Faster iteration, real coverage.
- Named owner per machine. Responsible team per state machine. Supports operational reviews.