First Step Function

Workflow.

Initial setup

The first Step Functions state machine starts with the execution-type choice: Standard for long-running orchestrations (per state-transition billing, multi-day jobs, approval flows) or Express for high-throughput event-driven workloads (per-duration billing, sub-five-minute executions). Pick the type at creation; switching later requires recreating the machine.

Simple workflow

Task, Choice, Map, Parallel cover most operational use cases. Task invokes a Lambda or service; Choice branches on previous output; Map fans out across an array; Parallel runs concurrent branches and waits for all to complete.

Error handling

Error handling in Step Functions is declarative. Catch routes per error type, Retry policies with exponential backoff, Fallback states for graceful degradation, explicit timeouts to catch stuck executions. Build error handling into the state machine rather than the Lambda code.

Observability

Three observability surfaces: CloudWatch state-transition logs, X-Ray distributed traces, the per-execution visual graph in the console. Failed-execution alarms per state machine close the loop on operations.

Operating Step Functions

IaC for definitions, version-controlled changes via PR review, TestState API for state-level testing before promote, named owner per machine. Click-built production state machines become legacy debt the day after the original author leaves.