AWS CLI Incident-Response Cheat Sheet
When the console is slow, the dashboard's lying, or the link is buried, these CLI commands answer the question faster.
Setup and identity
Confirm you're running in the right account and region before anything else. Mid-incident is a bad time to learn you're paged into staging.
aws sts get-caller-identity, account, ARN, user ID right nowaws configure list, current profile, region, credential sourceexport AWS_PROFILE=prod-readonly, switch profile per shellexport AWS_REGION=us-east-1, set region per shellaws --output json --query <path>, combine withjqor--queryJMESPath--no-cli-pager, kill the auto-pager that breaks pipelines
EC2
Find the instance, check its status, restart it, or grab the console log.
aws ec2 describe-instances --filters "Name=tag:Name,Values=api-*" --query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' --output table, by tagaws ec2 describe-instance-status --instance-ids i-0abc, system + instance health checksaws ec2 get-console-output --instance-id i-0abc --output text, last console output (boot diagnostics)aws ec2 reboot-instances --instance-ids i-0abc, soft rebootaws ec2 stop-instances --instance-ids i-0abc/start-instances, full power cycleaws ec2 describe-security-groups --group-ids sg-0abc --query 'SecurityGroups[].IpPermissions', ingress rulesaws ec2 describe-volumes --filters "Name=attachment.instance-id,Values=i-0abc", attached EBSaws ssm start-session --target i-0abc, Session Manager shell, no SSH key needed
ELB / ALB
Half the "service is down" incidents are actually one unhealthy target group.
aws elbv2 describe-load-balancers --query 'LoadBalancers[].[LoadBalancerName,State.Code,DNSName]' --output table, load balancers and their stateaws elbv2 describe-target-groups --query 'TargetGroups[].[TargetGroupName,Protocol,Port]' --output table, target groupsaws elbv2 describe-target-health --target-group-arn <arn> --query 'TargetHealthDescriptions[].[Target.Id,TargetHealth.State,TargetHealth.Reason]' --output table, which targets are unhealthy and whyaws elbv2 describe-listeners --load-balancer-arn <arn>, listener rules, certificatesaws elbv2 describe-rules --listener-arn <arn>, routing rules in priority order
RDS
Status, connections, recent events. RDS console is slow during incidents, CLI's faster.
aws rds describe-db-instances --query 'DBInstances[].[DBInstanceIdentifier,DBInstanceStatus,Engine,EngineVersion]' --output table, instance statusaws rds describe-db-clusters --query 'DBClusters[].[DBClusterIdentifier,Status,Engine,EngineVersion]', Aurora clustersaws rds describe-events --duration 60 --source-type db-instance --source-identifier <id>, last hour of eventsaws rds describe-db-log-files --db-instance-identifier <id>, list error logsaws rds download-db-log-file-portion --db-instance-identifier <id> --log-file-name <file> --output text, pull error logaws rds reboot-db-instance --db-instance-identifier <id>, reboot (causes failover on Multi-AZ)aws rds failover-db-cluster --db-cluster-identifier <id>, manual Aurora failover
CloudWatch
Metrics, logs, alarms. logs tail is criminally underused.
aws logs tail /aws/lambda/<fn> --follow, live tail; the killer featureaws logs tail /aws/lambda/<fn> --since 30m --filter-pattern ERROR, last 30 min, errors onlyaws logs describe-log-groups --query 'logGroups[].logGroupName', list groupsaws logs start-query --log-group-name <name> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 100', Insights queryaws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=i-0abc --start-time $(date -u -d '1 hour ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 60 --statistics Average, CPU last houraws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[].[AlarmName,StateReason]' --output table, currently firing alarmsaws cloudwatch set-alarm-state --alarm-name <n> --state-value OK --state-reason "manually cleared during incident", silence a flapping alarm
IAM and STS
Permission denied mid-incident is the worst kind of permission denied. Check the role, the policy, the boundary, in that order.
aws iam get-user, current useraws iam list-attached-role-policies --role-name <role>, managed policiesaws iam list-role-policies --role-name <role>, inline policiesaws iam simulate-principal-policy --policy-source-arn <arn> --action-names s3:GetObject --resource-arns <arn>, would this call succeed?aws sts assume-role --role-arn <arn> --role-session-name break-glass-$(date +%s), assume break-glass roleaws sts decode-authorization-message --encoded-message <blob>, decode the cryptic "encoded authorization failure message"
Health and Service quotas
Sometimes the answer is "AWS has the incident, not you." Check Health and quotas before you spend an hour on the wrong service.
aws health describe-events --filter eventStatusCodes=open --query 'events[].[service,statusCode,region,startTime]' --output table, open AWS health events affecting your account (Business support+)aws health describe-affected-entities --filter eventArns=<arn>, what's specifically affectedaws service-quotas list-service-quotas --service-code ec2 --query 'Quotas[?Value!=null].[QuotaName,Value]' --output table, current limitsaws service-quotas request-service-quota-increase --service-code ec2 --quota-code L-1216C47A --desired-value 200, emergency quota bump- Public status: health.aws.amazon.com, the page that lags reality by 20 minutes but is what your CEO will read