Kafka vs RabbitMQ vs SQS: Message Bus Tradeoffs
Message bus choice depends on what you mean by ‘messaging.’ Stream, queue, or pub-sub all map to different tools.
Kafka: streaming, retention
Kafka is high-throughput streaming with long retention and replay. Ordered partitions, durable storage, consumer-driven offsets; the right call when "I might want to replay this" is part of the requirements.
- High throughput. Hundreds of thousands of messages per second per broker; scales horizontally with partitions.
- Ordered partitions. Per-partition ordering guarantee; the consumer reads in producer order.
- Long retention. Days, weeks, or forever; the broker is the source of truth, not a transient buffer.
- Sweet spot. Event streams, analytics pipelines, log aggregation; anything you might want to replay or reprocess.
RabbitMQ: queues, routing
RabbitMQ is queues with rich routing. Lower throughput than Kafka, more flexible than SQS; the right call when routing rules and selective consumption are first-class needs.
- Queues. Classic queue semantics; consumers pull, broker tracks acks; the canonical message-queue model.
- Routing rules. Topic, header, and direct exchanges; complex topologies expressed declaratively.
- Selective consumption. Consumer subscribes to specific routing keys; the broker filters; saves consumer-side complexity.
- Sweet spot. Task queues, work distribution, complex routing topologies; anywhere "this message goes to that consumer" is intricate.
SQS: managed simplicity
SQS is AWS-managed simplicity: no broker to operate, near-zero ops, standard or FIFO. The right call for AWS-committed teams that want a queue and want to skip the operational tax of running their own.
- AWS-managed. No broker, no patches, no failover engineering; the operational story is "submit and pay."
- Standard or FIFO. Standard for at-least-once and best-effort ordering; FIFO for strict ordering at lower throughput.
- Near-zero ops. Auto-scaling, durability, regional replication all handled by AWS.
- Sweet spot. Simple async tasks where the queue is just a queue; you want to focus on producers and consumers, not brokers.
The dual-bus pattern
Many teams ship a dual-bus architecture: Kafka for events that other systems consume, SQS for tasks one service produces and another consumes. Each at the right scale; no overlap if you scope by purpose.
- Kafka for events. Domain events that multiple systems consume; the analytics warehouse and the search index both subscribe.
- SQS for tasks. Producer-consumer task queues; one service hands a unit of work to another; queue semantics fit.
- No overlap. Scope each by purpose; the two patterns rarely compete; each plays to its strengths.
- The cost. Two systems to operate; team needs both literacies; the duplication pays back if both purposes exist.
Antipatterns
- Kafka for simple async tasks. Operational overhead exceeds value.
- RabbitMQ for analytics streams. Throughput limit; replay weak.
- SQS with throughput beyond limits. Hits soft caps; degraded mid-quarter.
What to do this week
Three moves. (1) Run a 30-day trial of the candidate against your real workload. (2) Compare TCO + workflow fit, not just feature checklists. (3) Decide and commit; running both in parallel is the most expensive option.