Queues vs Streams: Message Delivery, Ordering & Reliability
Core Distinction
The fundamental split is about message lifecycle.
| Concept | Queue (RabbitMQ, SQS) | Stream (Kafka, Kinesis) |
|---|---|---|
| Message is | A command — do this task | An event — this fact happened |
| After processing | Deleted | Persisted (retention window) |
| Multiple consumers | Compete — each message to one | Independent — each reads at own offset |
| Replay | Not possible | Possible (rewind offset) |
| Ordering | Per-queue, limited | Per-partition, strict |
| Throughput | Thousands/sec | Millions/sec |
Rule of thumb: tasks demand queues; immutable facts demand streams.
Message Delivery Semantics
Three guarantees exist, each with trade-offs:
| Guarantee | Description | Risk | Where used |
|---|---|---|---|
| At-most-once | Fire-and-forget, no retry | Data loss possible | Metrics, logs where loss is acceptable |
| At-least-once | Retry until ACK | Duplicates possible | Default for queues/streams |
| Exactly-once | Idempotent + transactions | Highest latency/complexity | Financial, inventory systems |
Exactly-once in Kafka: requires enable.idempotence=true on producer +
transactional API + idempotent consumers (check processed event IDs).
# Producer-side idempotency (Kafka)
producer_config = {
"enable.idempotence": True,
"acks": "all",
"retries": 5,
"max.in.flight.requests.per.connection": 1,
}
Ordering Guarantees
Kafka: strict ordering within a partition, not across partitions. Partition key determines which partition a message lands in.
# Same document_id → same partition → strict order guaranteed
producer.produce(
topic="document-edits",
key=document_id, # partition key
value=event_payload,
)
SQS Standard: no ordering guarantee; occasional duplicates.
SQS FIFO: strict ordering, exactly-once, but max 3 000 TPS per queue.
RabbitMQ: per-queue order if single consumer; breaks with competing consumers.
Retries & Backoff
Never retry immediately — use exponential backoff with jitter.
import time
import random
import logging
logger = logging.getLogger(__name__)
MAX_RETRIES = 5
BASE_DELAY = 1.0 # seconds
def process_with_retry(message: dict, handler) -> bool:
for attempt in range(MAX_RETRIES):
try:
handler(message)
return True
except TransientError as exc:
delay = BASE_DELAY * (2 ** attempt) + random.uniform(0, 0.5)
logger.warning(
"attempt=%d failed, retrying in %.2fs: %s",
attempt + 1,
delay,
exc,
)
time.sleep(delay)
except PermanentError as exc:
logger.error("permanent failure, sending to DLQ: %s", exc)
send_to_dlq(message, exc)
return False
send_to_dlq(message, "max retries exceeded")
return False
Transient errors (network timeout, DB overload) → retry.
Permanent errors (schema mismatch, validation failure) → skip retries, send to DLQ.
Dead-Letter Queues (DLQ)
A DLQ is a dedicated topic/queue that captures messages that cannot be processed. It isolates failures from the main pipeline, preserving throughput.
What to store in DLQ
import json
import traceback
def send_to_dlq(original: dict, error: Exception, producer) -> None:
dlq_record = {
"original_payload": original,
"error_type": type(error).__name__,
"error_message": str(error),
"stack_trace": traceback.format_exc(),
"failed_at": "2026-03-30T12:00:00Z",
"source_topic": original.get("_topic"),
"source_partition": original.get("_partition"),
"source_offset": original.get("_offset"),
}
logger.error("dlq send | error=%s | payload=%s", error, original)
producer.produce("orders.dlq", value=json.dumps(dlq_record))
DLQ recovery patterns
| Pattern | When to use |
|---|---|
| Replay | Fix the bug, rewind offset, reprocess |
| Parking-lot | Move to retry topic, process later |
| Circuit breaker | Pause all retries during outage |
| Manual review | Schema / business-logic errors |
Tools Comparison
| Kafka | RabbitMQ | SQS | |
|---|---|---|---|
| Model | Stream (log) | Queue (AMQP) | Queue (HTTP) |
| Ordering | Per partition | Per queue | FIFO variant |
| Replay | Yes | No | No |
| DLQ | App-level topic | Built-in (x-dead-letter-exchange) |
Built-in (RedrivePolicy) |
| Routing | Topic + consumer groups | Exchange + bindings | Filter policies |
| Ops overhead | High | Medium | Zero (managed) |
| Best for | Event sourcing, audit log, multi-consumer | Task queues, priority routing | Simple decoupling, AWS ecosystem |
Decision Guide
Need replay / multiple independent consumers?
└─ YES → Kafka / Kinesis (stream)
└─ NO →
Need complex routing / priority?
└─ YES → RabbitMQ
└─ NO →
In AWS and want zero ops?
└─ YES → SQS
└─ NO → RabbitMQ or Redis Streams
Most mature systems combine both: Kafka as the central event bus, individual services fan out to SQS/RabbitMQ for their own task workers.