gRPC: Retry and Hedging Policy
This document adds a clear policy for retry and hedging in production gRPC systems.
Core rule
Retry and hedging must be enabled only for idempotent operations. For non-idempotent operations, use application idempotency keys before enabling retries.
Retry policy (recommended baseline)
- Retry only transient failures:
UNAVAILABLE, optionalRESOURCE_EXHAUSTED. - Use exponential backoff with jitter.
- Set max attempts to prevent retry storms.
- Enforce per-method deadlines.
Example service config:
{
"methodConfig": [
{
"name": [{"service": "user.v1.UserService", "method": "GetUser"}],
"timeout": "1s",
"retryPolicy": {
"maxAttempts": 4,
"initialBackoff": "0.1s",
"maxBackoff": "1s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE"]
}
}
]
}
Hedging policy (tail latency control)
Hedging sends parallel attempts and uses the first successful response. Use only for read calls where duplicate execution is safe.
- Keep
maxAttemptslow (2 or 3). - Add
hedgingDelay(for example30ms) to avoid immediate fan-out. - Track backend amplification factor from hedged requests.
Safety controls
- Enable retry throttling to avoid overload loops.
- Add circuit breaker before dependency calls.
- Disable retry for validation errors and business conflicts.
- Propagate retry attempt number in metadata for observability.
Observability checklist
- Metrics: retries per method, hedge attempts, success-after-retry ratio.
- Logs: final status + attempt count + deadline.
- Traces: one parent span, child span per attempt.
- Alerts: high retry ratio + rising latency = dependency degradation.