Skip to content

gRPC: Retry and Hedging Policy

This document adds a clear policy for retry and hedging in production gRPC systems.

Core rule

Retry and hedging must be enabled only for idempotent operations. For non-idempotent operations, use application idempotency keys before enabling retries.

  • Retry only transient failures: UNAVAILABLE, optional RESOURCE_EXHAUSTED.
  • Use exponential backoff with jitter.
  • Set max attempts to prevent retry storms.
  • Enforce per-method deadlines.

Example service config:

{
  "methodConfig": [
    {
      "name": [{"service": "user.v1.UserService", "method": "GetUser"}],
      "timeout": "1s",
      "retryPolicy": {
        "maxAttempts": 4,
        "initialBackoff": "0.1s",
        "maxBackoff": "1s",
        "backoffMultiplier": 2,
        "retryableStatusCodes": ["UNAVAILABLE"]
      }
    }
  ]
}

Hedging policy (tail latency control)

Hedging sends parallel attempts and uses the first successful response. Use only for read calls where duplicate execution is safe.

  • Keep maxAttempts low (2 or 3).
  • Add hedgingDelay (for example 30ms) to avoid immediate fan-out.
  • Track backend amplification factor from hedged requests.

Safety controls

  • Enable retry throttling to avoid overload loops.
  • Add circuit breaker before dependency calls.
  • Disable retry for validation errors and business conflicts.
  • Propagate retry attempt number in metadata for observability.

Observability checklist

  • Metrics: retries per method, hedge attempts, success-after-retry ratio.
  • Logs: final status + attempt count + deadline.
  • Traces: one parent span, child span per attempt.
  • Alerts: high retry ratio + rising latency = dependency degradation.