Failure Handling in CI/CD
Pipeline Failure Categories
| Category | Example | Response |
|---|---|---|
| Transient failure | Flaky test, network timeout | Retry |
| Real failure | Test caught a bug | Fix code, do not retry |
| Infrastructure failure | Runner out of disk | Re-run on different runner |
| Partial failure | One of 4 shards failed | Investigate before re-run |
The key question before retrying: is this a real failure? Retrying a real failure wastes time and hides the bug.
Retry Strategies
Automatic Retry for Transient Failures
- name: Deploy to staging
uses: ./actions/deploy
with:
environment: staging
# Retry up to 3 times on failure (network, timeout)
continue-on-error: false
# Or with retry action
- name: Deploy with retry
uses: nick-fields/retry@v3
with:
timeout_minutes: 10
max_attempts: 3
retry_wait_seconds: 30
command: ./scripts/deploy.sh staging
Retry Only Known-Transient Steps
- name: Wait for service to be ready
run: |
for i in {1..10}; do
if curl -sf "${{ vars.STAGING_URL }}/health"; then
echo "Service is ready"
exit 0
fi
echo "Attempt $i/10 failed, waiting 30s..."
sleep 30
done
echo "Service failed to become ready"
exit 1
Do not retry: lint failures, type errors, test failures. Do retry: network timeouts, service startup waits, DNS resolution.
Pipeline Recovery
Continue-on-Error for Non-Critical Steps
- name: Upload coverage report
continue-on-error: true # not worth failing the pipeline for this
run: uv run codecov
- name: Send Slack notification
continue-on-error: true # notification failure should not block deploy
run: ./scripts/notify.sh
Always-Run Cleanup
jobs:
test:
steps:
- name: Start test database
run: docker compose up -d postgres
- name: Run tests
run: uv run pytest
- name: Stop test database
if: always() # runs even if tests failed
run: docker compose down
if: always() ensures cleanup runs regardless of previous step outcome.
Partial Failure Handling
Matrix Job Failure Control
strategy:
matrix:
service: [users, orders, products]
fail-fast: false # continue other matrix jobs even if one fails
fail-fast: false — one service test failure does not cancel the others.
All results collected, all failures surfaced, not just the first.
Allowed Failures
strategy:
matrix:
python-version: ["3.12", "3.13"]
include:
- python-version: "3.13"
experimental: true
steps:
- name: Run tests
continue-on-error: ${{ matrix.experimental }}
Python 3.13 failure does not block the pipeline — it is experimental.
Failure Notification Strategy
- name: Notify team on failure
if: failure() && github.ref == 'refs/heads/main'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":x: Pipeline failed on `main`",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Run", "value": "${{ github.run_url }}", "short": false},
{"title": "Triggered by", "value": "${{ github.actor }}", "short": true},
{"title": "Commit", "value": "${{ github.sha }}", "short": true}
]
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Notify on:
- main branch failures (team must act)
- Deployment failures (on-call must act)
Do not notify on: - PR branch failures (developer responsible, not team) - Scheduled job success (noise)
Incident Response After Pipeline Failure
Pipeline failure on main
│
├── Is it a flaky test?
│ └── Yes → re-run, open flakiness ticket
│
├── Is it a real test failure?
│ └── Yes → developer who merged fixes immediately
│
├── Is it a deploy failure?
│ └── Yes → rollback, investigate, fix
│
└── Is it infrastructure?
└── Yes → retry on different runner, alert platform team
Main branch must be green. A broken main is a team emergency, not one person's problem.