Pipeline Observability
What to Observe
Running a pipeline without observability is flying blind. Two dimensions matter: pipeline health and deployment health.
Pipeline health → build times, failure rates, flakiness
Deployment health → latency, error rate, resource usage after deploy
Pipeline Logs
Every step must produce structured, searchable logs.
# GitHub Actions — log grouping for readability
- name: Run tests
run: |
echo "::group::Test output"
uv run pytest -v --tb=short
echo "::endgroup::"
Key logging practices: - Log the exact command run with its version - Log start time and duration for each stage - Surface failures prominently (not buried in thousands of success lines) - Archive logs with artifacts for post-mortem analysis
Pipeline Metrics
Track these per pipeline run and trend over time:
| Metric | What It Tells You |
|---|---|
| Pipeline duration (p50/p95) | Overall speed trend |
| Stage duration breakdown | Which stage is slowing down |
| Build success rate | Pipeline health |
| Test failure rate | Code quality trend |
| Flaky test rate | Test reliability |
| Deployment frequency | Engineering velocity |
Collecting Metrics
- name: Record pipeline metrics
if: always()
run: |
python scripts/record_metrics.py \
--duration=${{ steps.test.outputs.duration }} \
--status=${{ job.status }} \
--branch=${{ github.ref_name }} \
--run-id=${{ github.run_id }}
Push to Datadog, Prometheus Pushgateway, or a simple JSON file in S3.
Deployment Health Monitoring
After every deployment, watch these for at least 10 minutes:
Deploy v1.4.2 to production
↓
Monitor for 10 min:
- Error rate (should stay < 1%)
- p95 latency (should stay within SLO)
- Memory / CPU (should be stable, not climbing)
↓
Pass → deployment complete
Fail → automatic rollback trigger
Automated Rollback on Health Breach
- name: Verify deployment health
run: |
python scripts/check_deployment_health.py \
--service api \
--duration 600 \
--max-error-rate 0.01 \
--max-p95-ms 300
# scripts/check_deployment_health.py
import time
import sys
import requests
import logging
log = logging.getLogger(__name__)
def check_health(
prometheus_url: str,
service: str,
duration_seconds: int,
max_error_rate: float,
) -> bool:
deadline = time.monotonic() + duration_seconds
while time.monotonic() < deadline:
error_rate = query_prometheus(
prometheus_url,
f'rate(http_errors_total{{service="{service}"}}[1m])',
)
if error_rate > max_error_rate:
log.error("Error rate %.2f%% exceeds threshold", error_rate * 100)
return False
time.sleep(30)
return True
Deployment Frequency & DORA Metrics
DORA metrics measure engineering delivery performance:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment frequency | Multiple/day | Daily | Weekly | Monthly |
| Lead time for changes | < 1 hour | 1 day | 1 week | 1 month |
| Change failure rate | < 5% | < 10% | < 15% | > 15% |
| Recovery time | < 1 hour | < 1 day | < 1 week | > 1 week |
Track these automatically from pipeline metadata:
def calculate_dora_metrics(deployments: list[dict]) -> dict:
deployment_count = len(deployments)
failed = [d for d in deployments if d["status"] == "failed"]
return {
"deployment_frequency_per_day": deployment_count / 30,
"change_failure_rate": len(failed) / max(deployment_count, 1),
"mean_lead_time_hours": sum(
d["lead_time_seconds"] for d in deployments
) / max(deployment_count, 1) / 3600,
}
Notification Strategy
Not every event needs a notification. Noise kills attention.
| Event | Channel | Who |
|---|---|---|
| Build failure on main | Slack #dev-alerts | Team |
| Deployment to production | Slack #deployments | Team |
| Security vulnerability found | Slack #security | Security + leads |
| Nightly regression failure | Slack #qa-alerts | QA |
| SLO breach after deploy | PagerDuty | On-call |
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":x: Build failed on `${{ github.ref_name }}`",
"attachments": [{
"text": "Run: ${{ github.run_url }}"
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}