Skip to content

Pipeline Observability

What to Observe

Running a pipeline without observability is flying blind. Two dimensions matter: pipeline health and deployment health.

Pipeline health  → build times, failure rates, flakiness
Deployment health → latency, error rate, resource usage after deploy

Pipeline Logs

Every step must produce structured, searchable logs.

# GitHub Actions — log grouping for readability
- name: Run tests
  run: |
    echo "::group::Test output"
    uv run pytest -v --tb=short
    echo "::endgroup::"

Key logging practices: - Log the exact command run with its version - Log start time and duration for each stage - Surface failures prominently (not buried in thousands of success lines) - Archive logs with artifacts for post-mortem analysis


Pipeline Metrics

Track these per pipeline run and trend over time:

Metric What It Tells You
Pipeline duration (p50/p95) Overall speed trend
Stage duration breakdown Which stage is slowing down
Build success rate Pipeline health
Test failure rate Code quality trend
Flaky test rate Test reliability
Deployment frequency Engineering velocity

Collecting Metrics

- name: Record pipeline metrics
  if: always()
  run: |
    python scripts/record_metrics.py \
      --duration=${{ steps.test.outputs.duration }} \
      --status=${{ job.status }} \
      --branch=${{ github.ref_name }} \
      --run-id=${{ github.run_id }}

Push to Datadog, Prometheus Pushgateway, or a simple JSON file in S3.


Deployment Health Monitoring

After every deployment, watch these for at least 10 minutes:

Deploy v1.4.2 to production
    ↓
Monitor for 10 min:
  - Error rate (should stay < 1%)
  - p95 latency (should stay within SLO)
  - Memory / CPU (should be stable, not climbing)
    ↓
Pass → deployment complete
Fail → automatic rollback trigger

Automated Rollback on Health Breach

- name: Verify deployment health
  run: |
    python scripts/check_deployment_health.py \
      --service api \
      --duration 600 \
      --max-error-rate 0.01 \
      --max-p95-ms 300
# scripts/check_deployment_health.py
import time
import sys
import requests
import logging

log = logging.getLogger(__name__)


def check_health(
    prometheus_url: str,
    service: str,
    duration_seconds: int,
    max_error_rate: float,
) -> bool:
    deadline = time.monotonic() + duration_seconds
    while time.monotonic() < deadline:
        error_rate = query_prometheus(
            prometheus_url,
            f'rate(http_errors_total{{service="{service}"}}[1m])',
        )
        if error_rate > max_error_rate:
            log.error("Error rate %.2f%% exceeds threshold", error_rate * 100)
            return False
        time.sleep(30)
    return True

Deployment Frequency & DORA Metrics

DORA metrics measure engineering delivery performance:

Metric Elite High Medium Low
Deployment frequency Multiple/day Daily Weekly Monthly
Lead time for changes < 1 hour 1 day 1 week 1 month
Change failure rate < 5% < 10% < 15% > 15%
Recovery time < 1 hour < 1 day < 1 week > 1 week

Track these automatically from pipeline metadata:

def calculate_dora_metrics(deployments: list[dict]) -> dict:
    deployment_count = len(deployments)
    failed = [d for d in deployments if d["status"] == "failed"]
    return {
        "deployment_frequency_per_day": deployment_count / 30,
        "change_failure_rate": len(failed) / max(deployment_count, 1),
        "mean_lead_time_hours": sum(
            d["lead_time_seconds"] for d in deployments
        ) / max(deployment_count, 1) / 3600,
    }

Notification Strategy

Not every event needs a notification. Noise kills attention.

Event Channel Who
Build failure on main Slack #dev-alerts Team
Deployment to production Slack #deployments Team
Security vulnerability found Slack #security Security + leads
Nightly regression failure Slack #qa-alerts QA
SLO breach after deploy PagerDuty On-call
- name: Notify on failure
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": ":x: Build failed on `${{ github.ref_name }}`",
        "attachments": [{
          "text": "Run: ${{ github.run_url }}"
        }]
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}