Security & Observability

RBAC (Role-Based Access Control)

RBAC controls who can do what in the cluster. Two scopes: namespace-level (Role) and cluster-level (ClusterRole).

Role + RoleBinding (Namespace-scoped)

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: dev
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: dev
subjects:
  - kind: ServiceAccount
    name: ci-bot
    namespace: dev
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Use ClusterRole + ClusterRoleBinding for cluster-wide access (e.g., namespace management, node monitoring).

RBAC Best Practices

Practice	Why
Least privilege	Grant minimum permissions needed
Namespace-scoped first	Use Role before ClusterRole
ServiceAccounts per workload	Not the default SA
Audit regularly	`kubectl auth can-i --list --as=system:serviceaccount:dev:ci-bot`

Pod Security

SecurityContext

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  containers:
    - name: app
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]

Pod Security Standards (PSS)

Replace deprecated PodSecurityPolicies. Three levels enforced per namespace:

Level	Description
Privileged	No restrictions
Baseline	Prevents known privilege escalations
Restricted	Hardened, best practice (recommended for production)

kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted

Resource Quotas and Limits

Namespace Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"

LimitRange (Per-Pod Defaults)

Use LimitRange to set default requests/limits for containers that don't specify them. Prevents runaway pods.

Observability Stack

Metrics: Prometheus + Grafana

# Install via Helm (kube-prometheus-stack)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

Provides: Prometheus (metrics collection), Grafana (dashboards), Alertmanager (alerts), Node Exporter, kube-state-metrics.

Component	Purpose
Prometheus	Scrapes metrics from pods/nodes via `/metrics` endpoints
Grafana	Visualization dashboards, alerting
Alertmanager	Routes alerts to Slack, PagerDuty, email
kube-state-metrics	Cluster state → Prometheus metrics

Logging: EFK Stack

# Fluent Bit (DaemonSet — log collector)
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit -n logging --create-namespace

# Elasticsearch + Kibana
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch -n logging
helm install kibana elastic/kibana -n logging

Component	Role
Fluent Bit	Lightweight log collector (DaemonSet on every node)
Elasticsearch	Log storage and full-text search
Kibana	Log visualization and debugging UI

Health Checks

kubectl top requires metrics-server installed in the cluster.

kubectl top nodes                          # node CPU/memory
kubectl top pods -A --sort-by=memory       # pod resource usage
kubectl get events --sort-by='.lastTimestamp' -A  # cluster events

Debugging Checklist

Symptom	First Commands
Pod `Pending`	`kubectl describe pod <p>` → check events, resources, taints
Pod `CrashLoopBackOff`	`kubectl logs <p> --previous` → check exit code
Pod `ImagePullBackOff`	Check image name, registry auth, `imagePullSecrets`
Service no endpoints	`kubectl get endpoints <svc>` → labels mismatch?
High latency	`kubectl top pods`, check resource limits, HPA status
OOMKilled	Increase `limits.memory` or optimize app memory usage

Debug Commands

kubectl describe pod <name>               # events, conditions, mounts
kubectl logs <pod> -f --tail=200          # follow logs
kubectl logs <pod> -c <container>         # multi-container pod
kubectl logs <pod> --previous             # previous crash
kubectl exec -it <pod> -- /bin/sh         # shell access
kubectl debug -it <pod> --image=busybox   # ephemeral debug container
kubectl debug node/<node> -it --image=busybox  # node-level debug
kubectl get events -A --sort-by='.lastTimestamp' | tail -20