Monitoring and Alerting for Microservices

  • Metrics: Quantitative signals (latency, error rate, saturation) for automated alerting.
  • Logs: High-fidelity event data for forensics; centralise with Elasticsearch, Loki, or Splunk.
  • Traces: Follow a request across services to locate bottlenecks (OpenTelemetry, Jaeger, Zipkin).

Metrics Stack (Prometheus + Grafana)

  • Expose /metrics endpoints using client libraries.
  • Collect via Prometheus (scrape configs, Kubernetes service discovery).
  • Visualise in Grafana; codify dashboards in Git to track changes.

Example alert (Prometheus):

groups:
- name: service-availability
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels: {severity: critical}
    annotations:
      summary: "High error rate"
      runbook: "https://runbooks.example.com/high-error-rate"

Alerting Principles

  1. Actionable: Every page must map to a runbook.
  2. Prioritised: Define severity tiers (P0→P2) and escalation paths.
  3. Contextual: Include dashboards, logs, and recent deploy links.
  4. Tested: Exercise alerts through game days and synthetic checks.

Distributed Tracing

Instrument services with OpenTelemetry. Export spans to Jaeger/Datadog/SkyWalking to:

  • Visualise request latency per hop.
  • Correlate traces with logs (trace IDs).
  • Identify the blast radius of failed calls quickly.

Operational Tips

  • Automate incident creation (PagerDuty/Opsgenie) from Alertmanager.
  • Track Service Level Objectives (SLOs) using tools like Grafana SLO or Nobl9.
  • Review observability debt during post-incident retrospectives.
  • Guard against alert fatigue; tune thresholds and silence windows during maintenance.

Further Reading