Monitoring and Alerting for Microservices
- Metrics: Quantitative signals (latency, error rate, saturation) for automated alerting.
- Logs: High-fidelity event data for forensics; centralise with Elasticsearch, Loki, or Splunk.
- Traces: Follow a request across services to locate bottlenecks (OpenTelemetry, Jaeger, Zipkin).
Metrics Stack (Prometheus + Grafana)
- Expose
/metrics
endpoints using client libraries. - Collect via Prometheus (scrape configs, Kubernetes service discovery).
- Visualise in Grafana; codify dashboards in Git to track changes.
Example alert (Prometheus):
groups:
- name: service-availability
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels: {severity: critical}
annotations:
summary: "High error rate"
runbook: "https://runbooks.example.com/high-error-rate"
Alerting Principles
- Actionable: Every page must map to a runbook.
- Prioritised: Define severity tiers (P0→P2) and escalation paths.
- Contextual: Include dashboards, logs, and recent deploy links.
- Tested: Exercise alerts through game days and synthetic checks.
Distributed Tracing
Instrument services with OpenTelemetry. Export spans to Jaeger/Datadog/SkyWalking to:
- Visualise request latency per hop.
- Correlate traces with logs (trace IDs).
- Identify the blast radius of failed calls quickly.
Operational Tips
- Automate incident creation (PagerDuty/Opsgenie) from Alertmanager.
- Track Service Level Objectives (SLOs) using tools like Grafana SLO or Nobl9.
- Review observability debt during post-incident retrospectives.
- Guard against alert fatigue; tune thresholds and silence windows during maintenance.