Building an Effective SRE Culture
-
22/9/2018
-
One-minute read
- Shared Ownership: Reliability is everyone’s job. Embed SRE goals into product roadmaps and definition-of-done checklists.
- Learning over Blame: Post-incident reviews should document contributing factors, action items, and owners—never scapegoats.
- Automation First: Reserve human effort for engineering work; automate toil such as deployments, rollbacks, and capacity checks.
Operating Model
- Error Budgets: Define service-level objectives (SLOs), track error-budget burn-down, and use the metrics to drive release gating.
- Incident Response: Maintain on-call runbooks, regular game days, and retrospectives that feed a shared improvement backlog.
- Cross-Functional Collaboration: Pair SREs with product squads to co-design telemetry, capacity planning, and chaos tests.
Investment Areas
- Training: Budget time for SRE fundamentals (SLOs, incident command, observability) and platform-specific skills.
- Tooling: Standardise on observability stacks, runbook repositories, and self-service deployment pipelines.
- Executive Support: Align leadership on the trade-offs between feature velocity and reliability; publish quarterly outcomes to maintain sponsorship.
Metrics to Track
- Error-budget consumption by service
- Mean time to detect (MTTD) and mean time to resolve (MTTR)
- Percentage of incidents with clear preventive action items
- Automation coverage for routine operational tasks
Related Articles