Scrum. Agile. Lean. DevOps. Been there, done that?

How about Site Reliability Engineering (SRE)?

Site Reliability Engineering explained

What: The goal of Site Reliability Engineering is to create and maintain highly reliable, highly scalable software.  

As Ben Treynor, team leader of Google software engineers, famously defined it, SRE is “what happens when a software engineer is tasked with what used to be called operations.” Due to its reliance on Dev and Ops collaboration, SRE is sometimes seen as a subset of DevOps.

Why: Given their scale and desire for a repeatable, excellent customer experience, Google and leaders like Treynor needed a novel way to allocate time and resources between development and operations.

Thus, SRE was born of necessity. Traditionally Dev wants to quickly churn out great software while Ops typically prefers to slow down to ensure it actually works…safely…and at scale…in the real world. Nobody had time for the traditional Dev vs Ops power struggles over resource allocation so that question had to be reframed to answer itself.

An end to the Dev-Ops tug-of-war?

site reliability engineering designed to end the tug of war - orcaconfig

How: Site Reliability Engineering’s forcing function was to reduce the Dev and Ops team and time allocation to a mathematical formula.

  • “Error Budget” based On SLA (Service Level Agreement):
    Say for instance you have launched an application with 99.7% uptime. Your error budget is, therefore, .3% or about 2 hours per month.
  • Prioritizing between current performance vs new releases:
    With this information, your SREs have a built-in incentive to live within their error budget. If their software is highly available, great! They can afford to launch more often. And conversely, when applications are approaching or over their error budget, SRE teams need to invest more in operational reliability instead of fancy new features.

Who: SRE teams are of fixed size and include highly-talented, Dev-and-Ops-knowledgeable free agents who can be allocated to different projects.

Again, the forcing function here is that every Ops specialist brought onto the team means one less Dev can be hired. Highly reliable code with little operational drama allows more resources to flow to development efforts.

How Much: Allocating SRE to Dev and Ops work.

  • A maximum of 50% of SRE time should be spent on “keeping the lights on” with Ops work. Their time is better spent hardening IT systems by error-proofing and driving new efficiencies.
  • Conversely, Dev teams should handle 5% of traditional operations responsibilities like trouble tickets and other support.

 SRE Tools to Support SRE Teams?

DevOps and Site Reliability Engineering are powerful philosophies for releasing better software more frequently. To supplement these philosophies, contact us to learn more how Orca supports Dev-and-Ops collaboration for central, secure change control and predictable application releases.