Why does your Company need an SRE?

Site Reliability Engineering - SRE

Agenda:

  1. What is Site Reliability Engineering SRE?

  2. Why do you need Reliability?

  3. What do SREs do?

  4. SRE Organizational Structure

  5. How is SRE different from Software Engineering?

  6. Reference:

What is Site Reliability Engineering SRE?

SRE = Software(Operations)

Technically, SRE is a Software Engineer who understands and works on the operations of a service.

AIM of SRE = 3P(Protect, Provide, Progress)

The main aim is to protect, provide for, and progress the software and systems behind all the microservices and take care of their availability, latency, performance, and capacity.

Why do you need Site Reliability?

In order to understand this you need to first know that there are two teams in any Software Company.

  • Development team: Smart guys who get paid for designing, coding and building new features for any applications fast.

  • Operations team: Guys who are smart enough to get paid for maintaining the stability of an application and are kind of slow about rolling out the new features.

The result is a CONFLICT

To mitigate the conflict and expedite the collaboration, DevOps was introduced. In these DevOps teams, there has to be someone who has to continuously monitor, and focus on the availability, and reliability of applications along with the scalability of infrastructure and performance of your service(API).

The main purpose of Reliability is to reduce the error count to 0.

When a system fails, you have to understand the root cause of what caused the issue

What do SREs do?

  • Solve Production problems with software. Simply speaking

    SRE = Software (Engineers + Operations)

  • SRE makes sure that all the services of your enterprise are efficient, available and reliable all the time.

  • If a new service is introduced they maximise the rate at which new services, features, etc. can be delivered to users.

  • On-call responsibilities as well...

In addition, they work with Developers

To Onboard new Services

To Develop software...that directly benefits users, improves reliability, manageability, efficiency of services.

To Handle On Call to Maintain the Service Level Objective (SLO). Fix issues whose resolution has not yet been automated. Control changes to protect customer experience.

SRE teams -

  •     > share responsibility of a service with the development or product teams.  
        > define and measure service-level objectives that specify the level of service reliability and availability expected.
        > responsible for maintaining error budgets.
        > use automation to reduce manual toil and improve the scalability and reliability of systems.
        > rely on comprehensive monitoring and alerting systems to proactively detect and respond to issues.
        > conduct post-incident reviews to learn from the experience, identify root causes.
        > implement improvements to prevent similar incidents in the future.
    

" With great Service (Power) comes great On-Call (resposnibility)"- SRE

SRE Organizational Structure

How cool it is to have an SRE Organization which is independent of business units. Usually, SRE teams are organized around a single service or a collection. Services are jointly owned by Product Development and Site Reliability. They decide the percentage of dependability on Site Reliability

The Staffing budget comes from Product.

How is SRE different from Software Engineering?

Unlike Software Engineering, SRE is not a frontend, feature-driven workload. SRE concentrates on the following

  1. Automate the boring stuff which is manually performed

  2. Define SLA's, SLO's, SLI's

  3. Minimize the impact of a Disaster

  4. Prevent the Recurrence of the Disaster

  5. SRE Portability

References:

  1. Tech World With Nana

  2. Jamie Walkinson's blog

  3. The SRE Book

  4. The SRE Website