Why does your Company need an SRE?
Site Reliability Engineering - SRE
Agenda:
What is Site Reliability Engineering SRE?
Why do you need Reliability?
What do SREs do?
SRE Organizational Structure
How is SRE different from Software Engineering?
Reference:
What is Site Reliability Engineering SRE?
SRE = Software(Operations)
Technically, SRE is a Software Engineer who understands and works on the operations of a service.
AIM of SRE = 3P(Protect, Provide, Progress)
The main aim is to protect, provide for, and progress the software and systems behind all the microservices and take care of their availability, latency, performance, and capacity.
Why do you need Site Reliability?
In order to understand this you need to first know that there are two teams in any Software Company.
Development team: Smart guys who get paid for designing, coding and building new features for any applications fast.
Operations team: Guys who are smart enough to get paid for maintaining the stability of an application and are kind of slow about rolling out the new features.
The result is a CONFLICT
To mitigate the conflict and expedite the collaboration, DevOps was introduced. In these DevOps teams, there has to be someone who has to continuously monitor, and focus on the availability, and reliability of applications along with the scalability of infrastructure and performance of your service(API).
The main purpose of Reliability is to reduce the error count to 0.
When a system fails, you have to understand the root cause of what caused the issue
What do SREs do?
Solve Production problems with software. Simply speaking
SRE = Software (Engineers + Operations)
SRE makes sure that all the services of your enterprise are efficient, available and reliable all the time.
If a new service is introduced they maximise the rate at which new services, features, etc. can be delivered to users.
On-call responsibilities as well...
In addition, they work with Developers
To Onboard new Services
To Develop software...that directly benefits users, improves reliability, manageability, efficiency of services.
To Handle On Call to Maintain the Service Level Objective (SLO). Fix issues whose resolution has not yet been automated. Control changes to protect customer experience.
SRE teams -
> share responsibility of a service with the development or product teams. > define and measure service-level objectives that specify the level of service reliability and availability expected. > responsible for maintaining error budgets. > use automation to reduce manual toil and improve the scalability and reliability of systems. > rely on comprehensive monitoring and alerting systems to proactively detect and respond to issues. > conduct post-incident reviews to learn from the experience, identify root causes. > implement improvements to prevent similar incidents in the future.
" With great Service (Power) comes great On-Call (resposnibility)"- SRE
SRE Organizational Structure
How cool it is to have an SRE Organization which is independent of business units. Usually, SRE teams are organized around a single service or a collection. Services are jointly owned by Product Development and Site Reliability. They decide the percentage of dependability on Site Reliability
The Staffing budget comes from Product.
How is SRE different from Software Engineering?
Unlike Software Engineering, SRE is not a frontend, feature-driven workload. SRE concentrates on the following
Automate the boring stuff which is manually performed
Define SLA's, SLO's, SLI's
Minimize the impact of a Disaster
Prevent the Recurrence of the Disaster
SRE Portability
References: