SRE Book Chapter Review - SLOs
June 18, 2019
Summary
This is the first of a series of posts covering important chapters from the books Site Reliability Engineering and Site Reliability Workbook link. For other posts in the series, check this.
What are SLOs?
SLOs or Service Level Objectives are a set of service behavior metrics used to evaluate the quality of the service in question. SLOs, along with Service level Indicators (SLIs) and agreements (SLAs) are the set of measurements describing basic properties of metrics that matter.
SLIs - Quantitative measure of some aspect of the level of service in question. Eg request latency, error rate (expressed as a fraction of all requests received) and throughput. Another SLI important to SREs is Availability, expressed as a percentage of time that the service is usable.
SLOs - The range of SLI values acceptable for a service.
Choosing and publishing SLOs to users sets expectiations about how a service will perform. Without published SLO metrics users often develop their own beliefs about desired performance. For example, Googles distributed lock system, Chubby, is highly available and global outages are so infrequent that service owners sometimes add dependencies to Chubby assuming it will never go down. This has led to Google SREs intentionally bringing Chubby down in a quarter when SLO levels are unusually high, to flush out unwanted Chubby dependencies.
SLAs - SLOs but which are critical to businesses (if an SLO is not met and the business goes down, it is most probably an SLA). A real SLA violation would probably trigger a court case for breach of contract.
Types of SLOs
User facing serving systems, like Shakespeare frontends, generally care about availability, latency and throughput.
Storage Systems generally care about availability, latency and durability
Big data systems care about throughput and end-to-end latency
All systems care about correctness: was the right answer returned?
Collecting Indicators
Generally by a monitoring service like Borgmon, Prometheus or Logstash, or with periodic log analysis (ex HTTP 500 responses as fraction of all requests). Client side metric collection is preferable (and sometimes required) to ensure the metrics are representative of end user experience.
It is advisable to build SLO specifications over a set of standard intervals:
Aggregation intervals: "Averaged over 1 minute"
Aggregation regions: "All the tasks in a cluster"
How frequently measurements are made: "Every 10 seconds"
Which requests are included: "HTTP GETs from black-box monitoring jobs"
How the data is acquired: "Through our monitoring, measured at the server"
Data-access latency: "Time to last byte"
It is a good idea to have as few SLOs as possible.
Control Measures
SLIs and SLOs are crucial elements in the control loops used to manage systems:
Monitor and measure the system’s SLIs.
Compare the SLIs to the SLOs, and decide whether or not action is needed.
If action is needed, figure out what needs to happen in order to meet the target.
Take that action.
For example, if step 2 shows that request latency is increasing, and will miss the SLO in a few hours unless something is done, step 3 might include testing the hypothesis that the servers are CPU-bound, and deciding to add more of them to spread the load. Without the SLO, you wouldn’t know whether (or when) to take action.
A team wanting to build a photo-sharing website might want to avoid using a service that promises very strong durability and low cost in exchange for slightly lower availability, though the same service might be a perfect fit for an archival records management system.