Measurement is critical

Measurement is critical

SLIs -> SLOs -> SLAs ⚠

Hello friends, my earlier posts were about Observability observability-why-now & monitoring-vs-observability.

Someone asked me, OK I made some changes but how would I measure If I am doing well or at least improving. A good question and thus measurement is the first step to establishing a baseline before taking any action. Without an understanding of the current situation, it will be difficult to determine if the actions taken are working in cohesion towards the desired outcome.

The metrics that should be measured should directly impact the engineering team's experience, customer's experience, or both. The three most common and well-known metrics organizations need to understand and maintain:

SLA (Service Level Agreement) - the agreement that you make with your customers

  • In general, your organization may have legal and/or monetary impact in case these SLAs are not met.

  • These metrics are used by your sales/marketing team to attract potential customers of your product.

SLO (Sevice Level Objectives) - the objectives your must hit to meet the agreement (SLA)

  • SLO is the metric used by engineering teams and product owner/s.

  • There is a buffer maintained between SLO & SLA to ensure SLAs are safely met if all internal SLOs are met.

SLIs (Service Level Indicators) - these are the most granular and clear numerical indicators for defining the availability.

  • Metrics measured and aggregated over time. For eg. is the 99th percentile of latency of the request received less than 300ms?

Some Interesting thoughts?

Should we not build systems that are 100% available?

No, SLOs have both lower and upper bounds.

  • If you build a system that is more reliable than expected then you are spending additional development cycles to build something that is not required and will slow down the development velocity.

  • If you build a system that is less available than your lower bound SLOs then you are preventing your customers from building other services which may be dependent on your services.

Measurement is critical and SLIs -> SLOs -> SLAs enable the entire organization and customer to have common well-defined metrics which can be measured and improved continuously.