precision = real incidents triggers / all alerts trigger
recall = detected significant events / all significant events
Detection time - how long does it take to send notification in various conditions.
record: job:slo_errors_per_request:ratio_rate10m
expr:
sum(rate(slo_errors[10m])) by (job)
/
sum(rate(slo_requests[10m])) by (job)
Rule
- alert: HighErrorRate
expr: job:slo_errors_per_request:ratio_rate10m{job="my_job"} >= 0.001
Pros and cons
Rule
- alert: HighErrorRate
expr: job:slo_errors_per_request:ratio_rate36h{job="my_job"} > 0.001
Pros and cons
Rule
- alert: HighErrorRate
expr: job:slo_errors_per_request:ratio_rate1m{job="my_job"} > 0.001
for: 1h
Pros and cons
Burn Rate is how fast, relative to SLO, service consumes error budget.
- alert: HighErrorRate
expr: job:slo_errors_per_request:ratio_rate1h{job="my_job"} > 36 * 0.001
2% burn rate for 1h, 5% burn rate for 6h, 10% burn rate for 3days
Rule
- expr: (
job:slo_errors_per_request:ratio_rate1h{job="my_job"} > (14.4*0.001)
or
job:slo_errors_per_request:ratio_rate6h{job="my_job"} > (6*0.001)
)
severity: page
- expr: job:slo_errors_per_request:ratio_rate3d{job="my_job"} > 0.001
severity: ticket
Enhanceent to the multple burn rate alerts is to track if we still consuming error budget with smaller time window. Google suggests to use 1/12 of initial window.
Rule
expr: (
job:slo_errors_per_request:ratio_rate1h{job="my_job"} > (14.4*0.001)
and
job:slo_errors_per_request:ratio_rate5m{job="my_job"} > (14.4*0.001)
)
or
(
job:slo_errors_per_request:ratio_rate6h{job="my_job"} > (6*0.001)
and
job:slo_errors_per_request:ratio_rate30m{job="my_job"} > (6*0.001)
)
severity: page
expr: (
job:slo_errors_per_request:ratio_rate24h{job="my_job"} > (3*0.001)
and
job:slo_errors_per_request:ratio_rate2h{job="my_job"} > (3*0.001)
)
or
(
job:slo_errors_per_request:ratio_rate3d{job="my_job"} > 0.001
and
job:slo_errors_per_request:ratio_rate6h{job="my_job"} > 0.001
)
severity: ticket
Pros + Better precision (we do not react on jitter) + Still good recall + Still Good Detection time + Good reset time - A lot of parameters to specify