Linkerd — Alerting On Error Rates

This is how to create error rate alerts at the service level (as opposed to the load balancer/ingress level) with Linkerd, Prometheus and Alertmanager.

The plumbing

The rules configmap is connected to the linkerd-prometheus, the linkerd-prometheus is connected to the alertmanager, the alertmanager is connected to the pager duty, the pager duty is connected to my wristwatch…

As the Linkerd helm chart stands in version 2.8.1, stable release at the time of writing, in order to get alerting setup you must bring:

  • a rules configmap
  • an alertmanager

And you will want the following chart values:

prometheusAlertmanagers:
- scheme: http
static_configs:
- targets:
- "byo-alertmanager-svc.and-its-namespace"
prometheusRuleConfigMapMounts:
- name: alerting-rules
subPath: alerting_rules.yml
configMap: linkerd-prometheus-rules
- name: recording-rules
subPath: recording_rules.yml
configMap: linkerd-prometheus-rules

Rules Configmap

For the rules configmap, as the Linkerd helm chart stands, it needs to pre-exist. Apply a k8s manifest similar to the one below, tweaking the alert expression to suit your needs (more on this below):

apiVersion: v1
kind: ConfigMap
metadata:
name: linkerd-prometheus-rules
namespace: linkerd
data:
recording_rules.yml: |
groups:
- name: linkerd
rules:
- record: deployment:error_rate_1m
expr: |
sum(rate(response_total{classification="failure", direction="inbound"}[1m])) by (deployment)
/
sum(rate(response_total{ direction="inbound"}[1m])) by (deployment)
alerting_rules.yml: |
groups:
- name: linkerd
rules:
- alert: HighErrorRate
expr: |
deployment:error_rate_1m >= 0.01
annotations:
message: |
The error rate for deployment {{ $labels.deployment }} has been high for 5 minutes. The current value is {{ $value | humanizePercentage }}.
summary: High Error Rate
for: 5m
labels:
severity: page

The above is 1 recording rule used to compute error rates, and 1 alerting rule for if the error rate of any meshed deployment goes beyond 0.01% for 5 minutes.

It may be desirable to include/exclude some deployments from alerting, this can be achieved by use of labels. Also, more than one quality of service tier may be desired. Below is an example expression combining these two ideas:

expr: |
deployment:error_rate_1m{deployment!~"janky-deployment"} >= 0.01
or
deployment:error_rate_1m{deployment=~"janky-deployment"} >= 0.05

As seen above, all deployments need to stay under the 1% error rate except for which only needs to stay below 5%. Once again this is just an example— play around until you get actionable alerts!

Alertmanager

Which brings us to Alertmanager — you might already have one running hooked up to another prometheus, you can re-use it! If you don’t, have a look at prometheus-community’s standalone alertmanager chart.

To configure notifications and such please see alertmanager documentation.

Image for post
Image for post
High error rate alert firing in the Alertmanager UI.

Conclusion

If you are using Linkerd and would like to know when services error rates spike, please give the instructions above a try.

If you’d like to try some more sophisticated alerting check out the SRE Workbook’s chapter on Alerting on SLOs, though these may require longer retention times for metrics than what linkerd-prometheus offers out of the box. While projects like Thanos and Cortex come to mind when talking about long-term prometheus metrics, it is very nice to know that you can just use Prometheus with a large enough disk.

Written by

DevOps @ Transit

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store