Linkerd — Alerting On Error Rates
This is how to create error rate alerts at the service level (as opposed to the load balancer/ingress level) with Linkerd, Prometheus and Alertmanager.
The plumbing
The rules configmap is connected to the linkerd-prometheus, the linkerd-prometheus is connected to the alertmanager, the alertmanager is connected to the pager duty, the pager duty is connected to my wristwatch…
As the Linkerd helm chart stands in version 2.8.1, stable release at the time of writing, in order to get alerting setup you must bring:
- a rules configmap
- an alertmanager
And you will want the following chart values for Linkerd 2.8.x:
# Linkerd 2.8.xprometheusAlertmanagers:
- scheme: http
static_configs:
- targets:
- "byo-alertmanager-svc.and-its-namespace"prometheusRuleConfigMapMounts:
- name: alerting-rules
subPath: alerting_rules.yml
configMap: linkerd-prometheus-rules
- name: recording-rules
subPath: recording_rules.yml
configMap: linkerd-prometheus-rules
Or these for Linkerd 2.9.x, which introduces a nested approach to configuring linkerd-prometheus:
# linkerd 2.9.xprometheus:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "prometheus-alertmanager.monitoring" ruleConfigMapMounts:
- name: alerting-rules
subPath: alerting_rules.yml
configMap: linkerd-prometheus-rules
- name: recording-rules
subPath: recording_rules.yml
configMap: linkerd-prometheus-rules
Rules Configmap
For the rules configmap, as the Linkerd helm chart stands, it needs to pre-exist. Apply a k8s manifest similar to the one below, tweaking the alert expression to suit your needs (more on this below):
apiVersion: v1
kind: ConfigMap
metadata:
name: linkerd-prometheus-rules
namespace: linkerd
data:
recording_rules.yml: |
groups:
- name: linkerd
rules:
- record: deployment:error_rate_1m
expr: |
sum(rate(response_total{classification="failure", direction="inbound"}[1m])) by (deployment)
/
sum(rate(response_total{ direction="inbound"}[1m])) by (deployment) alerting_rules.yml: |
groups:
- name: linkerd
rules:
- alert: HighErrorRate
expr: |
deployment:error_rate_1m >= 0.01
annotations:
message: |
The error rate for deployment {{ $labels.deployment }} has been high for 5 minutes. The current value is {{ $value | humanizePercentage }}.
summary: High Error Rate
for: 5m
labels:
severity: page
The above is 1 recording rule used to compute error rates, and 1 alerting rule for if the error rate of any meshed deployment goes beyond 0.01% for 5 minutes.
It may be desirable to include/exclude some deployments from alerting, this can be achieved by use of labels. Also, more than one quality of service tier may be desired. Below is an example expression combining these two ideas:
expr: |
deployment:error_rate_1m{deployment!~"janky-deployment"} >= 0.01
or
deployment:error_rate_1m{deployment=~"janky-deployment"} >= 0.05
As seen above, all deployments need to stay under the 1% error rate except forjanky-deployment
which only needs to stay below 5%. Once again this is just an example— play around until you get actionable alerts!
Alertmanager
Which brings us to Alertmanager — you might already have one running hooked up to another prometheus, you can re-use it! If you don’t, have a look at prometheus-community’s standalone alertmanager chart.
To configure notifications and such please see alertmanager documentation.
Conclusion
If you are using Linkerd and would like to know when services error rates spike, please give the instructions above a try.
If you’d like to try some more sophisticated alerting check out the SRE Workbook’s chapter on Alerting on SLOs, though these may require longer retention times for metrics than what linkerd-prometheus offers out of the box. While projects like Thanos and Cortex come to mind when talking about long-term prometheus metrics, it is very nice to know that you can just use Prometheus with a large enough disk.