Linkerd — Alerting On Error Rates

3 min readOct 14, 2020

This is how to create error rate alerts at the service level (as opposed to the load balancer/ingress level) with Linkerd, Prometheus and Alertmanager.

The plumbing

The rules configmap is connected to the linkerd-prometheus, the linkerd-prometheus is connected to the alertmanager, the alertmanager is connected to the pager duty, the pager duty is connected to my wristwatch…

As the Linkerd helm chart stands in version 2.8.1, stable release at the time of writing, in order to get alerting setup you must bring:

a rules configmap
an alertmanager

And you will want the following chart values for Linkerd 2.8.x:

# Linkerd 2.8.xprometheusAlertmanagers:
  - scheme: http
    static_configs:
      - targets:
        - "byo-alertmanager-svc.and-its-namespace"prometheusRuleConfigMapMounts:
  - name: alerting-rules
    subPath: alerting_rules.yml
    configMap: linkerd-prometheus-rules
  - name: recording-rules
    subPath: recording_rules.yml
    configMap: linkerd-prometheus-rules

Or these for Linkerd 2.9.x, which introduces a nested approach to configuring linkerd-prometheus:

# linkerd 2.9.xprometheus:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
          - "prometheus-alertmanager.monitoring"  ruleConfigMapMounts:
    - name: alerting-rules
      subPath: alerting_rules.yml
      configMap: linkerd-prometheus-rules
    - name: recording-rules
      subPath: recording_rules.yml
      configMap: linkerd-prometheus-rules

Rules Configmap

For the rules configmap, as the Linkerd helm chart stands, it needs to pre-exist. Apply a k8s manifest similar to the one below, tweaking the alert expression to suit your needs (more on this below):

apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-prometheus-rules
  namespace: linkerd
data:
  recording_rules.yml: |
    groups:
      - name: linkerd
    rules:
      - record: deployment:error_rate_1m
        expr: |
        sum(rate(response_total{classification="failure", direction="inbound"}[1m])) by (deployment)
          /
        sum(rate(response_total{ direction="inbound"}[1m])) by (deployment)  alerting_rules.yml: |
    groups:
      - name: linkerd
    rules:
      - alert: HighErrorRate
        expr: |
          deployment:error_rate_1m >= 0.01
        annotations:
        message: |
        The error rate for deployment {{ $labels.deployment }} has been high for 5 minutes. The current value is {{ $value |   humanizePercentage }}.
        summary: High Error Rate
        for: 5m
        labels:
          severity: page

The above is 1 recording rule used to compute error rates, and 1 alerting rule for if the error rate of any meshed deployment goes beyond 0.01% for 5 minutes.

It may be desirable to include/exclude some deployments from alerting, this can be achieved by use of labels. Also, more than one quality of service tier may be desired. Below is an example expression combining these two ideas:

expr: |
  deployment:error_rate_1m{deployment!~"janky-deployment"} >= 0.01
    or
  deployment:error_rate_1m{deployment=~"janky-deployment"} >= 0.05

As seen above, all deployments need to stay under the 1% error rate except forjanky-deployment which only needs to stay below 5%. Once again this is just an example— play around until you get actionable alerts!

Alertmanager

Which brings us to Alertmanager — you might already have one running hooked up to another prometheus, you can re-use it! If you don’t, have a look at prometheus-community’s standalone alertmanager chart.

To configure notifications and such please see alertmanager documentation.

High error rate alert firing in the Alertmanager UI.

Conclusion

If you are using Linkerd and would like to know when services error rates spike, please give the instructions above a try.

If you’d like to try some more sophisticated alerting check out the SRE Workbook’s chapter on Alerting on SLOs, though these may require longer retention times for metrics than what linkerd-prometheus offers out of the box. While projects like Thanos and Cortex come to mind when talking about long-term prometheus metrics, it is very nice to know that you can just use Prometheus with a large enough disk.

Linkerd — Alerting On Error Rates

The plumbing

Rules Configmap

Alertmanager

Conclusion

Written by Naseem Ullah

No responses yet