Effective Alerting with Prometheus Alertmanager

In today's dynamic application environments, effective monitoring and alerting are non-negotiable for maintaining the reliability of your systems.

While Prometheus offers robust metrics collection and alerting capabilities, managing these alerts at scale requires a dedicated system. That's where Alertmanager comes in.

This guide provides a practical walkthrough of Alertmanager, from its basic configuration to advanced techniques like grouping and inhibition that help reduce alert fatigue.

By reading it to the end, you'll equip yourself with the knowledge to build an alerting system that cuts through the noise and empowers rapid incident response.

Let's get started!

We call when your
website goes down

Get notified with a radically better infrastructure monitoring platform.

Prerequisites

Prior knowledge of Prometheus monitoring concepts.
A recent version of Docker and Docker Compose installed on your system.

What is Prometheus Alertmanager?

The Alertmanager is the central alerting component of the Prometheus ecosystem. It is designed to handle alerts generated by Prometheus servers, which monitor various systems and applications.

These servers are typically configured with alerting rules that define conditions for triggering alerts. When these conditions are met, Prometheus sends alerts to Alertmanager.

Once Alertmanager receives an alert from Prometheus, it is processed based on its configuration. This usually involves grouping, deduplication, and routing to the appropriate receivers such, as email, Slack, or a custom webhook. It also takes care of silencing and inhibiting alerts if needed.

The best way to understand how the Alertmanager works is to see it in action, so let's look at an example of setting it up and using it to configure alerting for your Prometheus metrics.

Getting started with Alertmanager

Before using Alertmanager to handle alerts, you need a monitoring setup with Prometheus.

This will walk you through setting up Prometheus to monitor a Linux server using Node Exporter and configuring alerts to notify you if your server goes down.

To get started, create a new directory anywhere on your filesystem to place the necessary configuration files:

Copied!

mkdir prometheus-alertmanager

Copied!

cd prometheus-alertmanager

Then create a docker-compose.yml file within this directory with the following contents:

docker-compose.yml

Copied!

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --path.procfs=/host/proc
      - --path.rootfs=/rootfs
      - --path.sysfs=/host/sys
      - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --web.console.libraries=/etc/prometheus/console_libraries
      - --web.console.templates=/etc/prometheus/consoles
      - --web.enable-lifecycle
    expose:
      - 9090
    ports:
      - 9090:9090
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    expose:
      - 9093
    ports:
      - 9093:9093
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:

This file defines a monitoring stack consisting of three services: node-exporter, prometheus, and alertmanager.

The node-exporter service collects hardware and operating system metrics from the host machine, while prometheus scrapes these metrics and stores them. It also exposes a web interface for querying and visualizing the collected data.

Finally, alertmanager handles alerts sent by prometheus so they are sent to the configured notification channels. The services are all connected through a bridge network called monitoring to enable communication between them.

Next, you must configure Prometheus to scrape the node-exporter service and send alerts to Alertmanager by providing a prometheus.yml configuration file in the current directory:

prometheus.yml

Copied!

global:
  scrape_interval: 10s

scrape_configs:
  - job_name: prometheus
    scrape_interval: 10s
    static_configs:
      - targets:
          - 'localhost:9090'

  - job_name: node-exporter
    static_configs:
      - targets:
          - 'node-exporter:9100'

rule_files:
  - /etc/prometheus/alerts.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

This configuration file defines a global scrape interval of 10 seconds. It then specifies that Prometheus should monitor itself and the node-exporter service.

The rule_files line specifies that Prometheus can find alert rules in the /etc/prometheus/alerts.yml file, and the alerting section sends any triggered alerts to an Alertmanager instance running at alertmanager:9093.

Let's go ahead and create an alert rule that is triggered when the up metric for the node-exporter job equals 0. This metric is used within Prometheus to indicate whether the target is reachable (1) or not (0).

Create and configure your alerts.yml file as follows:

alerts.yaml

Copied!

groups:
- name: node_exporter_alerts
  rules:
  - alert: NodeExporterDown
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Node Exporter down"
      description: "Node Exporter has been down for more than 1 minute."

Alerts in Prometheus are organized into groups to logically separate and manage them, while the rules property lists the alerting rules within the group.

Here's a breakdown of the rule:

expr: up{job="node-exporter"} == 0: This alert condition could be any PromQL query. It checks if the up metric with the label job="node-exporter" is equal to 0. If so, it means the Node Exporter is down.
for: 1m: This specifies that the alert should only fire if the condition has been valid for at least one minute to prevent transient failures from triggering an alert.
labels: These are additional labels attached to the alert.
annotations: These provide more context to the alert. Here, we're including a summary and a description to explain the alert.

Now that you've configured Prometheus to detect and send alerts to Alertmanager, let's set up a receiver that will send you a notification when an alert is triggered.

For demonstration purposes, we'll use Better Stack which supports receiving alerts from Prometheus, and forwarding them to your preferred notification channel (email, Slack, SMS, e.t.c).

From the menu on the left, choose Integrations and click on the Importing data tab.

Better Stack Uptime integrations dashboard

Scroll down to the Infrastructure monitoring section, and find the Prometheus entry, then click Add.

On the next screen, give your integration a suitable name, then click Continue:

Name Prometheus integration in Better Stack

Your Prometheus integration will be subsequently created, and you'll find a Prometheus webhook URL on the resulting page. Copy the URL to your clipboard, then scroll down to the bottom of the page and click Save changes:

Using the supplied webhook URL, create a new file called alertmanager.yml inside your prometheus-alertmanager directory with the following contents:

alertmanager.yml

Copied!

global:
  resolve_timeout: 1m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'betterstack'

receivers:
- name: 'betterstack'
  webhook_configs:
  - url: '<your_betterstack_webhook_url>'

This alertmanager.yml configuration file defines how Alertmanager should handle alerts. Let's break down the key sections:

resolve_timeout: This sets the time after which an alert is considered resolved if no more firing alerts are received.
group_by: Specifies the conditions for grouping alerts. In this case, alerts with the same name will be grouped.
group_wait: The time to wait before sending the first notification for a group of alerts.
group_interval: The minimum after which a resolved alert will be re-notified if it fires again.
repeat_interval: The time after which a resolved alert will be re-notified if it fires again.
receiver: Specifies the default receiver to use for alerts which is betterstack.

The receivers section then configures the betterstack receiver with the webhook URL you copied above.

With your Alertmanager configuration in place, you're now ready to test the configuration. You can do this by bringing up the services with:

Copied!

docker compose up -d

You should see a similar output to:

Output

. . .
[+] Running 4/4
 ✔ Network prometheus-alertmanager_monitoring  Created           0.2s
 ✔ Container alertmanager                      Started           0.7s
 ✔ Container node-exporter                     Started           0.6s
 ✔ Container prometheus                        Started           0.7s

Open the Prometheus targets interface in your browser at http://localhost:9090/targets. You should see that the Node Exporter target is up and running:

Navigate to http://localhost:9090/alerts to see the alert you created:

Since the Node exporter service is up, the alert is reported as inactive.

To simulate the server being down, run the command below to stop the node-exporter service:

Copied!

docker compose stop node-exporter

Output

[+] Stopping 1/1
 ✔ Container node-exporter  Stopped          0.2s

The service will now be reported as "Down" in Prometheus:

Node exporter service is down in Prometheus

If it is not restarted within a minute, the alert status will progressively change from inactive to pending, and then to firing:

You can also inspect the Alertmanager interface to see the firing alert at http://localhost:9093:

Subsequently, an incident will be created in Better Stack which you can inspect by going to the Incidents menu:

You will also immediately receive an email letting you know that your service has gone down:

You can then take action to resolve the problem. To do this, bring up the node-exporter service once more:

Copied!

docker compose start node-exporter

Copied!

[+] Running 1/1
 ✔ Container node-exporter  Started                              0.3s

After a minute (as configured), Better Stack will detect that the Node Exporter service is back up again, and the incident will be automatically resolved. You'll get another email notification to this effect:

Now that you've seen how Alertmanager works in practice, let's look at some of its core concepts and possible customizations to help you get the most out of it.

Grouping alerts to reduce noise

Grouping alerts is crucial in environments where many similar alerts could be triggered simultaneously. It allows you to consolidate related alerts into a single notification instead of receiving a barrage of individual notifications.

For example, if you have a microservices architecture where multiple instances of the same service are running, a failure in some instances could trigger individual alerts for each one.

You can configure Alertmanager to group these alerts by service name so that a single notification indicates that the service is experiencing issues, along with a list of affected instances.

In Alertmanager, grouping works by configuring a list of matching labels. Alerts that share the same values for these labels will be combined into a single notification.

By default, all labels are used for grouping unless specified otherwise in the group_by field:

alertmanager.yml

Copied!

route:
  group_by: ['alertname', 'cluster', 'service']

In the above snippet, alerts with the same alertname, cluster, and service labels will be grouped. Ensure to avoid using high cardinality labels, as this could result in a large number of small groups, which defeats the purpose of grouping in the first place.

When Alertmanager receives alerts, it will use your group_by configuration to determine which alerts belong together. It then waits for a period defined by group_wait to allow more related alerts to arrive:

alertmanager.yml

Copied!

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 1m

After the group_wait period, it sends out a single notification representing the alert grouping.

Customizing alert routing rules

You've already seen the basics of how alerting and notifications work in Alertmanager, but there are a lot more features and customization options available to help you direct alerts to the right people at the right time.

Let's start with the routes configuration. It uses a tree-like structure to match alerts based on various criteria, and then direct them to the appropriate receiver:

alertmanager.yml

Copied!

route:
  receiver: betterstack # root receiver

  routes:
  # All alerts with service=postgresql or service=clickhouse
  # are dispatched to the database slack.
  - receiver: 'database-slack'
    matchers:
    - service=~"postgresql|clickhouse"

  # All alerts with the team=frontend label match this sub-route.
  - receiver: 'frontend-email'
    matchers:
    - team="frontend"

The routing tree starts with a root route, and it can have multiple child routes. If an alert doesn't match a specific route, it falls back to the parent route until it reaches the root route to ensure that all alerts are handled.

In the above configuration, the default receiver is betterstack. If an alert doesn't match any of the specific routes defined below, it will be sent to this receiver.

Then we have two child routes:

When the service label is set to postgresql or clickhouse, the alert is sent to the database-slack receiver.
When the label team: frontend is matched, such alerts are sent to the frontend-email receiver.

These routes also support custom grouping settings such as:

alertmanager.yml

Copied!

route:
  routes:
    - receiver: 'betterstack'
      matchers:
      - severity="critical"
      group_wait: 10s # Quicker notification for critical alerts.
      group_interval: 2m
      repeat_interval: 15m

You can use the Routing tree editor to assist you with building routing trees.

The receivers section specifies all the notification channels you're using to receive alerts. Every receiver mentioned in the routes section must be defined here first.

alertmanager.yml

Copied!

receivers:
  - name: 'betterstack'
    webhook_configs:
      - url: '<webhook_url>'

  - name: 'frontend-email'
    email_configs:
    - to: <example2@gmail.com>
      from: <example@gmail.com>
      smarthost: smtp.gmail.com:587
      auth_username: <example@gmail.com>
      auth_identity: <example@gmail.com>
      auth_password: <app_password>
      send_resolved: true

You can find other supported receiver configurations in the Prometheus documentation.

Silencing alerts

Alert silencing allows you to temporarily mute specific alerts based on predefined criteria. It is particularly useful for avoiding unnecessary notifications during planned maintenance, known issues, or when testing.

A silence in Alertmanager is a set of matchers (conditions) that suppress alerts for a specified period. Alerts that match the silence criteria will not trigger notifications to the configured receivers during the silencing period.

To silence an alert, you can use the Alertmanager web interface. Head over to http://localhost:9093 in your browser, then click on New Silence at the top right of the page:

Fill in the provided form to silence the NodeExporterDown alert. You can use the alertname label to match this alert:

Copied!

alertname=NodeExporterDown

In general, you should be as precise as possible with your matchers to avoid silencing unrelated alerts, but if you want to silence all alerts, you can use:

Copied!

alertname=~".+"

You also need to specify the period of silencing and why the alert is being silenced:

Once you're done, click Create.

You can see what silences are active on the Silences page:

You can test it out by stopping the node-exporter service once again:

Copied!

docker compose stop node-exporter

Output

[+] Stopping 1/1
 ✔ Container node-exporter  Stopped                              0.2

You'll notice that no alerts are received in Alertmanager, and all notifications are suppressed due to the active silencing period. This is an excellent way to avoid unnecessary notifications during expected events or known issues.

Just remember to set an appropriate expiry time so your alerts can start coming through again. You can also manually expire a silence before its scheduled end time through the web interface.

Inhibiting alerts

Alert inhibition is a feature that suppresses specific alerts when related higher-priority alerts are already firing.

For example, if you've set up your Prometheus server to warn at 70% memory usage and alert at 90%, you will get two alerts if your application rapidly spikes to 95%. The 70% warning will be suppressed with inhibition since it is redundant when the critical 90% alert is already firing.

You can set up alert inhibition rules with the inhibit_rules property as follows:

alertmanager.yml

Copied!

inhibit_rules:
  - source_matchers:
      - alertname = HighMemoryUsage
    target_matchers:
      - alertname = MemoryUsageWarning
    equal: ['instance', 'job']

The source_matchers specifies the alerts that will suppress others (inhibiting alerts), while target_matchers specifies the alerts that will be suppressed. The equal property then defines the labels that must match between the source and target alerts for inhibition to apply.

This essentially means that if an alert with alertname=HighMemoryUsage is firing for a specific instance and job, it will inhibit any MemoryUsageWarning alert for the same instance and job.

To ensure that your inhibition rules work reliably, place the inhibiting rules before the alerts they inhibit in your alerting configuration:

alerts.yml

Copied!

groups:
- name: Example group
  rules:
  - alert: HighMemoryUsage # The rule that suppresses should come before the rule that is suppressed in each group
    expr: 1
    for: 5m
    labels:
      inhibit: "true"
    annotations:
      summary: "This is an inhibiting rule"
  - alert: MemoryUsageWarning
    expr: 1
    for: 5m
    labels:
      inhibited: "true"
    annotations:
      summary: "This is an inhibited rule"

alertmanager.yml

Copied!

inhibit_rules:
  - source_matchers:
      - inhibit="true"
    target_matchers:
      - inhibited="true"

Also, since Prometheus evaluates rule groups concurrently, you cannot rely on rules in one group to inhibit rules in another group. If you need an inhibition rule to work across multiple rule groups, duplicate the rule in each relevant group:

alerts.yml

Copied!

groups:
- name: Group1
  rules:
  - alert: InhibitingRule
    . . .
  - alert: InhibitedRule
    . . .

- name: Group2
  rules:
  - alert: InhibitingRule
    . . .
  - alert: SecondInhibitedRule
    . . .

Although the InhibitingRule appears twice here, Alertmanager will intelligently recognize them as the same alert and de-duplicate them.

Finally, set up alerts on the prometheus_notifications_dropped_total metric to catch any instances where alerts, particularly inhibition rules, are dropped before reaching Alertmanager.

Customizing notification templates

Alertmanager also allows you to customize and standardize alert notifications with templating.

Instead of relying on the generic defaults, you can write templates that pull data from alerts (labels, annotations, and metadata) and format it for various receivers.

Here's an example that customizes the notification template for an email receiver:

alertmanager.yml

Copied!

receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: >
          Alert Summary:
          {{ range .Alerts }}
          - Alert: {{ .Annotations.summary }}
          - Description: {{ .Annotations.description }}
          - Severity: {{ .Labels.severity }}
          {{ end }}
          View Alert Runbook: https://internal.example.com/wiki/alerts/{{ .GroupLabels.alertname }}

Alertmanager uses the Go templating system and allows you to include all kinds of details. You can also use template functions to format and structure the notification for better readability.

Instead of specifying the template directly in your configuration file, you can also specify them in a file like this:

/etc/alertmanager/templates/email.tmpl

Copied!

{{ define "email.myorg.text" }}
Alert Summary:
{{ range .Alerts }}
- Alert: {{ .Annotations.summary }}
- Description: {{ .Annotations.description }}
- Severity: {{ .Labels.severity }}
{{ end }}
View Alert Runbook: https://internal.example.com/wiki/alerts/{{ .GroupLabels.alertname }}
{{ end }}

You can now include the template in your configuration with:

Copied!

receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ template "email.myorg.text" . }}'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

You can find several examples of custom notification templates here.

Managing Alertmanager from the Command Line

amtool is a command-line utility for interacting with the Alertmanager API. It allows you to manage alerts, silences, and configurations directly from the terminal, making it a powerful tool for testing, debugging, and your Alertmanager setup.

You can install it with the Go toolchain using:

Copied!

go install github.com/prometheus/alertmanager/cmd/amtool@latest

Or you can run it from the prom/alertmanager Docker image like this:

Copied!

docker run -it --entrypoint=amtool prom/alertmanager:latest

You may need to specify a configuration file at /etc/amtool/config.yml:

/etc/amtool/config.yml

Copied!

alertmanager.url: "http://localhost:9093"
author: John Doe
comment_required: false # for silencing
output: simple
# Set a default receiver
receiver: betterstack

Once installed and configured, you can use it to view all currently firing alerts using:

Copied!

amtool alert

Output

Alertname         Starts At                Summary             State
NodeExporterDown  2024-11-20 13:36:36 UTC  Node Exporter down  active

Or filter them with:

Copied!

amtool alert query severity="critical"

You can also use it to silence a specific alert with:

Copied!

amtool silence add -d 3h --comment="scheduled downtime" instance="web-server-1"

Output

9f60504f-ffd0-4147-8005-a7dbb45fe6f5

Then you can view active silences and expire them:

Copied!

amtool silence query

Output

ID                                    Matchers                 Ends At                  Created By  Comment
b16d0599-03ed-4863-8b8c-77bc2355837b  instance="web-server-1"  2024-11-20 16:41:28 UTC  Ayo         scheduled downtime

Copied!

amtool silence expire b16d0599-03ed-4863-8b8c-77bc2355837b

Another helpful feature is its ability to test notification templates so that you can ensure the alert notifications are properly formatted.

Copied!

amtool template render --template.glob='/etc/alertmanager/template/*.tmpl' --template.text='{{ template "email.myorg.text" . }}'

Feel free to check out the amtool documentation for more usage examples.

Final thoughts

You've now taken the first step towards mastering Prometheus Alertmanager!

If you read through the entire guide, you should now have the foundational knowledge to build a robust alerting system that reduces noise and enhances incident response.

For further exploration, consider diving into the alerting rules, the Alertmanager configuration docs, and our Prometheus best practices article to learn more.

Thanks for reading, and happy monitoring!

Got an article suggestion? Let us know

The Beginner’s Handbook to PromQL

This guide introduces PromQL basics, covering data types, query structures, and practical examples to help you master Prometheus queries and analyze metrics

→