Effective Alerting with Prometheus Alertmanager
In today's dynamic application environments, effective monitoring and alerting are non-negotiable for maintaining the reliability of your systems.
While Prometheus offers robust metrics collection and alerting capabilities, managing these alerts at scale requires a dedicated system. That's where Alertmanager comes in.
This guide provides a practical walkthrough of Alertmanager, from its basic configuration to advanced techniques like grouping and inhibition that help reduce alert fatigue.
By reading it to the end, you'll equip yourself with the knowledge to build an alerting system that cuts through the noise and empowers rapid incident response.
Let's get started!
Prerequisites
- Prior knowledge of Prometheus monitoring concepts.
- A recent version of Docker and Docker Compose installed on your system.
What is Prometheus Alertmanager?
The Alertmanager is the central alerting component of the Prometheus ecosystem. It is designed to handle alerts generated by Prometheus servers, which monitor various systems and applications.
These servers are typically configured with alerting rules that define conditions for triggering alerts. When these conditions are met, Prometheus sends alerts to Alertmanager.
Once Alertmanager receives an alert from Prometheus, it is processed based on its configuration. This usually involves grouping, deduplication, and routing to the appropriate receivers such, as email, Slack, or a custom webhook. It also takes care of silencing and inhibiting alerts if needed.
The best way to understand how the Alertmanager works is to see it in action, so let's look at an example of setting it up and using it to configure alerting for your Prometheus metrics.
Getting started with Alertmanager
Before using Alertmanager to handle alerts, you need a monitoring setup with Prometheus.
This will walk you through setting up Prometheus to monitor a Linux server using Node Exporter and configuring alerts to notify you if your server goes down.
To get started, create a new directory anywhere on your filesystem to place the necessary configuration files:
mkdir prometheus-alertmanager
cd prometheus-alertmanager
Then create a docker-compose.yml
file within this directory with the following
contents:
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- --path.procfs=/host/proc
- --path.rootfs=/rootfs
- --path.sysfs=/host/sys
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
expose:
- 9090
ports:
- 9090:9090
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- --config.file=/etc/alertmanager/alertmanager.yml
expose:
- 9093
ports:
- 9093:9093
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
This file defines a monitoring stack consisting of three services:
node-exporter
, prometheus
, and alertmanager
.
The node-exporter
service collects hardware and operating system metrics from
the host machine, while prometheus
scrapes these metrics and stores them. It
also exposes a web interface for querying and visualizing the collected data.
Finally, alertmanager
handles alerts sent by prometheus
so they are sent to
the configured notification channels. The services are all connected through a
bridge network called monitoring to enable communication between them.
Next, you must configure Prometheus to scrape the node-exporter
service and
send alerts to Alertmanager by providing a prometheus.yml
configuration file
in the current directory:
global:
scrape_interval: 10s
scrape_configs:
- job_name: prometheus
scrape_interval: 10s
static_configs:
- targets:
- 'localhost:9090'
- job_name: node-exporter
static_configs:
- targets:
- 'node-exporter:9100'
rule_files:
- /etc/prometheus/alerts.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
This configuration file defines a global scrape interval of 10 seconds. It then
specifies that Prometheus should monitor itself and the node-exporter
service.
The rule_files
line specifies that Prometheus can find alert rules in the
/etc/prometheus/alerts.yml
file, and the alerting
section sends any
triggered alerts to an Alertmanager instance running at alertmanager:9093
.
Let's go ahead and create an alert rule that is triggered when the up
metric
for the node-exporter
job equals 0. This metric is used within Prometheus to
indicate whether the target is reachable (1) or not (0).
Create and configure your alerts.yml
file as follows:
groups:
- name: node_exporter_alerts
rules:
- alert: NodeExporterDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node Exporter down"
description: "Node Exporter has been down for more than 1 minute."
Alerts in Prometheus are organized into groups to logically separate and manage
them, while the rules
property lists the alerting rules within the group.
Here's a breakdown of the rule:
expr: up{job="node-exporter"} == 0
: This alert condition could be any PromQL query. It checks if the up metric with the labeljob="node-exporter"
is equal to 0. If so, it means the Node Exporter is down.for: 1m
: This specifies that the alert should only fire if the condition has been valid for at least one minute to prevent transient failures from triggering an alert.labels
: These are additional labels attached to the alert.annotations
: These provide more context to the alert. Here, we're including a summary and a description to explain the alert.
Now that you've configured Prometheus to detect and send alerts to Alertmanager, let's set up a receiver that will send you a notification when an alert is triggered.
For demonstration purposes, we'll use Better Stack which supports receiving alerts from Prometheus, and forwarding them to your preferred notification channel (email, Slack, SMS, e.t.c).
Sign up for a free account, and navigate to the Uptime dashboard.
From the menu on the left, choose Integrations and click on the Importing data tab.
Scroll down to the Infrastructure monitoring section, and find the Prometheus entry, then click Add.
On the next screen, give your integration a suitable name, then click Continue:
Your Prometheus integration will be subsequently created, and you'll find a Prometheus webhook URL on the resulting page. Copy the URL to your clipboard, then scroll down to the bottom of the page and click Save changes:
Using the supplied webhook URL, create a new file called alertmanager.yml
inside your prometheus-alertmanager
directory with the following contents:
global:
resolve_timeout: 1m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'betterstack'
receivers:
- name: 'betterstack'
webhook_configs:
- url: '<your_betterstack_webhook_url>'
This alertmanager.yml
configuration file defines how Alertmanager should
handle alerts. Let's break down the key sections:
resolve_timeout
: This sets the time after which an alert is considered resolved if no more firing alerts are received.group_by
: Specifies the conditions for grouping alerts. In this case, alerts with the same name will be grouped.group_wait
: The time to wait before sending the first notification for a group of alerts.group_interval
: The minimum after which a resolved alert will be re-notified if it fires again.repeat_interval
: The time after which a resolved alert will be re-notified if it fires again.receiver
: Specifies the default receiver to use for alerts which isbetterstack
.
The receivers
section then configures the betterstack
receiver with the
webhook URL you copied above.
With your Alertmanager configuration in place, you're now ready to test the configuration. You can do this by bringing up the services with:
docker compose up -d
You should see a similar output to:
. . .
[+] Running 4/4
✔ Network prometheus-alertmanager_monitoring Created 0.2s
✔ Container alertmanager Started 0.7s
✔ Container node-exporter Started 0.6s
✔ Container prometheus Started 0.7s
Open the Prometheus targets interface in your browser at
http://localhost:9090/targets
. You should see that the Node Exporter target is
up and running:
Navigate to http://localhost:9090/alerts
to see the alert you created:
Since the Node exporter service is up, the alert is reported as inactive
.
To simulate the server being down, run the command below to stop the
node-exporter
service:
docker compose stop node-exporter
[+] Stopping 1/1
✔ Container node-exporter Stopped 0.2s
The service will now be reported as "Down" in Prometheus:
If it is not restarted within a minute, the alert status will progressively
change from inactive
to pending
, and then to firing
:
You can also inspect the Alertmanager interface to see the firing alert at
http://localhost:9093
:
Subsequently, an incident will be created in Better Stack which you can inspect by going to the Incidents menu:
You will also immediately receive an email letting you know that your service has gone down:
You can then take action to resolve the problem. To do this, bring up the
node-exporter
service once more:
docker compose start node-exporter
[+] Running 1/1
✔ Container node-exporter Started 0.3s
After a minute (as configured), Better Stack will detect that the Node Exporter service is back up again, and the incident will be automatically resolved. You'll get another email notification to this effect:
Now that you've seen how Alertmanager works in practice, let's look at some of its core concepts and possible customizations to help you get the most out of it.
Grouping alerts to reduce noise
Grouping alerts is crucial in environments where many similar alerts could be triggered simultaneously. It allows you to consolidate related alerts into a single notification instead of receiving a barrage of individual notifications.
For example, if you have a microservices architecture where multiple instances of the same service are running, a failure in some instances could trigger individual alerts for each one.
You can configure Alertmanager to group these alerts by service name so that a single notification indicates that the service is experiencing issues, along with a list of affected instances.
In Alertmanager, grouping works by configuring a list of matching labels. Alerts that share the same values for these labels will be combined into a single notification.
By default, all labels are used for grouping unless specified otherwise in the
group_by
field:
route:
group_by: ['alertname', 'cluster', 'service']
In the above snippet, alerts with the same alertname
, cluster
, and service
labels will be grouped. Ensure to avoid using high
cardinality labels, as this could result in a
large number of small groups, which defeats the purpose of grouping in the first
place.
When Alertmanager receives alerts, it will use your group_by
configuration to
determine which alerts belong together. It then waits for a period defined by
group_wait
to allow more related alerts to arrive:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 1m
After the group_wait
period, it sends out a single notification representing
the alert grouping.
Customizing alert routing rules
You've already seen the basics of how alerting and notifications work in Alertmanager, but there are a lot more features and customization options available to help you direct alerts to the right people at the right time.
Let's start with the routes
configuration. It uses a tree-like structure to
match alerts based on various criteria, and then direct them to the appropriate
receiver:
route:
receiver: betterstack # root receiver
routes:
# All alerts with service=postgresql or service=clickhouse
# are dispatched to the database slack.
- receiver: 'database-slack'
matchers:
- service=~"postgresql|clickhouse"
# All alerts with the team=frontend label match this sub-route.
- receiver: 'frontend-email'
matchers:
- team="frontend"
The routing tree starts with a root route, and it can have multiple child routes. If an alert doesn't match a specific route, it falls back to the parent route until it reaches the root route to ensure that all alerts are handled.
In the above configuration, the default receiver is betterstack
. If an alert
doesn't match any of the specific routes defined below, it will be sent to this
receiver.
Then we have two child routes:
- When the
service
label is set topostgresql
orclickhouse
, the alert is sent to thedatabase-slack
receiver. - When the label
team: frontend
is matched, such alerts are sent to thefrontend-email
receiver.
These routes also support custom grouping settings such as:
route:
routes:
- receiver: 'betterstack'
matchers:
- severity="critical"
group_wait: 10s # Quicker notification for critical alerts.
group_interval: 2m
repeat_interval: 15m
You can use the Routing tree editor to assist you with building routing trees.
The receivers
section specifies all the notification channels you're using to
receive alerts. Every receiver
mentioned in the routes
section must be
defined here first.
receivers:
- name: 'betterstack'
webhook_configs:
- url: '<webhook_url>'
- name: 'frontend-email'
email_configs:
- to: <example2@gmail.com>
from: <example@gmail.com>
smarthost: smtp.gmail.com:587
auth_username: <example@gmail.com>
auth_identity: <example@gmail.com>
auth_password: <app_password>
send_resolved: true
You can find other supported receiver configurations in the Prometheus documentation.
Silencing alerts
Alert silencing allows you to temporarily mute specific alerts based on predefined criteria. It is particularly useful for avoiding unnecessary notifications during planned maintenance, known issues, or when testing.
A silence in Alertmanager is a set of matchers (conditions) that suppress alerts for a specified period. Alerts that match the silence criteria will not trigger notifications to the configured receivers during the silencing period.
To silence an alert, you can use the Alertmanager web interface. Head over to
http://localhost:9093
in your browser, then click on New Silence at the
top right of the page:
Fill in the provided form to silence the NodeExporterDown
alert. You can use
the alertname
label to match this alert:
alertname=NodeExporterDown
In general, you should be as precise as possible with your matchers to avoid silencing unrelated alerts, but if you want to silence all alerts, you can use:
alertname=~".+"
You also need to specify the period of silencing and why the alert is being silenced:
Once you're done, click Create.
You can see what silences are active on the Silences page:
You can test it out by stopping the node-exporter
service once again:
docker compose stop node-exporter
[+] Stopping 1/1
✔ Container node-exporter Stopped 0.2
You'll notice that no alerts are received in Alertmanager, and all notifications are suppressed due to the active silencing period. This is an excellent way to avoid unnecessary notifications during expected events or known issues.
Just remember to set an appropriate expiry time so your alerts can start coming through again. You can also manually expire a silence before its scheduled end time through the web interface.
Inhibiting alerts
Alert inhibition is a feature that suppresses specific alerts when related higher-priority alerts are already firing.
For example, if you've set up your Prometheus server to warn at 70% memory usage and alert at 90%, you will get two alerts if your application rapidly spikes to 95%. The 70% warning will be suppressed with inhibition since it is redundant when the critical 90% alert is already firing.
You can set up alert inhibition rules with the inhibit_rules
property as
follows:
inhibit_rules:
- source_matchers:
- alertname = HighMemoryUsage
target_matchers:
- alertname = MemoryUsageWarning
equal: ['instance', 'job']
The source_matchers
specifies the alerts that will suppress others (inhibiting
alerts), while target_matchers
specifies the alerts that will be suppressed.
The equal
property then defines the labels that must match between the source
and target alerts for inhibition to apply.
This essentially means that if an alert with alertname=HighMemoryUsage
is
firing for a specific instance and job, it will inhibit any MemoryUsageWarning
alert for the same instance and job.
To ensure that your inhibition rules work reliably, place the inhibiting rules before the alerts they inhibit in your alerting configuration:
groups:
- name: Example group
rules:
- alert: HighMemoryUsage # The rule that suppresses should come before the rule that is suppressed in each group
expr: 1
for: 5m
labels:
inhibit: "true"
annotations:
summary: "This is an inhibiting rule"
- alert: MemoryUsageWarning
expr: 1
for: 5m
labels:
inhibited: "true"
annotations:
summary: "This is an inhibited rule"
inhibit_rules:
- source_matchers:
- inhibit="true"
target_matchers:
- inhibited="true"
Also, since Prometheus evaluates rule groups concurrently, you cannot rely on rules in one group to inhibit rules in another group. If you need an inhibition rule to work across multiple rule groups, duplicate the rule in each relevant group:
groups:
- name: Group1
rules:
- alert: InhibitingRule
. . .
- alert: InhibitedRule
. . .
- name: Group2
rules:
- alert: InhibitingRule
. . .
- alert: SecondInhibitedRule
. . .
Although the InhibitingRule
appears twice here, Alertmanager will
intelligently recognize them as the same alert and de-duplicate them.
Finally, set up alerts on the prometheus_notifications_dropped_total
metric to
catch any instances where alerts, particularly inhibition rules, are dropped
before reaching Alertmanager.
Customizing notification templates
Alertmanager also allows you to customize and standardize alert notifications with templating.
Instead of relying on the generic defaults, you can write templates that pull data from alerts (labels, annotations, and metadata) and format it for various receivers.
Here's an example that customizes the notification template for an email receiver:
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: >
Alert Summary:
{{ range .Alerts }}
- Alert: {{ .Annotations.summary }}
- Description: {{ .Annotations.description }}
- Severity: {{ .Labels.severity }}
{{ end }}
View Alert Runbook: https://internal.example.com/wiki/alerts/{{ .GroupLabels.alertname }}
Alertmanager uses the Go templating system and allows you to include all kinds of details. You can also use template functions to format and structure the notification for better readability.
Instead of specifying the template directly in your configuration file, you can also specify them in a file like this:
{{ define "email.myorg.text" }}
Alert Summary:
{{ range .Alerts }}
- Alert: {{ .Annotations.summary }}
- Description: {{ .Annotations.description }}
- Severity: {{ .Labels.severity }}
{{ end }}
View Alert Runbook: https://internal.example.com/wiki/alerts/{{ .GroupLabels.alertname }}
{{ end }}
You can now include the template in your configuration with:
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ template "email.myorg.text" . }}'
templates:
- '/etc/alertmanager/templates/*.tmpl'
You can find several examples of custom notification templates here.
Managing Alertmanager from the Command Line
amtool is a command-line utility for interacting with the Alertmanager API. It allows you to manage alerts, silences, and configurations directly from the terminal, making it a powerful tool for testing, debugging, and your Alertmanager setup.
You can install it with the Go toolchain using:
go install github.com/prometheus/alertmanager/cmd/amtool@latest
Or you can run it from the prom/alertmanager
Docker image like this:
docker run -it --entrypoint=amtool prom/alertmanager:latest
You may need to specify a configuration file at /etc/amtool/config.yml
:
alertmanager.url: "http://localhost:9093"
author: John Doe
comment_required: false # for silencing
output: simple
# Set a default receiver
receiver: betterstack
Once installed and configured, you can use it to view all currently firing alerts using:
amtool alert
Alertname Starts At Summary State
NodeExporterDown 2024-11-20 13:36:36 UTC Node Exporter down active
Or filter them with:
amtool alert query severity="critical"
You can also use it to silence a specific alert with:
amtool silence add -d 3h --comment="scheduled downtime" instance="web-server-1"
9f60504f-ffd0-4147-8005-a7dbb45fe6f5
Then you can view active silences and expire them:
amtool silence query
ID Matchers Ends At Created By Comment
b16d0599-03ed-4863-8b8c-77bc2355837b instance="web-server-1" 2024-11-20 16:41:28 UTC Ayo scheduled downtime
amtool silence expire b16d0599-03ed-4863-8b8c-77bc2355837b
Another helpful feature is its ability to test notification templates so that you can ensure the alert notifications are properly formatted.
amtool template render --template.glob='/etc/alertmanager/template/*.tmpl' --template.text='{{ template "email.myorg.text" . }}'
Feel free to check out the amtool documentation for more usage examples.
Final thoughts
You've now taken the first step towards mastering Prometheus Alertmanager!
If you read through the entire guide, you should now have the foundational knowledge to build a robust alerting system that reduces noise and enhances incident response.
For further exploration, consider diving into the alerting rules and the Alertmanager configuration docs to learn more.
Thanks for reading, and happy monitoring!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github