Cron jobor heartbeat monitoring is an
automated way of checking whether scheduled tasks run correctly. When a cron
job fails the monitor spots the issue and alerts the right person on the
development team. If your service performs a vital process periodically,
this is the ideal monitoring solution.
In this article, you will learn the following:
What is cron job and heartbeat monitoring and do they work.
Overview of on-call alerting and incident management process of cron job incidents.
What are the best practices for cron job monitoring.
What are the benefits and drawbacks of using cron job monitoring.
How to set up basic cron job monitoring.
How does cron job monitoring work?
The cron monitoring process works by setting up a remote monitoring service with
a dedicated URL to which the scheduled task sends a GET ,HEAD or POST
request after it has run correctly. This tracking of a system's health by
sending regular requests (heartbeats) is also called heartbeat monitoring. Cron
job and heartbeat monitoring are often used interchangeably.
The heartbeat monitor is set up to expect a heartbeat once every x minutes,
hours, or days. There is also a grace period that assures that alerting doesn't
start immediately if the job is delayed.
When the monitor receives a heartbeat within the pre-set time window, no action
is taken, and the monitoring continues. However, when no heartbeat is received
when it’s expected, the monitor starts what is called an incident and starts
alerting according to the on-call calendar.
What is a cron job incident?
A cron job incident is a period of time during which the given monitor doesn’t
receive heartbeats from the monitored service. This situation means that the
monitored service didn’t run correctly as all the correct runs send a heartbeat
to the monitor before finishing, keeping it from creating an incident.
How to receive cron job incident alerts?
After an incident is spotted by the cron job monitor, it needs to be
communicated to the service admins. This process is called incident alerting or
on-call alerting. In case of an incident, the
person from a team who is currently on-call (has scheduled duty) receives the
The most common types of getting alerted by an cron job monitor include
automated phone calls, SMS, Slack, and Microsoft Teams messages. Ways of
alerting depend on factors like the importance of the monitored service, time of
the day, and team preference.
What information do incident alerts include?
The incident alert for cron jobs and hearbeats in general is very basic because
the monitoring provides only simple up/down information. Implementing
logging into the monitored services and forwarding those logs into a
log aggregation tool is great way of getting
in-depth insights about any potential scheduled jobs incidents.
Process after receiving an alert? The cron job incident resolution process
After an alert is received, it should be acknowledged immediately. If the alert
is not acknowledged in a specified time frame (usually 3 minutes), the person
next in line on the on-call duty is alerted. This process could continue further
until the whole team is alerted. However, the best practice is to have the
on-call schedule set up in a way that the first team member is always ready to
solve incoming incidents.
Once the incident is acknowledged the escalation process is paused and the team
can fully focus on solving it. The speed by which an alert is acknowledged is
called Time to acknowledge (TTA). Its average from different incidents called
Mean Time to Acknowledge (MTTA) is a widely used incident management
The following steps in the downtime resolution process are individual to
different teams and apps. For larger teams, they can include collaborations
between a few developers or even teams of developers, delegations of incidents
to dedicated team members, and more. There are some best practices that all
teams managing incidents should use. These include incident communication (both
internal and external) and incident post-mortems.
What are the best practices for cron job monitoring?
Human alert tolerance
The heartbeat monitor will create an alert whenever it detects an issue.
However, if the monitor sends an alert (for example, SMS or email) to all team
members about the same incident ten times every day, they will very likely
This situation when alerts are ignored or not treated with the necessary care is
called Alert fatigue and poses a serious issue. To prevent alert fatigue,
only vital services should be connected to the on-call alerting and notify the
Grace time configuration
Grace time is the short time period after the time the heartbeat was expected
when no incident will be started. This prevents delayed jobs from causing
incidents and also helps to decrease the possibility of alert fatigue. However,
when grace period is too long, it will delay the incident alerting in case of
actual incident as well, so it needs to be set up carefully.
Synchronise monitor and cron job timezone
In many cases, your server running cron jobs will not be in the same timezone as
the monitoring service. To prevent any timezone differences and faulty alerting,
both should have the same time. Command-line utility timedatectl shows the
server timezone, and monitors typically offer the option to change timezones, so
both can be synced.
Encrypt communication between monitor and cron job
The communication between the service and heartbeat monitor typically uses
HTTP GET or POST methods. The cron job usually includes a unique token
assigned by the monitor to each request. The token is an authorisation measure.
Without an authorisation token, anyone can send a fake heartbeat and your
monitor won't detect an incident. However, the cron job must use TLS encryption
(HTTPS). Otherwise, anyone on the Internet can capture your authorisation token.
What are the main benefits and drawbacks of cron job monitoring?
Automated and running continously:Heartbeat monitoring
tool is listening on its dedicated URL continuously
and once set it needs little to no maintenance, while still providing the same
Simple to set up and use: Heartbeat monitors for any service can be set up
in minutes while providing the incident information right from the start.
Since it provides simple up/down information it can be applied widely across
different services and use cases.
Limited incident cause reporting: Heartbeat monitoring lacks the
information that could answer why the incident happened. Since it only
monitors the final output and not the actual workings of the service. To get a
better idea about the root cause, application performance management (APM) or
a log management service needs to be used.
Custom code dependency: Since the sending of the heartbeat needs to be
custom coded into a given script or app, there is a possibility for error and
misconfiguration. This is why any heartbeat setup needs to be checked
Where does cron job monitoring fit in the synthetic monitoring setup?
Synthetic monitoring also offers monitoring options like checking an
API, DNS or Transaction
How to start cron job monitoring in 5 minutes with Better Uptime?
Better Uptime is an infrastructure monitoring tool
that offers cron job monitoring. Here is how to get notified whenever a service
fails to run correctly, let’s set it up to get alerted whenever a database
Let’s say that to do the database backup you would run the following script:
$ bash /database/backup/script
Now, you can create a cron job by executing utility crontab with parameter
$ crontab -e
The -e option is used to edit the file crontab using your default
environment text editor. You will be redirected to the file. At the end of the
file append the following line of code (make sure to copy your heartbeat URL and
replace it in the code below):
We set up a heartbeat interval for 1 day, so we must set up the cron job to the
same time period, the cron expression for that is 0 0 * * *. The curl
utility sends the heartbeat if the backup script runs successfully.
Once the crontab sends the first heartbeat to the monitor the monitoring will
start - expecting the next request in 24 hours.