🔭 Want to get alerted when your application goes down?
Head over to Better Uptime start monitoring your jobs in 2 minutes
Health checks are vital to maintaining the performance and reliability of systems and applications. Whether it's a server, a network, or an application, regular health checks can help identify problems before they become serious and ensure that systems are running at peak performance.
They typically involve sending regular requests to a server or application to check that it is responding as expected and that it is functioning properly. The results of these health checks can be used to identify problems with the application and to take appropriate action, such as restarting the application or rolling back to a previous version.
In this article, we will explore the various types of health checks that can be performed, the benefits of regularly performing health checks, and some best practices for implementing health checks in your organization. So if you want to ensure the health and stability of your systems and applications, read on to learn more about health checks.
Head over to Better Uptime start monitoring your jobs in 2 minutes
The simplest health check endpoint that could work is one that returns a 200 OK response. This is the bare minimum check you can implement to confirm that your application is alive and able to handle incoming requests.
app.get('/health', async (req, res) => {
console.log('Health Check Request');
res.status(200).send();
});
However, a positive response doesn't indicate that your application is functioning optimally since no specific health checks are being performed in the endpoint. In the following sections, we will discuss a few ways to measure application health and some best practices to follow when implementing said checks.
Server health can be measured in several ways and at various points in the system. Some health checks can accurately determine if a specific server is malfunctioning, but others can give false positives if you're not careful. Each type of health check has its benefits so you must investigate the most relevant ones for your situation and implement them accordingly to enable quick failure detection and resolution.
The specific health checks you need to perform will depend on the type of application and its operating environment. That said, there are quite a few checks that are applicable in most scenarios, and they are listed below:
These confirm that the application is listening on the expected port and is able
to accept and respond to new TCP connections. Servers that run for extended
periods may eventually break, and these checks are essential to detect such
situations so that it is promptly remedied (usually by restarting). You may
perform liveness checks under a /health/live
endpoint:
app.get('/health/live', async (req, res) => {
res.status(200).send();
});
Readiness checks ensure that the application instance is ready to receive requests. This can usually be combined with the liveness check, but it is useful to distinguish the two to easily detect situations where the application is running but temporarily unable to serve traffic (due to some long-running startup tasks or similar). A readiness check will prevent traffic from reaching a server that is not ready.
app.get('/health/ready', async (req, res) => {
let isReady = false;
// anything you need to verify before confirming your application is ready
// to start handling incoming requests.
if (isReady) {
res.status(200).send();
} else {
res.status(503).send();
}
});
Local server resources that affect an application's ability to function should also be checked to confirm healthy status. This includes memory, disk, and CPU usage, read and write permissions, and more. You need to define your healthy thresholds and then return a status report based on those limits. The methods for retrieving such information will also vary depending on the application environment as some do not allow direct system calls.
Applications often require supporting processes to function, and failures involving such processes can be hard to detect. Therefore, a regular health check should be performed to ensure that missing or malfunctioning support processes can be quickly detected and fixed.
Dependency health checks confirm that an application can interact with core dependencies such as databases, cache layer, or external APIs. Anything that is crucial to the function of your application should be checked to confirm availability and normal function before before your application is considered healthy.
The example below checks the connection to the application's database and cache layer to ensure that they are up and running:
app.get('/health/dbs', async (req, res) => {
try {
await redisClient.ping();
await db.query('SELECT 1');
res.status(200).send();
} catch (err) {
console.error(err);
res.status(503).send();
}
});
To avoid overloading the database and degrading its performance, it's best to
choose a quick query when checking a database connection. In many cases, there
is no need to run a test query if you can establish a successful connection to
the database. If you do need to run a query, use a simple SELECT
query like
SELECT 1
.
An anomaly is anything that differs from the expected norm or baseline. This could include a sudden increase in error rates from an application instance, a surge in in-flight messages that are yet to be resolved, a sudden increase or decrease in response times (relative to other instances of the application), or even servers that are running an older and incompatible version of the application.
Checking for anomalies is useful when your application is deployed to a cluster of servers. Such checks compare the behavior of all the servers in the cluster so that potential anomalies are identified and recovery strategies are applied to the offending server.
Once your health checks are in place, you need to choose a reporting or recovery path (or both) for when a specific check fails. Usually, the failing server must be acted upon by a central system that will decide the appropriate course of action for the particular failure depending on how tolerant the service should be to such failures.
There are several strategies for responding to health check failures. This section will evaluate a few of them.
Oftentimes, the right thing to do for a failing server/application instance is to restart it immediately or replace it with a fresh instance. This is useful when the application fails a liveness check for example. This strategy can usually fix the problem and may not require any further action.
Take care not to restart your application like crazy when something bad happens as that can make a bad situation worse. You can use the exponential backoff algorithm to incrementally increase the time between restarts to reduce the pressure on your dependencies. If the application fails even after multiple restarts, human intervention may be required to fix the issue.
If a health check failure is caused by a problem local to a specific server, temporarily taking it out of service can be an excellent solution to help it recover from the failure before being restored again. Such strategy should be implemented using the Circuit Breaker pattern where traffic is cut to the server for a specified period. Once this period expires, a limited number of requests are allowed to pass through. If they succeed, the server is allowed to resume regular operation. Otherwise, the timeout period is extended. If the server cannot automatically recover after some time, further escalation may be necessary.
Service degradation is a strategy to fix failing health checks where your
application degrades its quality of service instead of shutting down entirely
after failing a health check. For example, if your database INSERT
s become too
slow, you might fall back to read-only mode until the issue with your INSERT
s
are fixed. This might make users temporarily unable to create new resources on
the service, but they'll be able to view existing ones just fine.
This article from the Netflix Team
explains how they made their API more resilient through fallbacks and service
degradation.
Taking a server out of operation for a health check failure only makes sense when the cause of failure is localized to the server alone. If all the servers are failing health checks at the same time (perhaps due to a failing shared dependency), such automation can take down the entire service! In cases where continued operation of the service is deemed more important than the failing checks, you can employ fail-open behavior, which ensures that traffic continues to be routed to all servers as long as they can continue to perform useful work despite the failing checks.
When a health check failure cannot be automatically resolved, human intervention is often required to investigate and address the problem. To ensure that such failures are detected and resolved quickly, you must set up an appropriate service and configure it so that alerts on such failures reach you promptly.
For instance, Better Uptime can deliver prompt alerts through phone calls, SMS, Slack, email, push notifications, and more. It also provides status pages for communicating downtime, and it integrates with several monitoring and calendar tools to facilitate incident management and on-call scheduling, respectively. For further resources on incident management, check the most used incident management tools.
We've already ascertained the importance of regular health checks and some strategies to mitigate failures. However, there are also certain best practices that you should follow when implementing health checks to ensure that they are effective and efficient. This section will explore some of these best practices and discuss why they are useful guidelines to follow.
When performing health checks on application dependencies, it is best to separate the checks for mandatory dependencies for the service from the optional ones. This helps avoid a disproportionate reaction (such as removing servers from service) when a non-critical dependency is failing.
A cascading failure is one of the most common causes of service outages. It occurs when a failing dependency leads to the loss of an entire system by progressively toppling service instances one after the other until there are no healthy instances available.
For example, when servers are taken out of service due to a failing check, it can create a downward spiral as the remaining servers must handle a greater share of the incoming traffic. This increases the likelihood that they will also become overloaded and fail health checks, which can eventually take the entire service down.
Some common strategies to avoid cascading failures include the following:
To ensure that your applications respond promptly to health check requests, you
can perform the checks in the background and update an isHealthy
flag that is
subsequently used in the request handler logic. This pattern will allow server
instances to respond promptly to health checks, but you must take care to also
monitor the health of the background task as failure there could lead to
outdated status reports.
When implementing health checks for dependencies, take care to guard against short and intermittent failures using techniques like retries, fallbacks, timeouts, and the circuit breaker pattern. This can help to prevent false positives where the failure is due to intermittent network issues or similar. Without such guards, such failures can ripple through the service and ultimately cause a system-wide outage.
Health checks can be broadly categorized into two: shallow and deep. Shallow health checks are typically quick and simple tests that are used to verify that a service or dependency is available for use. These checks typically involve sending a simple request to an API and checking for a response, or verifying that an API's endpoint is accessible. They are often performed frequently, such as every minute, in order to ensure that the subject under test is available and functioning properly.
On the other hand, deep health checks are more comprehensive tests that are used to verify the functionality of a dependency and ensure that is providing the correct function to its dependants. Due to their expensive nature and susceptibility to false positives, they are typically run less often compared to shallow checks.
Separating shallow and deep health checks is crucial because it allows developers and sysadmins to monitor the performance and reliability of an API more effectively. For example, performing deep checks on a system that is trying to recover from an outage can end up counterproductive. In such scenarios, you can start with shallow checks and enable deeper checks when the system stabilizes.
Carefully planning the granularity of measurements is essential to ensure that the data collected is useful and cost-effective. For example, if you're targeting 99.9% annual uptime for your API, probing for a readiness check more than once a minute is probably unnecessary. On the other hand, measuring CPU load over the same period may not reveal short-term spikes that contribute to high latency.
To optimize monitoring efforts, it may be helpful to internally sample measurements on the server and then configure an external system to collect and aggregate the data over time or across servers. This can allow for high-resolution monitoring without incurring high costs for collection and analysis.
To ensure that your health check endpoints always provide the most accurate and up-to-date status, you must disable any form of caching on then. When caching is enabled, an outdated status report may be returned. In addition, by serving every request to the health check endpoint in real-time, you can ensure that the endpoint always provides the most current status of your services.
Better Uptime lets you monitor your application health directly from its web interface, and it offers built-in incident management and alerting so you can quickly detect failures.
In its Monitors section, you can add one or more health check endpoints and configure how they should be monitored. For example, you could indicate downtime when the response from the endpoint is outside the 2XX range, or if a keyword is present or absent in its response.
After determining what a failing check looks like for the endpoint, you can configure alerting and get automated notifications when the status of your health checks change. It supports phone calls, SMS, email, and instant push notifications (via the Better Uptime Android/iOS app), but you can also post new incidents to platforms like Slack, or create a custom integration through its webhook feature.
It also supports configuring on-call duties through a calendar interface, such that only the specific on-call person is notified when a health check is failing. You can also configure an incident escalation policy such that if the current on-call person doesn't acknowledge the incident within some minutes, other team members are promptly notified.
Software can fail for various reasons ranging from bugs in the software itself to hardware issues. Multiple layers of checks and monitoring are necessary to catch all types of failures. When failures occur, it is important to detect them quickly and apply the right strategies to correct the issue. It is also necessary to have safeguards in place to prevent automated responses from causing further problems, and to involve humans in uncertain situations.
There's a lot more to be said about setting up and monitoring health checks but we hope the information in this article will get you more comfortable with designing a health checks strategy for your production applications and monitoring them accordingly so that if something unexpected happens, you will know exactly where to look.
Thanks for reading, and happy monitoring!
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usWrite a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github