🔭 Want to get alerted when your website goes down?
Go to Better Uptime and start monitoring in 2 minutes.
Uptime monitoring is an automated way of checking whether a service such as a website or an application is available. When service goes down during an outage (downtime), uptime monitoring spots the issue and alerts the right person on the development team.
Go to Better Uptime and start monitoring in 2 minutes.
The uptime monitoring process works by sending automated
HTTP requests at a
pre-defined frequency to a specific URL and checking for the desired response.
HTTP GET requests are usually used. Other HTTP requests can be used as well,
for example when monitoring APIs and other
functionality. The pre-defined frequency of the checks depends on the specific
user’s need but can generally range anywhere from 30 seconds for business
websites to 10 or more minutes for hobby projects.
The desired response from the monitored URL is the
200 OK HTTP response code
(other codes might be acceptable as well in specific cases). Uptime monitor can
also be set to monitor for the desired keyword in the response. This is often
used in health checks or when assuring correct display of a critical part of the
website. For instance, keywords like signup or subscribe, are often checked
to assure that the most valuable user actions are working as they should.
When the correct code or keyword is received from an URL no further action is taken and the monitoring continues. When a different code is returned (any of the 5xx server errors for example), the monitor starts what is called a downtime incident and starts alerting according to the on-call calendar.
Keyword monitor checks for the presence or absence of the desired keyword in the HTML of the monitored URL.
Since the whole HTML code is checked, the keyword monitor can also check for specific code parts like a desired
<div> element or a signup button.
It’s recommended to use keyword monitoring instead of simple response code monitoring as the default for uptime checks. This is because the keyword check prevents the situation where a non-error response code is returned, but the page content shows incorrectly.
When a keyword is checked as well it gives an extra layer of protection as it allows to check any key component of a given URL. Such elements can be a call to action like a subscribe to newsletter button or a title of a blog post.
A downtime incident (or simply just downtime) is a period of time during which a given service is unavailable. Any users that are trying to use the service during the downtime will see the website's error page or an error page generated by their browser. This is how a custom 500 error page looks at GitHub .
A downtime incident can be also a situation where the request sent by the monitor doesn’t receive a response in a given time frame. The request timeout can be anywhere from 2 seconds to 1 minute, depending on the priority of the monitor. Setting the monitor sensitivity correctly is key in avoiding large amounts of false-positive alerts.
After an incident is spotted by the uptime monitoring tool it needs to be communicated to you. This process is called incident alerting or on-call alerting. On-call (or on-call calendar) is basically a scheduled duties calendar that defines which team member is responsible for incoming incidents.
The most common ways of getting alerted by an uptime monitor are automated phone calls, SMS, Slack, and Microsoft Teams messages. Ways of alerting depend on factors like the importance of the monitored service, time of the day, and team preference. For example push notifications or emails are generally used for less vital monitors.
Downtime incident alerts include information about what monitor went down and when. They also include information about the error that triggered the incident, specifically the received response (see example from Twitter below) and a screenshot of the site. Screenshots can’t be taken everywhere but in the case of website monitoring, they offer a great insight into what went wrong and what customers experienced.
Downtime alerts also include a call to action for the on-call person to take. Those usually include the option to acknowledge or to view the incident.
After an alert is received it should be acknowledged immediately. If the alert is not acknowledged in a specified time frame (usually 3 to 5 minutes), the person next in line on the on-call duty is alerted. This process could continue further until the whole team is alerted. The best practice however is to have the on-call schedule set up in a way that the first team member is always ready to solve incoming incidents.
Once the incident is acknowledged the escalation process is paused and the team can fully focus on solving it. The speed by which an alert is acknowledged is called Time to acknowledge (TTA). Its average from different incidents called Mean Time to Acknowledge (MTTA) is a widely used incident management metric.
The next steps in the downtime resolution process are individual to different teams and apps. For larger teams, they can include collaborations between a few developers or even teams of developers, delegations of incidents to dedicated team members, and more. There are some best practices that should be used by all teams managing incidents. These include incident communication (both internal and external) and incident post-mortems.
Uptime monitoring is a fully automated process that can run as often as every 30 seconds, which helps to discover any issues right away. In a best-case scenario, any downtime is fixed quickly, keeping the number of affected users to a minimum.
By consistently running over a long period of time, uptime monitoring gives a unique insight into apps performance - specifically uptime and latency. This set of historical data allows to benchmark against competitors or older versions of the same app or product.
Service level agreements (SLAs) are an essential part of enterprise offerings for many software businesses. Outbidding a competitor with better availability can play a decisive role in the sales process.
Vendors can use uptime monitoring to arm themselves with data showing adherence to their SLAs. While their clients can do the same to get paid penalties when the SLAs are not adhered to.
Integrations like payment processing, site search, recommendation plugins, CDNs, CRMs or analytics are integral in many modern applications.
Monitoring their functionality is necessary to accommodate for any performance degradations or downtime incidents. Monitoring them is also essential in both incident communication to your users and holding your vendors accountable. Although some vendors have public status pages, like status.hubspot.com it's always better to double-check.
Uptime monitoring is the main but not the only part of the synthetic monitoring toolbox. When it comes to website monitoring, uptime checks are ideally accompanied by SSL certificate checks and Domain expiration checks to prevent any security issues or loss of valuable business assets respectively.
For more information, explore Better Uptime docs .
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.Write for us
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our firstname.lastname@example.org
or submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github