👉 New to incident management? Check the What Is Incident Management? Beginner’s Guide
Looking to formalize your incident management process by picking a professional solution? Or just checking alternatives for tools you are already using?
Signing up for all is not a time-viable option, so I have analyzed the most well-known tools for you.
I have tested each according to five main criteria (Keep reading to see what those criteria were).
And here are the results, of the best incident management tools for 2022 (as per our testing):
- Better Uptime (our solution*)
- Opsgenie (15/25 points)
- Pagerduty (14/25 points)
- Splunk on-call (13/25 points)
- xMatters (9/25 points)
Let’s take a deeper look at the testing criteria and results.
* It’s a fair thing to say that Better Uptime is a part of Better Stack and because of that, we can’t be entirely without bias. That’s why I omited Better Uptime from the point review and only analyzed the other tools on the 25-point scale.
My criteria, methodology & results
Based on my experience I think that the perfect incident management tool has to score high in the four main criteria:
- On-call scheduling: It must be easy to schedule and edit [on-call duties][on-callscheduling]
- Alerting: It must provide necessary alerting options with a reasonably simple setup
- Incident lifecycle: It must have the capabilities to collaborate on troubleshooting and solving incidents
- Integrations: It must provide straightforward integration of monitoring, customer support, and other vital tools
In a perfect test, I would also include the testing of the alerts for reliability, actionable information (essentially what information can be passed from monitoring tools into the alert), and specific marketed features such as grouping of similar alerts or AI functionality.
I didn’t have the capacity to integrate monitoring into all of the analyzed tools and check all of the alerts separately. Because of that I mainly focused on testing the initial setup as well as checking commonly used features from the on-call engineer perspective.
My methodology for this article was the following: I would test these 4 popular incident management tools on all of those criteria and give them ratings from 1-5 for each.
At the end of each review, I also gave my overall feeling about the tool. I tried focusing on which teams could be most well suited and why they should be considered when picking a new tool.
1. Better Uptime (our solution)
Better Uptime has a very intuitive on-call calendar scheduling: you can edit duties by drag-and-drop editor or by pre-fill the whole calendar based on specific rotation.
The calendar integration options also offer Gmail and Microsoft Outlook.
The four main alerting options for Better Uptime are:
- Automated phone call
- Mobile push notifications
Other alerting options like Slack and Microsoft Teams can be setup via advanced escalation policy and once integrated takes only a few clicks to do for new alerting setups.
The incident dashboard offers a different approach than most other tools by having a timeline where team members can be tagged with @ like in Slack. Especially when writing post-mortems having a single source of timeline is very useful.
Other vital options like escalations are there. Incidents that come via the monitoring (are not manually reported) also have screenshots, error codes and other details.
Compared to others in the list, Better Uptime doesn’t have a integrations directory that can be accessed without signing up. Better Uptime includes the majority of the vital scheduling and monitoring (Datadog, New Relic, Prometheus etc.) integrations, the rest can be setup via webhooks or Zapier.
Ccustomer support integrations currently only include Zendesk and Front, with others being available via email or webhook integration.
Better Uptime covers all the major use cases and doesn’t create any unnecessary obstacles to getting started right away. Main benefit is that incident management/on-call, monitoring, and status page can be all managed from a single product.
2. Opsgenie (15/20)
On-call scheduling: 5/5
Opsgenie has a well-made scheduling dashboard that has all the capabilities necessary. Its main benefit compared to other tools is the very clear distinction of main rotation, overrides, and most importantly the final schedule all shown in one dashboard view.
The calendar/timeline switcher is also useful.
Alerting and escalations setup is the weakest point of Opsgenie. Not just that it’s quite hidden in the Teams settings menu it also doesn’t have a clear way of integrating with Slack or picking whether it’s going to be a phone or e-mail alert. That needs to be configured in the alerting settings.
Incident lifecycle: 3/5
The incident view is easy to navigate and has a good timeline view. Every resolved incident has a link to a brief prepared post-mortem template, which comes in handy.
Inviting and collaborating with other team members is very non-intuitive. Also, notes or comments can’t be added directly to an incident.
Opsgenie belongs to the Atlassian family of products, so integrating with Jira, Trello and other Atlassian products is straightforward and within a few clicks.
The integration selection is large and offers all necessary monitoring, ChatOps, and ticketing tools for the majority of teams.
Opsgenie could be a good fit for larger teams with complex workflows that they need to service and for teams that enjoy the Jira-like processes. The signature of Atlassian's project managers and designers is quite visible.
For any users of other Atlassian products, Opsgenie can be a good solution because of its easy integration and the possibility of bundle pricing.
3. Pagerduty (14/20)
On-call scheduling: 2/5
Pagerduty has a fairly complex scheduling page, the main issue being the timezone picker. That can be tricky in the case of distributed teams as it’s unclear how others in different timezones will be able to interact with it.
Otherwise, the on-call rotations and user-picker are straightforward. The ability to add multiple layers with extra users is also quite easy to grasp.
The alerting capabilities are not bad, there are actually plenty of them. The main issue is the complexity in which those can be set up and changed. Especially when onboarding new engineers, this was always an issue for me.
The idea that escalation policies and notification rules are not in a single dashboard is not the best UX decision.
Incident lifecycle: 4/5
Pagerduty’s incident lifecycle has all the necessary, especially adding responders and escalations to other team members is very simple.
Pagerduty has the largest integrations library by far and offers everything from monitoring to customer support.
Pagerduty is the oldest provider of incident management and has all the capabilities people could want. For large teams that are fine with longer initial setups and onboarding that can provide great value since they cover almost every possible use case.
For teams that want something that works out of the box and has a more user-friendly feel, I would recommend the alternatives.
4. Splunk (13/25)
On-call scheduling: 3/5
Not as good as Opsgenie or Pagerduty when it comes to embedding calendars, but overall ok to set up.
When it comes to alert escalations they don’t allow for much customization and you need to work with pre-build options.
Even though Splunk on-call (formerly VictorOps) is officially part of the Splunk ecosystem, the integrations to their monitoring are still not one-click away, but are instead treated as regular integrations.
Incident lifecycle: 3/5
Splunk on-call incident lifecycle is built upon timelines with incidents in a separate view. This isn’t the best solution as you need to switch between views to work with an incident (acknowledge, resolve etc.) and to see what is happening in real-time.
Apart from this constant need to switch views, the timelines are detailed and logical.
Great selection of integrations with each giving you a webhook link without the need to go through loads of marketing texts.
Setup guides are also available for people just getting started.
Since the integration with Splunk products is far from the level of Better Uptime or Opsgenie there is no significant benefit in picking it for teams using Splunk for monitoring.
Overall Splunk on-call is quite easy to work with and navigate and there are no overcomplicated setups.
5. xMatters (9/20)
On-call scheduling: 1/5
On-call scheduling in xMatters might be powerful once setup correctly but any changes are complicated to make. The calendar can’t be edited directly and specific shifts need to be set up first, which is annoying at best.
xMatters is probably the worst when it comes to setting up alerting in this list. It needs to be done via workflows and is very non-intuitive. It’s also quite hard to understand and grasp the connection between schedules, personal notifications, integrations, and workflows.
Incident lifecycle: 3/5
The incident dashboard is fine, and includes a nice overview with a clear incident timeline.
The only issue is the status dropdowns, because to acknowledge and resolve an incident users need to click on the status and select from a dropdown.
The integrations library for xMatters is large and covers all the necessary services.
Some integrations are easy to do with few clicks, however, when testing some there is a rather unfortunate redirect to extensive documentation and GitHub tutorials (with several pages even for quite simple integration). So when setting up less common integrations this could be an issue.
I wouldn’t really recommend going with xMatters if there aren’t significant benefits to integrating with existing solutions from EventBridge, which is the parent company.
Not sure about the selection? Let’s look at open-source alternatives
In case you are looking to have complete control over your incident management process and not be limited by any specific vendor here are the options:
Even though those tools are free, you won’t get the alerting functionality including phone calls and SMS out of the box like you would with the paid tools. To get those it’s necessary to integrate with other tools like Twillio to manage the calls for you.
Working with open-sourcing incident management tool can be beneficial when it comes to customization and price. However, hosting struggles, managing integrations, and the idea of reinventing the wheel are the counter-arguments.
Incident management is always hard so picking a tool to help you make it a bit easier is a great move.
You should evaluate each tool on what features and objectives are the most important to you and your team. It might take time to narrow it down to the most important and impactful objectives, but it’s well worth it.
If you want to learn more explore these articles:
- Better Uptime vs. Pagerduty vs. Opsgenie
- How to Create a Developer-Friendly On-Call Schedule in 7 steps
- What Is Incident Management? Beginner’s Guide
We call you when your
website goes down
Get notified with a radically better
infrastructure monitoring platform.