It’s 4:00 in the morning. Phone rings. The website is down. Here we go again.
On-call is often the most stressful part of the job for developers. But it rarely gets the care it really deserves.
Here are actionable tips for both managers and individual developers that can significantly improve resolving downtimes.
What is an on-call schedule?
On-Call is a practice of always having a team member on standby, ready to respond in case of an urgent incident, even if it occurs outside of regular working hours. It’s one of the core processes of incident management and a key to minimizing downtime and ensuring a reliable service.
On-Call schedule is a dedicated calendar allowing teams to assign and monitor the on-call shifts.
💡 How others do it: On-call scheduling at Google
Google SRE (site reliability engineering) teams usually have a schedule where each engineer is on-call for one week every month. During this week, they are ready to respond to incidents at any time of the day and night.
They spend the rest of the month on engineering (ideally 50% of their time) and on other operational, non-project tasks (around 25% of their time).
Creating a quality on-call is challenging because there is no one-fits-all model. But on-call doesn’t have to be difficult, sleep-depriving, or inevitably leading to burnout.
Here is how to approach it.
Step 1: Understand team preferences
Everyone is different. Some people like to start working at 11 am and finish late at night. Some start at 7 am and want to spend the afternoons with the kids.
Before pushing pre-build templates used at Google, see what everyone prefers and centralize your findings into a single document. For each team member, you should know the following.
Timezone
Regular working hours preference
On-call preference
Once that’s done, you will have something similar to this table.
Team member
Timezone
Regular working hours preference
On-call preference
Katie
+0 UTM (London)
9:00 - 17:00
Prefers full-week on-call duty once a month and then the rest off.
Brandon
-8 UTM (San Francisco)
10:00 - 18:00
Can work Saturdays or Sundays from time to time if compensated for it.
Cecelia
-5 UTM (Atlanta)
7:00 - 15:00
No preference
David
+4 UTM (Dubai)
9:00 - 17:00
Can work on Sundays but wants Saturdays off.
This mini research will give you a great starting position to fully understand what your teams’ preferences and capabilities are.
Step 2: Pick one of the common schedules
With team preferences in place, you can start drafting the schedule.
If your initial research doesn’t land a schedule out of the gate, you can start by picking one of the battle-tested schedules.
Workweek and weekend
Engineers are on-call during the workweek and weekend — 7 days in a row. The on-call week is intense. However, the rest of the month is on-call free. This practice is used in Google and is also doable for teams of any size. Depending on your team size, you have several options.
Workweek and weekend (monthly): One workweek and weekend every month, then the rest of the month is free. For this, you’ll need at least four team members.
Workweek and weekend (bi-weekly): One workweek and weekend, then one week and weekend off. You can do this with only two people in a team.
Follow-the-sun
This model leverages the timezone difference between different team members. It allows all on-call engineers to have only business hours duties and avoid night shifts. It’s the most protective of a healthy sleep schedule.
Completely eliminating night shifts requires specific team locations across different time zones. The reality in most cases is that the follow the sun eliminates only some parts of the night shifts but only rarely creates a smooth nine-to-five on-call schedule.
You can of course, implement follow the sun model only with two people, given that they have a reasonable time difference, and eliminate at least some non-standard working hours.
If you have the luxury of a larger team and want to be extra safe, then set up primary and secondary (backup) schedules.
How it works is that If the primary on-call person doesn’t acknowledge the incident within a given timeframe, it’s escalated to the backup. Backup on-call carries the same responsibilities as the primary one, and it needs to be treated that way.
First, the team needs to understand that being a backup is no different from being a primary — you must be ready to react within minutes.
Secondly, managers must treat it that way and consider being a backup on-call engineer equivalent to regular on-call duties, especially regarding compensation.
There is an option for a third on-call tier, which escalates to the team lead. If everything is set up correctly with the team, the escalation to the third tier should be only an infrequent occurrence event.
The image above shows how this setup looks in Better Stack.
Step 4: Define the on-call process and responsibilities
Write down all the responsibilities of on-call engineers and make it crystal clear what is expected of everyone.
Responsibilities are specific to a given organization, but good questions to answer and write down include:
Defining success: Are there specific metrics like MTTR that determine success?
Working during on-call: Are developers doing development work during on-call time? And if yes, how are the deliverables (development work) checked in the context of incidents?
Working vs. non-working hours responsibilities: Is there a difference between what is expected from an on-call person during working hours and non-working hours (night-time)?
SLAs/SLOs: Are there any contractual obligations that must be achieved?
Vacations: When and how does one apply for a vacation to make on-call planning possible?
Ad-hoc changes: What is the process of changing on-call on the same day (for example, due to sickness)?
Compensation: What is the compensation for on-call employees? What’s the maximum time a single person can be on-call every month?
Managing on-call from a spreadsheet is a thing of the past. There are dedicated tools built just for this. Here are the options.
Option 1: Selecting a SaaS solution to manage it — easiest to set up and manage since everything is handled by a provider like Better Stack. This way, you can create on-call calendars and manage alerting in one place.
Option 2: Combining Google or Microsoft Calendar with an alerting tool — if you already have a schedule in your calendar, you can pick a combined setup. Better Stack offers a native calendar integration, which gives you the ability to manage scheduling there but also to get access to all alerting capabilities (phone calls, SMS, emails, Slack & Teams notifications, and more).
Option 3: Self-hosting an open-source project — this potentially gives you more control, but comes at the cost of more management. The most popular open-source tools are: Cabot,Dispatch,Openduty (now archived), and Response.
Often dismissed as obvious by many, but it’s usually not that obvious.
Creating a supportive culture within a team can significantly improve both employee happiness and incident response effectiveness.
Every once in a while, personal emergencies or important life events come up. Encouraging team members to help each other and switch duties to step in for others makes all the difference. When teams care for each other, the on-call challenge feels much more manageable.
Step 7: Measure, iterate, and improve
As products, organizations, and teams develop, there is always a need to iterate and fine-tune to accommodate for changes. Don’t be afraid to revisit old processes and ask your team for feedback frequently. On-call is not a static process.
Measuring on-call performance
The end goal of having an on-call team is usually a target uptime that is usually based on a company SLA. This availability table shows the different target levels of uptime from 99.9% up to 99.999% (so-called five nines).
Incident metrics like MTTR (mean time to resolve) are also common KPIs for on-call teams. Those are usually useful only once there is an established structure within the on-call process. They give a unique insight into how effective are the specific parts of the incident management process.
Measuring on-call well-being
Measuring the happiness of on-call engineers is equally important to any performance metrics. The best way to do this is by regular 1:1s with the team, which yields qualitative feedback.
The quantitative data to collect include:
Number of false positives: How many alerts were not actionable, and engineers investigated something that actually wasn’t a problem? How can this be prevented?
Number of duplicate alerts: How many alerts were duplicated, and what can we iterate to prevent engineers from being called multiple times for the incident they are already aware of?
Number of low-priority alerts: How many alerts didn’t require immediate reaction from the on-call team, and how many of those were outside business hours?
Number of all alerts: Is the current number of alerts manageable for the number of people on-call?
Here is an example of a split of incident alert types one might receive. Minimizing the low-priority, false positives, and duplicate alerts once should be one of the easy wins for quick on-call optimization.
And of course, the fewer alerts an on-call person receives, the lower the chance of something like alert fatigue developing.
If you’re working on a new on-call schedule, you can book a free consultancy call with us.
Article by
Jenda Tovarys
Jenda leads Growth at Better Stack. For the past 5 years, Jenda has been writing about exciting learnings from working with hundreds of developers across the world. When he's not spreading the word about the amazing software built at Better Stack, he enjoys traveling, hiking, reading, and playing tennis.
Are you a developer and love writing and sharing your knowledge with the world? Join our guest
writing program and get paid for writing amazing technical guides. We'll get them to the right
readers that will appreciate them.