Creating a better on-call culture in your tech team

Oct 15 - 5 min read

This article has been updated and enriched for February 2022

Tech products play a crucial role in many of today’s lives. Some definitions of “crucial" are a little different to others, but banking apps, communication platforms, health systems, and travel management software are just a handful of examples that genuinely prop up modern society.

Add that to the ever-present SLA (service level agreement, legally guaranteeing a certain standard of service), and the pressure to optimize your MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond, or resolve), MTTF (mean time to failure), and MTTA (mean time to acknowledge) and solving your product’s problems quickly and efficiently becomes of huge importance for both customers and your bottom line.

The traditional approach to on-call

To keep these metrics low, businesses have lots of options: today, we’re going to look at the on-call strategy. In smaller businesses, this may not even be a decision - it’s just that the lucky victim must keep his or her mobile on all weekend or night.

This isn't sustainable - if anyone is expected to work outside their standard work hours, you need to have an on-call rota.

On-call strategies in DevOps-first companies

The interplay between DevOps and on-call strategies is an interesting one. On one hand, some parts of having an on-call culture can be harder on techs in DevOps-focused companies. With the motto “you build it, you run it, you take care of it when it falls over”, there can be a lot of stress on individual devs and teams to be highly available when things go wrong.

On the other hand, with DevOps' insistence on resilience, things should go wrong much less frequently. Also, since devs should be intimately familiar with their own code, troubleshooting tends to be easier and take less time.

You’ll notice that we said, “tends”...

Regardless of size or stage in the DevOps journey, more than an on-call schedule, you need an on-call strategy. Businesses are living, growing things, and no matter how resilient your systems or how competent your devs, things will go wrong. Things will also change, and changes can make old schedules ineffective or redundant with remarkable speed.

The problem with bad on-call strategies

On-call has a bad reputation, mainly because techs get stretched too thin and are asked to be so available, so frequently, that they never get any true downtime. This is especially true in startup culture and countries where work-life balance isn’t protected by law and common consensus - devs on-call burn out, crash, and cash out.

That’s not the only way to do things. With planning, practicality, and a little empathy, you can create an on-call strategy that will support your product and customers, respect your devs, and grow and breathe with your business’ transformation.

Read on to find out how to do on-call better.

How to recognize bad on-call strategies

Luckily, when an on-call strategy stinks, it tends to give off some pretty clear signals. Seen any of these on your team? Tread carefully, because it’s very likely you have a sucky on-call rota!

You have no strategy
Oh. Not great. Skip to How to do on-call better immediately.
Your schedule is pretty much the same as it was 10 years ago
Unless your company’s growth has been 0, in which case, you have bigger problems, you should check to see if you are experiencing any of the problems we mention below.
Alerts (or problems) are frequently missed
Another way of looking at this is by asking how many times your users report problems, and how many times your monitoring gets there first. If it’s often your users who are complaining (ask customer support - they’ll be happy to tell you), some alerts are being missed or your alert tolerance is too generous. Either way, you have a problem.
The mere mention of “on-call” makes your devs hostile and touchy
There are several reasons why techs hate on-call. The most basic is that they are being stretched too thin and are close to burning out. One of the most important parts of a good on-call strategy is allowing for adequate rest, and we’ll go into that later.
Every Monday, someone’s annoyed about an on-call fail
Adding more reasons why techs hate on-call, disgruntled people on a Monday are usually the result of problems over the weekend. Many times, this is because too many alerts have gone off, which necessitates a long and hard look at what you’re monitoring for and where your thresholds are set.
All alerts, big or small, generate chaos
If the sound of the pager sends techs diving for cover, you’ve got a problem. Whether it’s an actual problem like they don’t feel they have the ability to solve the issue or are unsure of what to do if they can’t solve the issue, or something more basic, like they feel that they’re always being shouted at for the decisions they take at 3 am, it’s not sustainable and it’s not the way forward.

How to do on-call better

Make a deliberate choice for a strategy, not a schedule
Our first piece of advice is to make very sure you are 100% aware of the differences between a schedule and a strategy, and conscious of the fact that you need the latter. A schedule is just a timetable of names and dates, it tells people when they’re on, sure, but doesn’t help them to deal with or sort out the problems they might face. You need a strategy, which should cover things like escalation policies, approval processes, and yes, a schedule.
Plan for serious problems
If your company is growing, it’s probably just a matter of time before you have a serious issue. When your problem is radically outside the normal remit of niggles, your techs are likely to be completely in the dark unless you ensure they’re not. Create a serious incident policy and go wild imagining some of the worst technological problems that could befall your company. Then write a detailed plan for how to deal with each. You may also think about appointing a serious incident commander - a clear leader in times of crisis is always a huge help.
Give techs the tools they need
Your techs will be happier and less stressed if they have the tools and information they need. Your tech team will be happy to let you know what will make their job easier, even if that’s simply the existence of an “emergency handbook” that lays out a clear path for fixing a critical problem at 3am, detailing logins for an obscure service, or stating exactly how someone should go about escalating privilege to help solve the out of the ordinary problem that’s landed in their lap.
Don’t make your team too small…
Make sure to avoid the on-call rota being between just one or two people. That’s a recipe for burnout, with too much stress and not enough downtime (remember that merely being on call is a stressor, even if there’s never actually an alert). Three is really the minimum number you need for on-call, but if you can wrangle more, all the better.
Or too big…
Sounds ridiculous, but if your techs are part of a huge on-call team, it could be doing them more harm than good. Sure, they’ll likely be well-rested but won’t be getting the crucial experience they need to handle anything the dreaded pager beep throws at them.
Build it for humans
The best way to ensure your on-call strategy is human-appropriate is to make sure that you allow the humans, i.e. your techs, to have a hand in creating it. Your on-call rota needs to be flexible enough to compensate for the ups and downs of life - new babies, illness, doctors appointments, etc. Some teams will prefer longer on-call periods, and some will prefer shorter ones, and when all the details have been hashed out, make sure your on-call app can handle the overrides. Here at Cycloid, for example, we have 6 people on a one-week rota, and are pretty flexible about changes.
Compensate everyone properly
How is on call compensated? On-call, even if not paged, requires you to be on standby - the inconvenience should be fairly compensated. You should also consider compensating any actual alerts and responses out of hours.
Assess and constantly improve
Once you’ve followed our and other people’s advice and have an excellent on-call strategy, it can be tempting to consider it done and dusted, and put it to bed. Don’t! Make sure your incidents are generating metrics, and study those metrics to see if there’s anything you could do to make them better or even avoid them in the first place.

Summing up

Creating a great on-call strategy isn’t hard, but it does take deliberate work and discussion. If you already have one, make it better. If you don’t have one, make one! You’ll keep your devs happy and productive, your customers satisfied and positive, and your company calm and controlled in the case of a true emergency.

Nobody loses when you’ve got your on-call sorted out, so sort it out today!

Liked this article? We send really good emails too (just try it!)

DevOps Culture

Creating a better on-call culture in your tech team

The traditional approach to on-call

On-call strategies in DevOps-first companies

The problem with bad on-call strategies

How to recognize bad on-call strategies

How to do on-call better

Summing up

Read More

Biz team versus dev team - is DevOps the best path to harmony?

Helping diverse teams work better together

Value stream mapping - the right choice for every DevOps-first org?