Production incident training program

Being on call can be daunting. Anything can happen in production and we may be all alone and completely out of our comfort zone dealing with a problem we have no clues about, and with the pressure of having the app down and impacting the business.

Introduction

There is no replacement for actual experience dealing with production issues. We start building our toolbelt and can troubleshoot faster. But we should also try to ramp up new teammates and give then the tools to be autonomous on handling incidents. My proposal here is to gamify this learning experience.

What do we need to be able to troubleshot production incidents?

  • We need a good grasp of how the application works
  • We need to know where are the configurations, access datastores and servers
  • We need to know how to pull some health metrics and understand their impact

When we find the issue, then we can:

  • Try to make some amend and put things online
  • Figure out the root cause and plan to fix it
  • Create a blameless postmortem

All this needs to be trained. My idea here is to setup a game that will act as a certification program for developers to be able to be on call and handle incidents.

Game

The team will setup a collection of environments that replicate typical production incidents. This could be with docker containers, vagrant or even some on demand cloud instances. Let’s call these environments levels. Then the players will get a random level and will have to troubleshoot it while filling in an incident report.

After that they will need to create a proper postmortem.

This will be an investment on the company/team. We’re taking a proactive posture here and allow the team to learn more about incident handling before setting them free in the wild.

Tutorial

Before starting the game we should have a tutorial. One thing that I find very useful is to have new developers configuring and putting up the application from scratch. Could be just a docker-compose file, but with everything working. For example:

  • A node running the FE (maybe with a proxy)
  • A node running the backend (or several nodes per service)
  • A node per data store used

This will force developers to know how all the different parts interact and connect. It will force them to know how to configure those connections and know where are the configuration files.

Sometimes doing this can be tricky. We may not be able to boot the full application, but we may boot just what we own. There may also be some teams that don’t have everything set up to properly boot the app with clean datastores. But having these difficulties should be seen as a red flag. We should be able to setup a new environment relatively fast.

Levels

Each level should have very simple instructions.

  • The command to boot the environment
  • How to access the app
  • How to access the instances
  • A description of the problem

Then the player needs to troubleshoot and understand what’s happening. We should also have:

  • The code that was used

Because we may change the code to introduce problems. Is this fair? This will add a lot to the trickiness of the problems. But the truth is: there could have been some patch that introduced a bug. So changing code is fair game. Let’s give them access to it.

Next we have several suggestions of replicable problems that we can use for levels.

Database is locked with schema change operation

There are some processes that automatically run database migrations on the database. We may have a scenario where those operations are very slow and the app is not responding.

  • Description: we did a deploy, the application restarted but is not responding to requests.

We could simulate this with a migration that locks a table forever. When the application goes up and tries to read from the database, it blocks.

Objectives:

  • Check the application logs and understand if it booted successfully
  • Verify that the application is not consuming CPU or RAM and is blocked on IO
  • Go to the database instance and see the current running transactions
  • Find the culprit and kill it

Payload too large

This is more common in applications that support the upload of files.

  • Description: we make an AJAX request to an endpoint, and nothing happens. The browser shows the request, but doesn’t show the response.

We can setup the maximum payload size on our proxy. For example, this would be client_max_body_size in nginx. This one is very tricky because browsers can’t properly display this error.

Objectives:

  • Understand that the request is not logged at the application
  • Understand that the request is logged at the proxy level with status 413 (Request Entity Too Large)
  • Patch the proxy’s configuration

Invalid CORS configuration

If we use a frontend on a different domain than the backend, we’ll need to configure CORS.

  • Description: the application isn’t working, the server is denying access for all requests.

We can change the CORS configuration to disallow the frontend’s domain from accessing it.

Objectives:

  • Understand that it’s a CORS problem and it’s the backend that is denying the requests
  • Change the configuration to allow the access

Missing query index

The application may not be down, but can be very slow.

  • Description: going to a specific page is very slow, it takes a lot of time to perform searches.

We can seed a database with lots of records and remove indexes that are used to list that model.

Objectives:

  • Understand that it’s the database query that is very slow
  • Perform an explain and verify that no index is being used
  • Create a patch with the index creation

Select number index

This one is similar to the previous one, but more twisted. The description and the objectives are the same, but on this scenario we’ll have an index that should be used, but is not.

For example, imagine that we query for sequence_number:

  • select * from invoices where sequence_number = 1

And we do have an index on sequence_number. But this will still be slow and the index is not used. How is this possible? This is in fact possible if sequence_number is of varchar type. On mysql the query works, the 1 is converted to '1', but it won’t match the index.

So the fix is actually on the application:

  • select * from invoices where sequence_number = '1'

This one is very tricky.

Order by and pagination

Find some complex and heavy query that is used on the application and add an order by to it. Or go to a search page and open page number 20K.

  • Description: going to the page is very slow.

Objectives:

  • Understand that the order by field should have an index
  • Understand that even with the index, this is a severe bottleneck

Force restarts

We may have an instance that is restarted, but then the application isn’t automatically restarted.

  • Description: the application stops responding.

We can create a level that boots the application and then forces a restart on the backend’s node.

Objective:

  • Verify that the instance is up, but the application is not
  • Find out that there was a restart
  • Figure out how to setup the instances so that the application is always up

No disk space

Create an instance that has a full disk.

  • Description: the application is acting very weird, sometimes things don’t work.

Objectives:

  • Verify the health of the instance
  • Understand that the disk is full

No available RAM

This should one of the most common issues. We have our app consuming too much memory, starts swapping, and it’s a mess.

  • Description: the application is acting very weird, sometimes things don’t work. And it’s very slow.

Objectives:

  • Verify the health of the instance
  • Understand that the memory is full

Something is eating CPU

I’ve seen this happen with a database backup what was being zipped on the application server.

  • Description: the application is very slow.

Objectives:

  • Verify the health of the instance
  • Understand that the CPU is being used a lot by some application

No directory exists

The application may rely on some directory, for example a tmp/ directory, and my fail if the directory doesn’t exist.

  • Description: uploading a file yields a 500 error.

Objectives:

  • Check the application logs
  • Verify that the error is due to a missing directory
  • Improve the code to properly handle this, by ensuring the directory is created, or using proper system tooling for temporary files.

No write access to files

We could have the tmp/ directory, but the application’s running user isn’t allowed to access it. Or maybe to some configuration file.

  • Description: uploading a file yields a 500 error.

Objectives:

  • Check the application logs
  • Verify that the error is due to permissions

Mess around with time zones

If we have a page that lists data based on the current data, we can change the time zone of the server and force some bugs.

  • Description: we add users, but they don’t show up on the list page.

Objectives:

  • Verify that the users are actually being created
  • Find out that the problem is due to time zoned

Summary

Well, these are just some ideas. I’m sure you’ll have much more ideas and scenarios that are specific to the work you’ve been doing. But the main point here is to build this game system and allow developers to learn how to troubleshoot production faster.

Even if we do have some levels that are easier, developers will learn very important things:

  • How to find configuration files and common pitfalls
  • How to quickly check the healthiness of the system and if something’s up
  • How the infrastructure is set and how all the parts interact

By forcing then to feel the pains beforehand, they will better prepared for prime time.