Need help with Markdown language? Check out the Markdown cheat-sheet to create the best post mortems out there.
Post mortems are short summaries of incidents. They typically describe why the incident happened, an estimated cost, and how to prevent similar incidents in the future.
The best teams write and share their post mortems after every significant incident.
To write a post mortem, add a comment to an incident including
post mortem, like in the example below:
While writing the post mortem, you can use the Markdown language to pretty format the post mortem as shown in the example above.
An example of the post mortem written in Markdown:
# Sample Post Mortem -- What happened today
1. Homepage stopped responding at 5:22am ET today and all requests resulted in timeouts
2. After a quick glance at New Relic, it was clear that requests were queueing at Puma
3. Database connections were OK, CPU on application servers was OK
4. I decided to restart all application servers with `heroku restart`
5. After the app rebooted, homepage started returning `Internal Server Error (500)` at 5:26 ET
6. Looking at PGHero, I realized slow queries from previous requests were still being executed by PostgreSQL
7. I killed all existing PostgreSQL connections, and restarted PostgreSQL and application servers again
8. This resulted into multiple failed workers and requests in the `#bugsnag` Slack channel
9. After the app booted at 5:41am ET, it was working correctly
10. Just to be sure, I scaled up the application servers to 5 instances
# Why this happened
1. Our Puma threads/workers count seems misconfigured. We have set these values arbitrarily when configuring the deployment without any stress testing, and seems like we just hit the limits
2. In addition to Puma misconfiguration, we didn't have timeouts correctly configured for neither application requests nor database queries
3. The new reporting functionality deployed last friday unfortunately triggered many slow requests with long-running queries, which quickly depleted Puma's connection pool and prevented other clients from connecting
# Estimated costs
1. We were down for ~19 minutes
2. Fortunately, we typically don't have many users online at around 5am
3. According to Google Analytics approximately 1,000 users experienced this incident
# How to prevent this in the future
1. We'll configure 15s database timeouts in Rails `database.yml`
2. We'll implement the application server request timeouts using the `slowpoke` gem
3. We'll increase Puma's threads and workers so that the application servers are running closer to 80% memory/CPU limits
4. We'll be implementing Nginx as a reverse proxy, letting us to have a more granular control over request queueing
5. We'll be stress-testing the configuration of application servers with loader.io in the following week