# Managing runbooks

Create structured incident response procedures using Better Stack's **Escalation policies**. Runbooks help your team follow consistent steps during incidents, reducing response time and ensuring nothing is missed.

[info]
#### Need dedicated runbook management?

While some monitoring tools offer separate runbook features, Better Stack integrates runbooks directly into **Escalation policies**, keeping your incident response workflow simple and unified.
[/info]

## What are runbooks?

A runbook is a structured set of predefined steps for handling specific incidents or scenarios. They typically include:

- **Tasks to perform:** step-by-step procedures.
- **People to notify:** this is automated by using [Escalation rules](https://betterstack.com/docs/uptime/escalation-policies/#alerting).
- **Links to resources:** your dashboards, playbooks, and internal documentation.

## Creating runbooks

### Step 1: Create a runbook policy

Create a dedicated **Escalation policy** for your new runbook:

1. Go to [Escalation policies](https://uptime.betterstack.com/team/0/policies ";_blank") → **Create escalation policy**.
2. **Name your runbook**, for example `Runbook: High CPU Usage`.
3. **Remove the default escalation steps** to avoid notifying people in the runbook directly.
4. **Position the runbook step** as the last step of your escalation flow.
5. **Add your runbook instructions**
  - Use the **Instructions & todo list** step.
  - Format your the step-by-step guide using markdown.
  - Start a line with `- [ ]` to add an interactive task.
6. **Save the escalation policy**

```markdown
[label Example runbook instructions with TODO list]
## When to Use
Triggered when CPU > 90% for 5+ minutes on a web server.

## Steps
- [ ] **Acknowledge the Alert**
- [ ] **Find the Affected Server**
   - Use logs or metrics dashboard to identify the instance/container
   - Example: `aws ecs list-tasks --cluster web-prod`
- [ ] **SSH or Access Container**
   - `ssh ec2-user@<instance-ip>`
- [ ] **Diagnose the Issue**
   - Run `top` or `htop` to find CPU-heavy process
   - Check application logs
- [ ] **Fix or Mitigate**
   - Restart service if needed
   - Scale up if traffic is legitimate
- [ ] **Verify**
   - CPU drops below 70%
   - No 5xx errors
   - App is responsive
```

![Create a runbook policy](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/b079c06b-e97a-46a1-7fb9-115e15adce00/lg2x =1502x1842)

### Step 2: Reference runbooks in active policies

In your actual escalation policies - the ones that notify your team:

1. Go to [Escalation policies](https://uptime.betterstack.com/team/0/policies ";_blank") and edit one of your existing policies.
1. **Add a time-based rule** for the runbook.
2. **Set schedule** to all days from 00:00 to 00:00, ensuring it will be always used.
3. **Select your runbook policy** from the dropdown.
4. **Position the runbook step** as the last step of your escalation flow.

[info]
#### Need to use different runbooks based on incident context?

Use [Metadata-based rule](https://betterstack.com/docs/uptime/escalation-policies/#metadata-based-rules) instead of the time-based rule, and redirect to your runbooks based on the incident metadata values.
[/info]

![Reference runbooks in active policies](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/bddc3be8-753e-4c44-65da-b82ef434ee00/md2x =1502x2432)

### Step 3: Test the escalation

Click the **Report a new incident** in your escalation policy to create a new incident.

You should see your instructions in the incident timeline:

![Test the escalation](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/4b89d90f-d5e7-4f7f-ca71-86aa36749100/md2x =1492x2130)

## Best practices

### Naming convention
Use a consistent prefix for easy identification:

- `Runbook: Database Outage`
- `Runbook: API Rate Limiting`
- `Runbook: SSL Certificate Expiry`

### Keep instructions actionable

- **Use checkboxes** for step-by-step procedures.
- Include **specific commands and code snippets**.
- **Add links** to relevant dashboards and documentation.
- Specify **expected outcomes** for verification steps.

## Reusing runbooks

The same runbook can be referenced across multiple escalation policies. For example, your **High CPU Usage** runbook might be used in policies for:

- Web server monitoring
- Background job processing
- Database server alerts

This approach keeps runbooks centralized while allowing flexible incident response workflows 🚀
