The Four Golden Signals for SRE Monitoring

Ayooluwa Isaiah

Updated on March 31, 2025

Understanding the Four Golden Signals
Implementing the Four Golden Signals with Prometheus
The Four Golden Signals vs. Other Monitoring Frameworks
Setting Alerts Based on the Four Golden Signals
Final Thoughts

Modern systems are increasingly intricate, with numerous components interacting in sometimes unpredictable ways. To effectively monitor these systems, you need structured approaches that help you focus on the most important metrics.

The Four Golden Signals represent a service-centric monitoring approach that gives you direct insight into user experience and system health. Developed and documented by Google's SRE team, these signals have become a cornerstone of modern observability practices.

The Four Golden Signals are:

Latency - How long it takes to serve a request
Traffic - The amount of demand placed on your system
Errors - The rate of failed requests
Saturation - How "full" your service is

Together, these metrics provide a holistic view of service health. Latency and error metrics tell you about the quality of service your users are experiencing. Traffic metrics help you understand the load on your system and provide context for other measurements. Saturation metrics indicate when your service is approaching its limits and may soon experience degradation.

Let's explore each of these signals in depth to understand how they work together to form a comprehensive monitoring strategy.

The fastest log
search on the planet

Better Stack lets you see inside any stack, debug any issue, and resolve any incident.

Understanding the Four Golden Signals

1. Latency

Latency measures how long it takes to process a request. This is typically measured in milliseconds or seconds and provides insight into the performance of your service from a user perspective.

When measuring latency, it's crucial to distinguish between successful and failed requests. Failed requests often have misleadingly low latencies (for example, an immediate error response) or extremely high latencies (like timeouts). By separating these measurements, you get a clearer picture of your service's actual performance.

Similar to duration metrics in the RED methodology, latency should be measured as a distribution rather than an average. This means tracking percentiles such as:

The median (50th percentile or p50), which represents the typical user experience
The 90th percentile (p90), showing how the slower 10% of requests perform
The 99th percentile (p99), highlighting the experience of your slowest 1% of requests

Here's how you might implement latency tracking with Prometheus and Node.js:

Copied!

import { Histogram } from 'prom-client';

const requestLatency = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2.5, 5, 10]
});

function latencyMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    requestLatency.observe(
      {
        method: req.method,
        route: req.route?.path || req.path,
        status_code: res.statusCode
      },
      duration
    );
  });

  next();
}

Monitoring latency helps you set realistic Service Level Objectives (SLOs) and detect performance degradations quickly, especially when they affect only a subset of requests.

2. Traffic

Traffic metrics measure the demand placed on your system. This is typically expressed as a rate of requests per second, though the specific unit varies by service type.

Traffic metrics serve as a foundational element of your monitoring strategy. They provide crucial context for interpreting other metrics and help you detect anomalies like unexpected spikes or drops in usage.

For different systems, traffic might be measured as:

HTTP requests per second for web services
Queries per second for databases
Messages processed per second for queue systems
Transactions per second for payment systems

When implementing traffic metrics, you'll typically use a counter that increases monotonically over time:

Copied!

import { Counter } from 'prom-client';

const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route']
});

function trafficMiddleware(req, res, next) {
  requestCounter.inc({
    method: req.method,
    route: req.route?.path || req.path
  });

  next();
}

Traffic metrics also help with capacity planning. By analyzing historical traffic patterns, you can predict future needs and scale your infrastructure appropriately. Seasonal patterns, day-of-week variations, and growth trends all become visible through traffic metrics.

3. Errors

Error metrics track the rate of failed requests. This includes any request that doesn't complete as expected, whether due to explicit errors, timeouts, or incorrect results.

Error metrics help you understand the reliability of your service from a user perspective. They're typically tracked both as an absolute rate (errors per second) and as a percentage of total requests.

When implementing error metrics, it's important to define clearly what constitutes an "error" for your service. For web services, HTTP status codes in the 4xx and 5xx ranges are often considered errors, though you might want to distinguish between client errors (4xx) and server errors (5xx).

Here's how you might implement error tracking with Prometheus:

Copied!

import { Counter } from 'prom-client';

const errorCounter = new Counter({
  name: 'http_request_errors_total',
  help: 'Total number of HTTP request errors',
  labelNames: ['method', 'route', 'status_code', 'error_type']
});

function errorHandler(err, req, res, next) {
  errorCounter.inc({
    method: req.method,
    route: req.route?.path || req.path,
    status_code: res.statusCode || 500,
    error_type: err.name || 'UnknownError'
  });

  res.status(500).send('Something went wrong');
}

Error metrics should include enough detail to help with troubleshooting. Adding dimensions like error type, affected endpoint, or error source can provide valuable context for debugging issues.

4. Saturation

Saturation metrics indicate how "full" your service is. They measure the proportion of your system's resources that are currently being used and help predict when the system might start experiencing performance degradation.

Saturation is often the most complex of the four signals because it varies by system type and resource. For a web service, saturation might involve:

Thread pool utilization
Database connection pool usage
Memory consumption
CPU utilization
Network bandwidth usage

The key insight with saturation metrics is that most systems start to degrade before they reach 100% utilization. By identifying and monitoring the most constrained resources in your system, you can predict and prevent performance issues before they affect users.

Here's how you might implement saturation monitoring for a connection pool:

Copied!

import { Gauge } from 'prom-client';

const connectionPoolGauge = new Gauge({
  name: 'db_connection_pool_usage_ratio',
  help: 'Database connection pool usage ratio',
  labelNames: ['pool_name']
});

// Update the gauge regularly
function updateConnectionPoolMetrics() {
  const poolSize = db.pool.max;
  const activeConnections = db.pool.used;

  connectionPoolGauge.set(
    { pool_name: 'main' },
    activeConnections / poolSize
  );
}

setInterval(updateConnectionPoolMetrics, 5000);

For many systems, saturation is best measured by the presence of queuing. When a system starts to queue work instead of processing it immediately, that's a clear sign of saturation. Metrics like queue length or wait time can provide early warnings of impending problems.

Implementing the Four Golden Signals with Prometheus

Prometheus has become the de facto standard for metrics collection in cloud-native environments. Its pull-based architecture, powerful query language, and extensive ecosystem make it an excellent choice for implementing the Four Golden Signals.

Setting Up Prometheus

If you're new to Prometheus, start by setting up a basic instance. You can run Prometheus as a Docker container:

Copied!

docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Your prometheus.yml configuration file should include targets for your services:

Copied!

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'your-service'
    static_configs:
      - targets: ['localhost:8080']

Exposing Metrics Endpoints

In your service, you'll need to expose an endpoint that Prometheus can scrape. For Node.js applications, you can use the prom-client library:

Copied!

import express from 'express';
import promClient from 'prom-client';

const app = express();
const register = promClient.register;

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080);

The Four Golden Signals vs. Other Monitoring Frameworks

The Four Golden Signals aren't the only framework for monitoring. They complement other approaches like RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors):

RED focuses on service-level metrics, much like the Four Golden Signals, but doesn't explicitly include saturation.
USE focuses on resource-level metrics, with a strong emphasis on infrastructure health.

The Four Golden Signals sit somewhere between these approaches, offering a balanced view that includes both service performance (Latency, Errors) and resource health (Saturation). Traffic provides context for all three frameworks.

For comprehensive monitoring, consider using elements from all three frameworks:

For service health: Track Latency, Traffic, and Errors (from Four Golden Signals or RED).
For resource health: Track Saturation (from Four Golden Signals) and detailed USE metrics.
For business impact: Add custom business metrics on top of these technical indicators.

Setting Alerts Based on the Four Golden Signals

The Four Golden Signals provide a natural foundation for alerting. Here are some alerting strategies for each signal:

Latency Alerts

Alert on sustained increases in high percentile latency (e.g., p99), as these often indicate developing problems. Set thresholds based on your Service Level Objectives (SLOs).

Example Prometheus Alert:

Copied!

- alert: HighLatency
  expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High latency on {{ $labels.route }}"
    description: "P99 latency for {{ $labels.route }} is above 2 seconds"

Traffic Alerts

Alert on significant deviations from expected traffic patterns. This includes both unexpected drops (which might indicate upstream issues) and unexpected spikes (which might indicate attacks or viral content).

Example Prometheus Alert:

Copied!

- alert: LowTraffic
  expr: sum(rate(http_requests_total[5m])) < 10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Unusually low traffic detected"
    description: "Request rate has fallen below 10 rps for 10 minutes"

Error Alerts

Alert on error rates or percentages that exceed your SLOs. Different services may have different tolerance levels for errors.

Example Prometheus Alert:

Copied!

- alert: HighErrorRate
  expr: sum(rate(http_request_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is above 5% for 5 minutes"

Saturation Alerts

Alert when resources approach their capacity limits. The specific thresholds depend on your system's characteristics, but a common approach is to alert well before 100% utilization.

Example Prometheus Alert:

Copied!

- alert: HighConnectionPoolSaturation
  expr: avg(db_connection_pool_usage_ratio) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Connection pool nearing capacity"
    description: "Database connection pool usage is above 80% for 5 minutes"

Final Thoughts

The Four Golden Signals provide a powerful framework for monitoring and observability in modern services. By implementing these metrics, you gain visibility into both the user experience and the system health, enabling you to quickly identify, diagnose, and resolve issues in your services.

While the implementation details may vary depending on your specific technology stack, the principles remain the same: monitor latency, traffic, errors, and saturation to get a comprehensive view of your service's health.

By combining the Four Golden Signals with other monitoring frameworks like RED and USE, you can build a robust observability platform that helps you maintain reliable, performant services and delights your users.

Got an article suggestion? Let us know

Mastering Exponential Backoff in Distributed Systems

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us

Writer of the month

Marin Bezhanov

Marin is a software engineer and architect with a broad range of experience working...

Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github

The Four Golden Signals for SRE Monitoring

Contents

Understanding the Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Implementing the Four Golden Signals with Prometheus

Setting Up Prometheus

Exposing Metrics Endpoints

The Four Golden Signals vs. Other Monitoring Frameworks

Setting Alerts Based on the Four Golden Signals

Latency Alerts

Traffic Alerts

Error Alerts

Saturation Alerts

Final Thoughts

Make your mark

Join the writer's program

Build on top of Better Stack