SRE Fundamentals: Metrics, Monitoring, and Alerting
Understanding the intricate workings of your infrastructure and applications is crucial for maintaining stability, optimizing performance, and ensuring the reliability of your services. A robust monitoring system is an indispensable tool in achieving these goals.
Think of a monitoring system as your eyes and ears in the complex landscape of your IT environment. It provides a continuous stream of insights into the health and performance of your systems, empowering you to quickly identify and resolve problems, assess the impact of updates or modifications, and make informed, data-driven decisions.
This guide delves into the core components of a monitoring system:
- Metrics: The fundamental building blocks of monitoring, representing quantifiable measurements of system performance.
- Monitoring: The continuous process of collecting, analyzing, and visualizing metrics to understand system behavior.
- Alerting: The proactive mechanism that notifies you of critical events or deviations from expected behavior.
We'll explore the importance of these concepts, the types of data you should track, and the benefits they offer. Along the way, we'll introduce key terminology and provide a glossary of common terms to help you navigate the world of monitoring.
By the end of this guide, you'll have a solid understanding of how to leverage metrics, monitoring, and alerting to gain valuable insights into your systems, ensure their reliability, and optimize their performance.
What are metrics?
A metric is a single number that represents the state of a system at a specific point in time, offering a snapshot of its performance, behavior, or health. It simplifies complex underlying activities into an interpretable value to enable monitoring, issue diagnosis, and decision-making.
Metrics are typically collected continuously over time, forming time-series data that reveal patterns, detect anomalies, and highlight long-term performance trends. Rather than being static, a metric is best understood as part of an ongoing stream of data points that are recorded as long as the system is active.
However, a number alone has limited value without context. A metric becomes meaningful when compared to historical data, predefined baselines, or performance targets.
For instance, a CPU usage of 80% might indicate normal operation in one scenario but signal a potential issue in another, depending on the system's expected workload.
In essence, a metric is a dynamic, contextualized, and actionable data point that inform decisions, trigger automated or manual actions, and ultimately drive system improvements.
What metrics should you collect?
As your systems grow and evolve, so too will your monitoring needs. What you track today might be different from what you track tomorrow.
Think of your infrastructure as a building. You have the foundation, the core structure, the various floors with different functions, and finally, the roof. Each layer relies on the ones beneath it, and monitoring each layer provides a different perspective on the building's overall health.
Similarly, IT systems often function in a hierarchy, with more complex layers built upon foundational components. When planning your monitoring strategy, it's crucial to consider the different levels of your infrastructure and the unique metrics each offers.
Let's look at a few of these below:
1. Host-level metrics
These metrics focus on the health and performance of individual machines to help evaluate the operating system and hardware. Some common metrics collected at this level include CPU usage and memory usage, disk space, processes, network statistics, and more.
A common way to collect such metrics is by using the Prometheus Node Exporter which collects and exposes a wide array of host-level metrics from *nix kernels and presents them in the Prometheus format.
2. Application metrics
While host-level metrics provide a foundational understanding of your servers, application metrics are explicitly created by you to measure the behavior, performance, and health of your software.
The specific metrics you choose will depend on the nature of your application, its dependencies, and its interactions with other components. However, some common categories include:
- Request rates and errors
- Response times and latency
- Resource consumption
- Uptime and downtime
- Queue lengths and processing times
To get metrics from your application, you'll need to instrument it using telemetry libraries or frameworks. These tools provide the means to capture and expose relevant data points from within your application code.
Prometheus and OpenTelemetry are two projects backed by the Cloud Native Computing Foundation that can instrument your applications.
Infrastructure metrics
While host-level metrics provide a granular view of individual machines, and application metrics focus on the performance of your software, infrastructure metrics provide a broader perspective on the health and performance of your overall IT infrastructure.
These metrics encompass various components that support your applications and services, including:
- Network infrastructure
- Containers
- Databases
- Cloud services
- Load balancers
External dependency metrics
In addition to tracking metrics within your own systems, it's essential to monitor those related to external dependencies. Many third-party services provide status pages or APIs to report outages and performance issues.
However, incorporating these insights into your monitoring framework—along with data on your interactions with these services—can provide a clearer picture of how external dependencies impact your operations.
Some metrics to track in this area include:
- Service availability
- Error rates
- API latency
- Resource exhaustion such as request quotas
External dependencies are often integral to your system's functionality. Any issues with these services—whether outages, degraded performance, or resource limitations—can cascade into problems for your own infrastructure.
By proactively monitoring these metrics, you can quickly detect and respond to provider issues, minimizing their impact on your operations.
What is monitoring?
Monitoring is often misunderstood. People might think it's about watching security cameras or tracking employee productivity, but in the world of software, it has a much more focused purpose.
Imagine you're a pilot flying a plane. Monitoring is like having a dashboard full of instruments that tell you everything you need to know about the flight: altitude, speed, fuel levels, engine performance, and even weather conditions. Without these instruments, you'd be flying blind.
Similarly, software monitoring provides a comprehensive view of your systems, applications, and infrastructure. It's about collecting and analyzing data to ensure everything is running smoothly, identify potential problems, and make informed decisions.
There are three main aspects to monitoring:
1. Alerting
The heart of monitoring is knowing when something goes wrong or is reaching a critical state. An alert can be simply information such as a notification of a system update, or it draw attention to an spike in errors. It could also be a major emergency like a complete server outage that demands immediate attention.
2. Debugging
Once alerted, you need to understand the "why" behind the problem. In most cases, metrics alone cannot give you this information. This a broader focus on observability with telemetry signals like logs and traces come into play to help diagnose the root cause of the issue.
3. Historical analysis
Beyond immediate firefighting, monitoring also helps you understand long-term patterns. Analyzing trends in resource usage, performance, and user behavior allows for better system design and optimization.
Essentially, monitoring provides the data and tools to investigate, diagnose, and ultimately resolve the root cause of incidents. It's not just about collecting metrics and displaying them on a dashboard, but also debugging problems and
What's involved in monitoring?
Monitoring starts with defining the goals for the systems you're interested in. In most case, you're interested in tracking the performance, health and behavior of your system. But you may have other monitoring goals.
The next step is collecting the raw metric data needed to achieve your monitoring goals. This includes collecting metrics from your infrastructure through native mechanisms, exporters, and other agents, and instrumenting your services to emit relevant metrics.
It could also mean collecting logs and deriving metrics from them for systems where native metrics aren't available, or monitoring the logs for significant events.
Once you have your metrics data, you need to send it to a monitoring system where it can be used to plot dashboards and configure alerting. A popular open-source combination is Prometheus and Grafana where you can use the Prometheus format or even OTLP.
With your metrics data being ingested, you'll need to plot a dashboard that interpret the raw data into meaningful visualizations. These dashboards providing an at-a-glance overview of your systems' health and performance in a way that let's you quickly spot anomalies in application behavior.
Most monitoring systems provide a set of default dashboards for popular software systems so you can quickly get started, but you'll definitely need to customize them depending on your monitoring goals.
It's not enough to simply collect and visualize data; you need to be proactively notified when critical situations arise. Any monitoring system needs to be accompanied with a robust alerting system to be truly effective.
You'll need to define clear thresholds that trigger alerts based on your metrics. This might involve setting limits for CPU usage, error rates, response times, or any other metric.
For example, an alert could trigger if a server's CPU usage consistently exceeds 90% or if an application's error rate surpasses a predefined threshold.
When an alert is triggered, you'll need to ensure that they reach the right people in the right channel. Monitoring systems are often complemented with incident response to ensure the entire swift reactions to critical situations.
What's involved in alerting?
Alerting is a simple concept: when something important happens, send a notification.
However, effective alerting requires careful planning and implementation to ensure that alerts are meaningful, actionable, and not overwhelming. Here's a breakdown of what's involved:
1. Defining what's important
Not every anomaly or metric change warrants an alert. Identifying what constitutes "important" requires selecting relevant metrics and categorizing their impact based on urgency.
For example, a metric that tracks if the service is up or not is clearly much more impactful than one that monitors minor fluctuations in resource usage.
2. Setting up alert triggers
The next step is specifying the conditions under which alerts are generated. This could involved static thresholds such as memory usage above 80%, or dynamic thresholds like error rate increasing by 20% above average.
For more complex scenarios, you can combine multiple conditions using logical operators so that alerts are triggered only when specific combinations of events occur.
3. Routing alerts effectively
Once an alert is triggered, it needs to reach the right people through the appropriate channels. This involves determining who should be notified for different types of alerts, customizing what information should be included in the alert, and routing them to the correct channel (such as phone calls for urgent issues, or email for less urgent alerts).
4. Preventing alert fatigue
A common challenge with alerting is alert fatigue – when excessive or irrelevant notifications desensitize recipients and lead to important alerts being ignored.
To combat this, ensure to add a time threshold to your alerts to ensure they are appropriately sensitive and avoid false positives. You can also implement mechanisms to filter out irrelevant alerts or group similar notifications to reduce clutter.
5. Escalation procedures
For critical alerts, it's essential to have escalation procedures in place to ensure a timely response, even if the initial recipient is unavailable.
This involves establishing a clear chain of command for escalating alerts to different individuals or teams and determining how long an alert should remain unacknowledged before escalating to the next level.
The limits of monitoring in modern systems
Traditional monitoring which is built on the concept of metrics, has served us well for decades. However, the increasing complexity of modern systems is pushing its boundaries.
Modern systems are characterized by:
- Distributed architectures: Applications are now composed of numerous interconnected services, often running across diverse platforms and environments, making it difficult to pinpoint the root cause of problems.
- Dynamic infrastructure: Cloud-native systems utilize elastic infrastructure, with components constantly changing and scaling based on demand, challenging the assumptions of static environments.
- Complex dependencies: Modern applications rely on a web of interconnected services, many of which are external and beyond direct control, making it harder to isolate issues.
These characteristics create a level of complexity that traditional monitoring, with its focus on predefined metrics and dashboards, struggles to handle. This is where observability steps in.
Observability acknowledges the inherent complexity of modern systems and provides a more comprehensive approach to understanding their behavior. It goes beyond basic metrics to incorporate rich data sources like logs, traces, and events, providing a multi-dimensional view of system dynamics.
Monitoring with Better Stack
Better Stack's observability platform offers first-class monitoring features that help you transform your metrics data into actionable insights without breaking the bank.
Beyond basic monitoring features like dashboards and alerting, you can derive metrics directly from logs to enable anomaly monitoring even in scenarios with high cardinality or where direct metrics instrumentation is not feasible.
You'll also get comprehensive incident and on-call management tools to help you detect issues immediately they occur, route alerts to the right channels, and AI-based incident silencing that prevents alert fatigue.
To see all this and more in action, sign up for a free account here.
Final thoughts
In the ever-evolving world of IT, monitoring is no longer a luxury, but a necessity. It's the key to unlocking insights, ensuring stability, and optimizing performance. By investing in a robust monitoring system, you empower your team to navigate complexities, make informed decisions, and drive success.
Embrace the power of metrics, monitoring, and alerting, and transform your IT infrastructure from a source of uncertainty to a foundation for innovation and growth.
Thanks for reading!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github