Due to the high volume of data that is often generated in modern application
environments, storing and analyzing every single log entry is often impractical
or costly.
This is where log sampling comes into play.
It involves selectively recording or capturing only a portion of all log entries
generated. This method allows for effective data management, ensuring that the
volume of logs is kept at a manageable level without losing critical insights.
To better understand log sampling, consider a scenario involving an endpoint
that interacts with a third-party provider.
Suppose the service experiences a brief downtime of a few seconds. During this
short period, let's say a thousand requests are sent to the endpoint, each
failing and generating an error log.
In a traditional setup, this would result in a thousand error log entries, which
could be excessive and unnecessary for analysis.
With log sampling implemented, instead of capturing all 1000 error logs, only a
representative subset is recorded.
For example, the system might be configured to capture every tenth identical
log, reducing the total number of stored logs to just 100.
This reduction significantly eases the burden on storage and processing
resources while still providing a meaningful overview of the incident.
Log sampling can also be fine-tuned based on various criteria, such as error
severity, frequency, or even specific time intervals, to ensure that the most
relevant and useful data is retained.
Centralize your logs with Better Stack
While sampling reduces log volume at the application level, Better Stack lets you sample during query time and mark noisy logs "as spam" to avoid billing entirely. Get live tailing, SQL and PromQL queries, and automated anomaly detection across all your services.
As we've alluded to, log sampling helps reduce the cost of logging by only
capturing a subset of the logs generated by the system. This manifests itself in
various ways, such as the following:
Reduced storage needs: By capturing only a portion of the logs, the
amount of storage required is significantly decreased. This is particularly
beneficial for systems that generate large volumes of log data.
Improved performance: Processing and analyzing a smaller set of logs
leads to faster performance in log analysis tools and monitoring systems, as
they have fewer data to process.
Cost efficiency: With less data to store and process, the overall cost of
logging decreases. This includes savings on data storage, infrastructure, and
even operational costs related to log management.
Focused analysis: Sampling can help focus on the most relevant or
critical logs. For example, sampling strategies can be designed to capture
more error logs and fewer informational logs, thereby directing attention to
potential issues.
Scalability: As systems scale and the volume of logs increases, log
sampling is a practical approach to ensure that log management remains
sustainable and doesn't overwhelm the system's resources.
Sampling strategies
When it comes to sampling techniques, there are quite a few out there, but this
article will highlight only the most common ones.
1. Random sampling
Random sampling involves selecting log entries from the data stream randomly,
without any consideration for their content, severity, or any other specific
attribute. This approach ensures each log entry has an equal probability of
selection.
Typically, random sampling is implemented by defining a probability factor. For
instance, one might choose to record one log entry for every N entries
generated. This ratio can be adjusted to suit the volume of log data and the
desired granularity of the logs retained.
Consider this example where one out of every five log entries (20%) is recorded:
for i := 1; i <= 10; i++ {
log.Info().Msgf("an info message: %d", i)
}
}
In the absence of sampling, a loop like this would produce ten log entries.
However, with sampling, only the first and sixth messages are recorded:
Output
{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 1"}
{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 6"}
The randomness of the sampling process means all types of log records are
treated equally:
Copied!
. . .
for i := 1; i <= 10; i++ {
log.Info().Msgf("an info message: %d", i)
log.Debug().Msgf("a debug message: %d", i)
log.Warn().Msgf("a warning message: %d", i)
}
. . .
With time-based sampling, log entries are captured based on specific time
intervals. This method involves selecting up to a maximum number of logs at
regular, predetermined intervals, regardless of log's content or level.
Consider the following example, which records a maximum of three log entries per
second, disregarding all others:
While this method is straightforward, it might miss critical logs that occur
between intervals. For instance, error logs are not captured in the above
example.
An alternative approach considers log level and content to ensure a
representative sample. Here's a variant using the Zap module:
0, // thereafter log zero entires within the interval
)
return zap.New(samplingCore)
}
func main() {
logger := createLogger().Sugar()
defer logger.Sync()
for i := 1; i <= 10; i++ {
logger.Info("an info message")
logger.Debug("a debug message")
logger.Warn("a warning message")
logger.Error("an error message")
}
}
This approach differs from the previous one in that it takes into account the
log level and message content, capturing up to 12 messages (3 for each level and
message):
Copied!
{"level":"info","ts":1703799628.2984571,"msg":"an info message"}
{"level":"warn","ts":1703799628.2984889,"msg":"a warning message"}
{"level":"error","ts":1703799628.2984993,"msg":"an error message"}
{"level":"info","ts":1703799628.2985106,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985137,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985182,"msg":"an error message"}
{"level":"info","ts":1703799628.298522,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985265,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985306,"msg":"an error message"}
3. Hash-based sampling
In distributed systems, you may have multiple interconnected services involved
in processing user requests. Each service in the pipeline processes these
requests and generates corresponding logs, which include a correlation,
timestamp, contextual data, and any exceptions encountered.
When sampling logs in such environments, you may want either all or no logs
related to a specific request instead of a random or partial selection of logs.
This is where Hash-based sampling comes into play.
It allows for sampling logs based on a unique identifier, which could be a
single field or combination of fields.
By applying a hash function to this identifier and setting a threshold, you
ensure that either all or no logs related to a specific request are sampled,
providing a comprehensive view of the request's journey through the processing
pipeline, even in a sampled dataset.
This technique is not is not limited to logging in
microservices. It can also ensure that either all or
none of the related log entries for a transaction are included in the sample,
even on single-service systems.
To implement hash-based sampling, you can use Rsyslog. See
its
documentation
for more details.
How to sample your logs
Log sampling can be done at the application level through your logging
framework of choice. However, if the framework doesn't
support sampling or you need a more flexible sampling configuration, you can use
a log collection agent to implement sampling.
Vector is one such tool that can sample your logs based on
supplied criteria and at a configurable rate. For example, a rate of 5 means
that 1 out of every 5 logs will be captured while the remainder are dropped:
Centralize and control log costs with Better Stack
Once you've implemented sampling in your application, Better Stack provides centralized log management with flexible cost controls—30x cheaper than Datadog with predictable pricing starting at $0.25/GB.
You can mark noisy logs "as spam" to exclude them from billing, apply sampling during query time, and set custom retention periods per source. Better Stack works alongside your application-level sampling from Vector, Zerolog, or Zap.
Live tail lets you monitor sampled logs as they arrive with instant filtering. Even with aggressive sampling, you can spot patterns and debug issues in real-time.
You can also create custom dashboards with SQL or PromQL queries that execute in sub-seconds. Better Stack automatically analyzes sampled logs to identify patterns, error rates, and performance trends.
Final thoughts
In a world where data is king, log sampling helps maintain the kingdom without
the burden of excessive information. It's about working smarter, not harder, and
making data analysis both feasible and meaningful.
As systems continue to grow in scale and complexity, I expect sampling to get
even more emphasis in the industry as a way to prevent costs from spiraling out
of control.