How to Reduce Logging Costs with Log Sampling
Due to the high volume of data that is often generated in modern application environments, storing and analyzing every single log entry is often impractical or costly.
This is where log sampling comes into play.
It involves selectively recording or capturing only a portion of all log entries generated. This method allows for effective data management, ensuring that the volume of logs is kept at a manageable level without losing critical insights.
To better understand log sampling, consider a scenario involving an endpoint that interacts with a third-party provider.
Suppose the service experiences a brief downtime of a few seconds. During this short period, let's say a thousand requests are sent to the endpoint, each failing and generating an error log.
In a traditional setup, this would result in a thousand error log entries, which could be excessive and unnecessary for analysis.
With log sampling implemented, instead of capturing all 1000 error logs, only a representative subset is recorded.
For example, the system might be configured to capture every tenth identical log, reducing the total number of stored logs to just 100.
This reduction significantly eases the burden on storage and processing resources while still providing a meaningful overview of the incident.
Log sampling can also be fine-tuned based on various criteria, such as error severity, frequency, or even specific time intervals, to ensure that the most relevant and useful data is retained.
Why is log sampling useful?
As we've alluded to, log sampling helps reduce the cost of logging by only capturing a subset of the logs generated by the system. This manifests itself in various ways, such as the following:
Reduced storage needs: By capturing only a portion of the logs, the amount of storage required is significantly decreased. This is particularly beneficial for systems that generate large volumes of log data.
Improved performance: Processing and analyzing a smaller set of logs leads to faster performance in log analysis tools and monitoring systems, as they have fewer data to process.
Cost efficiency: With less data to store and process, the overall cost of logging decreases. This includes savings on data storage, infrastructure, and even operational costs related to log management.
Focused analysis: Sampling can help focus on the most relevant or critical logs. For example, sampling strategies can be designed to capture more error logs and fewer informational logs, thereby directing attention to potential issues.
Scalability: As systems scale and the volume of logs increases, log sampling is a practical approach to ensure that log management remains sustainable and doesn't overwhelm the system's resources.
Sampling strategies
When it comes to sampling techniques, there are quite a few out there, but this article will highlight only the most common ones.
1. Random sampling
Random sampling involves selecting log entries from the data stream randomly, without any consideration for their content, severity, or any other specific attribute. This approach ensures each log entry has an equal probability of selection.
Typically, random sampling is implemented by defining a probability factor. For instance, one might choose to record one log entry for every N entries generated. This ratio can be adjusted to suit the volume of log data and the desired granularity of the logs retained.
Consider this example where one out of every five log entries (20%) is recorded:
package main
import (
"os"
"github.com/rs/zerolog"
)
func main() {
log := zerolog.New(os.Stdout).
With().
Timestamp().
Logger().
Sample(&zerolog.BasicSampler{N: 5})
for i := 1; i <= 10; i++ {
log.Info().Msgf("an info message: %d", i)
}
}
In the absence of sampling, a loop like this would produce ten log entries. However, with sampling, only the first and sixth messages are recorded:
{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 1"}
{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 6"}
The randomness of the sampling process means all types of log records are treated equally:
. . .
for i := 1; i <= 10; i++ {
log.Info().Msgf("an info message: %d", i)
log.Debug().Msgf("a debug message: %d", i)
log.Warn().Msgf("a warning message: %d", i)
}
. . .
The output reflects this unbiased selection:
{"level":"info","time":"2023-12-28T18:11:20+01:00","message":"an info message: 1"}
{"level":"warn","time":"2023-12-28T18:11:20+01:00","message":"a warning message: 2"}
{"level":"debug","time":"2023-12-28T18:11:20+01:00","message":"a debug message: 4"}
{"level":"info","time":"2023-12-28T18:11:20+01:00","message":"an info message: 6"}
{"level":"warn","time":"2023-12-28T18:11:20+01:00","message":"a warning message: 7"}
{"level":"debug","time":"2023-12-28T18:11:20+01:00","message":"a debug message: 9"}
2. Time-based sampling
With time-based sampling, log entries are captured based on specific time intervals. This method involves selecting up to a maximum number of logs at regular, predetermined intervals, regardless of log's content or level.
Consider the following example, which records a maximum of three log entries per second, disregarding all others:
package main
import (
"os"
"time"
"github.com/rs/zerolog"
)
func main() {
log := zerolog.New(os.Stdout).
With().
Timestamp().
Logger().
Sample(&zerolog.BurstSampler{Period: 1 * time.Second, Burst: 3})
for i := 1; i <= 10; i++ {
log.Info().Msgf("an info message: %d", i)
log.Debug().Msgf("a debug message: %d", i)
log.Warn().Msgf("a warning message: %d", i)
log.Error().Msgf("an error message: %d", i)
}
}
{"level":"info","time":"2023-12-28T18:32:30+01:00","message":"an info message: 1"}
{"level":"debug","time":"2023-12-28T18:32:30+01:00","message":"a debug message: 1"}
{"level":"warn","time":"2023-12-28T18:32:30+01:00","message":"a warning message: 1"}
While this method is straightforward, it might miss critical logs that occur between intervals. For instance, error logs are not captured in the above example.
An alternative approach considers log level and content to ensure a representative sample. Here's a variant using the Zap module:
package main
import (
"os"
"time"
"go.uber.org/zap"
"go.uber.org/zap/zapcore"
)
func createLogger() *zap.Logger {
stdout := zapcore.AddSync(os.Stdout)
level := zap.NewAtomicLevelAt(zap.InfoLevel)
productionCfg := zap.NewProductionEncoderConfig()
jsonEncoder := zapcore.NewJSONEncoder(productionCfg)
jsonOutCore := zapcore.NewCore(jsonEncoder, stdout, level)
samplingCore := zapcore.NewSamplerWithOptions(
jsonOutCore,
time.Second, // interval
3, // log first 3 entries
0, // thereafter log zero entires within the interval
)
return zap.New(samplingCore)
}
func main() {
logger := createLogger().Sugar()
defer logger.Sync()
for i := 1; i <= 10; i++ {
logger.Info("an info message")
logger.Debug("a debug message")
logger.Warn("a warning message")
logger.Error("an error message")
}
}
This approach differs from the previous one in that it takes into account the log level and message content, capturing up to 12 messages (3 for each level and message):
{"level":"info","ts":1703799628.2984571,"msg":"an info message"}
{"level":"warn","ts":1703799628.2984889,"msg":"a warning message"}
{"level":"error","ts":1703799628.2984993,"msg":"an error message"}
{"level":"info","ts":1703799628.2985106,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985137,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985182,"msg":"an error message"}
{"level":"info","ts":1703799628.298522,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985265,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985306,"msg":"an error message"}
3. Hash-based sampling
In distributed systems, you may have multiple interconnected services involved in processing user requests. Each service in the pipeline processes these requests and generates corresponding logs, which include a correlation, timestamp, contextual data, and any exceptions encountered.
When sampling logs in such environments, you may want either all or no logs related to a specific request instead of a random or partial selection of logs.
This is where Hash-based sampling comes into play.
It allows for sampling logs based on a unique identifier, which could be a single field or combination of fields.
By applying a hash function to this identifier and setting a threshold, you ensure that either all or no logs related to a specific request are sampled, providing a comprehensive view of the request's journey through the processing pipeline, even in a sampled dataset.
This technique is not is not limited to logging in microservices. It can also ensure that either all or none of the related log entries for a transaction are included in the sample, even on single-service systems.
To implement hash-based sampling, you can use Rsyslog. See its documentation for more details.
How to sample your logs
Log sampling can be done at the application level through your logging framework of choice. However, if the framework doesn't support sampling or you need a more flexible sampling configuration, you can use a log collection agent to implement sampling.
Vector is one such tool that can sample your logs based on
supplied criteria and at a configurable rate. For example, a rate
of 5 means
that 1 out of every 5 logs will be captured while the remainder are dropped:
sources:
myapp_logs:
type: file
include:
- /var/log/myapp/app.log
transforms:
sample_myapp:
type: sample
inputs:
- myapp_logs
key_field: msg
rate: 5
sinks:
print:
type: console
inputs:
- sample_myapp
encoding:
codec: json
Final thoughts
In a world where data is king, log sampling helps maintain the kingdom without the burden of excessive information. It's about working smarter, not harder, and making data analysis both feasible and meaningful.
As systems continue to grow in scale and complexity, I expect sampling to get even more emphasis in the industry as a way to prevent costs from spiraling out of control.
Thanks for reading, and happy logging!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github