How to Reduce Logging Costs with Log Sampling

Ayooluwa Isaiah

Updated on January 5, 2024

Why is log sampling useful?
Sampling strategies
How to sample your logs
Final thoughts

Due to the high volume of data that is often generated in modern application environments, storing and analyzing every single log entry is often impractical or costly.

This is where log sampling comes into play.

It involves selectively recording or capturing only a portion of all log entries generated. This method allows for effective data management, ensuring that the volume of logs is kept at a manageable level without losing critical insights.

To better understand log sampling, consider a scenario involving an endpoint that interacts with a third-party provider.

Suppose the service experiences a brief downtime of a few seconds. During this short period, let's say a thousand requests are sent to the endpoint, each failing and generating an error log.

In a traditional setup, this would result in a thousand error log entries, which could be excessive and unnecessary for analysis.

With log sampling implemented, instead of capturing all 1000 error logs, only a representative subset is recorded.

For example, the system might be configured to capture every tenth identical log, reducing the total number of stored logs to just 100.

This reduction significantly eases the burden on storage and processing resources while still providing a meaningful overview of the incident.

Log sampling can also be fine-tuned based on various criteria, such as error severity, frequency, or even specific time intervals, to ensure that the most relevant and useful data is retained.

The fastest log
search on the planet

Better Stack lets you see inside any stack, debug any issue, and resolve any incident.

Why is log sampling useful?

As we've alluded to, log sampling helps reduce the cost of logging by only capturing a subset of the logs generated by the system. This manifests itself in various ways, such as the following:

Reduced storage needs: By capturing only a portion of the logs, the amount of storage required is significantly decreased. This is particularly beneficial for systems that generate large volumes of log data.
Improved performance: Processing and analyzing a smaller set of logs leads to faster performance in log analysis tools and monitoring systems, as they have fewer data to process.
Cost efficiency: With less data to store and process, the overall cost of logging decreases. This includes savings on data storage, infrastructure, and even operational costs related to log management.
Focused analysis: Sampling can help focus on the most relevant or critical logs. For example, sampling strategies can be designed to capture more error logs and fewer informational logs, thereby directing attention to potential issues.
Scalability: As systems scale and the volume of logs increases, log sampling is a practical approach to ensure that log management remains sustainable and doesn't overwhelm the system's resources.

Sampling strategies

When it comes to sampling techniques, there are quite a few out there, but this article will highlight only the most common ones.

1. Random sampling

Random sampling involves selecting log entries from the data stream randomly, without any consideration for their content, severity, or any other specific attribute. This approach ensures each log entry has an equal probability of selection.

Typically, random sampling is implemented by defining a probability factor. For instance, one might choose to record one log entry for every N entries generated. This ratio can be adjusted to suit the volume of log data and the desired granularity of the logs retained.

Consider this example where one out of every five log entries (20%) is recorded:

Copied!

package main

import (
    "os"

    "github.com/rs/zerolog"
)

func main() {
    log := zerolog.New(os.Stdout).
        With().
        Timestamp().
        Logger().
        Sample(&zerolog.BasicSampler{N: 5})

    for i := 1; i <= 10; i++ {
        log.Info().Msgf("an info message: %d", i)
    }
}

In the absence of sampling, a loop like this would produce ten log entries. However, with sampling, only the first and sixth messages are recorded:

Output

{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 1"}
{"level":"info","time":"2023-12-28T17:57:24+01:00","message":"an info message: 6"}

The randomness of the sampling process means all types of log records are treated equally:

Copied!

. . .
    for i := 1; i <= 10; i++ {
        log.Info().Msgf("an info message: %d", i)
        log.Debug().Msgf("a debug message: %d", i)
        log.Warn().Msgf("a warning message: %d", i)
    }
. . .

The output reflects this unbiased selection:

Copied!

{"level":"info","time":"2023-12-28T18:11:20+01:00","message":"an info message: 1"}
{"level":"warn","time":"2023-12-28T18:11:20+01:00","message":"a warning message: 2"}
{"level":"debug","time":"2023-12-28T18:11:20+01:00","message":"a debug message: 4"}
{"level":"info","time":"2023-12-28T18:11:20+01:00","message":"an info message: 6"}
{"level":"warn","time":"2023-12-28T18:11:20+01:00","message":"a warning message: 7"}
{"level":"debug","time":"2023-12-28T18:11:20+01:00","message":"a debug message: 9"}

2. Time-based sampling

With time-based sampling, log entries are captured based on specific time intervals. This method involves selecting up to a maximum number of logs at regular, predetermined intervals, regardless of log's content or level.

Consider the following example, which records a maximum of three log entries per second, disregarding all others:

Copied!

package main

import (
    "os"
    "time"

    "github.com/rs/zerolog"
)

func main() {
    log := zerolog.New(os.Stdout).
        With().
        Timestamp().
        Logger().
        Sample(&zerolog.BurstSampler{Period: 1 * time.Second, Burst: 3})

    for i := 1; i <= 10; i++ {
        log.Info().Msgf("an info message: %d", i)
        log.Debug().Msgf("a debug message: %d", i)
        log.Warn().Msgf("a warning message: %d", i)
        log.Error().Msgf("an error message: %d", i)
    }
}

Output

{"level":"info","time":"2023-12-28T18:32:30+01:00","message":"an info message: 1"}
{"level":"debug","time":"2023-12-28T18:32:30+01:00","message":"a debug message: 1"}
{"level":"warn","time":"2023-12-28T18:32:30+01:00","message":"a warning message: 1"}

While this method is straightforward, it might miss critical logs that occur between intervals. For instance, error logs are not captured in the above example.

An alternative approach considers log level and content to ensure a representative sample. Here's a variant using the Zap module:

Copied!

package main

import (
    "os"
    "time"

    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

func createLogger() *zap.Logger {
    stdout := zapcore.AddSync(os.Stdout)

    level := zap.NewAtomicLevelAt(zap.InfoLevel)

    productionCfg := zap.NewProductionEncoderConfig()

    jsonEncoder := zapcore.NewJSONEncoder(productionCfg)

    jsonOutCore := zapcore.NewCore(jsonEncoder, stdout, level)

    samplingCore := zapcore.NewSamplerWithOptions(
        jsonOutCore,
        time.Second, // interval
        3,           // log first 3 entries
        0,           // thereafter log zero entires within the interval
    )

    return zap.New(samplingCore)
}

func main() {
    logger := createLogger().Sugar()

    defer logger.Sync()

    for i := 1; i <= 10; i++ {
        logger.Info("an info message")
        logger.Debug("a debug message")
        logger.Warn("a warning message")
        logger.Error("an error message")
    }
}

This approach differs from the previous one in that it takes into account the log level and message content, capturing up to 12 messages (3 for each level and message):

Copied!

{"level":"info","ts":1703799628.2984571,"msg":"an info message"}
{"level":"warn","ts":1703799628.2984889,"msg":"a warning message"}
{"level":"error","ts":1703799628.2984993,"msg":"an error message"}
{"level":"info","ts":1703799628.2985106,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985137,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985182,"msg":"an error message"}
{"level":"info","ts":1703799628.298522,"msg":"an info message"}
{"level":"warn","ts":1703799628.2985265,"msg":"a warning message"}
{"level":"error","ts":1703799628.2985306,"msg":"an error message"}

3. Hash-based sampling

In distributed systems, you may have multiple interconnected services involved in processing user requests. Each service in the pipeline processes these requests and generates corresponding logs, which include a correlation, timestamp, contextual data, and any exceptions encountered.

When sampling logs in such environments, you may want either all or no logs related to a specific request instead of a random or partial selection of logs.

This is where Hash-based sampling comes into play.

It allows for sampling logs based on a unique identifier, which could be a single field or combination of fields.

By applying a hash function to this identifier and setting a threshold, you ensure that either all or no logs related to a specific request are sampled, providing a comprehensive view of the request's journey through the processing pipeline, even in a sampled dataset.

This technique is not is not limited to logging in microservices. It can also ensure that either all or none of the related log entries for a transaction are included in the sample, even on single-service systems.

To implement hash-based sampling, you can use Rsyslog. See its documentation for more details.

How to sample your logs

Log sampling can be done at the application level through your logging framework of choice. However, if the framework doesn't support sampling or you need a more flexible sampling configuration, you can use a log collection agent to implement sampling.

Vector is one such tool that can sample your logs based on supplied criteria and at a configurable rate. For example, a rate of 5 means that 1 out of every 5 logs will be captured while the remainder are dropped:

/etc/vector/vector.yaml

Copied!

sources:
  myapp_logs:
    type: file
    include:
      - /var/log/myapp/app.log

transforms:
  sample_myapp:
    type: sample
    inputs:
      - myapp_logs
    key_field: msg
    rate: 5

sinks:
  print:
    type: console
    inputs:
      - sample_myapp
    encoding:
      codec: json

Final thoughts

In a world where data is king, log sampling helps maintain the kingdom without the burden of excessive information. It's about working smarter, not harder, and making data analysis both feasible and meaningful.

As systems continue to grow in scale and complexity, I expect sampling to get even more emphasis in the industry as a way to prevent costs from spiraling out of control.

Thanks for reading, and happy logging!

Got an article suggestion? Let us know

7 Steps to Reducing Your Logging Costs

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

How to Reduce Logging Costs with Log Sampling

Contents

Why is log sampling useful?

Sampling strategies

1. Random sampling

2. Time-based sampling

3. Hash-based sampling

How to sample your logs

Final thoughts