Essential OpenTelemetry Best Practices for Robust Observability

Ayooluwa Isaiah

Updated on March 3, 2025

1. Start with Auto-instrumentation
2. Optimize attribute and context management
3. Deploy and configure collectors effectively
4. Implement smart sampling strategies
5. Correlate telemetry data types
6. Performance optimization
Final thoughts

In today's complex distributed systems, effective observability is critical for understanding application behavior, troubleshooting issues, and maintaining performance. OpenTelemetry has emerged as the leading open standard for generating and collecting telemetry data, providing a unified approach to instrumentation across different languages and frameworks.

However, implementing OpenTelemetry effectively requires careful planning and consideration. Based on real-world experiences, this article presents six essential best practices that will help you maximize the value of your OpenTelemetry implementation while avoiding common pitfalls.

Let's get started!

1. Start with Auto-instrumentation

The foundation of any successful OpenTelemetry implementation begins with a thoughtful instrumentation strategy. How you instrument your applications will directly impact the quality and usefulness of your telemetry data.

Auto-instrumentation provides an excellent starting point for your OpenTelemetry journey. It automatically instruments common libraries and frameworks, giving you immediate visibility into your application's behavior without requiring extensive code changes.

Auto-instrumentation casts a wide net, collecting data from HTTP requests, database queries, external API calls, and other common operations. This broad coverage allows you to quickly identify performance bottlenecks and issues without having to manually instrument every part of your application.

A critical but often overlooked detail is ensuring that OpenTelemetry is initialized before any instrumented libraries are used. This means including the OpenTelemetry SDK and configuration at the very beginning of your application's startup sequence.

For JavaScript applications, this means importing and configuring the OpenTelemetry SDK at the entry point of your application, before importing any other libraries or modules that you want to instrument.

Copied!

// Initialize OpenTelemetry first - before other library imports
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

// Create and configure the tracer provider
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces'
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

// Register auto-instrumentations
registerInstrumentations({
  instrumentations: [
    // Add your instrumentations here
  ],
});

// Now import your application modules
const express = require('express');
const app = express();

If you don't initialize OpenTelemetry early enough, you risk missing important spans and metrics from the early stages of your application's execution.

Add targeted manual instrumentation

While auto-instrumentation provides excellent baseline coverage, it can't capture everything. To get the most value from OpenTelemetry, you'll need to add manual instrumentation for business-critical paths and custom code.

Manual instrumentation allows you to:

Add context to automated spans with business-relevant attributes
Create spans for code that isn't covered by auto-instrumentation
Capture custom metrics specific to your application domain
Add span events for important state transitions or milestones

When adding manual instrumentation, remember to always close your spans. Unclosed spans can lead to incomplete traces and resource leaks. In JavaScript, use async/await or Promises with finally blocks to ensure spans are properly closed even when exceptions occur.

Copied!

const { trace, SpanStatusCode } = require('@opentelemetry/api');

async function processOrder(orderId, customerId) {
  const tracer = trace.getTracer('my-service');

  // Create a span for the operation
  const span = tracer.startSpan('processOrder');

  // Add relevant attributes
  span.setAttribute('order.id', orderId);
  span.setAttribute('customer.id', customerId);

  try {
    // Set the current span as active for context propagation
    return await tracer.startActiveSpan('processOrder', async (activeSpan) => {
      // Your business logic here
      await processPayment(orderId);

      // Add events to mark important milestones
      activeSpan.addEvent('payment.processed');

      const result = await updateInventory(orderId);
      activeSpan.addEvent('inventory.updated');

      return result;
    });
  } catch (error) {
    // Record the error on the span
    span.setStatus({ code: SpanStatusCode.ERROR });
    span.recordException(error);
    throw error;
  } finally {
    // Always end the span
    span.end();
  }
}

The most effective approach combines auto and manual instrumentation. Start with auto-instrumentation to get broad coverage, then strategically add manual instrumentation to fill gaps and add business context where needed.

2. Optimize attribute and context management

Attributes and context provide the critical metadata that makes your telemetry data meaningful and useful. Effective attribute management ensures that you have the information you need to understand your application's behavior without collecting unnecessary data.

Follow semantic conventions

OpenTelemetry defines semantic conventions that provide a common vocabulary for describing different types of entities. Following these conventions ensures that your data is consistent and interoperable across different services and tools.

Some key semantic conventions include:

http.method and http.status_code for HTTP requests.
db.system and db.statement for database operations.
messaging.system and messaging.destination for messaging operations.

These conventions make it easier to build queries and visualizations that work consistently across your entire system.

Apply descriptive and consistent naming

When naming custom attributes and resources, prioritize descriptive names that clearly communicate their purpose. Avoid using unfamiliar abbreviations or acronyms that might confuse future readers of your data.

Establish a consistent style for:

Capitalization (e.g. snake_case)
Formatting (e.g., using prefixes like app. for application-specific attributes)
Punctuation (e.g., using dots or underscores as separators)

Consistency makes your data easier to query and understand, reducing the cognitive load when analyzing telemetry data.

Include only relevant attributes

When adding attributes to spans, only include information that's relevant to the operation the span represents. For example, if you're tracing an HTTP request, include attributes like the request method, URL, and response status code, but not unrelated information like system memory usage.

As a general rule, avoid:

Putting metrics or logs as attributes in spans.
Including redundant attributes that specify the same information.
Adding high-cardinality attributes to every span (e.g., user IDs on spans that don't need them).

Remember that each attribute increases the size of your spans, which can impact storage costs and query performance.

Implement effective context propagation

Context propagation is essential for maintaining trace continuity across service boundaries. Unless you have unique requirements, use the W3C Trace Context standard, which is the default propagation format in most OpenTelemetry SDKs.

For HTTP and gRPC communication, auto-instrumentation typically handles context propagation automatically. For custom communication channels, you'll need to manually propagate context.

OpenTelemetry's baggage feature is particularly valuable for propagating contextual information across service boundaries. It allows you to pass key-value pairs from one service to another, ensuring that important context is available throughout a transaction.

For example, if the first service in a chain knows the customer ID, you can add it to baggage to make it available to downstream services:

Copied!

const { context, propagation, baggage } = require('@opentelemetry/api');

function enrichContextWithCustomerId(customerId) {
  // Create baggage with customer ID
  const customerBaggage = baggage.setBaggage(baggage.createBaggage(), 'customer.id', customerId);

  // Set as current context
  return context.with(propagation.setBaggage(context.active(), customerBaggage), () => {
    // Make downstream service call - the context will be propagated
    return callDownstreamService();
  });
}

This ensures that all spans in the trace can access the customer ID, even if they don't directly have that information available.

3. Deploy and configure collectors effectively

The OpenTelemetry Collector serves as a central hub for processing and routing telemetry data. Using collectors effectively can significantly improve the reliability, security, and flexibility of your observability pipeline.

Deploy collectors in all production environments

While it might be tempting to send telemetry data directly from your applications to your observability backend in simple environments, using a collector provides numerous benefits that make it worthwhile even in smaller deployments:

Buffer data during backend outages
Centralize and manage authentication credentials
Preprocess data before sending it to backends
Route data to multiple destinations
Reduce the load on your applications by offloading telemetry processing

Choose the right deployment pattern

There are several common deployment patterns for collectors, each with its own advantages:

Agent pattern: Deploy a collector alongside each application instance to minimize network traffic and provide isolation between applications.
Gateway pattern: Deploy centralized collectors that receive data from multiple applications and forward it to backends. This simplifies configuration management but creates potential single points of failure.
Hierarchical pattern: Combine agent and gateway collectors, with agents handling local processing and gateways handling aggregation and routing.

For most production environments, a hierarchical approach provides the best balance of reliability and manageability.

Buffer, batch, and process telemetry

Configuring appropriate buffering and batching in your collectors can significantly improve performance and reliability:

Copied!

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    send_batch_size: 10000
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000

exporters:
  otlp:
    endpoint: observability-backend:4317
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

This configuration batches telemetry data into larger chunks before sending it to the backend, reducing network overhead and improving throughput. It also includes a memory limiter to prevent the collector from using too much memory during traffic spikes.

Pre-process sensitive data

One of the most valuable capabilities of collectors is their ability to process telemetry data before it leaves your environment. This is particularly important for handling sensitive data like personally identifiable information (PII) or protected health information (PHI).

You can configure processors to redact, hash, or mask sensitive data before it's sent to your backend:

Copied!

processors:
  filter:
    spans:
      include:
        match_type: regexp
        attributes:
          - key: db.statement
            value: "(?i)(?:password|passwd|pwd)\\s*=\\s*[^\\s,;]+"
      actions:
        - key: db.statement
          action: update
          value: "REDACTED"

This configuration redacts database statements that contain password information, ensuring that sensitive credentials aren't stored in your observability platform.

The fastest log
search on the planet

Better Stack lets you see inside any stack, debug any issue, and resolve any incident.

4. Implement smart sampling strategies

As your system grows, the volume of telemetry data can become overwhelming. Sampling allows you to reduce the volume of data while still maintaining visibility into your system's behavior.

The ideal sampling strategy balances the need for complete data with the costs of collecting, transmitting, and storing that data. While collecting everything might seem ideal, it can quickly become prohibitively expensive and even impact application performance.

OpenTelemetry supports several sampling approaches:

Head sampling: Decisions are made at the beginning of a trace, before most spans are created. This is simple but can miss important data.
Tail sampling: Decisions are made after all spans in a trace are collected, allowing for more intelligent sampling based on the complete trace. However, this requires buffering all spans until the trace is complete.
Probabilistic sampling: Randomly samples a percentage of traces. Simple to implement but doesn't distinguish between normal and error cases.
Rate limiting sampling: Limits the number of traces per time period. Useful for controlling data volume during traffic spikes.

For most production systems, a combination of approaches works best. For example, you might use probabilistic sampling for normal traffic and ensure that all error traces are kept:

Copied!

const { ParentBasedSampler, TraceIdRatioBased } = require('@opentelemetry/sdk-trace-node');

// Create a sampler that keeps all error traces and samples 10% of normal traces
const errorAttributeSampler = {
  shouldSample(context, traceId, spanName, spanKind, attributes) {
    // Always sample if there's an error attribute
    if (attributes && attributes.error === true) {
      return {
        decision: SamplingDecision.RECORD_AND_SAMPLED
      };
    }
    // Delegate to the base sampler
    return { decision: SamplingDecision.NOT_RECORD };
  }
};

const compositeSampler = {
  shouldSample(context, traceId, spanName, spanKind, attributes) {
    // Try the error sampler first
    const errorSamplerResult = errorAttributeSampler.shouldSample(
      context, traceId, spanName, spanKind, attributes
    );

    if (errorSamplerResult.decision === SamplingDecision.RECORD_AND_SAMPLED) {
      return errorSamplerResult;
    }

    // Fall back to 10% sampling for normal traces
    return new TraceIdRatioBased(0.1).shouldSample(
      context, traceId, spanName, spanKind, attributes
    );
  }
};

// Use parent-based sampling to maintain trace consistency
const sampler = new ParentBasedSampler({
  root: compositeSampler
});

When implementing sampling, it's important to ensure that your metrics remain accurate. One approach is to generate metrics before sampling occurs, ensuring that your metrics reflect all operations even if some traces are discarded.

The OpenTelemetry Collector's metrics generator processor can help with this:

Copied!

processors:
  metricsgenerator:
    rules:
      - name: span.count
        type: sum
        inputs:
          - name: span
            type: counter
        dimensions:
          - name: service.name
          - name: span.kind
          - name: status.code

This configuration generates count metrics for all spans before any sampling occurs, ensuring that your metrics accurately reflect the total number of operations even if some traces are sampled out.

5. Correlate telemetry data types

One of the most powerful aspects of OpenTelemetry is its ability to generate and correlate different types of telemetry data: traces, metrics, and logs. Effectively correlating these data types provides a comprehensive view of your system's behavior.

To enable correlation, ensure that your logs include trace and span IDs when available. This allows you to navigate from a log entry to the corresponding trace, providing context for log messages.

Copied!

const winston = require('winston');
const { trace } = require('@opentelemetry/api');

// Configure your logger to include trace context
const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json(),
    winston.format((info) => {
      const span = trace.getSpan(trace.context());
      if (span) {
        const context = span.spanContext();
        info.trace_id = context.traceId;
        info.span_id = context.spanId;
      }
      return info;
    })()
  ),
  transports: [
    new winston.transports.Console()
  ]
});

// Now your logs will include trace and span IDs
logger.info('Processing order', { orderId: '12345' });

Similarly, adding service and instance information to both logs and metrics enables correlation across all telemetry types.

Implement span metrics for RED analysis

Rate, Error, and Duration (RED) metrics provide a high-level overview of service health. OpenTelemetry can automatically generate these metrics from your trace data using the span metrics processor:

Copied!

processors:
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: service.name
      - name: http.method
      - name: http.status_code
    dimensions_cache_size: 1000
    aggregation_temporality: AGGREGATION_TEMPORALITY_CUMULATIVE

This configuration generates request count, error count, and duration metrics for each combination of service, HTTP method, and status code, allowing you to monitor the overall health of your services without querying individual traces.

Visualize system dependencies

Service graphs provide a visual representation of the relationships between your services. The service graph processor can automatically generate these graphs from your trace data:

Copied!

processors:
  servicegraph:
    metrics_exporter: prometheus
    store:
      ttl: 2s
      max_items: 1000

This configuration generates metrics that describe the connections between services, which can be visualized in tools like Grafana to create service dependency maps.

6. Performance optimization

OpenTelemetry is designed to be lightweight, but without proper optimization, it can still impact your application's performance. Careful configuration can minimize this impact while still providing valuable telemetry data.

Below are some strategies to explore:

Batch and compress telemetry

Batching multiple telemetry items into a single transmission reduces network overhead and improves throughput. Both the OpenTelemetry SDK and Collector support batching:

Copied!

const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const exporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces'
});

// Configure batch processing for better performance
const batchProcessor = new BatchSpanProcessor(exporter, {
  maxQueueSize: 2048,            // Maximum queue size
  maxExportBatchSize: 512,       // Number of spans to send at once
  scheduledDelayMillis: 5000,    // How frequently to send batches
  exportTimeoutMillis: 30000     // How long to wait for the export to complete
});

// Add the processor to your provider
provider.addSpanProcessor(batchProcessor);

Similarly, enabling compression can significantly reduce the amount of data transmitted over the network:

Copied!

const exporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces',
  compression: 'gzip'  // Enable compression
});

These optimizations are particularly important for high-throughput applications that generate large volumes of telemetry data.

Monitor instrumentation impact

Even with optimization, it's important to monitor the impact of your instrumentation on application performance. This includes:

CPU usage by instrumentation components
Memory usage, particularly for buffered telemetry data
Network bandwidth consumed by telemetry transmission
Latency added to operations by instrumentation

If you observe significant performance impact, consider adjusting your sampling rate or reducing the amount of telemetry data collected.

Implement circuit breakers

To prevent telemetry collection from impacting application availability during outages or overload situations, implement circuit breakers in your telemetry pipeline:

Copied!

const { Resource } = require('@opentelemetry/resources');
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const exporter = new OTLPTraceExporter({
  url: 'http://collector:4318/v1/traces',
  timeoutMillis: 10000  // 10 second timeout for export operations
});

// Configure the processor with limits
const batchProcessor = new BatchSpanProcessor(exporter, {
  maxQueueSize: 2048,           // Limit memory usage
  maxExportBatchSize: 512,      // Reasonable batch size
  exportTimeoutMillis: 30000    // Timeout for exports
});

// Create the provider with the processor
const provider = new NodeTracerProvider({
  resource: new Resource({ 'service.name': 'my-service' })
});
provider.addSpanProcessor(batchProcessor);
provider.register();

This configuration sets timeouts for export operations and limits the amount of memory used for buffering spans, ensuring that telemetry collection doesn't consume excessive resources even if the backend is unavailable.

Final thoughts

Implementing OpenTelemetry effectively requires careful planning and ongoing refinement. By following these six best practices, you can build a robust observability foundation that provides valuable insights into your system's behavior without excessive complexity or cost.

Remember that observability is a journey, not a destination. Start with the basics, learn from your experiences, and continuously improve your approach based on the evolving needs of your system and teams.

Whether you're just starting with OpenTelemetry or looking to optimize an existing implementation, these best practices will help you maximize the value of your telemetry data and build a more observable, reliable, and maintainable system.

Article by

Ayooluwa Isaiah

Ayo is a technical content manager at Better Stack. His passion is simplifying and communicating complex technical ideas effectively. His work was featured on several esteemed publications including LWN.net, Digital Ocean, and CSS-Tricks. When he's not writing or coding, he loves to travel, bike, and play tennis.

Got an article suggestion? Let us know

Using Prometheus as a Backend for OpenTelemetry Metrics

Learn how to integrate OpenTelemetry with Prometheus 3.0 for a powerful metrics stack. Setup guide covers configuration, instrumentation, data modeling, and PromQL querying

→

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us

Writer of the month

Marin Bezhanov

Marin is a software engineer and architect with a broad range of experience working...

Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github

Essential OpenTelemetry Best Practices for Robust Observability

Contents

1. Start with Auto-instrumentation

Add targeted manual instrumentation

2. Optimize attribute and context management

Follow semantic conventions

Apply descriptive and consistent naming

Include only relevant attributes

Implement effective context propagation

3. Deploy and configure collectors effectively

Deploy collectors in all production environments

Choose the right deployment pattern

Buffer, batch, and process telemetry

Pre-process sensitive data

4. Implement smart sampling strategies

5. Correlate telemetry data types

Implement span metrics for RED analysis

Visualize system dependencies

6. Performance optimization

Batch and compress telemetry

Monitor instrumentation impact

Implement circuit breakers

Final thoughts

Make your mark

Join the writer's program

Build on top of Better Stack