An Introduction to the OpenTelemetry Protocol (OTLP)
The increasing complexity, distributed nature, and microservices architecture of modern software systems have made effective observability essential for maintaining application performance and reliability.
OpenTelemetry addresses this challenge by offering a comprehensive toolkit and standards for collecting telemetry data—logs, metrics, and traces—across an application's entire infrastructure.
At the heart of this initiative is the OpenTelemetry Protocol (OTLP), a standardized format for transmitting telemetry data between different components within the OpenTelemetry ecosystem.
OTLP enables you to capture and send data from instrumented applications, route it through OpenTelemetry Collectors, and forward it to various observability backends for analysis.
In this article, we will explore the key features of OTLP, how it works, and how to implement it in your applications to gain actionable insights from your telemetry data.
Prerequisites
Before proceeding with this article, ensure that you're familiar with the basic OpenTelemetry concepts.
What is the OpenTelemetry Protocol?
OTLP is a telemetry data format used to encode, transmit, and deliver telemetry data such as traces, metrics, and logs between the components of the OpenTelemetry ecosystem. This includes instrumented applications, infrastructure, the OTel Collector, and various observability backends.
It is a crucial aspect of the OpenTelemetry project that's designed to ensure that telemetry data, regardless of its source or vendor, can be processed in a consistent manner.
Key features of OTLP
Standardization: It provides a unified format for traces, metrics, and logs across various programming languages and platforms, promoting interoperability and vendor neutrality.
Flexibility: OTLP leverages gRPC with Protocol Buffers for efficient, real-time communication but it also supports
http/protobuf
orhttp/json
for environments where gRPC may not be ideal.Efficiency: OTLP is designed for efficient and reliable data transmission, supporting high throughput and low latency scenarios.
Extensibility: The protocol buffer-based encoding allows for future additions and extensions without breaking backward compatibility.
Semantic conventions: It defines a set of semantic conventions for common attributes and data types, ensuring consistency and meaning across different sources of telemetry data.
Exploring the OTLP Specification
The OpenTelemetry Protocol (OTLP) uses a request-based communication model, where telemetry data is transmitted from a client (sender) to a server (receiver) through individual requests.
Here's how the process typically works:
Data collection: An application instrumented with OpenTelemetry collects telemetry data (traces, metrics, and logs) and packages it into an OTLP-compliant request.
Data transmission: The request is sent to a server, often an OpenTelemetry Collector or directly to an observability backend.
Acknowledgment: The server processes the data and responds to the client with an acknowledgment of successful receipt. If there’s an issue, an error message is returned instead.
OTLP supports two primary transport mechanisms for this request/response interaction:
- gRPC: A high-performance, bidirectional streaming protocol.
- HTTP/1.1: A more traditional transport option for environments where gRPC may not be ideal.
Both transport mechanisms use Protocol Buffers (protobuf) to define the structure of the telemetry data payload. Additionally, servers are required to support Gzip compression for payloads, although uncompressed payloads are also accepted.
Choosing between gRPC and HTTP
When it comes to exporting telemetry data in OpenTelemetry, most SDKs provide
options for OTLP transmission over either grpc
or http/protobuf
, with some
exporters (like JavaScript) also supporting http/json
.
If you're trying to decide between gRPC and HTTP, consider these points:
Ensure that the protocol is supported in the medium of its intended usage. For example, using
grpc
for exporting is not supported in the browser so you must usehttp/json
orhttp/protobuf
.The default protocol used by SDKs differ per language and it is sometimes easier to stay with the defaults due to support levels. For instance, the Go SDK defaults to gRPC, while Node.js' SDK defaults to
http/protobuf
.gRPC often introduces larger dependencies into your code base.
gRPC relies on HTTP/2 for transport, which may have varying support across your network infrastructure (think firewalls, proxies, and load balancers)
gRPC is generally more efficient and supports streaming which should come in handy when dealing with larger payload and higher throughput scenarios.
Note that the default port for OTLP over gRPC is 4317, while OLTP over HTTP uses port 4318.
The OTLP data model
At its core, OTLP establishes a well-defined and structured data model for representing telemetry data, facilitating its consistent and efficient handling throughout the observability pipeline. This model encompasses the three major types of telemetry: traces, metrics, and logs.
Traces In OTLP, traces are defined wholly by the their execution context, namely spans which represent a specific operation within the overall transaction. The entire protocol buffer encoding of a trace can be found
here, but here is a simplified overview:
// https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto
message TracesData {
repeated ResourceSpans resource_spans = 1;
}
message ResourceSpans {
opentelemetry.proto.resource.v1.Resource resource = 1;
repeated ScopeSpans scope_spans = 2;
}
message ScopeSpans {
opentelemetry.proto.common.v1.InstrumentationScope scope = 1;
repeated Span spans = 2;
}
message Span {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
bytes parent_span_id = 4;
fixed32 flags = 16;
string name = 5;
enum SpanKind {}
SpanKind kind = 6;
fixed64 start_time_unix_nano = 7;
fixed64 end_time_unix_nano = 8;
Status status = 15;
message Event {}
repeated Event events = 11;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
uint32 dropped_attributes_count = 10;
uint32 dropped_events_count = 12;
message Link {}
repeated Link links = 13;
uint32 dropped_links_count = 14;
Status status = 15;
}
message Status {}
enum SpanFlags {}
This protocol buffer structure organizes telemetry data hierarchically like this:
TracesData -> ResourceSpans -> ScopeSpans -> Span
Here's an explanation of some of the major components of this structure:
TracesData
: A collection ofResourceSpans
, representing telemetry data associated with specific resources.ResourceSpans
: Contains information about theResource
itself and multipleScopeSpans
, which groups spans based on their instrumentation scope.ScopeSpans
: Groups multiple spans that share the sameInstrumentationScope
(the library or component responsible for generating the span).Span
: This is the core building block of a trace, representing a single operation or activity.- It includes identifiers like
trace_id
andspan_id
for linking spans within a trace. - Captures timing information with
start_time_unix_nano
andend_time_unix_nano
. - Contains
attributes
(key/value pairs) providing additional context. - Can include events (
Event
objects) representing specific occurrences during the span's lifetime. - Can have links (
Link
objects) to other spans, potentially in different traces. - Carries a
Status
object indicating the success or failure of the span.
- It includes identifiers like
Status
: It represents the outcome of a span's execution by including a code (UNSET
,OK
, andERROR
), and an optional message providing details.SpanFlags
: These are bit flags providing additional information about a span's context and behavior.
Metrics
Metrics data are also organized hierarchically by associating measurements with resources and instrumentation scopes. Each metric has a specific data type, and data points within them carry the actual values along with timestamps and additional context.
Here's a simplified view of the full metrics definition:
message MetricsData {
repeated ResourceMetrics resource_metrics = 1;
}
message ResourceMetrics {
reserved 1000;
opentelemetry.proto.resource.v1.Resource resource = 1;
repeated ScopeMetrics scope_metrics = 2;
string schema_url = 3;
}
message ScopeMetrics {
opentelemetry.proto.common.v1.InstrumentationScope scope = 1;
repeated Metric metrics = 2;
string schema_url = 3;
}
message Metric {
reserved 4, 6, 8;
string name = 1;
string description = 2;
string unit = 3;
oneof data {
Gauge gauge = 5;
Sum sum = 7;
Histogram histogram = 9;
ExponentialHistogram exponential_histogram = 10;
Summary summary = 11;
}
repeated opentelemetry.proto.common.v1.KeyValue metadata = 12;
}
message Gauge {}
message Sum {}
message Histogram {}
message ExponentialHistogram {}
message Summary {}
enum AggregationTemporality {}
enum DataPointFlags {}
message NumberDataPoint {}
message HistogramDataPoint {}
message ExponentialHistogramDataPoint {}
message SummaryDataPoint {}
message Exemplar {}
The most important fields are described as follows:
MetricsData
: A collection ofResourceMetrics
, which ties metrics to specific resources.ResourceMetrics
: Contains resource details and multipleScopeMetrics
for grouping metrics based on instrumentation scope (library/framework used).ScopeMetrics
: Groups metrics sharing the same instrumentation scope and contains multipleMetric
objects.Metric
: This is the fundamental unit, representing a specific measurement.- It has a
name
,description
, andunit
for identification and context, - One of five data types (explained below),
- Optional metadata (key-value pairs) for additional information.
- It has a
Types of metrics
Gauge: This represents a value at a specific point in time (e.g., current CPU usage, number of items in a queue, etc). It is a number that can either go up or down.
Sum: This is analogous to the
Counter
metric type in Prometheus. It represents a running total that increases over time (such as number of requests or errors).Histogram: These are used to represent a distribution of measurements by sampling observations and counting them in configurable buckets.
Exponential Histogram: They are a special type of histogram used to efficiently represent a distribution of values in high-cardinality scenarios where the range of values can vary significantly.
Summary: Summaries are included in OpenTelemetry for legacy support, and OpenTelemetry APIs and SDK do not produce summaries.
Logs
OTLP's data model for logs offers a standardized way to represent log data from various sources including application logs, machine-generated events, system logs, and more.
This model allows for unambiguous mapping from existing log formats, ensuring compatibility and ease of integration. It also enables reverse mapping back to specific log formats, provided those formats support equivalent features.
You can check out the protocol buffer representation or read the full design document, but here's a concise summary of the model:
message LogsData {
repeated ResourceLogs resource_logs = 1;
}
message ResourceLogs {
reserved 1000;
opentelemetry.proto.resource.v1.Resource resource = 1;
repeated ScopeLogs scope_logs = 2;
string schema_url = 3;
}
message ScopeLogs {
opentelemetry.proto.common.v1.InstrumentationScope scope = 1;
repeated LogRecord log_records = 2;
string schema_url = 3;
}
enum SeverityNumber {}
enum LogRecordFlags {}
message LogRecord {
reserved 4;
fixed64 time_unix_nano = 1;
fixed64 observed_time_unix_nano = 11;
SeverityNumber severity_number = 2;
string severity_text = 3;
opentelemetry.proto.common.v1.AnyValue body = 5;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 6;
uint32 dropped_attributes_count = 7;
fixed32 flags = 8;
bytes trace_id = 9;
bytes span_id = 10;
}
Besides the LogsData
, ResourceLogs
, and ScopeLogs
which perform a similar
function of organizing log data in a hierarchical manner similar to the
corresponding items in the trace and metrics data model, the main thing to pay
attention to is the LogRecord
object which is the fundamental unit
representing a single log entry.
It consists of of the following attributes:
- Timestamps: Such as
time_unix_nano
(when the log was created at the source) andobserved_time_unix_nano
(when the log was ingested by the Collector). - Severity: The
severity_number
is a numeric representation of the log severity ranging from ranging fromTRACE
(least severe) toFATAL
(most severe), whileseverity_text
is the human-readable description of the log severity. - Trace Context Fields: The
trace_id
andspan_id
fields allow for optionally linking the log to a specific trace and span for correlation. - Body: This
body
is the actual log message content. It uses theAnyValue
type to accommodate various data types. - Attributes: This holds key-value pairs providing additional context about the event.
Implementing OTLP instrumentation
Instrumenting your applications to generate OTLP data, be it logs, metrics, or traces, is really straightforward. You only need to choose the appropriate SDK for your application depending on the language you're working in, then configure the OpenTelemetry Collector to receive, process, and export the data.
1. Instrumenting your application
Begin by selecting the appropriate OpenTelemetry SDK for your programming language, initializing it, and instrumenting your code. A list of supported languages and their respective SDKs can be found on the OpenTelemetry website.
Next, ensure that your application is configured to export telemetry data to an OTLP endpoint. While some observability backends offer direct OTLP endpoints via gRPC or HTTP, utilizing the OpenTelemetry Collector as an intermediary is recommended for its flexibility and advanced processing capabilities.
You might also need to adjust environment variables to point to the correct OTLP
endpoint URL. By default, OTel SDKs use http://localhost:4317
and
http://localhost:4318
, but these can be customized via environment variables
to match your setup.
2. Configuring the OTLP collector
Once your application is producing OTLP-formatted telemetry data, configure the OpenTelemetry Collector to receive, process, and export it to your chosen observability backend. For example, here's a configuration snippet demonstrating how to receive OTLP trace data over HTTP and export it to Jaeger in its native format:
receivers:
otlp:
protocols:
http:
endpoint: localhost:4318
processors:
batch:
exporters:
otlp/jaeger:
endpoint: jaeger:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]
The Collector's powerful processing capabilities allow you to transform OTLP data before exporting. You can filter, enrich, or even anonymize data to comply with privacy regulations or optimize storage.
3. Convert existing instrumentation to OTLP
If your application already uses other instrumentation libraries or formats (e.g., Prometheus for metrics), you can still leverage OTLP and the OpenTelemetry ecosystem. The OpenTelemetry Collector supports a wide range of receivers that can ingest data in various formats and convert it to OTLP.
For example, to convert Prometheus metrics to OTLP you can use the following Collector configuration:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'example'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9090']
processors:
batch:
exporters:
otlp:
endpoint: <OTLP-endpoint>
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlp]
This configuration receives Prometheus metrics, processes them in batches, and exports them as OTLP data. Such setups are invaluable for legacy applications where you can't modify the original instrumentation.
4. Sending OTLP data to a backend
After configuring your application and Collector, the final step is to ensure OTLP data reaches your observability backend. Depending on the backend, you'll configure the Collector to export telemetry data in the required format.
For example, Better Stack supports ingesting OTLP data directly so you only need to you’ll configure the Collector’s exporter to match the provided endpoint. This way, the Collector can forward OTLP data efficiently, helping you monitor and visualize application performance in real-time.
Final thoughts
By following these steps, you can effectively implement OTLP instrumentation, harnessing its benefits for improved observability and gaining deeper insights into your system's behavior and performance.
Thanks for reading!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github