A Complete Introductory Guide to OpenTelemetry
OpenTelemetry is an open-source observability framework that addresses the limitations of existing telemetry agents by providing a unified, vendor-neutral approach for collecting and exporting telemetry data.
At its core, OpenTelemetry facilitates the instrumentation of applications in a vendor-agnostic manner, allowing for the subsequent analysis of collected data through backend tools like Prometheus, Jaeger, Zipkin, and others, according to your preference.
The broad scope and open-source nature of OpenTelemetry can sometimes create confusion and skepticism. This article seeks to clarify OpenTelemetry's role by explaining its key features, practical benefits, and how it can significantly improve your observability strategy.
Let's get started!
What is telemetry data?
In simple terms, telemetry data is the information gathered from various deployments of software applications, systems, and devices for monitoring and analysis purposes.
It is an essential aspect of understanding how systems operate, identifying performance issues, troubleshooting problems, and making informed decisions about optimization and resource allocation.
In the era of cloud-native and microservice architectures, where applications are complex and distributed, collecting and analysing telemetry data has become even more critical for achieving full observability into your systems.
The most common types of telemetry data include:
Metrics: Numerical measurements that quantify system health or performance, such as CPU usage, memory consumption, request latency, and error rates.
Traces: Records the path taken by a request as it moves through a distributed system, highlighting dependencies and bottlenecks.
Logs: Records significant actions, errors, and other relevant information that help with understanding system behavior and troubleshooting issues.
Events: Structured records containing contextual information about what it took to complete a unit of work in an application (commonly a single HTTP transaction).
Profiles: Provides insights into resource usage (CPU, memory, etc.) within the context of code execution.
What is OpenTelemetry?
OpenTelemetry (often abbreviated as OTel) is an open-source observability framework and toolkit for generating, collecting, and exporting telemetry data from your software applications and infrastructure.
It emerged in 2019 through a merger of OpenTracing and OpenCensus. Each project had unique strengths but also limitations that hindered broader adoption. By combining their best features under the Cloud Native Computing Foundation (CNCF), OpenTelemetry provides a unified, standardized framework for collecting all kinds of observability signals, addressing the shortcomings of its predecessors.
At the time of writing, the CNCF development statistics show that OpenTelemetry is currently the 2nd most active CNCF project, only surpassed by Kubernetes.
What problem is OpenTelemetry aimed at solving?
OpenTelemetry aims to address the fragmentation and complexity of how telemetry data is collected and processed in distributed systems. It seeks to replace the myriad of proprietary agents and formats with a unified, vendor-neutral, and open-source standard for instrumenting applications, collecting signals (traces, metrics, and logs), and exporting them to various analysis backends.
Note that OpenTelemetry focuses solely on collecting and delivering telemetry data, leaving the generation of actionable insights to dedicated analysis tools and platforms.
Components of OpenTelemetry
The OpenTelemetry framework comprises of several components that work together to capture and process telemetry data, which are outlined below:
1. API & SDK specification
The OpenTelemetry specification defines the standards, requirements, and expectations for implementing OpenTelemetry across different programming languages, frameworks, and environments. It is divided into three major sections:
API specification: This defines the data types and programming interfaces for creating and manipulating telemetry data in different languages to ensure consistency in how such data is generated and handled across various systems.
SDK specification: This defines the behavior and requirements for the language-specific implementation of the OpenTelemetry API. SDKs handle tasks like sampling, context propagation, processing, and exporting telemetry data. It also enables automatic instrumentation through integrations and agents, which reduces the need for manual coding to capture metrics and traces.
Data specification: This defines the OpenTelemetry Protocol (OTLP), a vendor-agnostic protocol for transmitting telemetry data between different components of the OpenTelemetry ecosystem. It specifies the supported telemetry signals' data formats, semantic conventions, and communication mechanisms to ensure consistency, making analyzing and correlating data from different sources easier.
2. Semantic conventions
OpenTelemetry semantic conventions are standardized guidelines that define the naming and structure of attributes used to describe telemetry data. These conventions provide meaning to data when producing, collecting, and consuming it.
Some key aspects of OTel semantic conventions include:
Attribute naming: It provides a set of well-defined names for span attributes, metrics, and other fields that represent common concepts, operations, and properties in different domains. For example,
http.response.status_code
represents the HTTP status code of a request,db.system
denotes the database system being used, andexception.type
indicates the type of exception thrown.Telemetry schemas: It defines the structure of attributes, their data types, and allowed values. This ensures telemetry data generated by different components can be seamlessly combined and correlated. It also allows these schemas to evolve over time through versioning.
3. Collector
The OpenTelemetry Collector acts as an intermediary between your instrumented applications and the backend systems where you analyze and visualize your data. It is designed to be a standalone binary process (that can be run as a proxy or sidecar) that receives telemetry data in various formats, processes and filters it then sends it to one or more configured observability backends.
The Collector is composed of the following key components that come together to form an observability pipeline:
Receivers for ingesting telemetry data in different formats (OTLP, Prometheus, Jaeger, etc) and from various sources.
Processors for processing the ingested telemetry data by filtering, aggregating, enriching or transforming it.
Exporters for delivering the processed data to one or more observability backends in whatever format you desire.
Connectors for bridging different pipelines within the Collector, enabling seamless data flow and transformation between them. They act as both an exporter for one pipeline and a receiver for another.
Extensions for providing additional functionality like health checks, profiling, and configuration management. These don't require direct access to telemetry data.
It is also separated into two main GitHub repositories:
otel-collector: This is the core project focuses on the fundamental processing logic of the collector, specifically the handling and manipulation of OTLP data.
otel-collector-contrib: This project acts as a comprehensive repository of various integrations, including receivers for collecting telemetry data from different sources and exporters for sending data to diverse backends.
Due to the vast number of integrations, you are advised to create custom
otel-collector-contrib
builds that include only the specific components you
need. This can be done through the
OpenTelemetry Collector Builder tool.
4. Protocol (OTLP)
The OpenTelemetry Protocol (OTLP) is a vendor-neutral and open-source specification for how telemetry data are encoded, transported, and delivered between different components within the OpenTelemetry ecosystem.
It enables seamless communication between various parts of your observability stack, regardless of the specific tools or platforms you're using. This flexibility prevents vendor lock-in and allows you to choose the tools that best suit your needs.
Note that OpenTelemetry also supports ingesting data in other protocols (such as Zipkin, Prometheus, Jaeger, etc) with the appropriate receiver, and you can convert data from one format to another to simplify integration with different backends.
5. Open Agent Management Protocol (OpAMP)
The Open Agent Management Protocol is an emerging open standard designed to manage large fleets of data collection agents at scale. It was donated to the OpenTelemetry (OTel) project by Splunk in 2022 and is currently under active development within the OTel community.
OpAMP defines a network protocol for remote management of agents, including instances of the OpenTelemetry Collector, as well as vendor-specific agents that implement the OpAMP spec. This allows a centralized server (OpAMP control plane) to provide a "single pane of glass" view that monitors, configures, and updates a large fleet of agents across a distributed environment.
6. Transformation Language (OTTL)
The OpenTelemetry Transformation Language is a powerful, flexible language designed to transform telemetry data efficiently within the OpenTelemetry Collector. It provides a vendor-neutral way to filter, transform, and modify data before it is exported to various analysis backends.
It is still under active development as part of the otel-collector-contrib
project, but it holds great potential for simplifying and standardizing the
processing of telemetry data in observability pipelines.
7. Demo application
A microservice-based shopping site showcasing the capabilities of various OpenTelemetry features and language SDKs. It provides a practical example of how Otel can be used to instrument and observe a distributed system in real-world scenarios.
What programming languages are supported?
OpenTelemetry supports a wide range of programming languages, making it a truly universal observability framework. Here's a list of the officially supported language APIs and SDKs:
- Java
- JavaScript
- Python
- Go
- .NET
- C++
- Ruby
- PHP
- Erlang/Elixir
- Rust
- Swift
There are also community-supported SDKs and instrumentation libraries for other languages, which can be found in the registry.
Note that supported SDKs' maturity and feature set can vary across languages. While the core API is standardized, some language-specific implementations might have differences in features or stability levels.
You can find the official list of supported languages and their documentation on the OpenTelemetry website.
OpenTelemetry signals and stability explained
Signals refer to the different types of telemetry data that the OpenTelemetry framework is designed to collect, process, and export. We're currently dealing with three primary types of signals: distributed traces, metrics, and logs, with continuous profiling in early development as a fourth.
Each component, including the individual signal types, language-specific SDKs, and collector integrations is handled by a different group within the OpenTelemetry project, leading to a truly collaborative effort to develop and maintain the framework.
In OpenTelemetry, "stability" refers to a specific stage in the maturity lifecycle of a component or signal. It could mean stability in the specification, semantic conventions, protocol representation, language-specific SDKs, and the collector.
A component or signal deemed "stable" has a well-defined API, schema, and behavior unlikely to undergo significant changes in future releases. This stability allows you to reliably build upon and integrate these components in your production applications without concern for disruptive changes.
It's important to understand that stability in one area does not mean "stable for everything". Always consult the official documentation to verify the latest stability status of any components you plan to utilize in your projects.
Now, let's look at each major signal supported by OpenTelemetry and their stability status:
1. Traces — stable
Distributed tracing within OpenTelemetry reached general availability in September 2021. This means the Tracing API, SDK, and Protocol specifications are considered stable and suitable for production use.
At the time of writing, the OTel tracing implementation for all officially supported languages is stable except for Rust, which is currently in beta.
Language | Traces |
---|---|
C++ | Stable |
C#/.NET | Stable |
Erlang/Elixir | Stable |
Go | Stable |
Java | Stable |
JavaScript | Stable |
PHP | Stable |
Python | Stable |
Ruby | Stable |
Rust | Beta |
Swift | Stable |
2. Metrics — stable
OpenTelemetry metrics achieved general availability in 2021, signifying that its API, SDK, and Protocol specifications are production-ready for various programming languages. That said, development for full SDK stability is still ongoing across the board.
Language | Metrics |
---|---|
C++ | Stable |
C#/.NET | Stable |
Erlang/Elixir | Experimental |
Go | Stable |
Java | Stable |
JavaScript | Stable |
PHP | Stable |
Python | Stable |
Ruby | In development |
Rust | Alpha |
Swift | Experimental |
3. Logs — stable
The general availability announcement of OpenTelemetry logs at Kubecon North America's 2023 edition marked a significant step towards wider adoption. It enables the OpenTelemetry Collector and APIs/SDKs to seamlessly capture, process, and export logs along with metrics and traces, making it an attractive solution for many organizations who often start their observability journey with logs.
Language | Logs |
---|---|
C++ | Stable |
C#/.NET | Stable |
Erlang/Elixir | Experimental |
Go | Alpha |
Java | Stable |
JavaScript | Experimental |
PHP | Stable |
Python | Experimental |
Ruby | In development |
Rust | Alpha |
Swift | In development |
Creating a plan to adopt OpenTelemetry
Before embracing OpenTelemetry for your project, a thorough assessment of your current technology stack is necessary. Start by identifying the programming languages and frameworks powering your frontend and backend services. This will guide your selection of compatible client libraries and instrumentation agents.
Next, pinpoint the specific telemetry data (logs, metrics, or traces) you need to collect and their origins. Whether they're generated within your application or sourced from external systems like Kafka, Docker, or PostgreSQL, understanding this will direct your choice of receivers for the OpenTelemetry Collector.
If your existing code already generates telemetry data, determine whether it utilizes OpenCensus, OpenTracing, or another framework. OpenTelemetry is backwards compatible with both OpenCensus and OpenTracing, which should eliminate the need for major initial code modifications.
However, to leverage the full potential of OpenTelemetry, a gradual migration is recommended. If you rely on vendor-specific instrumentation, anticipate the need for re-instrumentation using OpenTelemetry.
Finally, determine the destination of your telemetry data. Are you using open-source tools like Jaeger or Prometheus, a proprietary vendor solution, or even a Kafka cluster for further processing? This decision will dictate the exporters you'll need within the OpenTelemetry Collector.
By mapping out your technology stack and identifying the relevant OpenTelemetry components, you'll be well-prepared to evaluate their stability and readiness for your project's specific needs.
Instrumenting an application with OpenTelemetry
Instrumentation with OpenTelemetry involves adding code manually or using auto-instrumentation agents to generate telemetry data for each operation performed in an application.
We'll focus on instrumenting a service for generating traces, where each service operation emits one or more spans. Spans contain data on the service, operation, timing, context (trace/span IDs), and optional attributes.
The OpenTelemetry SDK propagates the span's context across services to establish causal relationships, ensuring spans can be reconstructed into meaningful traces for analysis in backend tools.
Let's see these concepts in action by instrumenting a basic Node.js application with OpenTelemetry and sending the generated data to Jaeger for analysis.
1. Begin with automatic instrumentation
Automatic instrumentation can help jumpstart your observability journey by capturing data from many popular libraries and frameworks without requiring any code changes. This means you can start collecting traces within minutes instead of doing everything manually.
Let's say you have a simple Fastify app like this:
import Fastify from 'fastify';
const fastify = Fastify({
logger: false,
});
fastify.get('/', async function (request, reply) {
const response = await fetch('https://icanhazdadjoke.com/', {
headers: {
Accept: 'application/json',
},
});
const data = await response.json();
reply.send({ data });
});
const PORT = parseInt(process.env.PORT || '8080');
fastify.listen({ port: PORT }, function (err, address) {
if (err) {
console.error(err);
process.exit(1);
}
console.log(`Listening for requests on ${address}`);
});
You can instrument it with OpenTelemetry through its Node.js SDK and auto-instrumentations package so that it automatically creates spans for each incoming request.
Install the required packages first:
npm install @opentelemetry/sdk-node \
@opentelemetry/api \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/sdk-trace-node
Then set up the instrumentation in a different file:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { ConsoleSpanExporter } from '@opentelemetry/sdk-trace-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new ConsoleSpanExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Finally, you must register the instrumentation before your application code like this:
import './instrumentation.js';
import Fastify from 'fastify';
. . .
Start the application and send a few requests to the /
endpoint. You'll start
seeing spans that track the lifetime of each request in the console:
. . .
{
resource: {
attributes: {
. . .
}
},
traceId: '1bb070e6ff071ce5ae311695861ad5ae',
parentId: 'fee8dce0965687ac',
traceState: undefined,
name: 'GET',
id: 'dc00b1e8bb4e11c2',
kind: 2,
timestamp: 1715071474394000,
duration: 524759.881,
attributes: {
'http.request.method': 'GET',
'http.request.method_original': 'GET',
'url.full': 'https://icanhazdadjoke.com/',
'url.path': '/',
'url.query': '',
'url.scheme': 'https',
'server.address': 'icanhazdadjoke.com',
'server.port': 443,
'user_agent.original': 'node',
'network.peer.address': '2606:4700:3033::6815:420f',
'network.peer.port': 443,
'http.response.status_code': 200
},
status: { code: 0 },
events: [],
links: []
}
{
resource: {
attributes: {
. . .
}
},
traceId: '1bb070e6ff071ce5ae311695861ad5ae',
parentId: undefined,
traceState: undefined,
name: 'GET',
id: 'fee8dce0965687ac',
kind: 1,
timestamp: 1715071474382000,
duration: 544035.726,
attributes: {
'http.url': 'http://localhost:8080/',
'http.host': 'localhost:8080',
'net.host.name': 'localhost',
'http.method': 'GET',
'http.scheme': 'http',
'http.target': '/',
'http.user_agent': 'curl/8.6.0',
'http.flavor': '1.1',
'net.transport': 'ip_tcp',
'net.host.ip': '::1',
'net.host.port': 8080,
'net.peer.ip': '::1',
'net.peer.port': 59792,
'http.status_code': 200,
'http.status_text': 'OK'
},
status: { code: 0 },
events: [],
links: []
}
The trace for each request contains two spans: one for the request to the server and the other for the GET request to icanhazdadjoke.com's API. Armed with such data, you can immediately pinpoint slowdowns within your services or their dependencies.
Currently, automatic instrumentation is available for Java, .NET, Python, JavaScript, and PHP. Compiled languages like Go and Rust lack direct support, but automatic trace injection can still be achieved using external tools like eBPF or service mesh technologies.
Let's look at how to add custom instrumentation next for even deeper insights.
2. Manually instrument your code
Automatic instrumentation gives you a solid foundation, but to truly understand the inner workings of your system, you'll need custom instrumentation. This lets you monitor the specific business logic that makes your application unique.
To get started, you need to identify the unit of work you'd like to track. This could be function executions, cache interactions, background tasks, or other internal steps within a service
Assuming you have the following route in your application that calculates the specified Fibonacci number:
. . .
function fibonacci(n) {
if (n <= 1) return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
fastify.get('/fibonacci/:n', (request, reply) => {
const n = parseInt(request.params.n, 10);
const result = fibonacci(n);
reply.send({ result });
});
. . .
You can use the following code to create a span for each Fibonacci computation like this:
import './instrumentation.js';
import Fastify from 'fastify';
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('fastify-app', '0.1.0');
. . .
fastify.get('/fibonacci/:n', (request, reply) => {
const n = parseInt(request.params.n, 10);
const span = tracer.startSpan('calculate-fibonacci-number', {
attributes: {
'fibonacci.input': n,
},
});
const result = fibonacci(n);
span.setAttribute('fibonacci.result', result);
span.end();
reply.send({ result });
});
Custom instrumentation starts with obtaining the tracer and creating a span for
the work you'd like to track. You can attach key/value pairs to the span to
provide more details about the operation that it's tracking. Once the operation
is done, the span is finalized with span.end()
.
Such instrumentation will now capture spans detailing how long each Fibonacci calculation takes, its input, and the result:
{
resource: {
attributes: {
. . .
}
},
traceId: '94acf0a34595230b72acbd473ca78617',
parentId: '91888aca54a65286',
traceState: undefined,
name: 'calculate-fibonacci-number',
id: '88ec3f8d2304a32e',
kind: 0,
timestamp: 1715076859034000,
duration: 28.72,
attributes: { 'fibonacci.input': 10, 'fibonacci.result': 55 },
status: { code: 0 },
events: [],
links: []
}
Up next, we'll explore how to visualize this collected data to troubleshoot issues and optimize your application!
3. Export trace data to backend system
Now that you've generated all this helpful data, it's time to send it to a backend system for visualization and analysis. OpenTelemetry offers two main export methods:
- The aforementioned OpenTelemetry collector which offers flexibility for data processing and routing to various backends.
- A direct export from your application to one or more backends of your choice.
We'll use the second approach to export traces to Jaeger for simplicity. You can use the following command to launch Jaeger in your local environment:
docker run --rm --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.57
Unable to find image 'jaegertracing/all-in-one:1.57' locally
1.57: Pulling from jaegertracing/all-in-one
a88dc8b54e91: Already exists
1aad216be65d: Pull complete
4b87021fa57f: Pull complete
1c6e9aedbcb3: Pull complete
7e4eba3a7c50: Pull complete
Digest: sha256:8f165334f418ca53691ce358c19b4244226ed35c5d18408c5acf305af2065fb9
Status: Downloaded newer image for jaegertracing/all-in-one:1.57
. . .
Visit http://localhost:16686
to access the Jaeger UI. You should see:
OpenTelemetry includes exporter libraries for Node.js that allow you to push recorded spans directly to a consumer. In this case, you will push the generated spans to your local Jaeger instance.
Start by installing the OpenTelemetry Collector exporter for Node.js with:
npm install --save @opentelemetry/exporter-trace-otlp-proto
Next, modify your instrumentation.js
file as follows to configure the
exporter:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Restart your application, and specify the OTEL_SERVICE_NAME
environmental
variable so that you can identify your application traces in Jaeger:
OTEL_SERVICE_NAME=fastify-app node app.js
Ensure to send requests to your routes to generate some traces:
curl http://localhost:8080/
curl http://localhost:8080/fibonacci/30
Refresh the Jaeger UI, and click the Service dropdown to select your application:
Click Find traces to view the most recently collected traces for your service:
If you click on an individual trace you are presented with a breakdown of the spans contained in the trace:
This trace was generated for a request to /fibonacci/40
, and it clearly shows
that approximately all the time generating a response was spent calculating the
specified Fibonacci number.
In a distributed scenario where all the downstream services were also instrumented for tracing and pushing spans to the same Jaeger instance, you'll see the entire request journey mapped out in Jaeger!
This demonstrates the general process of instrumenting an application with OpenTelemetry and seeing the timeline of how requests flow through your system, revealing the interactions between components.
Best practices for OpenTelemetry instrumentation
To ensure that OpenTelemetry's utility is maximized in your project, follow the guidelines below:
1. Avoid over-instrumentation
While auto-instrumentation offers convenience, exercise caution to avoid excessive, irrelevant data that hampers troubleshooting. Selectively enable auto-instrumentation only for necessary libraries and be measured when instrumenting your code.
2. Instrument as you code
Embrace observability-driven development (ODD) by incorporating instrumentation while writing code. This ensures targeted instrumentation and prevents technical debt associated with retrofitting observability later.
3. Own your instrumentation
Application teams should take ownership of instrumenting their code. Their intimate knowledge of the codebase ensures optimal instrumentation and effective troubleshooting.
4. Deploy an OpenTelemetry Collector
Utilize at least one Collector instance to centralize data collection and processing from various sources instead of sending telemetry data directly from your application. This streamlines data management, enables seamless backend switching, and simplifies future observability adjustments through YAML configuration updates.
Challenges of OpenTelemetry
Despite its immense potential and growing popularity, OpenTelemetry presents several challenges that you need to consider before adopting it in your organization:
1. Maturity and stability
While the tracing component is fairly mature, many aspects of logs and metrics support are still evolving. This can lead to inconsistencies, breaking changes, and a steeper learning curve for new adopters
2. Complexity
OpenTelemetry is a complex project with a wide range of features and components. The learning curve can be steep, particularly if you're new to observability or distributed tracing concepts. Properly configuring and managing the Collector can also be challenging, requiring a deep understanding of its configuration options.
3. Instrumentation overhead
While automatic instrumentation simplifies the process, it can sometimes introduce performance overhead, especially in high-traffic environments. Fine-tuning and optimizing instrumentation may be necessary to minimize the impact on application performance.
4. Varying component quality
The evolving nature and varying quality of OpenTelemetry libraries and documentation pose a challenge, especially as new versions are released frequently. The current inconsistency in maturity across components can lead to varying user experiences depending on your specific needs and goals.
5. Documentation gaps
Documentation and best practices are still evolving, and there might be a lack of clear guidance for certain use cases or specific technologies. This can lead to trial and error, and slower adoption.
Final thoughts
I hope this article has helped you understand where OpenTelemetry fits in your observability strategy and how it provides a standardized vendor-neutral way to collect telemetry signals from your application and infrastructure.
For even more information on OpenTelemetry, consider visiting their website, and digging deeper into the official documentation.
Thanks for reading!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github