A Beginner's Guide to Distributed Tracing
Tracing is capturing and recording details about the execution flow of a software application, primarily for troubleshooting purposes. It evolved from a need to understand and debug increasingly complex systems, which outgrew the ability of developers to track logic and data flow manually.
Today's internet applications are often implemented as cloud-native distributed systems, built from a network of components that may be created in different languages by different teams and deployed to multiple locations across the globe.
This approach to application development introduces significant challenges as user requests frequently cross process, machine, and network boundaries to be completed. Understanding how the application functions in this context requires you to monitor interactions between all these interconnected components.
These challenges have led to the rise of distributed tracing, which is simply the process of tracking the progression of a single request as it navigates through the various components and services that make up an application.
The collected telemetry data (called a trace) provides a visual map of system relationships, enabling you to pinpoint errors and performance bottlenecks rapidly.
This article will guide you through distributed tracing basics: how it works, its core concepts, benefits and challenges, and instrumenting your application for a successful implementation.
Let's get started!
Why is distributed tracing needed?
The last 15 years have witnessed a revolution in application design driven by cloud computing, containerization, and microservices. This shift offers greater scalability, efficiency, and rapid feature delivery, but it also introduces new challenges and complexity. Understanding this complexity is where distributed tracing becomes essential.
Consider a scenario where a user browses an e-commerce website, searches for products, adds items to their shopping cart, and proceeds to checkout and payment. The potential services involved and their interactions could include:
The product catalog service that handles queries for products, coordinating searches across various microservices for categories, availability, shipping details, and pricing.
The recommendation engine, which provides personalized product suggestions based on user history, influencing search results and listings.
The payment gateway responsible for processing payments, which may interact with external banking services or credit card networks.
The order management system, which processes user orders after payment, coordinates inventory updates, and initiates shipping logistics.
In such a system, a single request may pass through several microservices and even third-party APIs outside the application. A failure or slowdown in any of these components can ruin the user experience.
Pinpointing the root cause of issues in such systems is often challenging due to frequent service updates, a lack of deep expertise across all services when teams are siloed, and the potential for shared resources to create ripple effects across seemingly unrelated requests.
The scenario described above gave rise to distributed tracing for telling the story about exactly what happened to any request that you're investigating. It helps you answers questions like:
- What services did the request pass through?
- What did each service do, and how long did it take?
- Where are the bottlenecks in the pipeline?
- If there was an error, where did it originate?
- Did the request behave abnormally?
- What was the critical path?
- Who should be paged to investigate and fix the issue?
Google's influential Dapper paper (2010) by Ben Sigelman et al brought the knowledge of distributed tracing for understanding complex systems to the mainstream.
This sparked a wave of open-source projects, including prominent examples like Twitter's Zipkin (2012) and Uber's Jaeger (2017). From these and other solutions, the core concepts and benefits of distributed tracing have been solidified, despite differences in how individual tools are implemented.
In the next section, let's go into the specifics of how distributed tracing works.
How distributed tracing works
The primary purpose of distributed tracing is to enable you see relationships among various services. This is achieved by collecting data the moment a request is initiated and assigning a unique identifier called a trace ID to link together all the events and data related to that specific request.
As a request journeys through various services, each one creates a span to represent a unit of work done by the service. Spans record timing information, the service name, and potentially additional data related to the work being performed.
Within each trace, the first span is the root span, with subsequent spans nested within it, each with its own unique ID. Spans inherit the trace ID, and critically, services pass this context when communicating with each other. This links all the actions taken for a request across the entire system.
Once the request is completed, all trace data is sent to a tracing backend or monitoring platform. This might involve services directly sending data or having an agent collect it.
The collected traces are often visualized in flame charts, revealing how different services interact within a single request. Since spans are timed, you will see a clear timeline of the request flow, which can help guide your troubleshooting efforts accordingly.
How distributed tracing impacts observability
Distributed tracing sits alongside logs, events, metrics, and profiling as the building blocks of observability, but they serve different purposes and complement each other.
Tracing provides context by linking related operations across services, which is invaluable for understanding the behavior of distributed systems where issues may span multiple components.
Logs and events provide depth by offering detailed, local information about the work performed within each component. They can provide the "why" behind the behavior observed in traces, such as detailed error messages or system states leading up to an issue.
Metrics report aggregated statistics on service health like error rates, request latency, and resource utilization, while continuous profiling helps you understand resource usage down to the line number in your program.
A brief introduction to OpenTelemetry
In the past, getting telemetry data out of your applications was a vendor-specific process, and every monitoring tool had its way of doing things. This meant switching tools required a rewrite of your instrumentation code, leading to friction and duplicated effort.
To combat this, the observability community developed open-source projects like OpenTracing (from the Cloud Native Computing Foundation (CNCF)) and OpenCensus (from Google). These competing projects aimed to create a vendor-neutral way to collect application-level telemetry data. In 2019, they merged to form OpenTelemetry (OTel) under the CNCF.
OpenTelemetry offers a unified solution for capturing traces, metrics, logs, and other telemetry data. It's designed to be agnostic about where you send this data for analysis, eliminating the lock-in of past solutions while providing a robust, standardized approach to instrumentation. With OTel, you can instrument your code once and easily switch observability backends as often as you like.
The following section will show you how to jumpstart your distributed tracing journey with the flexibility and power of OTel. You can use the lessons learned to quickly instrument your applications and export the generated data to any number of observability solutions.
Instrumenting an application with OpenTelemetry
OpenTelemetry supports instrumenting programs written in many languages, such as JavaScript, Go, C++, Rust, Ruby, Python, Java, .NET, and others. In this section, I'll demonstrate the core instrumentation workflow through a Node.js application. However, note that the concepts discussed here apply to any supported language.
1. Begin with automatic instrumentation
Automatic instrumentation can help jumpstart your tracing journey by capturing data from many popular libraries and frameworks without requiring any code changes. This means you can start collecting traces within minutes instead of doing everything manually.
Let's say you have a simple Fastify app like this:
import Fastify from 'fastify';
const fastify = Fastify({
logger: false,
});
fastify.get('/', async function (request, reply) {
const response = await fetch('https://icanhazdadjoke.com/', {
headers: {
Accept: 'application/json',
},
});
const data = await response.json();
reply.send({ data });
});
const PORT = parseInt(process.env.PORT || '8080');
fastify.listen({ port: PORT }, function (err, address) {
if (err) {
console.error(err);
process.exit(1);
}
console.log(`Listening for requests on ${address}`);
});
You can instrument it with OpenTelemetry through its Node.js SDK and auto-instrumentations package so that it automatically creates spans for each incoming request.
Install the required packages first:
npm install @opentelemetry/sdk-node \
@opentelemetry/api \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/sdk-trace-node
Then set up the instrumentation in a different file:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { ConsoleSpanExporter } from '@opentelemetry/sdk-trace-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new ConsoleSpanExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Finally, you must register the instrumentation before your application code like this:
import './instrumentation.js';
import Fastify from 'fastify';
. . .
Start the application and send a few requests to the /
endpoint. You'll start
seeing spans that track the lifetime of each request in the console:
. . .
{
resource: {
attributes: {
. . .
}
},
traceId: '1bb070e6ff071ce5ae311695861ad5ae',
parentId: 'fee8dce0965687ac',
traceState: undefined,
name: 'GET',
id: 'dc00b1e8bb4e11c2',
kind: 2,
timestamp: 1715071474394000,
duration: 524759.881,
attributes: {
'http.request.method': 'GET',
'http.request.method_original': 'GET',
'url.full': 'https://icanhazdadjoke.com/',
'url.path': '/',
'url.query': '',
'url.scheme': 'https',
'server.address': 'icanhazdadjoke.com',
'server.port': 443,
'user_agent.original': 'node',
'network.peer.address': '2606:4700:3033::6815:420f',
'network.peer.port': 443,
'http.response.status_code': 200
},
status: { code: 0 },
events: [],
links: []
}
{
resource: {
attributes: {
. . .
}
},
traceId: '1bb070e6ff071ce5ae311695861ad5ae',
parentId: undefined,
traceState: undefined,
name: 'GET',
id: 'fee8dce0965687ac',
kind: 1,
timestamp: 1715071474382000,
duration: 544035.726,
attributes: {
'http.url': 'http://localhost:8080/',
'http.host': 'localhost:8080',
'net.host.name': 'localhost',
'http.method': 'GET',
'http.scheme': 'http',
'http.target': '/',
'http.user_agent': 'curl/8.6.0',
'http.flavor': '1.1',
'net.transport': 'ip_tcp',
'net.host.ip': '::1',
'net.host.port': 8080,
'net.peer.ip': '::1',
'net.peer.port': 59792,
'http.status_code': 200,
'http.status_text': 'OK'
},
status: { code: 0 },
events: [],
links: []
}
The trace for each request contains two spans: one for the request to the server and the other for the GET request to icanhazdadjoke.com's API. Armed with such data, you'll be able to immediately pinpoint slowdowns within your services or their dependencies.
Let's look at how to add custom instrumentation next for even deeper insights.
2. Instrument your code
Automatic instrumentation gives you a solid foundation, but to truly understand the inner workings of your system, you'll need custom instrumentation. This lets you monitor the specific business logic that makes your application unique.
To get started, you need to identify the unit of work you'd like to track. This could be function executions, cache interactions, background tasks, or other internal steps within a service
Assuming you have the following route in your application that calculates the specified Fibonacci number:
. . .
function fibonacci(n) {
if (n <= 1) return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
fastify.get('/fibonacci/:n', (request, reply) => {
const n = parseInt(request.params.n, 10);
const result = fibonacci(n);
reply.send({ result });
});
. . .
You can use the following code to create a span for each Fibonacci computation like this:
import './instrumentation.js';
import Fastify from 'fastify';
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('fastify-app', '0.1.0');
. . .
fastify.get('/fibonacci/:n', (request, reply) => {
const n = parseInt(request.params.n, 10);
const span = tracer.startSpan('calculate-fibonacci-number', {
attributes: {
'fibonacci.input': n,
},
});
const result = fibonacci(n);
span.setAttribute('fibonacci.result', result);
span.end();
reply.send({ result });
});
Custom instrumentation starts with obtaining the tracer and creating a span for
the work you'd like to track. You can attach key/value pairs to the span to
provide more details about the operation that it's tracking. Once the operation
is done, the span is finalized with span.end()
.
Such instrumentation will now capture spans detailing how long each Fibonacci calculation takes, its input, and the result:
{
resource: {
attributes: {
. . .
}
},
traceId: '94acf0a34595230b72acbd473ca78617',
parentId: '91888aca54a65286',
traceState: undefined,
name: 'calculate-fibonacci-number',
id: '88ec3f8d2304a32e',
kind: 0,
timestamp: 1715076859034000,
duration: 28.72,
attributes: { 'fibonacci.input': 10, 'fibonacci.result': 55 },
status: { code: 0 },
events: [],
links: []
}
Up next, we'll explore how to visualize this collected data to troubleshoot issues and optimize your application!
3. Export traces for analysis
Now that you've generated all this helpful data, it's time to send it to a backend system for visualization and analysis. OpenTelemetry offers two main export methods:
- OpenTelemetry collector: This acts as a proxy, offering flexibility for data processing and routing to various backends.
- Direct export: Send data straight from your application to the backend of your choice.
In this section, we'll use the second approach to export traces to Jaeger. You can use the following command to launch Jaeger in your local environment:
docker run --rm --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.57
Unable to find image 'jaegertracing/all-in-one:1.57' locally
1.57: Pulling from jaegertracing/all-in-one
a88dc8b54e91: Already exists
1aad216be65d: Pull complete
4b87021fa57f: Pull complete
1c6e9aedbcb3: Pull complete
7e4eba3a7c50: Pull complete
Digest: sha256:8f165334f418ca53691ce358c19b4244226ed35c5d18408c5acf305af2065fb9
Status: Downloaded newer image for jaegertracing/all-in-one:1.57
. . .
Visit http://localhost:16686
to access the Jaeger UI. You should see:
OpenTelemetry includes exporter libraries for Node.js that allow you to push recorded spans directly to a consumer. In this case, you will push the generated spans to your local Jaeger instance.
Start by installing the OpenTelemetry Collector exporter for Node.js with:
npm install --save @opentelemetry/exporter-trace-otlp-proto
Next, modify your instrumentation.js
file as follows to configure the
exporter:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Restart your application, and specify the OTEL_SERVICE_NAME
environmental
variable so that you can identify your application traces in Jaeger:
OTEL_SERVICE_NAME=fastify-app node app.js
Ensure to send requests to your routes to generate some traces:
curl http://localhost:8080/
curl http://localhost:8080/fibonacci/30
Refresh the Jaeger UI, and click the Service dropdown to select your application:
Click Find traces to view the most recently collected traces for your service:
If you click on an individual trace you are presented with a breakdown of the spans contained in the trace:
This trace was generated for a request to /fibonacci/40
, and it clearly shows
that approximately all the time generating a response was spent calculating the
specified Fibonacci number.
In a distributed scenario where all the downstream services were also instrumented for tracing and pushing spans to the same Jaeger instance, you'll see the entire request journey mapped out in Jaeger!
This demonstrates the general process of instrumenting an application with OpenTelemetry and seeing the timeline of how requests flow through your system, revealing the interactions between components.
Final thoughts
I hope this article has helped you understand how distributed tracing can help you decipher the interactions between the different services that make up your application, and how to get started with the vendor-neutral OpenTelemetry project for tracing instrumentation.
For even more information on tracing and OpenTelemetry, consider visiting their official website, and digging deeper into the documentation.
Thanks for reading, and happy tracing!
Make your mark
Join the writer's program
Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.
Write for usBuild on top of Better Stack
Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.
community@betterstack.comor submit a pull request and help us build better products for everyone.
See the full list of amazing projects on github