Back to Observability guides

A Beginner's Guide to Distributed Tracing

Ayooluwa Isaiah
Updated on May 13, 2024

Tracing is capturing and recording details about the execution flow of a software application, primarily for troubleshooting purposes. It evolved from a need to understand and debug increasingly complex systems, which outgrew the ability of developers to track logic and data flow manually.

Today's internet applications are often implemented as cloud-native distributed systems, built from a network of components that may be created in different languages by different teams and deployed to multiple locations across the globe.

This approach to application development introduces significant challenges as user requests frequently cross process, machine, and network boundaries to be completed. Understanding how the application functions in this context requires you to monitor interactions between all these interconnected components.

These challenges have led to the rise of distributed tracing, which is simply the process of tracking the progression of a single request as it navigates through the various components and services that make up an application.

The collected telemetry data (called a trace) provides a visual map of system relationships, enabling you to pinpoint errors and performance bottlenecks rapidly.

This article will guide you through distributed tracing basics: how it works, its core concepts, benefits and challenges, and instrumenting your application for a successful implementation.

Let's get started!

Why is distributed tracing needed?

The last 15 years have witnessed a revolution in application design driven by cloud computing, containerization, and microservices. This shift offers greater scalability, efficiency, and rapid feature delivery, but it also introduces new challenges and complexity. Understanding this complexity is where distributed tracing becomes essential.

Consider a scenario where a user browses an e-commerce website, searches for products, adds items to their shopping cart, and proceeds to checkout and payment. The potential services involved and their interactions could include:

  1. The product catalog service that handles queries for products, coordinating searches across various microservices for categories, availability, shipping details, and pricing.

  2. The recommendation engine, which provides personalized product suggestions based on user history, influencing search results and listings.

  3. The payment gateway responsible for processing payments, which may interact with external banking services or credit card networks.

  4. The order management system, which processes user orders after payment, coordinates inventory updates, and initiates shipping logistics.

In such a system, a single request may pass through several microservices and even third-party APIs outside the application. A failure or slowdown in any of these components can ruin the user experience.

Pinpointing the root cause of issues in such systems is often challenging due to frequent service updates, a lack of deep expertise across all services when teams are siloed, and the potential for shared resources to create ripple effects across seemingly unrelated requests.

The scenario described above gave rise to distributed tracing for telling the story about exactly what happened to any request that you're investigating. It helps you answers questions like:

  • What services did the request pass through?
  • What did each service do, and how long did it take?
  • Where are the bottlenecks in the pipeline?
  • If there was an error, where did it originate?
  • Did the request behave abnormally?
  • What was the critical path?
  • Who should be paged to investigate and fix the issue?

Google's influential Dapper paper (2010) by Ben Sigelman et al brought the knowledge of distributed tracing for understanding complex systems to the mainstream.

This sparked a wave of open-source projects, including prominent examples like Twitter's Zipkin (2012) and Uber's Jaeger (2017). From these and other solutions, the core concepts and benefits of distributed tracing have been solidified, despite differences in how individual tools are implemented.

In the next section, let's go into the specifics of how distributed tracing works.

How distributed tracing works

The primary purpose of distributed tracing is to enable you see relationships among various services. This is achieved by collecting data the moment a request is initiated and assigning a unique identifier called a trace ID to link together all the events and data related to that specific request.

As a request journeys through various services, each one creates a span to represent a unit of work done by the service. Spans record timing information, the service name, and potentially additional data related to the work being performed.

Within each trace, the first span is the root span, with subsequent spans nested within it, each with its own unique ID. Spans inherit the trace ID, and critically, services pass this context when communicating with each other. This links all the actions taken for a request across the entire system.

Structure of a trace

Once the request is completed, all trace data is sent to a tracing backend or monitoring platform. This might involve services directly sending data or having an agent collect it.

The collected traces are often visualized in flame charts, revealing how different services interact within a single request. Since spans are timed, you will see a clear timeline of the request flow, which can help guide your troubleshooting efforts accordingly.

How distributed tracing impacts observability

Distributed tracing sits alongside logs, events, metrics, and profiling as the building blocks of observability, but they serve different purposes and complement each other.

Tracing provides context by linking related operations across services, which is invaluable for understanding the behavior of distributed systems where issues may span multiple components.

Logs and events provide depth by offering detailed, local information about the work performed within each component. They can provide the "why" behind the behavior observed in traces, such as detailed error messages or system states leading up to an issue.

Metrics report aggregated statistics on service health like error rates, request latency, and resource utilization, while continuous profiling helps you understand resource usage down to the line number in your program.

A brief introduction to OpenTelemetry

In the past, getting telemetry data out of your applications was a vendor-specific process, and every monitoring tool had its way of doing things. This meant switching tools required a rewrite of your instrumentation code, leading to friction and duplicated effort.

To combat this, the observability community developed open-source projects like OpenTracing (from the Cloud Native Computing Foundation (CNCF)) and OpenCensus (from Google). These competing projects aimed to create a vendor-neutral way to collect application-level telemetry data. In 2019, they merged to form OpenTelemetry (OTel) under the CNCF.

OpenTelemetry offers a unified solution for capturing traces, metrics, logs, and other telemetry data. It's designed to be agnostic about where you send this data for analysis, eliminating the lock-in of past solutions while providing a robust, standardized approach to instrumentation. With OTel, you can instrument your code once and easily switch observability backends as often as you like.

The following section will show you how to jumpstart your distributed tracing journey with the flexibility and power of OTel. You can use the lessons learned to quickly instrument your applications and export the generated data to any number of observability solutions.

Instrumenting an application with OpenTelemetry

Instrumenting an application with OpenTelemetry

OpenTelemetry supports instrumenting programs written in many languages, such as JavaScript, Go, C++, Rust, Ruby, Python, Java, .NET, and others. In this section, I'll demonstrate the core instrumentation workflow through a Node.js application. However, note that the concepts discussed here apply to any supported language.

1. Begin with automatic instrumentation

Automatic instrumentation can help jumpstart your tracing journey by capturing data from many popular libraries and frameworks without requiring any code changes. This means you can start collecting traces within minutes instead of doing everything manually.

Let's say you have a simple Fastify app like this:

app.js
import Fastify from 'fastify';

const fastify = Fastify({
  logger: false,
});

fastify.get('/', async function (request, reply) {
  const response = await fetch('https://icanhazdadjoke.com/', {
    headers: {
      Accept: 'application/json',
    },
  });
  const data = await response.json();
  reply.send({ data });
});

const PORT = parseInt(process.env.PORT || '8080');

fastify.listen({ port: PORT }, function (err, address) {
  if (err) {
    console.error(err);
    process.exit(1);
  }

  console.log(`Listening for requests on ${address}`);
});

You can instrument it with OpenTelemetry through its Node.js SDK and auto-instrumentations package so that it automatically creates spans for each incoming request.

Install the required packages first:

 
npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/sdk-trace-node

Then set up the instrumentation in a different file:

instrumentation.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { ConsoleSpanExporter } from '@opentelemetry/sdk-trace-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new ConsoleSpanExporter(),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Finally, you must register the instrumentation before your application code like this:

app.js
import './instrumentation.js';
import Fastify from 'fastify'; . . .

Start the application and send a few requests to the / endpoint. You'll start seeing spans that track the lifetime of each request in the console:

Output
. . .
{
  resource: {
    attributes: {
      . . .
    }
  },
  traceId: '1bb070e6ff071ce5ae311695861ad5ae',
  parentId: 'fee8dce0965687ac',
  traceState: undefined,
  name: 'GET',
  id: 'dc00b1e8bb4e11c2',
  kind: 2,
  timestamp: 1715071474394000,
  duration: 524759.881,
  attributes: {
    'http.request.method': 'GET',
    'http.request.method_original': 'GET',
    'url.full': 'https://icanhazdadjoke.com/',
    'url.path': '/',
    'url.query': '',
    'url.scheme': 'https',
    'server.address': 'icanhazdadjoke.com',
    'server.port': 443,
    'user_agent.original': 'node',
    'network.peer.address': '2606:4700:3033::6815:420f',
    'network.peer.port': 443,
    'http.response.status_code': 200
  },
  status: { code: 0 },
  events: [],
  links: []
}

{
  resource: {
    attributes: {
      . . .
    }
  },
  traceId: '1bb070e6ff071ce5ae311695861ad5ae',
  parentId: undefined,
  traceState: undefined,
  name: 'GET',
  id: 'fee8dce0965687ac',
  kind: 1,
  timestamp: 1715071474382000,
  duration: 544035.726,
  attributes: {
    'http.url': 'http://localhost:8080/',
    'http.host': 'localhost:8080',
    'net.host.name': 'localhost',
    'http.method': 'GET',
    'http.scheme': 'http',
    'http.target': '/',
    'http.user_agent': 'curl/8.6.0',
    'http.flavor': '1.1',
    'net.transport': 'ip_tcp',
    'net.host.ip': '::1',
    'net.host.port': 8080,
    'net.peer.ip': '::1',
    'net.peer.port': 59792,
    'http.status_code': 200,
    'http.status_text': 'OK'
  },
  status: { code: 0 },
  events: [],
  links: []
}

The trace for each request contains two spans: one for the request to the server and the other for the GET request to icanhazdadjoke.com's API. Armed with such data, you'll be able to immediately pinpoint slowdowns within your services or their dependencies.

Let's look at how to add custom instrumentation next for even deeper insights.

2. Instrument your code

Automatic instrumentation gives you a solid foundation, but to truly understand the inner workings of your system, you'll need custom instrumentation. This lets you monitor the specific business logic that makes your application unique.

To get started, you need to identify the unit of work you'd like to track. This could be function executions, cache interactions, background tasks, or other internal steps within a service

Assuming you have the following route in your application that calculates the specified Fibonacci number:

app.js
. . .

function fibonacci(n) {
  if (n <= 1) return n;
  return fibonacci(n - 1) + fibonacci(n - 2);
}

fastify.get('/fibonacci/:n', (request, reply) => {
  const n = parseInt(request.params.n, 10);

  const result = fibonacci(n);

  reply.send({ result });
});

. . .

You can use the following code to create a span for each Fibonacci computation like this:

app.js
import './instrumentation.js';
import Fastify from 'fastify';
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('fastify-app', '0.1.0');
. . . fastify.get('/fibonacci/:n', (request, reply) => { const n = parseInt(request.params.n, 10);
const span = tracer.startSpan('calculate-fibonacci-number', {
attributes: {
'fibonacci.input': n,
},
});
const result = fibonacci(n);
span.setAttribute('fibonacci.result', result);
span.end();
reply.send({ result }); });

Custom instrumentation starts with obtaining the tracer and creating a span for the work you'd like to track. You can attach key/value pairs to the span to provide more details about the operation that it's tracking. Once the operation is done, the span is finalized with span.end().

Such instrumentation will now capture spans detailing how long each Fibonacci calculation takes, its input, and the result:

Output
{
  resource: {
    attributes: {
     . . .
    }
  },
  traceId: '94acf0a34595230b72acbd473ca78617',
  parentId: '91888aca54a65286',
  traceState: undefined,
  name: 'calculate-fibonacci-number',
  id: '88ec3f8d2304a32e',
  kind: 0,
  timestamp: 1715076859034000,
  duration: 28.72,
  attributes: { 'fibonacci.input': 10, 'fibonacci.result': 55 },
  status: { code: 0 },
  events: [],
  links: []
}

Up next, we'll explore how to visualize this collected data to troubleshoot issues and optimize your application!

3. Export traces for analysis

Now that you've generated all this helpful data, it's time to send it to a backend system for visualization and analysis. OpenTelemetry offers two main export methods:

  1. OpenTelemetry collector: This acts as a proxy, offering flexibility for data processing and routing to various backends.
  2. Direct export: Send data straight from your application to the backend of your choice.

In this section, we'll use the second approach to export traces to Jaeger. You can use the following command to launch Jaeger in your local environment:

 
docker run --rm --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.57
Output
Unable to find image 'jaegertracing/all-in-one:1.57' locally
1.57: Pulling from jaegertracing/all-in-one
a88dc8b54e91: Already exists
1aad216be65d: Pull complete
4b87021fa57f: Pull complete
1c6e9aedbcb3: Pull complete
7e4eba3a7c50: Pull complete
Digest: sha256:8f165334f418ca53691ce358c19b4244226ed35c5d18408c5acf305af2065fb9
Status: Downloaded newer image for jaegertracing/all-in-one:1.57
. . .

Visit http://localhost:16686 to access the Jaeger UI. You should see:

Jaeger user interface

OpenTelemetry includes exporter libraries for Node.js that allow you to push recorded spans directly to a consumer. In this case, you will push the generated spans to your local Jaeger instance.

Start by installing the OpenTelemetry Collector exporter for Node.js with:

 
npm install --save @opentelemetry/exporter-trace-otlp-proto

Next, modify your instrumentation.js file as follows to configure the exporter:

instrumentation.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter(),
instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

Restart your application, and specify the OTEL_SERVICE_NAME environmental variable so that you can identify your application traces in Jaeger:

 
OTEL_SERVICE_NAME=fastify-app node app.js

Ensure to send requests to your routes to generate some traces:

 
curl http://localhost:8080/
 
curl http://localhost:8080/fibonacci/30

Refresh the Jaeger UI, and click the Service dropdown to select your application:

Choosing the application in Jaeger

Click Find traces to view the most recently collected traces for your service:

Finding traces in Jaeger

If you click on an individual trace you are presented with a breakdown of the spans contained in the trace:

Trace Spans in Jaeger

This trace was generated for a request to /fibonacci/40, and it clearly shows that approximately all the time generating a response was spent calculating the specified Fibonacci number.

In a distributed scenario where all the downstream services were also instrumented for tracing and pushing spans to the same Jaeger instance, you'll see the entire request journey mapped out in Jaeger!

This demonstrates the general process of instrumenting an application with OpenTelemetry and seeing the timeline of how requests flow through your system, revealing the interactions between components.

Final thoughts

I hope this article has helped you understand how distributed tracing can help you decipher the interactions between the different services that make up your application, and how to get started with the vendor-neutral OpenTelemetry project for tracing instrumentation.

For even more information on tracing and OpenTelemetry, consider visiting their official website, and digging deeper into the documentation.

Thanks for reading, and happy tracing!

Author's avatar
Article by
Ayooluwa Isaiah
Ayo is the Head of Content at Better Stack. His passion is simplifying and communicating complex technical ideas effectively. His work was featured on several esteemed publications including LWN.net, Digital Ocean, and CSS-Tricks. When he’s not writing or coding, he loves to travel, bike, and play tennis.
Got an article suggestion? Let us know
Next article
A Complete Introductory Guide to OpenTelemetry
Learn about the benefits, features, challenges, and best practices of OpenTelemetry, the open-source observability framework
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us
Writer of the month
Marin Bezhanov
Marin is a software engineer and architect with a broad range of experience working...
Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github