Back to Observability guides

A Beginner's Guide to Distributed Tracing

Ayooluwa Isaiah
Updated on July 1, 2024

Tracing is capturing and recording details about the execution flow of a software application, primarily for troubleshooting purposes. It evolved from a need to understand and debug increasingly complex systems, which outgrew the ability of developers to track logic and data flow manually.

Today's internet applications are often implemented as cloud-native distributed systems, built from a network of components often created in different languages by different teams and deployed to multiple locations across the globe.

This approach to application development introduces significant challenges as user requests frequently cross process, machine, and network boundaries to be completed. Understanding how the application functions in this context requires you to monitor interactions between all these interconnected components.

These challenges have led to the rise of distributed tracing, which is simply the process of tracking the progression of a single request as it navigates through the various components and services that make up an application.

The collected telemetry data (called a trace) provides a visual map of system relationships, enabling you to pinpoint errors and performance bottlenecks rapidly.

This article will guide you through distributed tracing basics: how it works, its core concepts, benefits and challenges, and instrumenting your application for a successful implementation.

Let's get started!

Why is distributed tracing needed?

The last 15 years have witnessed a revolution in application design driven by cloud computing, containerization, and microservices. This shift offers greater scalability, efficiency, and rapid feature delivery, but it also introduces new challenges and complexity. Understanding this complexity is where distributed tracing becomes essential.

Consider a scenario where a user browses an e-commerce website, searches for products, adds items to their shopping cart, and proceeds to checkout and payment. The potential services involved and their interactions could include:

  1. The product catalog service that handles queries for products, coordinating searches across various microservices for categories, availability, shipping details, and pricing.

  2. The recommendation engine, which provides personalized product suggestions based on user history, influencing search results and listings.

  3. The payment gateway responsible for processing payments, which may interact with external banking services or credit card networks.

  4. The order management system, which processes user orders after payment, coordinates inventory updates, and initiates shipping logistics.

In such a system, a single request may pass through several microservices and even third-party APIs outside the application. A failure or slowdown in any of these components can ruin the user experience.

Pinpointing the root cause of issues in such systems is often challenging due to frequent service updates, a lack of deep expertise across all services when teams are siloed, and the potential for shared resources to create ripple effects across seemingly unrelated requests.

The scenario described above gave rise to distributed tracing for telling the story about exactly what happened to any request that you're investigating. It helps you answers questions like:

  • What services did the request pass through?
  • What did each service do, and how long did it take?
  • Where are the bottlenecks in the pipeline?
  • If there was an error, where did it originate?
  • Did the request behave abnormally?
  • What was the critical path?
  • Who should be paged to investigate and fix the issue?

Google's influential Dapper paper (2010) by Ben Sigelman et al brought the knowledge of distributed tracing for understanding complex systems to the mainstream.

This sparked a wave of open-source projects, including prominent examples like Twitter's Zipkin (2012) and Uber's Jaeger (2017). From these and other solutions, the core concepts and benefits of distributed tracing have been solidified, despite differences in how individual tools are implemented.

In the next section, we'll go into the specifics of how distributed tracing works.

How distributed tracing works

A trace documents the complete process of an operation you're monitoring, such as an HTTP request or a background job. If this operation involves multiple services working together, the resulting record is called a distributed trace, as it encompasses data collected from all participating services.

The primary purpose of distributed tracing is to enable you see relationships among various services. This is achieved by collecting data the moment a request is initiated and assigning a unique identifier called a trace ID to link together all the events and data related to that specific request.

As a request journeys through various services, each one creates one or more spans to represent the units of work done by the service. Spans record timing information, the service name, and potentially contextual attributes related to the work being performed.

Structure of a trace

Within each trace, there is a parent-child hierarchy of spans. The initial span is the root, with subsequent spans nested within it, each with its own unique ID. All spans inherit the trace ID from its parent, and critically, services pass this context when communicating with each other. This links all the actions taken for a request across the entire distributed system.

Once the request is completed, all trace data is sent to a tracing backend or monitoring platform. This might involve services directly sending data or having an agent collect it (usually the latter).

The collected traces are often visualized in flame charts, revealing how different services interact within a single request. Since spans are timed, you will see a clear timeline of the request flow, which can help guide your troubleshooting efforts accordingly.

Distributed tracing components

Distributed tracing involves several components that work together to capture and analyze the flow of requests across a distributed system. The key components and their relationships are explained below:

1. Trace

A trace represents the end-to-end journey of a request as it passes through various services and components in a system. It consists of a collection of spans that share the same trace ID.

2. Trace ID

The trace ID is a unique identifier assigned to each distinct trace. It links together all the spans that constitute a single trace, regardless of the processes or services involved. It is generated at the entry point of the request in a way that ensures global uniqueness with a high degree of certainty.

3. Span

A span is the fundamental unit of work in a trace. Each span has a unique ID, a parent ID (except the root span), a start and end time, a name, and additional metadata that describe the operation.

 
{
  "name": "/api/products",
  "context": {
    "trace_id": "5a10e8d6b5c423e1f2a81c90d7f35f4a",
    "span_id": "9a2f0e3c1d4b5a6f"
  },
  "parent_id": "",
  "start_time": "2023-12-04 10:15:32.123456789 +0000 UTC",
  "end_time": "2023-12-04 10:15:32.154321012 +0000 UTC",
  "status_code": "STATUS_CODE_OK",
  "status_message": "",
  "attributes": {
    "net.transport": "IP.TCP",
    "net.peer.ip": "192.168.1.100",
    "net.peer.port": "54321",
    "net.host.ip": "10.0.0.5",
    "net.host.port": "8080",
    "http.method": "POST",
    "http.target": "/api/products",
    "http.server_name": "product-service",
    "http.route": "/api/products",
    "http.user_agent": "Mozilla/5.0",
    "http.scheme": "http",
    "http.host": "10.0.0.5:8080",
    "http.flavor": "1.1"
  },
  "events": [
    {
      "name": "",
      "message": "Product created successfully",
      "timestamp": "2023-12-04 10:15:32.154320998 +0000 UTC"
    }
  ]
}

4. Span attributes

Attributes are attributes are key-value pairs that provide additional context about a specific span within a trace. They allow you to capture details about the operation that are not captured by the span's name or other standard fields.

For example, if a span tracks the process of playing a song in a music streaming service, attributes could include the song ID, artist name, album title, user ID, and device type.

5. Span events

Think of span events like annotations of specific occurrences during a span's lifetime. It represents a significant event that's too fleeting to get its own span, and too distinct to be merged with the parent span. For example, when tracking a database query execution, you can use span events to mark when the connection to the database is established, when the query is sent to the database, when the first row of results is received, and when the entire result set is fetched.

6. Trace context

Trace State and TraceContext example

The trace context is a set of identifiers that are propagated across service boundaries to connect spans together and form a complete trace. It includes the trace ID, parent span ID, and other state information such as whether the parent span was sampled or not.

7. Context propagation

This is the fundamental mechanism that underpins distributed tracing. It ensures the trace context is passed along with requests as they move between services so that generated spans can be correlated regardless of their origin. For HTTP requests, context propagation is typically accomplished through HTTP headers

8. Trace sampling

Sampling allows you to regulate the volume of spans sent to your observability backend. This helps manage ingest costs associated with exporting and storing every span, filter out irrelevant or low-priority traces, and optimize system performance by reducing the overhead of capturing and transmitting excessive data.

9. Instrumentation

Instrumentation refers to the process of adding code to your application to generate trace (or other telemetry) data.

10. Backend analysis tool

Trace analysis in Jaeger

The backend is where trace data is stored and analyzed. Popular tracing backends include Jaeger, Zipkin, Grafana Tempo, AWS X-Ray, and more. They provide interfaces for querying, analyzing, and visualizing trace data to help you understand the flow of requests and identify performance bottlenecks.

How distributed tracing impacts observability

Distributed tracing sits alongside logs, events, metrics, and profiling as the building blocks of observability, but they serve different purposes and complement each other.

Tracing provides context by linking related operations across services, which is invaluable for understanding the behavior of distributed systems where issues may span multiple components.

Logs and events provide depth by offering detailed, local information about the work performed within each component. They can provide the "why" behind the behavior observed in traces, such as detailed error messages or system states leading up to an issue.

Metrics report aggregated statistics on service health like error rates, request latency, and resource utilization, while continuous profiling helps you understand resource usage down to the line number in your program.

The distributed tracing ecosystem

Deploying a successful tracing system goes beyond simply instrumenting your application and setting up a tracing backend. This section outlines some key aspects of the distributed tracing ecosystem you should be aware of to ensure full compatibility and interoperability with various tools now and well into the future.

OpenTelemetry

Instrumentation with OpenTelemetry

In the past, instrumenting your code and getting telemetry data out of your applications was a largely vendor-specific process, and every monitoring tool had its way of doing things. This meant switching tools required a rewrite of your instrumentation code, leading to friction and duplicated effort.

To combat this, the observability community developed open-source projects like OpenTracing (from the Cloud Native Computing Foundation (CNCF)) and OpenCensus (from Google). These competing projects aimed to create a vendor-neutral way to generate and collect telemetry data. In 2019, they merged to form OpenTelemetry (OTel) under the CNCF.

OpenTelemetry offers a unified solution for generating and capturing traces, metrics, logs, and other telemetry data. It's designed to be agnostic about where you send this data for analysis, eliminating the lock-in of past solutions while providing a robust, standardized approach to instrumentation. With OTel, you can instrument your code once and switch observability backends at will.

Learn more: A Complete Introductory Guide to OpenTelemetry

Trace analysis tools

Trace analysis tools are designed to help developers and system administrators understand the behavior of complex software systems. They do this by analyzing the already captured traces, which are then used to:

  • Provide graphical representations of how requests move through different components of a system, making it easier to identify bottlenecks or errors.

  • Pinpoint slow or inefficient parts of your code or infrastructure.

  • Trace errors back to their root cause by examining the sequence of events leading up to the error.

  • Monitor the overall health of a system, identify trends, and predict potential issues.

There are many different trace analysis tools available, both open source and commercial. We'll go over how to choose the best tool for your needs in a different article.

W3C Trace Context

In 2017, a coalition of vendors, cloud providers, and open source projects established the W3C Distributed Tracing Working Group which focuses on creating standards for tracing tool interoperability.

The problem was that existing distributed tracing implementations lacked a standardized way to propagate context information across different vendors and platforms.

This lack of standardization led to inability to correlate traces across different vendors, difficulty propagating traces between vendors, potential loss of vendor-specific metadata, and lack of support from cloud and service providers.

Their primary initiative, the Trace Context specification, provides a vendor-neutral solution for propagating tracing information (known as trace context) across these distributed systems. It does this through two key HTTP headers:

  • traceparent: This header carries the essential information for identifying and correlating requests within a trace. It includes:

    • Trace ID: A unique identifier for the entire trace, spanning all services involved.
    • Parent ID: The ID of the direct parent operation in the trace, establishing the relationship between operations.
    • Flags: Indicate sampling decisions and other trace-related options.
  • tracestate: This header allows vendors to include their own custom tracing data alongside the standardized traceparent information. This enables interoperability while still supporting vendor-specific features.

W3C Baggage

The W3C Baggage specification standardizes the representation and propagation of application-defined properties for a distributed request or workflow execution. It is separate from the Trace Context specification, and implemented as its own HTTP header which looks like this:

 
baggage: userId=alice,serverNode=DF%2028,isProduction=false

Its purpose is to enable the propagation of system-specific contextual data across services in a distributed system without requiring any modifications to the services themselves. Each service can access existing contextual information or add new ones to be shared with subsequent services in the workflow.

The term "baggage" was coined by Professor Rodrigo Fonseca from Brown University. It draws from the concept of luggage or baggage that travelers carry with them. Just as luggage contains items necessary for a journey, the baggage header carries contextual information necessary for a request's journey through a distributed system.

Challenges of distributed tracing

Distributed tracing, while incredibly powerful for understanding complex systems, presents several challenges:

1. Clock skew

Clock skew refers to when there's a discrepancy between nodes in a distributed system. For instance, if your application server's system time isn't synchronized with that of your database server, the spans they create they might appear out of order in the final trace. To mitigate this, utilize the Network Time Protocol (NTP) or your cloud provider's clock synchronization services.

2. Instrumentation overhead

Capturing and transmitting trace data can introduce significant amounts of latency especially in high-traffic environments. Careful configuration and optimization are necessary to minimize this impact.

3. Learning curve

Understanding and implementing distributed tracing concepts can be complex, requiring a good grasp of concepts like context propagation, instrumentation, and sampling strategies. You'll need to invest a non-trivial amount of time to learn and master these concepts.

4. Data volume and sampling

A major hurdle in distributed tracing is managing the sheer volume of data. With potentially thousands of services generating a deluge of trace data every second, questions arise: How to efficiently capture and store it? What data to retain and for how long? How to scale data collection in line with the ever-increasing request volume?

Sampling techniques help with managing data volume, but they introduce the challenge of ensuring representative sampling without losing critical information.

Final thoughts

I hope this article has helped you understand how distributed tracing can help you decipher the interactions between the different services that make up your application, and how to navigate the tracing ecosystem effectively.

Thanks for reading, and happy tracing!

Author's avatar
Article by
Ayooluwa Isaiah
Ayo is the Head of Content at Better Stack. His passion is simplifying and communicating complex technical ideas effectively. His work was featured on several esteemed publications including LWN.net, Digital Ocean, and CSS-Tricks. When he’s not writing or coding, he loves to travel, bike, and play tennis.
Got an article suggestion? Let us know
Next article
A Complete Introductory Guide to OpenTelemetry
Learn about the benefits, features, challenges, and best practices of OpenTelemetry, the open-source observability framework
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Make your mark

Join the writer's program

Are you a developer and love writing and sharing your knowledge with the world? Join our guest writing program and get paid for writing amazing technical guides. We'll get them to the right readers that will appreciate them.

Write for us
Writer of the month
Marin Bezhanov
Marin is a software engineer and architect with a broad range of experience working...
Build on top of Better Stack

Write a script, app or project on top of Better Stack and share it with the world. Make a public repository and share it with us at our email.

community@betterstack.com

or submit a pull request and help us build better products for everyone.

See the full list of amazing projects on github