Distributed tracing is a technique that empowers you to track requests as they
navigate complex distributed systems. It reveals the path of requests,
processing times, service connections, and potential failure points.
Jaeger is a leading open-source distributed tracing tool that excels in
collecting, storing, and visualizing traces across microservices, enabling
comprehensive system monitoring and troubleshooting. Its integration with
Kubernetes and support for popular storage solutions like Elasticsearch and
Cassandra make it ideal for large-scale, dynamic environments.
In this article, we will explore Jaeger's architecture and provide a
step-by-step guide on how to implement tracing with Jaeger, from initial setup
to visualizing and analyzing trace data for improved system performance and
reliability.
Let's get started!
Prerequisites
Before proceeding with this article, ensure that you meet the following
requirements:
Basic familiarity with Docker and Docker Compose.
A recent version of Docker
installed on your local machine.
Jaeger is a distributed tracing backend that emerged at Uber in 2015 after being
inspired by
Google's Dapper
and Twitter's OpenZipkin. It later joined the Cloud Native
Computing Foundation (CNCF) as an incubating project in 2017, before being
promoted to graduated status in 2019.
Its core function is to monitor and troubleshoot distributed applications by
providing a clear view of request flows and trace data throughout the system.
This allows for the identification of performance bottlenecks, the root cause of
errors, and a deeper understanding of service dependencies.
While Jaeger itself doesn't provide tracing instrumentation, it initially
maintained OpenTracing-compatible tracers. However,
these have been deprecated
in favor of OpenTelemetry-based instrumentation, which
is the recommended method for generating trace data in the OpenTelemetry
Protocol (OLTP) format.
Essentially, OpenTelemetry handles the collection of telemetry data (including
logs, metrics, and traces), while Jaeger focuses on the storage, visualization,
and analysis of trace data specifically. Jaeger does not support logs or metrics.
Although OpenTelemetry is the preferred approach, Jaeger maintains compatibility
with Zipkin instrumentation for those who have already invested in it.
Instrument tracing without code changes
While Jaeger provides powerful distributed tracing capabilities, Better Stack Tracing can instrument your Kubernetes or Docker workloads without code changes. Start ingesting traces in about 5 minutes.
Predictable pricing and up to 30x cheaper than Datadog. Start free in minutes.
How does Jaeger work?
Jaeger's operation revolves around collecting and analyzing trace data generated
by distributed systems. This assumes that your application is already
instrumented to produce such data using OpenTelemetry SDKs or compatible
libraries.
With that in mind, here's a brief overview of how it all works:
Collection: Instrumented applications generate spans representing units
of work within a trace. These spans are then sent to a local OpenTelemetry
collector, or directly to the Jaeger collector via OTLP exporters, which
aggregates and batches them.
Processing: Jaeger's processing pipeline receives the trace data,
validating it, applying sampling to manage volume (if configured), and
enriching it with additional details.
Storage: The processed spans are subsequently stored them in a scalable
storage backend such as Cassandra or Elsasticsearch.
Ingester (optiona): If a message queue like Apache Kafka buffers the
collector and storage, the Jaeger ingester reads from Kafka and writes to the
storage.
Querying: The Jaeger query service enables data querying by providing an
API to retrieve trace data from the configured storage.
Visualization: Jaeger provides a web interface for querying your trace
data and analyzing the results to identify performance bottlenecks, and
troubleshoot issues.
Now that you understand the internal mechanisms of Jaeger, let's move on to the
practical steps of setting it up in your local environment.
Setting up Jaeger locally
As we explored in the previous section, Jaeger consists of several components
working together to manage trace data. For convenient local testing, we'll
utilize the all-in-one
Docker image which bundles the Jaeger UI, collector, query, and agent components
with an in-memory storage.
This command launches a self-contained Jaeger instance, accessible through the
UI at http://localhost:16686. It's capable of accepting trace data from
various sources, including OpenTelemetry and Zipkin.
You should see the following messages in the terminal (truncated for brevity),
confirming that the various services are listening on their respective ports:
When you open http://localhost:16686 in your browser, you should also observe
the Jaeger user interface. Since it's backed by an in-memory database, there's
not much to see here:
With Jaeger successfully running, let's now introduce the demo project we'll use
to generate trace data for visualization.
Setting up the demo application
Jaeger provides a demo application called
HotROD,
simulating a ride-sharing service with multiple microservices, all instrumented
with OpenTelemetry for trace generation.
A standalone Docker image is also provided for this application so you can run
the command below in a separate terminal to get set up:
Copied!
docker run \
--rm \
--name hotrod \
--link jaeger \
--env OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318 \
-p 8080-8083:8080-8083 \
jaegertracing/example-hotrod:latest \
all
As you can see from the logs, there are four service: frontend, customer,
route, and driver. You can access the main user interface by opening
http://localhost:8080 in your browser:
The interface displays four customers, and clicking on any of them simulates
requesting a ride, showing the assigned driver and estimated arrival time.
You'll also see some other bits of information on the screen:
The web client ID (8258 in my case) on the top left is a random session ID
assigned each time you refresh the page.
The driver's licence plate (T757873C) that is responding to the request.
The request ID (req: 8258-1) is an amalgamation of the client ID and a
sequence number.
The latency which shows how long the backend took to respond (737ms).
The interface also provides links to view traces generated by the current client
or the current request in Jaeger. Clicking the open trace link will open the
corresponding trace in the Jaeger UI:
Next, we'll delve deeper into how to interpret and utilize this trace data in
Jaeger for understanding and optimizing your applications.
Examining the HotROD architecture
We've learned that the HotROD application comprises four microservices, but to
understand the request flow and interactions, we can leverage Jaeger's ability
to automatically generate architecture diagrams.
The previous ride-sharing request provided Jaeger with enough data to create a
visual representation. Go to the System Architecture page in the Jaeger UI
to see it in action:
While the Force Directed Graph is useful for large architectures, for
HotROD's smaller scale, switch to the DAG (Directed Acyclic Graph) tab:
Interpreting the diagram
This diagram reveals the components involved in fulfilling the ride request. We
see the frontend initiating calls to three downstream services, two of which
further interact with MySQL and Redis.
The graph also indicates the frequency of calls to each service, for example,
route was called 10 times, and redis-manual was called 14 times. A single
frontend request triggered 27 service interactions.
While this diagram offers a high-level overview, it doesn't detail the request
flow or individual service response times. To gain these insights, we'll examine
the actual traces in the next section.
Viewing trace data in Jaeger
To view the request trace data in Jaeger, you can click the open trace link
in the HotROD interface like you did earlier. Another way is to open the Jaeger
search page at http://localhost:16686/search where you'll see all the names of
the services in the Services dropdown.
Select the root service (frontend in this case) and click the Find Traces
button:
You may see more than one trace depending on how long the application has been
running, but the one we're interested in here is the trace for the
frontend: /dispatch entry which should be the first one on the list:
This entry summarizes the trace, displaying the total spans, any errors, the
involved services, and the duration of the backend operation (730.02ms), which
might differ slightly from the frontend UI's report due to network latency.
Clicking the entry opens the timeline view, revealing details about the trace.
The left side shows the hierarchy of calls, while the right side presents a
Gantt chart visualizing each span's timing.
Red circles with exclamation marks signify error-generating calls. Clicking
these spans reveals details like the error tag and captured logs explaining the
error ("redis timeout" in this case):
Understanding the timeline view
The trace timeline visualizes the chronological sequence of events within the
ride-sharing request. Each colored bar on the Gantt chart signifies a span (a
unit of work performed by a service), showing its duration and timing relative
to other spans in the trace.
In this specific request:
The frontend service receives a /dispatch request, initiating the
process.
The frontend then makes a GET request to the /customer endpoint of the
customer service.
The customer service executes a SELECT SQL query on MySQL, returning the
results to the frontend.
Upon receiving the results, the frontend makes an RPC call to the driver
service (driver.DriverService/FindNearest), which then makes multiple calls
to Redis (some failing).
After the driver service completes, the frontend executes a sequence of
GET requests to the /route endpoint of the route service.
Finally, the results are sent back to the external caller and displayed in
the HotROD UI.
This demonstrates the power of tracing in understanding request flow within
distributed systems. In the following section, we'll explore the specific
contents of a span in Jaeger for deeper insights.
Digging deeper into spans
Each span within a trace holds a wealth of information so let's expand the root
span (frontend /dispatch) in the trace timeline to explore its contents. The
header displays the operation name, start time, duration, and the service
responsible for the span. As the root span, it encompasses the entire trace
duration.
The span consists of three sections:
1. Tags
These are key-value pairs attached to individual spans for capturing
context-specific information like HTTP status codes, database query parameters,
or custom attributes.
OpenTelemetry Semantic Conventions provide
standardized attribute names for common scenarios, ensuring consistency and
portability across systems.
2. Process
Process tags are associated with the entire service and added to all its spans.
These include details like hostname, version, environment, and other static
attributes.
3. Logs
Logs captured during a span's execution offer insights into the service's
behavior in a specific operation. These logs can also be viewed through
Docker (docker logs hotrod), but Jaeger
provides a more focused view within the trace context.
Understanding service behavior through span logs
Span logs can reveal application actions without requiring you to delve into the
code. Here's a summary of the frontend service operation according to the
logs:
It requests customer data (with customer_id=123) from the customer service.
Upon receiving the customer's location (115,277), it forwards it to the
drivers service to find nearby drivers.
Once drivers are found, the route service is called 10 times (matching the
number of drivers) to calculate the shortest routes between driver and
customer locations.
The driver with the shortest route is dispatched, and the estimated arrival
time (eta) is recorded.
By combining Gantt charts, span tags, and span logs, Jaeger offers a holistic
view of distributed workflows, allowing easy navigation between high-level
overviews and detailed analysis of individual operations.
Diagnosing and fixing performance issues
One of the primary advantages of distributed tracing lies in its capability to
diagnose and pinpoint performance bottlenecks within complex systems. In this
section, we'll discuss how to interpret Jaeger's trace timeline and span details
to identify the sources of latency in the HotROD application.
Analyzing the trace timeline
Let's return to the high level overview of the trace timeline in Jaeger:
From the timeline, we can make the key observations:
The initial customer service call is critical as it blocks further
progress. It consumes nearly 40% of the total time, indicating a prime target
for optimization.
The driver service initially issues a FindDriverIDs request to the
redis-manual backend which retrieves a set of driver IDs closest to the
user's location as seen in the logs.
After obtaining the driver IDs, the system queries the redis-manual service
to fetch each driver's data. However, these queries are executed
sequentially, forming a distinct "staircase" pattern in the trace timeline.
This sequential execution appears unnecessary and inefficient, as these
queries could be performed concurrently to significantly reduce the overall
latency.
The route service calls demonstrate a concurrent execution pattern, where
multiple requests are handled simultaneously, but not fully in parallel. The
maximum number of concurrent requests is limited to three, indicating the use
of a fixed-size thread pool or similar mechanism to manage the execution.
This suggests an opportunity for optimization, as increasing the pool size or
exploring alternative concurrency models could potentially improve overall
performance.
Even a cursory examination of the timeline reveals potential areas for
improvement. Some optimizations are straightforward, like parallelizing Redis
and route queries, while others, like improving the customer service query,
require further investigation.
Simulating a high traffic scenario
To further understand how the service handles increased load, let's simulate a
scenario with numerous concurrent requests. You can either rapidly click
customer buttons in the HotROD UI or use the following command:
This command generates 50 concurrent requests with custom baggage headers
(session and request ID) for tracking. If using the browser, you'll find similar
details in the request headers:
Upon returning to Jaeger's Search page and refreshing, you'll likely see a surge
in traces. Sort them by Longest First to identify the most time-consuming
requests:
You might observe that the longest trace now takes significantly longer than the
initial single request (e.g., 3.15 seconds vs. 730ms in my case), indicating
significant performance degradation under higher load.
Identifying the bottleneck
Examining the longest trace reveals that the MySQL query now dominates the
overall duration, consuming approximately 86% of the total time. This query is
the primary bottleneck hindering the service's scalability.
Expanding the SQL SELECT span reveals logs that explain the delay:
The "Waiting for lock behind 8 transactions" event shows that the current
request has to wait for eight other requests to complete their queries first
before it is able to continue. It also provides the identities of the eight
requests through the blockers property (as supplied by the request
portion of the Baggage header)
Approximately 2.4 seconds later, the other requests are finally done and it's
able to acquire the lock. It also mentions that there are two other requests
waiting behind it and provides their identities as well.
These logs suggest the application is configured to allow only one database
connection at a time, causing significant contention and delays when multiple
requests are made concurrently.
Examining the source code and fixing the bottleneck
The only way to assess if our inference is correct is to grab the source code.
You can do this by cloning the
Jaeger repository to your machine:
if customer, ok := d.customers[customerID]; ok {
return customer, nil
}
return nil, errors.New("invalid customer ID")
}
The highlighted section intentionally creates a bottleneck by locking the
database connection and introducing a delay. However, the application also
includes configuration flags to disable these bottlenecks for demonstration
purposes.
If you have Go installed on your machine, you can change into the
examples/hotrod directory, and build the program with:
Copied!
cd examples/hotrod
Copied!
go build
A new hotrod binary will now be available in your current working directory.
To discover the available flags, run the help subcommand:
Copied!
./hotrod help
Output
HotR.O.D. - A tracing demo application.
. . .
Flags:
-b, --basepath string Basepath for frontend service(default "/")
-c, --customer-service-port int Port for customer service (default 8081)
-d, --driver-service-port int Port for driver service (default 8082)
-D, --fix-db-query-delay duration Average latency of MySQL DB query (default 300ms)
-M, --fix-disable-db-conn-mutex Disables the mutex guarding db connection
-W, --fix-route-worker-pool-size int Default worker pool size (default 3)
. . .
The three highlighted flags provided for fixing the performance issues we
observed. You can use --fix-db-query-delay/-D to reduce the fixed default
latency of 300ms for SQL queries, and --fix-disable-db-conn-mutex/-M to
disable the blocking behavior.
Let's go ahead and apply these flags to the HotROD application to see their
effect. Ensure to stop the existing Docker container with the command below
before launching the new instance:
After the requests complete, refresh Jaeger's search page and filter by Last 5
Minutes to see the latest traces, sorting by Longest First:
The highest reported latency is now 1.88 seconds which is a significant
improvement from the previous result, although the trend of increasing latency
per request continues.
Clicking on a few traces shows that the SQL SELECT span now takes around
50-100ms to complete showing that it is no longer the bottleneck in this
operation.
You'll also observe that the requests to the route service are exhibiting the
staircase pattern, and there is a significant gap between each request, leading
to latency still being relatively high and continuing to increase per additional
request.
This suggests the bottleneck lies in how the route service is called, not
within the service itself. Examining the services/frontend/best_eta.go file
confirms this:
services/frontend/best_eta.go
Copied!
. . .
// getRoutes calls Route service for each (customer, driver) pair
func (eta *bestETA) getRoutes(ctx context.Context, customer *customer.Customer, drivers []driver.Driver) []routeResult {
results := make([]routeResult, 0, len(drivers))
wg := sync.WaitGroup{}
routesLock := sync.Mutex{}
for _, dd := range drivers {
wg.Add(1)
driver := dd // capture loop var
// Use worker pool to (potentially) execute requests in parallel
eta.pool.Execute(func() {
route, err := eta.route.FindRoute(ctx, driver.Location, customer.Location)
routesLock.Lock()
results = append(results, routeResult{
driver: driver.DriverID,
route: route,
err: err,
})
routesLock.Unlock()
wg.Done()
})
}
wg.Wait()
return results
}
The getRoutes() function receives the customer information and list of drivers
that was retrieved earlier from the drivers service. It then executes the
route service asynchronously for each driver through a pool of goroutines
(eta.pool.Execute()). The pool size is configured according to the
config.RouteWorkerPoolSize which defaults to 3 as you can see below:
var (
. . .
// RouteWorkerPoolSize is the size of the worker pool used to query `route` service.
// Can be overwritten from command line.
RouteWorkerPoolSize = 3
. . .
)
This explains why we initially saw three parallel requests to the route
service when we initially analyzed a single request trace. As the number of
requests increases, the pool becomes saturated and we're only able to execute
one additional request when a worker becomes free. This explains why you're now
seeing one request at a time and why there are gaps between requests (since all
workers are in use at the time).
The fix for this is increasing the number of workers in the pool which can be
done through the -W/--fix-route-worker-pool-size option. Since goroutines are
quite cheap (only a few kilobytes allocation), you can set this to a relatively
high number like 10,000 depending on the maximum number of requests you'd like
to be active at a given time.
Go ahead and restart the HotROD server by pressing Ctrl-C and typing:
Copied!
./hotrod all -D 50ms -M -W 10000
The logs report that the worker pool size was indeed updated:
Output
fix: overriding MySQL query delay {"old": "300ms", "new": "50ms"}
fix: disabling db connection mutex
fix: overriding route worker pool size {"old": 3, "new": 10000}
Starting all services
When you simulate the high traffic experiment once again and observe the new
traces, you will see that the command finishes much quicker than before:
Opening the latest traces now reports the highest latency to be around 380ms,
and you can see that all requests to the route service are now performed in
parallel:
With these simple steps, you've been able to significantly improved the
performance of the service just by reading the traces, understanding where the
bottlenecks are, and fixing the issues.
Simplify tracing with automated instrumentation
We've seen how Jaeger provides powerful visualization and analysis of trace data. However, the journey from development to production tracing often involves significant instrumentation work: adding OpenTelemetry SDKs to your codebase, configuring exporters, managing sampling strategies, and maintaining this instrumentation as your services evolve.
Better Stack Tracing eliminates this instrumentation burden through eBPF-based automatic instrumentation:
Connect your Kubernetes or Docker cluster and start collecting traces immediately with zero code changes. Databases like PostgreSQL, MySQL, Redis, and MongoDB are automatically recognized and instrumented.
Investigate slow requests visually through the "bubble up" feature with drag and drop, similar to the detective work we performed with HotROD, but automated.
Get AI-powered assistance when troubleshooting production issues. The AI SRE analyzes your service map and logs to suggest root cause hypotheses while you remain in control.
Enjoy predictable pricing up to 30x cheaper than alternatives like Datadog with complete data ingestion instead of sampling.
Try Better Stack to experience how automated instrumentation and intelligent analysis can accelerate your distributed tracing workflow.
Final thoughts
In this guide, we explored Jaeger's capabilities as a powerful distributed tracing tool, enabling us to monitor and troubleshoot complex microservices-based applications.
By applying these techniques to the HotROD demo application, we demonstrated how Jaeger can be effectively utilized to optimize system performance and ensure seamless user experiences in modern distributed architectures. We identified bottlenecks through timeline analysis, traced request flows across multiple services, and applied targeted fixes that dramatically improved performance under load.
As you continue your journey with distributed tracing, remember that understanding the flow of requests and data is paramount to building reliable and efficient systems. Jaeger provides a solid foundation for trace visualization and analysis, especially when you need fine-grained control over your tracing infrastructure.
If you're looking to accelerate your implementation, Better Stack Tracing offers an automated approach that eliminates the instrumentation overhead we discussed earlier.
With eBPF-based collection and OpenTelemetry compatibility, you can start gathering traces from your Kubernetes or Docker workloads in minutes while maintaining the flexibility to use Jaeger or other OpenTelemetry-compatible backends as your needs evolve.