Distributed Tracing at Scale
Tracking request lifecycles across microservices with OpenTelemetry
In distributed systems, understanding what happens to a single request as it traverses multiple services is notoriously difficult. A user complaint of “the checkout is slow” could originate anywhere: a slow database query in the inventory service, network latency between containers, an overloaded authentication gateway, or a third-party payment processor. Traditional monitoring tools (logs, metrics, alerts) show you the forest, but not the individual trees.
Distributed tracing solves this by providing end-to-end visibility of request lifecycles.
The Core Concept: Traces and Spans
A trace represents the complete journey of a request through your system. Each unit of work within that trace is a span—a named, timed operation with metadata.
Trace: "POST /checkout"
├── Span: api_gateway.validate_request (0ms - 5ms)
├── Span: auth_service.verify_token (5ms - 45ms)
│ └── Span: redis.check_token_cache (10ms - 12ms)
├── Span: inventory_service.reserve_items (45ms - 230ms)
│ ├── Span: postgres.query_inventory (55ms - 180ms)
│ └── Span: kafka.publish_reservation_event (185ms - 225ms)
├── Span: payment_service.charge (230ms - 890ms)
│ └── Span: external.stripe_api (240ms - 885ms)
└── Span: order_service.create_order (890ms - 920ms)
The trace above shows exactly where the 920ms request spent its time: mostly in the external payment API call, with a secondary bottleneck in the database query. This is the “tree” you couldn’t see before.
Propagation: Keeping Context Across Services
For tracing to work, a TraceContext must travel with each request. This typically includes:
- TraceId: A unique identifier for the entire request
- SpanId: Identifier for the current operation
- Parent SpanId: The span that spawned this one
The magic happens in propagation: each service extracts the context from incoming requests (HTTP headers, gRPC metadata, Kafka headers, etc.), creates a new span, and injects the updated context into outgoing requests.
Propagation in Rust with OpenTelemetry1
OpenTelemetry provides SDKs for most languages and frameworks. The standard uses W3C TraceContext2 for propagation, ensuring compatibility across vendors and services.
use opentelemetry::trace::{Tracer, Span, SpanKind};
use opentelemetry_sdk::{runtime, trace as sdktrace};
use opentelemetry_otlp::WithExportConfig;
use reqwest::Client;
use std::collections::HashMap;
pub struct TracingClient {
client: Client,
tracer: Tracer,
}
impl TracingClient {
pub fn new() -> Self {
let exporter = opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint("http://collector:4317");
let tracer_provider = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(exporter)
.with_service_name("my-service")
.init_batch_provider(runtime::Tokio)
.unwrap();
let tracer = tracer_provider.tracer("my-service");
Self {
client: Client::new(),
tracer,
}
}
pub async fn get(&self, url: &str) -> Result<String, reqwest::Error> {
let span = self.tracer.start("http_get", SpanKind::Client);
let mut headers: HashMap<String, String> = HashMap::new();
// W3C TraceContext propagation via HTTP headers
opentelemetry::global::get_text_map_propagator(|propagator| {
propagator.inject(&mut headers, opentelemetry_sdk::propagation::TraceContextPropagator::default());
});
let mut request = self.client.get(url);
for (key, value) in headers {
request = request.header(key, value);
}
let response = request.send().await?;
span.end();
Ok(response.text().await?)
}
}
The W3C TraceContext standard2 specifies how TraceId, SpanId, and trace flags are encoded in HTTP headers (typically traceparent) and other propagation formats like gRPC and messaging systems.
The TraceContext travels from service to service, linking every span back to the original request. This is the foundation that makes distributed traces possible.
Beyond TraceId and SpanId, attaching additional metadata enriches traces with context useful for debugging:
span.set_attribute("service.version", env!("CARGO_PKG_VERSION"));
span.set_attribute("service.commit", env!("VERCEL_GIT_COMMIT_SHA"));
span.set_attribute("deployment.environment", "production");
span.set_attribute("host.name", hostname());
This metadata becomes invaluable when investigating issues: a sudden spike in errors correlated with a recent deployment, or latency patterns specific to a particular software version. Some teams also propagate request priority, tenant ID for multi-tenant systems, or feature flags to correlate behavior with specific system configurations.
Service Mesh Integration
Service meshes like Istio3 and Linkerd4 can inject tracing headers automatically via sidecar proxies (Envoy5). This means your application code doesn’t need to handle propagation, the mesh handles it transparently. However, you still benefit from adding custom span attributes for richer debugging context.
Kubernetes and Deployment Events
Correlating traces with Kubernetes events adds another dimension to observability:
span.set_attribute("k8s.namespace", env!("POD_NAMESPACE"));
span.set_attribute("k8s.pod", env!("POD_NAME"));
span.set_attribute("k8s.node", env!("NODE_NAME"));
When a deployment rolls out, pods are replaced. Cross-referencing trace data with pod lifecycle events helps answer: was the slow request served by the old pod or the new one? Did a rolling restart introduce latency during the transition?
For example, if you see a spike in database connection errors starting at 14:32, and a Kubernetes event shows a pod restart at 14:32, you’ve found the cause. Traces let you drill into which specific requests were affected during that window.
Async and Actor Systems
Actor systems present an interesting challenge: messages don’t carry HTTP headers. Context propagation in this model requires explicit handling.
use opentelemetry::{global, trace::{FutureExt, TraceContext, SpanKind}};
use tokio::spawn;
fn tracer() -> impl opentelemetry::trace::Tracer {
global::tracer("my-actor-system")
}
async fn process_message(msg: OrderMessage, ctx: TraceContext) {
let span = tracer().start("actor.process_order", SpanKind::Internal);
span.set_parent(ctx);
let result = validate_inventory(msg.items).await;
match result {
Ok(_) => {
span.set_attribute("order.valid", true);
spawn(process_payment(msg).with_context(span.context()));
}
Err(e) => {
span.set_attribute("order.valid", false);
span.record_error(e.clone());
}
}
span.end();
}
When spawning new tasks or sending actor messages, you must explicitly propagate the trace context. This ensures that even asynchronous, message-driven code produces coherent traces.
Sampling: Tracing at Production Scale
At scale, tracing every request is impractical. A system handling 100,000 requests per minute with 50 spans per request generates 5 million spans per minute. Instead, production systems rely on sampling strategies:
Head-Based Sampling
Decisions are made at the trace origin (the first service):
let sampler = Sampler::parent_based(
ParentBasedSampler {
root: Sampler::trace_id_ratio_based(0.01), // 1% of requests
remote_parent: Sampler::AlwaysOn,
local_parent: Sampler::AlwaysOn,
}
);
Head-based sampling is simple but may miss slow or errored requests that should be investigated.
Tail-Based Sampling
Decisions are made after the trace completes. In practice, tail-based sampling is typically configured in a collector layer6 rather than application code:
// Conceptual pseudocode - actual implementation varies by collector
let sampler = TailBasedSampler {
policy: CompositeSampler::new()
.add_rule(ErrorSampler, 1.0) // 100% of error traces
.add_rule(WarningSampler, 1.0) // 100% of traces with warnings
.add_rule(SlowTraceSampler { threshold_ms: 1000 }, 1.0)
.add_rule(RandomSampler, 0.001), // 0.1% of everything else
};
This ensures you always capture problematic traces while keeping storage costs manageable.
The key insight is that error and warning conditions warrant higher sampling rates than successful requests. When something goes wrong, you need full visibility to understand what happened. Aggressive filtering on normal traffic makes sense; the same filter on error paths leaves you blind. A trace that generated a 500 error might look identical to a successful one until you drill into the database call that timed out or the cache that returned stale data.
Correlating Traces with Logs and Profiles
Traces become powerful when combined with other observability signals.
Trace-Log Correlation
Link logs to their parent span by including the TraceId:
use opentelemetry_google_cloud::OpenTelemetryLogger;
let logger = OpenTelemetryLogger::builder()
.with_trace_id_header("x-trace-id")
.init();
log::info!("Inventory reserved for order {}", order_id);
// This log now appears in the trace timeline
Now, when investigating a trace, you can see both the structured events (spans) and the unstructured details (logs) together.
Continuous Profiling7
Profiling shows where time is spent at the function level:
Profile: cpu (30s window)
samples % function
1234 45.2 crypto::aes_encrypt
567 20.8 serialization::serde
234 8.6 database::execute_query
...
When a span shows slow encryption, you can drill into the profile to see exactly which code paths contribute to the slowdown. This closes the loop between “something is slow” and “here’s why.”
Correlating with Deployment Events
Traces become even more powerful when correlated with deployment metadata:
span.set_attribute("deployment.version", env!("CARGO_PKG_VERSION"));
span.set_attribute("deployment.git_sha", env!("VERCEL_GIT_COMMIT_SHA"));
span.set_attribute("deployment.timestamp", std::time::SystemTime::now());
By enriching spans with version and deployment information, you can answer questions like:
- Did a recent deployment cause this regression? Correlate spike in errors with deployment timestamps.
- Is this behavior specific to a version? Compare P99 latency across
v1.2.3versusv1.2.4. - What changed between A-B deployments? Traces from the canary show different patterns than baseline.
In A-B or canary deployments, traffic is split between versions. Tagging traces with their deployment variant lets you compare performance characteristics directly:
span.set_attribute("deployment.variant", "canary"); // or "baseline"
This enables data-driven rollout decisions: if the canary’s error rate is higher or latency is worse, you have the evidence to roll back.
For incremental updates—rolling restarts, config changes, feature flag toggles—the same principle applies. Correlate the timing of these events with trace anomalies to build a complete causal chain: a config change at 14:32 caused the spike in slow database queries at 14:33.
Identifying Bottlenecks and Scaling Issues
With traces collected at scale, patterns emerge that point directly to bottlenecks.
P99 Latency by Service
auth_service: 15ms
inventory_service: 85ms ← Database bottleneck
payment_service: 650ms ← External dependency
order_service: 12ms
P99 (99th percentile) latency tells you how slow the worst requests are for each service. In this example, inventory_service at 85ms stands out as the slow component in an otherwise fast flow. The trace shows this time is spent in a database query—most calls are fast, but a few are slow enough to drag P99 up.
payment_service at 650ms is expected: external payment gateways are inherently slower. If this suddenly jumps to 2 seconds, you’d investigate the payment provider’s status.
Span Duration Distributions
inventory_service.reserve_items
├── 0-100ms: 95% (healthy)
├── 100-500ms: 4% (GC pauses, connection pool exhaustion)
└── 500ms+: 1% (database locks, network partitions)
Percentile distributions reveal the shape of latency, not just the tail. The 4% in the 100-500ms bucket suggests occasional slowdowns—perhaps garbage collection pauses or connection pool contention. The 1% above 500ms are outliers: database locks, network reconnection, or cache misses.
If that 1% suddenly becomes 10%, something broke: a lock that was held too long, a network partition, or a downstream service that became unavailable.
Queue Depth Correlation
When span duration correlates with message queue depth, you’ve found a scaling issue: downstream services can’t keep up.
Request latency vs. Kafka consumer lag:
Time | Latency | Consumer Lag
10:00 | 45ms | 100 msgs
10:05 | 52ms | 500 msgs
10:10 | 120ms | 2000 msgs ← Lag growing
10:15 | 380ms | 8000 msgs
As the queue backs up, each message waits longer before processing. The span for “process_message” stretches from 45ms to 380ms—not because processing became slower, but because messages wait in the queue. This tells you to scale the consumer, not optimize the code.
Backpressure Patterns
When a downstream service slows down, the pressure backs up through the call chain. Traces reveal this pattern clearly: a slow database query causes connection pool exhaustion, which causes HTTP clients to wait for connections, which causes incoming requests to queue up.
Trace showing backpressure:
api_server.handle_request (queued: 500ms)
└── order_service.create_order (waiting for connection: 450ms)
└── database.query (actual work: 50ms)
The actual database query took 50ms, but the client experienced 500ms latency. The difference is backpressure—time spent waiting for resources. Without tracing, you might incorrectly assume the database is the problem; with tracing, you see the real bottleneck is connection pool size.
Putting It All Together
Effective observability requires three pillars working together:
| Signal | Answers |
|---|---|
| Metrics | Is the system healthy? How many errors? |
| Traces | Which request failed? Where did it spend time? |
| Profiles | Why is this function slow? What’s using CPU? |
| Logs | What happened during this operation? |
Traces sit at the center, providing the context to correlate the other signals. A trace with a slow span connects to the profile showing CPU time in serialization; the same trace connects to logs showing the specific database query that timed out.
Conclusion
Distributed tracing transforms debugging from guesswork into precision. By propagating context across service boundaries and combining traces with logs and profiles, you gain a complete picture of request lifecycles in production.
Whether you’re running HTTP microservices, actor-based systems, or batch jobs, instrumentation is an investment that pays dividends every time a user reports an issue you can now answer in minutes instead of days.
https://opentelemetry.io/ “OpenTelemetry - Vendor-neutral instrumentation SDK and collector” ↩︎
https://www.w3.org/TR/trace-context/ “W3C TraceContext - The standard for distributed tracing context propagation” ↩︎ ↩︎
https://istio.io/ “Istio - Service mesh with automatic tracing propagation” ↩︎
https://linkerd.io/ “Linkerd - Lightweight service mesh” ↩︎
https://www.envoyproxy.io/ “Envoy Proxy - Sidecar proxy used by service meshes” ↩︎
https://opentelemetry.io/docs/collector/ “OpenTelemetry Collector” ↩︎
https://grafana.com/docs/pyroscope/latest/ “Continuous Profiling - CPU/memory profiling linked to traces” ↩︎