ADR 0008: OpenTelemetry Instrumentation Strategy
- Status: Accepted
- Date: 2026-04-13
Context and Problem Statement
NAAS has correlation IDs in logs but no distributed tracing. A request flows through API → Redis queue → worker → netmiko SSH, and there is no way to visualize this lifecycle in a trace viewer or correlate API latency with device connection time.
How should we add distributed tracing without impacting users who don't need it?
Decision Drivers
- Zero overhead when tracing is disabled (most deployments won't use it initially)
- Trace context must survive the Redis/RQ queue boundary (API and worker are separate processes)
- Must not add hard dependencies — OTel packages are large and not needed by everyone
- Must be testable without running an OTLP collector
Considered Options
- Option 1: OpenTelemetry as an optional dependency, gated by
OTEL_ENABLED - Option 2: OpenTelemetry as a required dependency, always initialized
- Option 3: Custom tracing with structured log correlation only
Decision Outcome
Chosen option: Option 1 — optional dependency gated by env var, because it adds zero overhead for users who don't need tracing, follows the OTel SDK's own recommendation for optional instrumentation, and keeps the install size small for the common case.
Consequences
- Good: Zero runtime cost when disabled — no OTel imports, no tracer initialization
- Good:
pip install naas[otel]opts in explicitly - Good: Trace context propagates through RQ via W3C
traceparentin job metadata - Bad: Two code paths (enabled/disabled) require testing both
- Bad: Flask auto-instrumentation import at module level requires
pragma: no cover
Pros and Cons of the Options
Option 1: Optional dependency, gated by OTEL_ENABLED
- Good: Zero overhead when disabled — all functions are no-ops that never import OTel
- Good: Standard W3C trace context propagation
- Good: Users opt in with
pip install naas[otel]andOTEL_ENABLED=true - Bad: Conditional imports add complexity to the bootstrap module
Option 2: Required dependency, always initialized
- Good: Simpler code — no conditional paths
- Bad: Adds ~15MB of dependencies for all users
- Bad: OTel SDK initialization has measurable startup cost even with no-op exporter
- Bad: Forces dependency on gRPC/protobuf for OTLP exporter
Option 3: Custom tracing with structured logs
- Good: No new dependencies
- Good: Works with existing log infrastructure
- Bad: No standard trace format — can't use Jaeger, Zipkin, or Grafana Tempo
- Bad: No parent-child span relationships across the queue boundary
- Bad: Reinvents what OTel already solves
Implementation Details
Trace context propagation through RQ
The API injects a W3C traceparent string into the RQ job's meta dict at enqueue time. The worker extracts it and creates a child span linked to the API span. This gives a single trace ID across the full request lifecycle.
Span hierarchy
Flask HTTP span (auto-instrumented)
└── naas.worker.execute (worker picks up job)
├── naas.netmiko.connect (SSH connection)
└── naas.netmiko.send_command (per command)
or naas.netmiko.send_config (config set)
Testing approach
Unit tests use InMemorySpanExporter from opentelemetry-test-utils to capture spans in-memory and assert on names, attributes, parent-child relationships, and exception recording. Both enabled and disabled paths are tested.