Skip to content

ADR 0008: OpenTelemetry Instrumentation Strategy

  • Status: Accepted
  • Date: 2026-04-13

Context and Problem Statement

NAAS has correlation IDs in logs but no distributed tracing. A request flows through API → Redis queue → worker → netmiko SSH, and there is no way to visualize this lifecycle in a trace viewer or correlate API latency with device connection time.

How should we add distributed tracing without impacting users who don't need it?

Decision Drivers

  • Zero overhead when tracing is disabled (most deployments won't use it initially)
  • Trace context must survive the Redis/RQ queue boundary (API and worker are separate processes)
  • Must not add hard dependencies — OTel packages are large and not needed by everyone
  • Must be testable without running an OTLP collector

Considered Options

  • Option 1: OpenTelemetry as an optional dependency, gated by OTEL_ENABLED
  • Option 2: OpenTelemetry as a required dependency, always initialized
  • Option 3: Custom tracing with structured log correlation only

Decision Outcome

Chosen option: Option 1 — optional dependency gated by env var, because it adds zero overhead for users who don't need tracing, follows the OTel SDK's own recommendation for optional instrumentation, and keeps the install size small for the common case.

Consequences

  • Good: Zero runtime cost when disabled — no OTel imports, no tracer initialization
  • Good: pip install naas[otel] opts in explicitly
  • Good: Trace context propagates through RQ via W3C traceparent in job metadata
  • Bad: Two code paths (enabled/disabled) require testing both
  • Bad: Flask auto-instrumentation import at module level requires pragma: no cover

Pros and Cons of the Options

Option 1: Optional dependency, gated by OTEL_ENABLED

  • Good: Zero overhead when disabled — all functions are no-ops that never import OTel
  • Good: Standard W3C trace context propagation
  • Good: Users opt in with pip install naas[otel] and OTEL_ENABLED=true
  • Bad: Conditional imports add complexity to the bootstrap module

Option 2: Required dependency, always initialized

  • Good: Simpler code — no conditional paths
  • Bad: Adds ~15MB of dependencies for all users
  • Bad: OTel SDK initialization has measurable startup cost even with no-op exporter
  • Bad: Forces dependency on gRPC/protobuf for OTLP exporter

Option 3: Custom tracing with structured logs

  • Good: No new dependencies
  • Good: Works with existing log infrastructure
  • Bad: No standard trace format — can't use Jaeger, Zipkin, or Grafana Tempo
  • Bad: No parent-child span relationships across the queue boundary
  • Bad: Reinvents what OTel already solves

Implementation Details

Trace context propagation through RQ

The API injects a W3C traceparent string into the RQ job's meta dict at enqueue time. The worker extracts it and creates a child span linked to the API span. This gives a single trace ID across the full request lifecycle.

Span hierarchy

Flask HTTP span (auto-instrumented)
  └── naas.worker.execute (worker picks up job)
        ├── naas.netmiko.connect (SSH connection)
        └── naas.netmiko.send_command (per command)
            or naas.netmiko.send_config (config set)

Testing approach

Unit tests use InMemorySpanExporter from opentelemetry-test-utils to capture spans in-memory and assert on names, attributes, parent-child relationships, and exception recording. Both enabled and disabled paths are tested.