Observability

NAAS provides structured logging, request tracing, and Prometheus metrics for monitoring and troubleshooting.

Structured JSON Logging

All log output is JSON. Each line includes:

Field	Description
`timestamp`	ISO 8601 timestamp
`level`	Log level (`INFO`, `DEBUG`, `WARNING`, `ERROR`)
`logger`	Logger name (e.g. `NAAS`)
`message`	Log message

Example log line:

{"timestamp": "2026-02-26T17:00:00.000Z", "level": "INFO", "logger": "NAAS", "message": "abc123: admin is issuing 2 command(s) to 192.168.1.1:22"}

This format is directly ingestible by ELK, Splunk, CloudWatch Logs, Datadog, and similar tools.

Log Level

Set via LOG_LEVEL environment variable (default: INFO). Use DEBUG for verbose output including per-command device interaction.

Docker ComposeKubernetes (Helm)

LOG_LEVEL=DEBUG docker compose up -d

helm upgrade naas charts/naas --set config.LOG_LEVEL=DEBUG

Correlation ID Tracing

Every API request is assigned a UUID correlation ID (request_id). This ID:

Is returned as X-Request-ID in the response header
Is used as the RQ job ID
Appears as the first field in all worker log lines for that job

This enables end-to-end tracing of a single request across API and worker logs.

Supplying your own ID

Pass X-Request-ID in the request to use your own correlation ID (must be a valid UUID v4):

curl -k -X POST https://localhost:8443/v2/send-command \
  -u "admin:password" \
  -H "Content-Type: application/json" \
  -H "X-Request-ID: 550e8400-e29b-41d4-a716-446655440000" \
  -d '{"host": "192.168.1.1", "platform": "cisco_ios", "commands": ["show version"]}'

If omitted, NAAS generates one automatically.

Tracing a request through logs

Docker ComposeKubernetes

docker compose logs worker | grep "550e8400-e29b-41d4-a716-446655440000"

kubectl -n naas logs deploy/naas-worker | grep "550e8400-e29b-41d4-a716-446655440000"

Health Check

GET /healthcheck performs a live Redis ping and reports component status. This is the permanent operational endpoint used by k8s probes, Docker HEALTHCHECK, and similar infrastructure tooling — it is not subject to API versioning. The versioned form /v2/healthcheck is also available for clients that prefer an explicit version. See ADR 0012.

{
  "status": "healthy",
  "version": "1.1.0",
  "uptime_seconds": 3600,
  "components": {
    "redis": { "status": "healthy" },
    "queue": { "status": "healthy", "depth": 4 }
  }
}

status is "healthy" when all components are up, "degraded" when Redis is unreachable. Use this endpoint for load balancer health checks and uptime monitoring.

Prometheus Metrics

NAAS exposes Prometheus-compatible metrics at the /metrics endpoint for monitoring performance and health.

Accessing Metrics

curl -k https://localhost:8443/metrics

Note: The /metrics endpoint does not require authentication.

Available Metrics

Request Metrics

naas_http_requests_total{method, endpoint, status} - Total HTTP requests by method, endpoint, and status code
naas_http_request_duration_seconds{method, endpoint} - Request latency histogram

Queue Metrics

naas_queue_depth - Current number of jobs in the Redis queue
naas_queue_jobs_total{status} - Total jobs by status (queued, started, finished, failed)

Worker Metrics

naas_workers_active - Number of active RQ workers
naas_workers_busy - Number of workers currently processing jobs

Job Metrics

naas_jobs_duration_seconds{platform} - Job execution time histogram by platform
naas_jobs_total{platform, status} - Total jobs by platform and status

Grafana Dashboard

Example Grafana queries:

Request rate:

rate(naas_http_requests_total[5m])

P95 latency:

histogram_quantile(0.95, rate(naas_http_request_duration_seconds_bucket[5m]))

Queue depth over time:

naas_queue_depth

Worker utilization:

naas_workers_busy / naas_workers_active

Integration with Monitoring Systems

Prometheus

Add to prometheus.yml:

scrape_configs:
  - job_name: 'naas'
    static_configs:
      - targets: ['naas-api:8443']
    scheme: https
    tls_config:
      insecure_skip_verify: true

CloudWatch

Use CloudWatch Agent with Prometheus support to scrape /metrics and send to CloudWatch.

Datadog

Use Datadog Agent OpenMetrics integration to collect metrics.

Audit Events

NAAS emits structured audit events for job lifecycle tracking and security monitoring.

Event Types

job.submitted - Job submitted to queue
job.started - Worker began processing job
job.completed - Job finished (success or failure)
device.failure - Device connection or authentication failure

Event Format

Events are logged as JSON with these fields:

{
  "timestamp": "2026-03-04T19:00:00.000Z",
  "level": "INFO",
  "logger": "naas.audit",
  "event_type": "job.completed",
  "request_id": "abc-123",
  "username": "admin",
  "device_ip": "192.168.1.1",
  "platform": "cisco_ios",
  "status": "finished",
  "duration_ms": 1234
}

Consuming Audit Events

Filter logs by logger: "naas.audit" to extract audit events:

CloudWatch Logs Insights:

fields @timestamp, event_type, username, device_ip, status
| filter logger = "naas.audit"
| sort @timestamp desc

Splunk:

index=naas logger="naas.audit" | table _time event_type username device_ip status

ELK:

{
  "query": {
    "term": { "logger": "naas.audit" }
  }
}

Use Cases

Security auditing: Track who accessed which devices
Compliance: Maintain audit trail of network changes
Troubleshooting: Correlate failures with device/user patterns
Capacity planning: Analyze job duration and volume trends

Job Reaper

The job reaper is a background thread that runs in each worker process. It detects jobs that are stuck in the started state because their worker died (OOM kill, node failure, SIGKILL) and moves them to the failed state.

Why It Matters

Without the reaper, a dead worker leaves its in-flight jobs stuck in StartedJobRegistry until RQ's job timeout expires (up to the full JOB_TIMEOUT). During this window:

The job appears to be running but will never complete
The dedup key blocks re-submission of the same job
Clients polling for results see status: started indefinitely

The reaper detects this within WORKER_STALE_THRESHOLD seconds (default 120s) and moves the job to failed, clearing the dedup key so the job can be re-submitted.

Distributed Lock

All workers run the reaper thread, but only one executes per cycle. The reaper acquires a Redis lock (naas:reaper:lock) before scanning. If another reaper holds the lock, the current one skips that cycle. The lock TTL equals JOB_REAPER_INTERVAL, so it self-heals if the lock holder dies.

Audit Event

When a job is reaped, a job.orphaned audit event is emitted:

{
  "event": "job.orphaned",
  "job_id": "abc-123",
  "worker_name": "naas_worker_1"
}

Configuration

Variable	Default	Description
`JOB_REAPER_ENABLED`	`true`	Enable orphaned job detection
`JOB_REAPER_INTERVAL`	`60`	Seconds between reaper scans
`WORKER_STALE_THRESHOLD`	`120`	Seconds since last heartbeat before worker considered dead

OpenTelemetry Distributed Tracing

NAAS supports OpenTelemetry (OTel) distributed tracing with trace context propagation through the RQ job queue. When enabled, a single trace spans the full request lifecycle: API → queue → worker → device.

Enabling

Set OTEL_ENABLED=true and point the OTLP exporter at your collector:

Variable	Default	Description
`OTEL_ENABLED`	`false`	Enable OpenTelemetry tracing
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP gRPC collector endpoint
`OTEL_SERVICE_NAME`	`naas`	Service name in traces (set automatically)

Both the API and worker processes must have OTEL_ENABLED=true. Install the otel extra:

pip install naas[otel]

How It Works

API receives request — a span is created for the Flask route
Job enqueued — the current trace context (traceparent) is injected into the RQ job metadata
Worker picks up job — the traceparent is extracted from job metadata, and a child span is created linked to the original trace
Device operation — Netmiko SSH operations are wrapped in spans with device attributes

This means a single trace ID connects the HTTP request, queue wait time, and device operation — visible in Jaeger, Grafana Tempo, or any OTLP-compatible backend.

Deployment Example