Skip to content

v1.4 Release Notes

Overview

v1.4 is a major feature release focused on operational reliability, observability, and developer experience. It adds job deduplication, webhooks, a dead letter queue, context-aware routing, queue backpressure, and significant CI improvements.


Migration Guide

No breaking changes

v1.4 is fully backward-compatible with v1.3. All new fields are optional with sensible defaults.

Deprecations

  • ip field — deprecated in favor of host (accepts IP addresses, IPv6, and hostnames). The ip field still works but will be removed in v2.0.

What's New

Hostname support (host field)

The host field now accepts IP addresses, IPv6 addresses, and RFC 1123 hostnames:

{
  "host": "router1.example.com",
  "platform": "cisco_ios",
  "commands": ["show version"]
}

The ip field is deprecated — migrate to host at your convenience.

Context-aware job routing

Route jobs to specific worker pools using the context field. Useful for multi-VRF, multi-segment, or geographically distributed environments:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "context": "oob-dc1"
}

Workers declare their context via WORKER_CONTEXTS=oob-dc1. Jobs are routed to the matching worker pool. Use GET /v1/contexts to list available contexts.

Full guide: docs/contexts.md

Job deduplication

Duplicate in-flight jobs (same host + platform + commands + user) return the existing job_id instead of enqueuing a new job:

{
  "job_id": "abc-123",
  "deduplicated": true
}

Enabled by default. Disable with JOB_DEDUP_ENABLED=false.

Idempotency keys

Client-controlled deduplication for safe retries on network failure:

curl -H "X-Idempotency-Key: my-unique-key" -X POST .../v1/send_command ...

Repeat requests with the same key within 24 hours return the original job_id with idempotent: true.

Webhooks

Receive a notification when a job completes instead of polling:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "webhook_url": "https://my-app.example.com/naas-callback"
}

NAAS POSTs {"job_id": "...", "status": "finished", "enqueued_at": "...", "completed_at": "..."} to your URL. Results and credentials are never included. HTTPS only.

Dead letter queue

Inspect and replay failed jobs:

# List failed jobs
GET /v1/jobs/failed

# Replay a failed job with your current credentials
POST /v1/jobs/{job_id}/replay

Credentials are never exposed in the response. Error messages have credential values redacted. Cap registry size with FAILED_JOB_MAX_RETAIN (default 500).

Job tags

Attach metadata to jobs for filtering and auditing:

{
  "host": "192.168.1.1",
  "commands": ["show version"],
  "tags": {"team": "network-ops", "env": "prod", "ticket": "CHG-12345"}
}

Filter jobs by tag: GET /v1/jobs?tag=team:network-ops

Queue backpressure

Prevent queue overload with MAX_QUEUE_DEPTH. Returns 503 Service Unavailable when the queue is full:

MAX_QUEUE_DEPTH=1000

Enqueue response metadata

All enqueue responses now include:

{
  "job_id": "abc-123",
  "queue_position": 3,
  "enqueued_at": "2026-03-20T19:00:00+00:00",
  "timeout": 60
}

TTP structured output

/v1/send_command_structured now supports TTP templates in addition to TextFSM:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "ttp_template": "hostname {{ hostname }}"
}

ttp_template and textfsm_template are mutually exclusive.

Connection timeout control

Control TCP connection timeout with conn_timeout (default 10s):

{
  "host": "192.168.1.1",
  "commands": ["show version"],
  "conn_timeout": 5.0
}

Useful for fast failure detection on unreachable hosts or tuning for high-latency links.


Reliability improvements

Redis graceful degradation

Redis errors now return 503 Service Unavailable with Retry-After: 10 instead of unhandled 500 errors.

Job reaper

Background thread in each worker detects orphaned jobs from dead workers (OOM kills, node failures) and moves them to the failed registry. Clears dedup keys so jobs can be re-submitted. Uses a distributed Redis lock to ensure only one reaper runs per cycle.

Configure with JOB_REAPER_ENABLED, JOB_REAPER_INTERVAL, WORKER_STALE_THRESHOLD.

Connection pool exclusion list

Exclude specific devices or platforms from connection pooling:

CONNECTION_POOL_EXCLUDE=192.168.1.1,cisco_ios_old

Observability

  • naas_failed_jobs_total Prometheus gauge — tracks dead letter queue depth
  • failed_jobs count in /healthcheck response
  • job.orphaned audit event when reaper moves a job to failed registry

Developer experience

  • ADR processdocs/adr/ with MADR format for architectural decisions
  • Postman/OpenAPI artifacts — OpenAPI spec published as release artifact
  • Python 3.12/3.13/3.14 support — all tested in CI matrix
  • Docker default bumped to Python 3.14

CI improvements

  • Docker BuildKit GHA layer caching for integration tests (~30-40s savings per run)
  • Integration tests reduced from 4 parallel Python-version jobs to 1 (75% compute reduction — container always runs 3.14)
  • GUNICORN_WORKERS=2 in CI environments for faster startup
  • docker compose up --wait for reliable service readiness

New configuration variables

Variable Default Description
NAAS_CONTEXTS default Comma-separated list of valid routing contexts
WORKER_CONTEXTS default Contexts this worker handles
JOB_DEDUP_ENABLED true Enable server-side job deduplication
IDEMPOTENCY_TTL 86400 Idempotency key TTL in seconds
MAX_QUEUE_DEPTH (unlimited) Maximum queue depth before 503
WEBHOOK_ALLOW_HTTP false Allow HTTP webhook URLs (testing only)
FAILED_JOB_MAX_RETAIN 500 Maximum failed jobs in dead letter queue
JOB_REAPER_ENABLED true Enable orphaned job detection
JOB_REAPER_INTERVAL 60 Seconds between reaper scans
WORKER_STALE_THRESHOLD 120 Seconds before worker considered dead
GUNICORN_WORKERS 8 Number of gunicorn worker processes
CONNECTION_POOL_EXCLUDE (none) Comma-separated IPs/platforms to exclude from pooling

Upgrade Steps

  1. Pull the latest image or update your deployment
  2. Update ip field to host in your requests (optional — ip still works)
  3. Configure NAAS_CONTEXTS and WORKER_CONTEXTS if using multi-segment routing
  4. Review MAX_QUEUE_DEPTH for your workload
  5. Configure webhook endpoints if you want push notifications instead of polling
  6. Review Prometheus dashboards — new naas_failed_jobs_total gauge available