v1.4 Release Notes

Overview

v1.4 is a major feature release focused on operational reliability, observability, and developer experience. It adds job deduplication, webhooks, a dead letter queue, context-aware routing, queue backpressure, and significant CI improvements.

Migration Guide

No breaking changes

v1.4 is fully backward-compatible with v1.3. All new fields are optional with sensible defaults.

Deprecations

ip field — deprecated in favor of host (accepts IP addresses, IPv6, and hostnames). The ip field still works but will be removed in v2.0.

What's New

Hostname support (`host` field)

The host field now accepts IP addresses, IPv6 addresses, and RFC 1123 hostnames:

{
  "host": "router1.example.com",
  "platform": "cisco_ios",
  "commands": ["show version"]
}

The ip field is deprecated — migrate to host at your convenience.

Context-aware job routing

Route jobs to specific worker pools using the context field. Useful for multi-VRF, multi-segment, or geographically distributed environments:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "context": "oob-dc1"
}

Workers declare their context via WORKER_CONTEXTS=oob-dc1. Jobs are routed to the matching worker pool. Use GET /v1/contexts to list available contexts.

Full guide: docs/contexts.md

Job deduplication

Duplicate in-flight jobs (same host + platform + commands + user) return the existing job_id instead of enqueuing a new job:

{
  "job_id": "abc-123",
  "deduplicated": true
}

Enabled by default. Disable with JOB_DEDUP_ENABLED=false.

Idempotency keys

Client-controlled deduplication for safe retries on network failure:

curl -H "X-Idempotency-Key: my-unique-key" -X POST .../v1/send_command ...

Repeat requests with the same key within 24 hours return the original job_id with idempotent: true.

Webhooks

Receive a notification when a job completes instead of polling:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "webhook_url": "https://my-app.example.com/naas-callback"
}

NAAS POSTs {"job_id": "...", "status": "finished", "enqueued_at": "...", "completed_at": "..."} to your URL. Results and credentials are never included. HTTPS only.

Dead letter queue

Inspect and replay failed jobs:

# List failed jobs
GET /v1/jobs/failed

# Replay a failed job with your current credentials
POST /v1/jobs/{job_id}/replay

Credentials are never exposed in the response. Error messages have credential values redacted. Cap registry size with FAILED_JOB_MAX_RETAIN (default 500).

Job tags

Attach metadata to jobs for filtering and auditing:

{
  "host": "192.168.1.1",
  "commands": ["show version"],
  "tags": {"team": "network-ops", "env": "prod", "ticket": "CHG-12345"}
}

Filter jobs by tag: GET /v1/jobs?tag=team:network-ops

Queue backpressure

Prevent queue overload with MAX_QUEUE_DEPTH. Returns 503 Service Unavailable when the queue is full:

MAX_QUEUE_DEPTH=1000

Enqueue response metadata

All enqueue responses now include:

{
  "job_id": "abc-123",
  "queue_position": 3,
  "enqueued_at": "2026-03-20T19:00:00+00:00",
  "timeout": 60
}

TTP structured output

/v1/send_command_structured now supports TTP templates in addition to TextFSM:

{
  "host": "192.168.1.1",
  "platform": "cisco_ios",
  "commands": ["show version"],
  "ttp_template": "hostname {{ hostname }}"
}

ttp_template and textfsm_template are mutually exclusive.

Connection timeout control

Control TCP connection timeout with conn_timeout (default 10s):

{
  "host": "192.168.1.1",
  "commands": ["show version"],
  "conn_timeout": 5.0
}

Useful for fast failure detection on unreachable hosts or tuning for high-latency links.

Reliability improvements

Redis graceful degradation

Redis errors now return 503 Service Unavailable with Retry-After: 10 instead of unhandled 500 errors.

Job reaper

Background thread in each worker detects orphaned jobs from dead workers (OOM kills, node failures) and moves them to the failed registry. Clears dedup keys so jobs can be re-submitted. Uses a distributed Redis lock to ensure only one reaper runs per cycle.

Configure with JOB_REAPER_ENABLED, JOB_REAPER_INTERVAL, WORKER_STALE_THRESHOLD.

Connection pool exclusion list

Exclude specific devices or platforms from connection pooling:

CONNECTION_POOL_EXCLUDE=192.168.1.1,cisco_ios_old

Observability

naas_failed_jobs_total Prometheus gauge — tracks dead letter queue depth
failed_jobs count in /healthcheck response
job.orphaned audit event when reaper moves a job to failed registry

Developer experience

ADR process — docs/adr/ with MADR format for architectural decisions
Postman/OpenAPI artifacts — OpenAPI spec published as release artifact
Python 3.12/3.13/3.14 support — all tested in CI matrix
Docker default bumped to Python 3.14

CI improvements

Docker BuildKit GHA layer caching for integration tests (~30-40s savings per run)
Integration tests reduced from 4 parallel Python-version jobs to 1 (75% compute reduction — container always runs 3.14)
GUNICORN_WORKERS=2 in CI environments for faster startup
docker compose up --wait for reliable service readiness

New configuration variables

Variable	Default	Description
`NAAS_CONTEXTS`	`default`	Comma-separated list of valid routing contexts
`WORKER_CONTEXTS`	`default`	Contexts this worker handles
`JOB_DEDUP_ENABLED`	`true`	Enable server-side job deduplication
`IDEMPOTENCY_TTL`	`86400`	Idempotency key TTL in seconds
`MAX_QUEUE_DEPTH`	(unlimited)	Maximum queue depth before 503
`WEBHOOK_ALLOW_HTTP`	`false`	Allow HTTP webhook URLs (testing only)
`FAILED_JOB_MAX_RETAIN`	`500`	Maximum failed jobs in dead letter queue
`JOB_REAPER_ENABLED`	`true`	Enable orphaned job detection
`JOB_REAPER_INTERVAL`	`60`	Seconds between reaper scans
`WORKER_STALE_THRESHOLD`	`120`	Seconds before worker considered dead
`GUNICORN_WORKERS`	`8`	Number of gunicorn worker processes
`CONNECTION_POOL_EXCLUDE`	(none)	Comma-separated IPs/platforms to exclude from pooling

Upgrade Steps

Pull the latest image or update your deployment
Update ip field to host in your requests (optional — ip still works)
Configure NAAS_CONTEXTS and WORKER_CONTEXTS if using multi-segment routing
Review MAX_QUEUE_DEPTH for your workload
Configure webhook endpoints if you want push notifications instead of polling
Review Prometheus dashboards — new naas_failed_jobs_total gauge available