Configuring `_replicator` for IoT Edge Nodes

Your fleet of cellular or LPWAN edge nodes keeps dropping replication: jobs flap between running and crashing, checkpoints stall, and telemetry stops reaching the central cluster during backhaul outages. This page is the field guide for making a _replicator document survive the specific realities of constrained edge hardware — intermittent links, tight memory, and eMMC storage that can’t absorb metadata bloat. It applies the field contract from the _replicator document schema to the edge-node case: which parameters to set, how to size batches and connections for a flaky radio, and how to wrap the job in a Python supervisor that recovers without saturating the uplink. For the broader framework these decisions live in, see _replicator Configuration & Sync Pipeline Management.

Immediate Triage & Prerequisites

Before touching retry logic, confirm what state the scheduler actually holds. On the edge node (or against it), gather these signals:

Read the job’s real state. Query _scheduler/docs/_replicator/<doc_id> — not the raw document — because the scheduler endpoint reflects the live worker view including error_count and the last error object. A job marked crashing is retrying with exponential backoff; failed means the scheduler gave up and will not retry on its own.
Grep the CouchDB log for couch_replicator. Transient network partitions on cellular or LPWAN backhauls surface here as connection resets and timeouts. Persistent forbidden / doc_validation lines instead point at a validate_doc_update function on the target rejecting the payload — a schema problem, not a network one.
Read _replication_state_reason. Exposed via _scheduler/docs, it carries the specific cause: a bad source/target URL, a TLS failure, a missing design document for a filter, or an auth rejection.
Compare checkpoint against source update_seq. Fetch the source database’s update_seq and the job’s checkpointed_source_seq from _scheduler/jobs. A divergence of many thousands of sequences signals a checkpoint reset or a persistent write failure rather than ordinary lag.

Prerequisites for the automation in this guide: CouchDB 3.x on both ends, Python 3.9+, and the requests library on the provisioning host. Auth failures manifest in Python pipelines as requests.HTTPError (401/403), so surface them explicitly rather than swallowing them. Scope the replication user to named roles provisioned in the target’s _security members list and grant _admin only where the workflow genuinely needs it — the trust model here is governed by security boundaries in replication.

Step-by-Step Implementation

Declare the mandatory fields. Every edge job must set source, target, create_target, and an explicit continuous flag. Verify locally before deploy: python -c "import json,sys; d=json.load(open('job.json')); assert all(k in d for k in ('_id','source','target'))".
Pick exactly one scoping mechanism. Narrow the telemetry stream with a top-level selector (a Mango query), a doc_ids array, or filter: "ddoc/filtername" — never more than one, and note there is no _selector field name. An over-broad selector simply replicates too few or too many documents; it does not raise 401/403. Confirm the scheduler accepted your scoping with curl -s $URL/_scheduler/docs/_replicator/rep_edge_01 | jq .state and check it is not failed.
Cap connections and batch size for the radio. Set http_connections: 1 and worker_batch_size: 500 so a low-power gateway never exhausts memory or thrashes the link. Keep use_checkpoints: true (the default) so incremental progress survives an abrupt disconnect. Verify the running job’s batch behaviour against _scheduler/jobs: curl -s $URL/_scheduler/jobs | jq '.jobs[] | select(.doc_id=="rep_edge_01") | .info'.
Absorb transient packet loss. Set connection_timeout: 30000 and retries_per_request: 5 to ride out brief resets without exhausting the worker pool. Remember retries_per_request only caps retries inside a single HTTP request — it is not your job-level retry budget, which belongs in the supervisor.
Choose the lifetime deliberately. Use continuous: true only when the node holds a persistent link; otherwise default to one-shot sync triggered by a local scheduler or asyncio loop. The full trade matrix is in continuous vs one-way sync. Assert the deployed lifetime matches intent: the job should read continuous: true in _scheduler/docs for a persistent node.
Deploy idempotently under a deterministic _id. Derive _id from the hardware serial so a node reboot re-PUTs the identical document instead of spawning a duplicate. A concurrent write surfaces as 409 and is handled by re-reading _rev — the same pattern documented in handling 409 conflicts in replication jobs.

Complete Working Example

The script below deploys an edge-tuned _replicator document and wraps triggering in an asyncio supervisor that applies job-level exponential backoff, aligning retry cadence to network-availability windows. It catches aiohttp.ClientError and asyncio.TimeoutError, routes failures to a webhook, and re-reads state from _scheduler/docs. It is self-contained and runnable.

#!/usr/bin/env python3
"""Deploy and supervise an edge-tuned CouchDB _replicator job."""
import os
import json
import asyncio
import logging

import aiohttp

logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s [%(levelname)s] %(message)s")

COUCH = os.getenv("COUCHDB_URL", "http://localhost:5984")
AUTH = aiohttp.BasicAuth(os.getenv("COUCHDB_USER", "admin"),
                         os.getenv("COUCHDB_PASS", "password"))
WEBHOOK = os.getenv("ALERT_WEBHOOK")  # optional failure sink


def edge_job(node_serial: str) -> dict:
    """Build an edge-tuned _replicator document keyed by hardware serial."""
    return {
        "_id": f"rep_edge_{node_serial}",           # deterministic -> idempotent
        "source": os.getenv("SYNC_SOURCE",
                            f"http://edge-{node_serial}:5984/iot_telemetry"),
        "target": os.getenv("SYNC_TARGET",
                            "http://core-cluster:5984/iot_telemetry"),
        "create_target": True,
        "continuous": True,
        "selector": {"type": "temperature"},         # one scoping field only
        "connection_timeout": 30000,                  # ride out cellular resets
        "retries_per_request": 5,                     # per-HTTP-request cap
        "http_connections": 1,                        # cap memory on the gateway
        "worker_batch_size": 500,                     # small batches survive drops
        "use_checkpoints": True,                      # resume after disconnect
        "user_ctx": {"name": "replicator_svc", "roles": ["_admin"]},
    }


async def deploy(session: aiohttp.ClientSession, doc: dict) -> None:
    """Idempotently PUT the job, carrying _rev forward on 409."""
    url = f"{COUCH}/_replicator/{doc['_id']}"
    async with session.get(url) as r:
        if r.status == 200:
            doc = {**doc, "_rev": (await r.json())["_rev"]}
    async with session.put(url, json=doc) as r:
        if r.status == 409:  # lost a race with another provisioner
            async with session.get(url) as g:
                doc = {**doc, "_rev": (await g.json())["_rev"]}
            async with session.put(url, json=doc) as r2:
                r2.raise_for_status()
        else:
            r.raise_for_status()
    logging.info("deployed %s", doc["_id"])


async def state(session: aiohttp.ClientSession, doc_id: str) -> dict:
    """Return the scheduler's live view of the job."""
    url = f"{COUCH}/_scheduler/docs/_replicator/{doc_id}"
    async with session.get(url) as r:
        r.raise_for_status()
        return await r.json()


async def notify(session: aiohttp.ClientSession, payload: dict) -> None:
    """Best-effort push of a failure snapshot to a webhook."""
    if not WEBHOOK:
        return
    try:
        async with session.post(WEBHOOK, json=payload) as r:
            await r.read()
    except aiohttp.ClientError as exc:
        logging.warning("webhook delivery failed: %s", exc)


async def supervise(node_serial: str, max_backoff: float = 300.0) -> None:
    """Deploy, then poll state with job-level exponential backoff."""
    doc = edge_job(node_serial)
    backoff = 5.0
    async with aiohttp.ClientSession(auth=AUTH) as session:
        await deploy(session, doc)
        while True:
            try:
                snap = await asyncio.wait_for(state(session, doc["_id"]),
                                              timeout=15)
                st = snap.get("state")
                logging.info("%s -> %s (errors=%s)",
                             doc["_id"], st, snap.get("error_count", 0))
                if st in ("crashing", "failed"):
                    await notify(session, {"doc_id": doc["_id"], "snapshot": snap})
                    backoff = min(backoff * 2, max_backoff)  # widen on trouble
                else:
                    backoff = 5.0                              # healthy -> reset
                await asyncio.sleep(backoff)
            except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
                logging.warning("poll failed, backing off: %s", exc)
                await notify(session, {"doc_id": doc["_id"], "error": str(exc)})
                await asyncio.sleep(backoff)
                backoff = min(backoff * 2, max_backoff)


if __name__ == "__main__":
    serial = os.getenv("NODE_SERIAL", "node01")
    try:
        asyncio.run(supervise(serial))
    except KeyboardInterrupt:
        print(json.dumps({"stopped": serial}))

Gotchas & Edge Cases

socket_options is an Erlang-term string, not JSON. To keep TCP connections alive across idle windows, set "socket_options": "[{keepalive, true}]" as a literal string. Passing a JSON array is silently ignored and the keepalive never takes effect.
CouchDB exposes no replication_lag or replication_throughput metric. Derive backlog yourself from the source_seq-versus-checkpointed_source_seq gap and changes_pending in _scheduler/jobs. Alerting on a non-existent metric produces a monitor that never fires.
retries_per_request is not your retry budget. It caps retries within one HTTP request inside the worker. Job-level recovery — the thing that matters during a multi-hour outage — must live in your supervisor, as it does above.
eMMC metadata bloat kills long-lived checkpoints. On devices with limited flash, unbounded checkpoint history and stale _replicator documents accumulate. Prune completed one-shot jobs and old checkpoints on a schedule, or the node runs out of storage mid-sync.
There are no built-in _reader/_writer roles. Scope user_ctx.roles to roles you actually provisioned in the target’s _security members. Assuming CouchDB ships reader/writer roles leaves the job either over-privileged or unable to write.
Replication never merges divergent leaves. A narrowed edge job keeps write divergence contained, but reconciling the leaves that revision tree mechanics stack up is the job of the conflict detection strategies layer, not of any _replicator field.

Verification & Observability

Confirm the fix took hold from three angles. First, the job should hold running (or cycle cleanly through completed for one-shot nodes) in _scheduler/docs with a low, non-growing error_count:

curl -s $COUCHDB_URL/_scheduler/docs/_replicator/rep_edge_node01 \
  | jq '{state, error_count, info: .info.error}'

Second, the checkpoint gap should shrink toward zero as the backhaul recovers — poll _scheduler/jobs and watch checkpointed_source_seq climb toward the source update_seq:

curl -s $COUCHDB_URL/_scheduler/jobs \
  | jq '.jobs[] | select(.doc_id=="rep_edge_node01")
        | {through_seq: .info.through_seq, changes_pending: .info.changes_pending}'

Third, for a fleet, run a lightweight Prometheus exporter that polls _scheduler/docs and emits a gauge per state so crashing and failed counts are visible in real time — the same polling surface used by monitoring replication checkpoints via API. Route state transitions to your alerting stack through async monitoring & webhooks and tune recovery cadence with error handling & retry logic so a flaky node self-heals instead of being discovered by hand.

FAQ

Should edge nodes use continuous or one-shot replication?

Use continuous: true only when the node holds a persistent link. For intermittently connected cellular or LPWAN nodes, default to one-shot jobs triggered during known availability windows by a local scheduler or an asyncio loop — this avoids a live _changes listener burning the uplink and battery while the radio is down. The full trade matrix is in the continuous vs one-way sync page.

Why does my edge job flap between running and crashing?

Almost always intermittent connectivity to the target. Raise connection_timeout to 30000, keep retries_per_request at 5, lower worker_batch_size to 500, and set http_connections: 1. crashing is a transient state the scheduler retries with exponential backoff, so a small amount of flapping on a mobile link is normal; only a rising error_count or a transition to failed needs intervention.

How do I keep checkpoints from filling up limited edge storage?

Keep use_checkpoints: true so progress survives disconnects, but prune completed one-shot _replicator documents and stale checkpoint history on a schedule. On devices with small eMMC, unbounded metadata is a real failure mode — a periodic cleanup job that deletes finished replication documents prevents the node running out of flash mid-sync.

Part of: The _replicator Document Schema

Configuring _replicator for IoT Edge Nodes #

Immediate Triage & Prerequisites #

Step-by-Step Implementation #

Complete Working Example #

Gotchas & Edge Cases #

Verification & Observability #

FAQ #

Related #