Automating Continuous Sync with Python Scripts for Edge and IoT Deployments

You deployed a continuous: true replication document to a fleet of edge gateways, walked away, and hours later discovered that half the devices stopped converging — no exception, no alert, just a growing backlog on the source _changes feed and a scheduler quietly reporting crashing behind a flapping cellular link. That silent stall is the failure this page fixes: a Python supervisor that provisions a long-lived continuous job, confirms the scheduler actually picked it up, watches checkpoints move forward, and self-heals a crash-loop instead of waiting for a human to notice. It is the automation layer on top of the continuous vs one-way sync decision — this page assumes you have already decided a persistent listener is right for the edge and now need it to stay alive unattended.

Immediate Triage / Prerequisites

Before writing a line of supervisor code, confirm the job you think is continuous actually is. The single most common cause of an edge sync “stalling” is that continuous was never set and defaulted to false, so the job ran one-shot to the source’s update sequence and exited exactly as designed. Read the document back and assert the flag:

curl -s http://admin:password@localhost:5984/_replicator/rep_edge_state_sync \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); assert d.get("continuous") is True, "job is NOT continuous"; print("continuous OK")'

Then check the scheduler’s live view, not the document’s cached _replication_state. The _scheduler/docs endpoint reflects the real worker state including recent errors:

curl -s http://admin:password@localhost:5984/_scheduler/docs/_replicator/rep_edge_state_sync \
  | python3 -m json.tool

If state is crashing, grep the CouchDB log for the correlated failure signature before anything else — grep -E 'couch_replicator.*(error|econnrefused|timeout)' /var/log/couchdb/couchdb.log. Prerequisites for the automation itself are modest: Python 3.9+ and requests (pip install requests). Every job you supervise must already be shaped by the _replicator document schema — the supervisor provisions and watches documents, it does not invent new fields.

Step-by-Step Implementation

Each step below ends with a check you can run so a rollout gates on real state instead of “the PUT returned 201”.

Assert the source feed is reachable and the flag is set. Build the document in Python and validate continuous is True and that source/target resolve before deploying — a bad URL is the difference between running and an endless crashing retry.
```
assert doc["continuous"] is True
assert requests.head(doc["source"], auth=auth, timeout=10).status_code < 500
```

PUT the document idempotently under a deterministic _id. Carry the current _rev forward so re-running the supervisor after a pod restart is a no-op rather than a 409.

existing = session.get(url)
if existing.status_code == 200:
    doc["_rev"] = existing.json()["_rev"]
assert session.put(url, json=doc).status_code in (200, 201)

Block until the scheduler reports running. A continuous job never reaches completed; running is its healthy steady state. Poll _scheduler/docs and fail the deploy if it lands in failed.
```
state = session.get(sched_url).json()["state"]
assert state == "running", f"expected running, got {state}"
```
Sample _replication_stats to prove the checkpoint is advancing. A job can sit in running while making zero progress if the source feed is empty or the checkpoint document is conflicted. Compare docs_written across two samples separated by your convergence SLA.
```
before = session.get(sched_url).json()["info"]["docs_written"]
time.sleep(30)
after = session.get(sched_url).json()["info"]["docs_written"]
# after >= before always; a persistently flat value under load is the alarm
```
Detect a crash-loop and self-heal. When error_count on _scheduler/docs climbs past a threshold, the transient-retry backoff is not recovering. Tear the job down (delete the document) and redeploy it to reset the scheduler’s backoff state, then escalate if it recurs.
```
if snapshot["error_count"] > MAX_ERRORS:
    teardown(doc_id); deploy(doc)  # reset backoff, re-arm the listener
```

The trade-off you are encoding in step 5 is exactly the one covered by error handling & retry logic: the scheduler already retries with exponential backoff, so your supervisor should only intervene when that native backoff has demonstrably failed, never on the first crashing blip.

Complete Working Example

This script is self-contained and runnable. It provisions a continuous job, waits for running, then supervises it on an interval — logging checkpoint deltas and redeploying on a sustained crash-loop. Configure it entirely through environment variables so the same image runs across a fleet.

#!/usr/bin/env python3
"""Provision and supervise a continuous CouchDB _replicator job for edge/IoT."""
import os
import time
import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
)

MAX_ERRORS = int(os.getenv("MAX_ERRORS", "5"))      # crash-loop threshold
SAMPLE_INTERVAL = float(os.getenv("SAMPLE_INTERVAL", "30"))


class ContinuousSyncSupervisor:
    def __init__(self, base_url: str, auth: tuple):
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.auth = auth
        # Absorb transient transport errors; 409 is handled explicitly below.
        retry = Retry(total=4, backoff_factor=1,
                      status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount("http://", HTTPAdapter(max_retries=retry))
        self.session.mount("https://", HTTPAdapter(max_retries=retry))

    def deploy(self, doc: dict) -> str:
        """Idempotently PUT a continuous _replicator document."""
        assert doc.get("continuous") is True, "supervisor requires continuous:true"
        doc_id = doc["_id"]
        url = f"{self.base_url}/_replicator/{doc_id}"
        existing = self.session.get(url)
        if existing.status_code == 200:
            doc = {**doc, "_rev": existing.json()["_rev"]}
        resp = self.session.put(url, json=doc)
        if resp.status_code == 409:  # lost a race with a concurrent deployer
            doc = {**doc, "_rev": self.session.get(url).json()["_rev"]}
            resp = self.session.put(url, json=doc)
        resp.raise_for_status()
        logging.info("deployed continuous job %s", doc_id)
        return doc_id

    def snapshot(self, doc_id: str) -> dict:
        """The scheduler's live view: state, errors, and checkpoint stats."""
        url = f"{self.base_url}/_scheduler/docs/_replicator/{doc_id}"
        body = self.session.get(url, timeout=15).json()
        info = body.get("info") or {}
        return {
            "state": body.get("state"),
            "error_count": body.get("error_count", 0),
            "docs_written": info.get("docs_written", 0),
        }

    def wait_until_running(self, doc_id: str, timeout: float = 60.0) -> None:
        deadline = time.time() + timeout
        while time.time() < deadline:
            snap = self.snapshot(doc_id)
            logging.info("%s -> %s", doc_id, snap["state"])
            if snap["state"] == "running":
                return
            if snap["state"] == "failed":
                raise RuntimeError(f"{doc_id} entered 'failed'")
            time.sleep(2)
        raise TimeoutError(f"{doc_id} never reached 'running'")

    def teardown(self, doc_id: str) -> None:
        """Stop a job the supported way: delete its document."""
        url = f"{self.base_url}/_replicator/{doc_id}"
        rev = self.session.get(url).json().get("_rev")
        if rev:
            self.session.delete(url, params={"rev": rev}).raise_for_status()
            logging.info("tore down %s", doc_id)

    def supervise(self, doc: dict) -> None:
        """Deploy, then watch forever; redeploy on a sustained crash-loop."""
        doc_id = self.deploy(doc)
        self.wait_until_running(doc_id)
        last_written = 0
        while True:
            time.sleep(SAMPLE_INTERVAL)
            snap = self.snapshot(doc_id)
            delta = snap["docs_written"] - last_written
            last_written = snap["docs_written"]
            logging.info("%s state=%s errors=%d +%d docs",
                         doc_id, snap["state"], snap["error_count"], delta)
            if snap["error_count"] > MAX_ERRORS:
                logging.warning("%s crash-loop (errors=%d); redeploying",
                                doc_id, snap["error_count"])
                self.teardown(doc_id)
                self.deploy(doc)
                self.wait_until_running(doc_id)
                last_written = 0


if __name__ == "__main__":
    supervisor = ContinuousSyncSupervisor(
        base_url=os.getenv("COUCHDB_URL", "http://localhost:5984"),
        auth=(os.getenv("COUCHDB_USER", "admin"),
              os.getenv("COUCHDB_PASS", "password")),
    )
    job = {
        "_id": os.getenv("JOB_ID", "rep_edge_state_sync"),
        "source": os.getenv("SYNC_SOURCE", "http://core-cluster:5984/device_state"),
        "target": os.getenv("SYNC_TARGET", "http://edge-node-01:5984/device_state"),
        "continuous": True,
        "create_target": True,
        # Cap sockets on constrained cellular modems; see gotchas below.
        "http_connections": int(os.getenv("HTTP_CONNECTIONS", "10")),
        "connection_timeout": int(os.getenv("CONNECTION_TIMEOUT", "30000")),
        "user_ctx": {"name": "sync_service", "roles": ["_admin"]},
    }
    supervisor.supervise(job)

Run one supervisor process per job, single-replica-per-partition — two supervisors writing the same _id will race on the _rev and thrash the scheduler. In a container, this script is the entrypoint; a liveness probe that hits your metrics sidecar (see below) restarts it if the loop dies.

Gotchas & Edge Cases

continuous defaults to false. Omitting it does not error — it silently produces a one-shot job that completes and exits, which reads downstream as “sync stopped.” Always author continuous: true explicitly and assert it, as the supervisor does.
File-descriptor exhaustion on flaky links. The default http_connections of 20 is tuned for data centers; on a cellular modem it opens more sockets than the interface can service, and reconnect storms during handoff exhaust FDs. Lower it to 8–10 on constrained edges.
A conflicted checkpoint document freezes progress. CouchDB stores replication checkpoints as documents on both ends; if one develops a conflict the job stays running but docs_written flatlines. That is why step 4 samples the delta rather than trusting the running state alone.
crashing is transient, failed is terminal. A single crashing transition is normal on a lossy link — the scheduler will retry with backoff. Only a rising error_count or a failed state warrants intervention; redeploying on every blip amplifies the storm you are trying to calm.
Continuous sync moves revisions, it never merges them. A supervised continuous job will faithfully replicate divergent edits into stacked leaves on the target’s revision tree mechanics; reconciling them is a separate conflict detection strategies concern, and if you resolve with a timestamp rule, unsynchronised device clocks will silently corrupt a last-write-wins merge.

Verification & Observability

To confirm the automation is actually keeping the edge converged, watch the checkpointed sequence advance on _scheduler/jobs, which exposes the exact source sequence each worker has committed:

curl -s http://admin:password@localhost:5984/_scheduler/jobs \
  | python3 -c 'import sys,json; [print(j["id"], j.get("info",{}).get("checkpointed_source_seq")) for j in json.load(sys.stdin)["jobs"]]'

Run it twice a minute apart: a moving checkpointed_source_seq under write load is proof of a live, converging listener; a frozen one on an active source is your earliest stall signal — earlier than the error_count climb. The supervisor’s per-interval +N docs log line is the same measurement in your logs, and shipping error_count plus the checkpoint delta to your metrics stack lets you alert before a device drifts. For a webhook-driven, push-based version of this same signal, wire the job into monitoring replication checkpoints via API rather than polling from every gateway.

FAQ

Why does my continuous job show `running` but never replicate anything?

A running state only means the worker is attached to the source _changes feed, not that data is flowing. Sample docs_written (or checkpointed_source_seq) twice and compare: a flat value under active writes usually means a conflicted checkpoint document or a filter that matches nothing. Restarting via a delete-and-redeploy clears a stuck checkpoint.

How do I stop a supervised continuous job cleanly?

Delete its _replicator document — that is the only supported mechanism, and the scheduler tears the worker down after committing in-flight writes. The supervisor's teardown() does exactly this. The cancel: true flag belongs to the one-off POST /_replicate API, not to scheduler-managed documents.

Should the supervisor restart the job on the first `crashing` state?

No. CouchDB already retries transient failures with exponential backoff, so a single crashing transition on a lossy edge link is expected and self-heals. Restart only when error_count crosses a threshold, otherwise your redeploys reset the backoff timer and worsen a connection storm.

Part of: Continuous vs One-Way Sync

Automating Continuous Sync with Python Scripts for Edge and IoT Deployments #

Immediate Triage / Prerequisites #

Step-by-Step Implementation #

Complete Working Example #

Gotchas & Edge Cases #

Verification & Observability #

FAQ #

Related #