Continuous vs One-Way Sync in CouchDB `_replicator`

Edge and IoT fleets, mobile backends, and Python sync workers all face the same first decision when they wire two CouchDB databases together: should the replication job hold an open _changes listener forever, or run once and exit? Choosing wrong is expensive in production — an unmanaged continuous job on a flaky cellular link can exhaust file descriptors and trigger connection storms, while a one-shot sweep scheduled too coarsely lets device state drift for minutes before it converges. This page treats the continuous flag as the operational lever it really is: it shows the exact _replicator document for each mode, how to watch a job’s lifecycle from _scheduler/docs, a production-grade Python controller that provisions and supervises either mode, and a trade-off matrix for picking one per topology edge. It is the mode-selection layer of _replicator Configuration & Sync Pipeline Management, and it assumes you already deploy jobs with the _replicator document schema; everything below is about when the job runs, not what fields it carries.

Configuration Schema & Required Parameters

Both modes are the same document in the _replicator database with a single flag flipped. A one-way (one-shot) job transfers every change up to the source’s update sequence as of start time, then transitions to completed and the worker exits. A continuous job attaches a persistent listener to the source _changes feed and propagates mutations until you delete the document. Deleting the _replicator document — not a cancel flag — is the supported way to stop either job.

One-way (one-shot) sync — deterministic batch upload

{
  "_id": "rep_one_way_telemetry",
  "source": "https://edge-device.local:5984/telemetry_db",
  "target": "https://central-backend.example:5984/telemetry_aggregate",
  "create_target": true,
  "continuous": false,
  "http_connections": 10,
  "connection_timeout": 30000,
  "retries_per_request": 3,
  "user_ctx": {
    "name": "sync_service",
    "roles": ["_admin"]
  }
}

Setting continuous: false (the default) makes the replicator exit after it reaches the current source sequence. This is ideal for cron-driven batch uploads where deterministic completion is required and background resource consumption must be minimized — the worker holds no sockets between runs, which matters on constrained edge nodes.

Continuous sync — live state convergence

{
  "_id": "rep_continuous_state_sync",
  "source": "https://central-backend.example:5984/device_state",
  "target": "https://edge-device.local:5984/device_state",
  "continuous": true,
  "create_target": true,
  "heartbeat": 30000,
  "filter": "sync_filters/by_region",
  "query_params": {"region": "eu-west-1"},
  "http_connections": 5,
  "retries_per_request": 10,
  "user_ctx": {
    "name": "sync_service",
    "roles": ["_admin"]
  }
}

The heartbeat parameter makes the source emit a periodic newline on the _changes stream — an application-layer keepalive, distinct from TCP keep-alive, which is configured separately via socket_options. It stops intermediate proxies and NAT gateways from tearing down an idle connection. Pairing continuous mode with a server-side filter (named in ddocname/filtername form, so sync_filters/by_region targets the by_region function inside _design/sync_filters) means only the relevant partition streams down, drastically reducing payload on metered links. Scoping this filter well is also a conflict-avoidance tactic; the concurrency patterns it suppresses are catalogued in conflict generation models.

Parameter	Type	Default	Effect on the sync mode
`continuous`	boolean	`false`	The mode switch. `false` = run once to current seq and exit; `true` = hold an open `_changes` listener until the doc is deleted.
`source` / `target`	string or object	— (required)	Database URLs, or objects carrying `url`, `headers`, and `auth`. Both must be reachable from the replicator node.
`create_target`	boolean	`false`	Creates the target DB if absent. Safe for one-shot provisioning; leave `false` in steady-state so a typo can’t spawn a stray database.
`heartbeat`	integer (ms)	`10000`	Continuous only. Interval between keepalive newlines on the feed; must be shorter than the shortest idle timeout on the path.
`filter`	string	none	`ddocname/filtername` server-side filter. Mutually exclusive with `doc_ids` and `selector` — set exactly one.
`selector`	object	none	Mango selector that narrows the replicated stream declaratively; a lighter alternative to a JS `filter`.
`query_params`	object	none	Values passed to the named `filter` function; ignored unless `filter` is set.
`http_connections`	integer	`20`	Max concurrent connections to the remote. Lower it for continuous jobs on constrained links to cap descriptor use.
`connection_timeout`	integer (ms)	`30000`	Per-request timeout before a retry. Raise it on high-latency satellite/cellular paths to avoid thrashing.
`retries_per_request`	integer	`5`	Retries before the worker crashes and the scheduler backs off. Set higher for continuous jobs on unstable links.
`since_seq`	string	source start	Resume point. Set it to replay a one-shot job from a known checkpoint instead of the beginning.
`use_checkpoints`	boolean	`true`	Whether the job records checkpoints so an interrupted continuous job resumes rather than restarts. Leave on in production.
`user_ctx`	object	none	Roles the job runs under; writing to `_replicator` needs `_admin` or an equivalent database role.

Streaming Detection / Monitoring Setup

The lifecycle you care about differs by mode: a one-shot job’s success signal is the transition to completed, whereas a continuous job’s success signal is that it stays in running with an advancing checkpoint. Read both from GET /_scheduler/docs/_replicator/{id} rather than the legacy _replication_state field on the document, because the scheduler endpoint reports error_count, last_error, and the live checkpointed sequence. The minimal poller below distinguishes the two lifecycles and yields state transitions:

import time

import httpx


def watch_job(couch_url: str, doc_id: str, poll_seconds: int = 5):
    """Yield (state, info) transitions for a _replicator job via _scheduler/docs.

    One-shot jobs terminate at state == "completed"; continuous jobs should
    settle on "running" and are healthy only while their checkpoint advances.
    """
    url = f"{couch_url}/_scheduler/docs/_replicator/{doc_id}"
    last_state = None
    last_seq = None
    with httpx.Client(timeout=30) as client:
        while True:
            row = client.get(url).json()
            state = row.get("state")
            seq = (row.get("info") or {}).get("checkpointed_source_seq")
            if state != last_state:
                yield state, row
                last_state = state
            if state == "running" and seq == last_seq:
                yield "stalled", row  # checkpoint not advancing -> investigate
            last_seq = seq
            if state in ("completed", "failed"):
                return  # terminal for a one-shot job
            time.sleep(poll_seconds)


if __name__ == "__main__":
    for state, info in watch_job("http://admin:pass@localhost:5984", "rep_one_way_telemetry"):
        print(f"state={state} error={info.get('info', {}).get('error')}")

Emit a metric on every transition (state, timestamp, checkpointed seq). For continuous jobs the single most useful alert is “checkpoint stalled for longer than one heartbeat interval” — the connection can look alive at the TCP layer while the feed has silently stopped delivering rows. For deeper checkpoint inspection and webhook fan-out, wire this poller into Async Monitoring & Webhooks so operators react to crashing/failed transitions without polling every worker by hand.

Core Implementation

A production controller has to do more than PUT a document: it must upsert idempotently (a redeployed edge node should not error on an existing job), pick the mode from configuration, wait for the correct terminal or steady state per mode, and tear the job down cleanly. The class below provisions either mode, supervises it with bounded retries and exponential backoff, and logs structured events. For continuous jobs it returns once the job reaches running; for one-shot jobs it blocks until completed.

import logging
import os
import time
from dataclasses import dataclass

import httpx

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("replication-controller")


@dataclass
class SyncSpec:
    doc_id: str
    source: str
    target: str
    continuous: bool = False
    filter: str | None = None
    heartbeat: int = 30000
    retries_per_request: int = 5


class ReplicationController:
    """Provision and supervise a CouchDB _replicator job in either sync mode."""

    def __init__(self, couch_url: str, max_retries: int = 5):
        self.couch_url = couch_url.rstrip("/")
        self.max_retries = max_retries
        self.client = httpx.Client(timeout=30)

    def _doc_url(self, doc_id: str) -> str:
        return f"{self.couch_url}/_replicator/{doc_id}"

    def upsert(self, spec: SyncSpec) -> None:
        """Idempotently create or update the replication document.

        Re-reads the current _rev so a redeploy against an existing job
        updates in place instead of failing with a 409 Conflict.
        """
        body = {
            "source": spec.source,
            "target": spec.target,
            "continuous": spec.continuous,
            "retries_per_request": spec.retries_per_request,
            "user_ctx": {"name": "sync_service", "roles": ["_admin"]},
        }
        if spec.continuous:
            body["heartbeat"] = spec.heartbeat  # keepalive only matters for a live feed
        if spec.filter:
            body["filter"] = spec.filter  # ddocname/filtername form

        existing = self.client.get(self._doc_url(spec.doc_id))
        if existing.status_code == 200:
            body["_rev"] = existing.json()["_rev"]  # update in place
        resp = self.client.put(self._doc_url(spec.doc_id), json=body)
        resp.raise_for_status()
        log.info("upserted job %s (continuous=%s)", spec.doc_id, spec.continuous)

    def wait_ready(self, spec: SyncSpec, timeout: int = 300) -> str:
        """Block until the job reaches its healthy state for the chosen mode.

        Continuous -> returns on 'running'; one-shot -> returns on 'completed'.
        Raises on 'failed' or timeout so a supervisor can restart the pod.
        """
        target_state = "running" if spec.continuous else "completed"
        url = f"{self.couch_url}/_scheduler/docs/_replicator/{spec.doc_id}"
        deadline = time.monotonic() + timeout
        backoff = 2
        while time.monotonic() < deadline:
            row = self.client.get(url).json()
            state = row.get("state")
            if state == target_state:
                log.info("job %s reached %s", spec.doc_id, state)
                return state
            if state == "failed":
                raise RuntimeError(f"{spec.doc_id} failed: {row.get('info')}")
            time.sleep(backoff)
            backoff = min(backoff * 2, 30)  # exponential backoff, capped
        raise TimeoutError(f"{spec.doc_id} did not reach {target_state} in {timeout}s")

    def teardown(self, doc_id: str) -> None:
        """Stop a job by deleting its _replicator document (the supported path)."""
        existing = self.client.get(self._doc_url(doc_id))
        if existing.status_code == 200:
            rev = existing.json()["_rev"]
            self.client.delete(self._doc_url(doc_id), params={"rev": rev})
            log.info("torn down job %s", doc_id)

    def close(self) -> None:
        self.client.close()


def _main() -> None:
    couch = os.environ.get("COUCH_URL", "http://admin:pass@localhost:5984")
    controller = ReplicationController(couch)
    spec = SyncSpec(
        doc_id="rep_continuous_state_sync",
        source=f"{couch}/device_state",
        target="https://edge-device.local:5984/device_state",
        continuous=True,
        filter="sync_filters/by_region",
    )
    try:
        controller.upsert(spec)
        controller.wait_ready(spec)
    finally:
        controller.close()


if __name__ == "__main__":
    _main()

Two non-obvious points: the upsert re-reads _rev before writing so a fleet rollout that re-applies the same job is idempotent instead of raising 409, and wait_ready selects its success state from the mode — waiting for completed on a continuous job would block forever, and waiting for running on a one-shot job would return before the transfer finished. When you need to drive this asynchronously alongside many jobs, the asyncio variant is worked through in Automating Continuous Sync with Python Scripts.

Strategy Variants & Trade-offs

There are more than two options in practice, because the continuous flag combines with scheduling and filtering to give four distinct operating patterns. Bind one per topology edge rather than defaulting the whole fleet to continuous:

One-shot one-way sync transfers to the current source sequence and exits. Lowest resource footprint, deterministic completion, zero idle sockets — ideal for provisioning, batch telemetry upload, and nightly consolidation. Its cost is latency: nothing propagates between runs.

Cron-driven one-shot sweep wraps the one-shot job in an external scheduler (cron, a Kubernetes CronJob, or the controller above on a timer). You trade convergence granularity for total control over when the link is used, which is what metered or solar-powered edge nodes need. Tune the interval against your tolerated staleness.

Continuous (full) holds an open listener and converges within seconds. Best for live dashboards and central-to-edge config push over stable links. Its failure mode is resource pressure: unmanaged continuous jobs on unstable links crash-loop, and each open feed consumes a connection.

Filtered continuous is continuous mode narrowed by a server-side filter or Mango selector so only a partition streams. It keeps near-real-time convergence while cutting bytes and conflict surface on metered IoT links — usually the right default for mobile and edge fleets.

Strategy	Convergence latency	Resource footprint	Best fit
One-shot one-way	High (per run)	Lowest — no idle sockets	Provisioning, batch upload, consolidation
Cron-driven one-shot sweep	Bounded by interval	Low, bursty	Metered / power-constrained edge nodes
Continuous (full)	Seconds	Highest — one open feed per job	Stable links, live config push, dashboards
Filtered continuous	Seconds, partition-scoped	Medium — bounded by filter	Mobile/IoT fleets on metered links

The mode you pick here is independent of the merge strategy that reconciles any conflicts the sync surfaces; that decision belongs to algorithm selection for merge, and the physical shape of the links you are syncing across is covered by sync topology models.

Deployment & Orchestration

Run the controller as a small stateless service and observe the single-replica-per-partition rule: two controllers supervising the same continuous job double the checkpoint-write pressure and race each other’s retries, so scale by sharding topology edges across replicas, never by cloning workers onto one edge. Configure everything through the environment so one image serves every node:

# Container environment (one replica per topology edge)
COUCH_URL=https://central-db.cluster:5984
SYNC_MODE=continuous          # or: oneshot
SYNC_FILTER=sync_filters/by_region
SYNC_HEARTBEAT_MS=30000
WAIT_TIMEOUT_S=300
HEALTHCHECK_PORT=8080

Expose a /healthz endpoint whose meaning depends on the mode: for a continuous job it confirms the scheduler still reports running and the checkpointed sequence advanced within the last heartbeat interval; for a cron-driven one-shot deployment it confirms the last sweep reached completed within the expected window. Pin the checkpoint (since_seq / checkpointed_source_seq) in durable storage so a restarted pod resumes rather than replays a full transfer. For one-shot deployments, let the orchestrator own the schedule and exit code so a failed sweep surfaces as a failed job, not a silently skipped one.

Troubleshooting & Common Errors

Symptom / error	Likely cause	Remediation
Continuous job crash-loops (`crashing` → `failed`)	`retries_per_request` too low for the link’s jitter	Raise `retries_per_request` and `connection_timeout`; confirm the scheduler backoff in `_scheduler/docs`
Feed goes silent but TCP stays open	No/oversized `heartbeat`; proxy dropped the idle stream	Set `heartbeat` shorter than the shortest idle timeout on the path; alert on a stalled checkpoint
One-shot job never reaches `completed`	Waiting on `running`, or huge backlog still transferring	Poll for `completed`, not `running`; check `changes_pending` in `_scheduler/docs`
`409 Conflict` on redeploy	Re-`PUT` of an existing job without its `_rev`	Read the current doc, attach `_rev`, then update in place (see `upsert`)
`filter` job replicates nothing	`filter` written as a `_design/.../_filter` path	Use the `ddocname/filtername` form; validate with `GET /_active_tasks`
File descriptors exhausted on edge node	Too many concurrent continuous jobs / high `http_connections`	Consolidate with filtered continuous; lower `http_connections` per job
Checkpoint drift / full replay after restart	`use_checkpoints:false` or checkpoint not persisted	Keep `use_checkpoints:true` and store the last `seq` durably

When a job flaps between crashing and running, treat it as a retry-tuning problem before assuming a broken link: the mechanics of backoff, retries_per_request, and specific status codes live in error handling & retry logic, and the specific 409 handling around conflicting writes is detailed in handling 409 conflicts in replication jobs. Track three signals per job regardless of mode: state-transition rate (flap detector), checkpoint advance rate (progress), and open-connection count (resource headroom).

FAQ

What is the real difference between continuous and one-way replication?

One-way (one-shot) replication reads the source’s update sequence at start time, transfers every change up to that point, then transitions to completed and the worker exits. Continuous replication attaches a persistent _changes listener and keeps propagating new mutations until you delete the _replicator document. “One-way” describes direction (source → target) and is orthogonal to continuous; a continuous job is also one-directional unless you deploy a second job for the reverse edge.

How do I stop a continuous replication job?

Delete its document from the _replicator database (DELETE /_replicator/{id}?rev={rev}). That is the supported mechanism — the scheduler tears down the worker after committing in-flight writes. The cancel: true flag belongs to the one-off POST /_replicate API, not to _replicator documents, and has no effect there.

Does the heartbeat parameter keep the TCP connection alive?

No — heartbeat is an application-layer keepalive that makes the source emit a newline on the _changes stream at the given interval so proxies and NAT gateways don’t drop an idle feed. TCP keep-alive is a separate transport-layer setting configured through socket_options. Set heartbeat shorter than the shortest idle timeout anywhere on the network path.

Can I convert a running one-shot job into a continuous one?

Not by editing the flag on a completed job — a one-shot job that reached completed is finished. Update the document to continuous: true (re-reading its _rev first, as the controller’s upsert does) so the scheduler re-enqueues it as a continuous job, or delete and re-create it. Either way the checkpoint is preserved through use_checkpoints, so it resumes from the last sequence rather than replaying.

Which mode should I default to for a mobile or IoT fleet?

Filtered continuous is usually the right default: a server-side filter or Mango selector scopes the feed to the partition a device cares about, keeping convergence within seconds while cutting bytes and conflict surface on metered links. Fall back to a cron-driven one-shot sweep for nodes that are power-constrained or only intermittently reachable, where controlling when the link is used matters more than sub-minute freshness.

Part of: _replicator Configuration & Sync Pipeline Management

Continuous vs One-Way Sync in CouchDB _replicator #

Configuration Schema & Required Parameters #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #