`_replicator` Configuration & Sync Pipeline Management in CouchDB

CouchDB’s _replicator database is the declarative control plane for distributed data synchronization, replacing the ephemeral _replicate HTTP endpoint with a document-driven state machine that the CouchDB cluster continuously schedules, supervises, and recovers. For teams building edge telemetry collectors, mobile backend sync layers, or Python data pipelines, this shifts replication from a fire-and-forget operational task to a managed, observable service that survives node restarts, leader elections, and prolonged network partitions. This guide covers the full lifecycle of a production sync pipeline built on _replicator: how CouchDB implements the underlying scheduler, how to detect and observe replication state at fleet scale, how to deploy replication jobs deterministically, how to externalize routing into configuration, and how to harden the whole system against the failure modes that show up only in the field — clock skew, checkpoint drift, connection-pool exhaustion, and replication storms. The audience is senior distributed-systems engineers who need replication to behave predictably across intermittently connected, geographically dispersed clusters, within the bounds of CouchDB’s eventually-consistent model.

The _replicator database is the control plane: documents declare intent, the scheduler supervises workers, and the scheduler endpoints feed a Python loop that reconciles cluster state back to the desired set.

Core Concept & CouchDB Mechanics

A replication job in modern CouchDB (2.x/3.x) is not a live HTTP request — it is a JSON document you PUT into the _replicator system database. The CouchDB cluster runs a distributed scheduler (the successor to the older single-node replicator) that watches that database, validates each document, computes a stable _replication_id from the job parameters, and instantiates a supervised worker. Each worker consumes the source _changes feed, calls _revs_diff on the target to discover which revisions are missing, transfers only those revisions through _bulk_docs, and periodically writes checkpoints into _local/ documents on both source and target so an interrupted job resumes from the last committed sequence rather than replaying from zero.

One worker cycle: _changes → _revs_diff → fetch missing revs → _bulk_docs → _local checkpoints on both ends.

The document is authored by you; a companion set of _replication_* fields is written back by the CouchDB cluster and must never be set by hand. The core input fields — source, target, create_target, continuous, doc_ids, filter, and selector — define the data-flow boundaries and worker instantiation parameters, while user_ctx binds the job to a security context whose roles must satisfy the target database’s ACLs. Misconfiguring these — omitting an authentication context, setting mutually exclusive selectors (doc_ids, filter, and selector cannot coexist), or pointing at a non-existent design document — produces immediate worker failures or, worse, silent checkpoint corruption. The complete field-by-field contract, including which fields are read-only and how the scheduler interprets each one, is documented in the _replicator Document Schema; treat that schema as the canonical validation target before any document reaches the CouchDB cluster.

Because MVCC drives the whole model, replication never overwrites in place. When concurrent writes reach the same document ID on disconnected nodes, CouchDB preserves every divergent revision in the document’s revision tree and later designates a deterministic winner (highest generation, then lexicographically highest revision hash) purely as a read-time convenience — the losing branches are retained until an application explicitly resolves them. The way those branches form and how the winner is chosen is governed by revision tree mechanics, and the pipeline you build on _replicator is responsible for surfacing those conflicts rather than assuming the scheduler resolved them. Replication moves revisions; it does not merge them.

Scheduler states and the observable surface

Each job exposes a lifecycle through _replication_state and the two scheduler endpoints. States transition through initializing, running, pending (queued or throttled by the scheduler’s fair-share limits), crashing (transient error, will retry with exponential backoff), completed (a one-shot job that finished), and failed (permanent — the scheduler gave up). The _scheduler/docs endpoint reports state from the perspective of each _replicator document, while _scheduler/jobs reports the live worker view including the source/target, the checkpointed sequence, and per-worker error history. These two endpoints — not the deprecated _active_tasks shape — are the authoritative observability surface for a document-driven pipeline.

Production Detection / Configuration Pipeline

At fleet scale you cannot eyeball job health in Fauxton. The pipeline needs a supervisory loop that polls _scheduler/docs, classifies each job, and emits metrics your alerting stack can act on. The loop below reads the scheduler view, tallies jobs by state, flags any job that has been crashing past a threshold, and prints Prometheus-style gauges. It is deliberately dependency-light — only requests — so it drops into any sidecar or cron sidecar container.

#!/usr/bin/env python3
"""Poll CouchDB _scheduler/docs and emit replication health metrics."""
import sys
import time
import logging
from collections import Counter

import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("repl-monitor")

COUCH = "http://127.0.0.1:5984"
AUTH = ("replicator_svc", "change-me")   # inject from secrets in production
CRASH_ALERT_SECONDS = 300                 # a job crashing longer than this is paged


def scheduler_docs(session):
    """Return every replication job the scheduler currently tracks."""
    docs, skip, limit = [], 0, 100
    while True:
        r = session.get(f"{COUCH}/_scheduler/docs",
                        params={"limit": limit, "skip": skip}, timeout=15)
        r.raise_for_status()
        page = r.json().get("docs", [])
        docs.extend(page)
        if len(page) < limit:
            return docs
        skip += limit


def classify(docs):
    """Bucket jobs by state and collect ones stuck crashing/failed."""
    states = Counter(d.get("state", "unknown") for d in docs)
    unhealthy = [d for d in docs if d.get("state") in ("crashing", "failed")]
    return states, unhealthy


def emit(states, unhealthy):
    for state, count in sorted(states.items()):
        # One gauge per state is trivially graphable and alertable.
        print(f'couchdb_replication_jobs{{state="{state}"}} {count}')
    print(f"couchdb_replication_jobs_unhealthy {len(unhealthy)}")
    for d in unhealthy:
        info = d.get("info") or {}
        log.warning("unhealthy job doc_id=%s state=%s error=%s",
                    d.get("doc_id"), d.get("state"), info.get("error"))


def poll_forever(interval=30):
    with requests.Session() as s:
        s.auth = AUTH
        while True:
            try:
                states, unhealthy = classify(scheduler_docs(s))
                emit(states, unhealthy)
            except requests.RequestException as exc:
                log.error("scheduler poll failed: %s", exc)
            time.sleep(interval)


if __name__ == "__main__":
    try:
        poll_forever()
    except KeyboardInterrupt:
        sys.exit(0)

Polling _scheduler/docs on a fixed interval is enough for coarse dashboards, but for low-latency reaction you subscribe to the _replicator database’s own _changes feed in continuous mode. Every time a job’s _replication_state flips, the scheduler writes it back onto the document, which appears as a change event — so a _changes listener on _replicator is effectively a real-time state-transition stream. Pushing those transitions to alerting or auto-remediation is the domain of Async Monitoring & Webhooks, which covers the webhook fan-out, deduplication, and back-pressure concerns that a naive listener misses. The metrics worth emitting from either path are the same: job count per state, per-job checkpointed sequence lag, doc_write_failures from _replication_stats, and time-in-crashing. Those four signals catch the overwhelming majority of pipeline regressions before users notice stale data.

Deterministic Resolution Architecture

The defining discipline of a _replicator-driven pipeline is that replication documents are infrastructure code — declared, versioned, and reconciled, never hand-poked. The reconciler’s job is idempotency: given a desired set of replication jobs, converge the CouchDB cluster to exactly that set regardless of how many times it runs or what state a previous run left behind. The key is a deterministic _id derived from the job’s identity (not CouchDB’s computed _replication_id), so a redeploy after a pod restart or leader election updates the existing document instead of spawning a duplicate worker.

#!/usr/bin/env python3
"""Idempotently reconcile a desired set of _replicator jobs into a cluster."""
import sys
import json
import hashlib
import logging
from dataclasses import dataclass, field, asdict

import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("repl-reconciler")

COUCH = "http://127.0.0.1:5984"
AUTH = ("replicator_svc", "change-me")


@dataclass
class ReplicationJob:
    """A declarative description of one replication job."""
    name: str                              # stable human identity, e.g. 'edge01->central'
    source: str
    target: str
    continuous: bool = True
    create_target: bool = False
    selector: dict | None = None           # server-side filter; mutually exclusive with filter/doc_ids
    user_ctx: dict = field(default_factory=lambda: {"roles": ["_replicator"]})

    def doc_id(self) -> str:
        # Deterministic, human-readable _id keeps redeploys idempotent.
        digest = hashlib.sha1(self.name.encode()).hexdigest()[:8]
        return f"rep_{self.name.replace('/', '_')}_{digest}"

    def body(self) -> dict:
        doc = {k: v for k, v in asdict(self).items()
               if k != "name" and v is not None}
        return doc


class Reconciler:
    """Converge the _replicator database to a desired job set."""

    def __init__(self, base=COUCH, auth=AUTH):
        self.base = base
        self.s = requests.Session()
        self.s.auth = auth

    def _current_rev(self, doc_id):
        r = self.s.head(f"{self.base}/_replicator/{doc_id}", timeout=10)
        if r.status_code == 404:
            return None
        r.raise_for_status()
        return r.headers["ETag"].strip('"')      # existing _rev for an update

    def upsert(self, job: ReplicationJob):
        doc_id = job.doc_id()
        body = job.body()
        rev = self._current_rev(doc_id)
        if rev:
            body["_rev"] = rev                    # update in place, no duplicate worker
        r = self.s.put(f"{self.base}/_replicator/{doc_id}",
                       data=json.dumps(body),
                       headers={"Content-Type": "application/json"}, timeout=15)
        if r.status_code == 409:
            # Lost a race with a concurrent reconciler; caller should retry.
            log.warning("conflict upserting %s, retrying next pass", doc_id)
            return False
        r.raise_for_status()
        log.info("reconciled %s (%s)", doc_id, "updated" if rev else "created")
        return True

    def reconcile(self, desired: list[ReplicationJob]):
        for job in desired:
            self.upsert(job)


def main():
    desired = [
        ReplicationJob(
            name="edge01->central",
            source="http://replicator_svc:pw@edge01.local:5984/telemetry",
            target="http://replicator_svc:pw@central.prod:5984/telemetry",
            selector={"type": {"$eq": "reading"}},
        ),
    ]
    Reconciler().reconcile(desired)


if __name__ == "__main__":
    try:
        main()
    except requests.RequestException as exc:
        log.error("reconcile failed: %s", exc)
        sys.exit(1)

Two design choices carry the whole pattern. First, the HEAD probe fetches the current _rev so an update lands on the existing document rather than colliding — the difference between reconciling a job and accidentally provisioning a second worker that competes for the same checkpoint. Second, the reconciler treats a 409 as a benign race to be retried on the next pass, not a fatal error, because in a multi-writer control plane two reconcilers may briefly contend. Whether a given job should even be continuous is itself a design decision with real cost trade-offs; the mechanics and break-even analysis of persistent listeners versus scheduled one-shot runs are covered in Continuous vs One-Way Sync, and topology-wide choices about where workers live are governed by the broader sync topology models.

Declarative Automation & Rule Routing

Hard-coding a ReplicationJob list works for a handful of jobs; it collapses when a fleet of hundreds of edge nodes each needs slightly different routing. The scalable move is to externalize the job set into declarative configuration and let a small routing layer expand it into _replicator documents. A YAML manifest describes intent — which database classes replicate where, in which direction, with which filter — and the routing engine matches each source database against a rule and materializes the concrete document.

# replication-routes.yaml — intent, not implementation
defaults:
  continuous: true
  user_ctx: { roles: ["_replicator"] }

routes:
  - match: "telemetry_*"          # every per-device telemetry db
    direction: push               # edge -> central only
    target_cluster: "central.prod:5984"
    selector: { type: { $eq: "reading" } }

  - match: "config"               # shared config db
    direction: pull               # central -> edge only
    source_cluster: "central.prod:5984"

  - match: "orders"               # mobile backend, bidirectional
    direction: both
    peer_cluster: "central.prod:5984"

The routing layer walks the local databases, applies the first matching rule, and emits one (or, for both, two) _replicator documents through the same idempotent reconciler shown above. Because the rules live in configuration, an operator can retarget an entire class of databases, tighten a selector, or flip a direction without touching Python or redeploying the sync worker — and can roll the change back just as fast. The decision tree the router walks — match pattern, resolve direction, apply defaults, decide continuous vs one-shot, emit document(s) — is worth rendering explicitly.

The router matches each database against the manifest, resolves direction and defaults, then emits _replicator documents through the same idempotent reconciler — so a manifest edit is a reviewable diff against cluster state.

This separation also enables safe experimentation: because every emitted document carries a deterministic _id, a change to the manifest is a diff against the current cluster state, so you can preview exactly which jobs will be created, updated, or (by absence) pruned before applying. That property — configuration as the single source of truth, cluster state as a reconciled projection of it — is what turns replication from a pile of imperative API calls into a reviewable, revertible deployment artifact.

Failure Modes & Escalation Paths

Production replication fails in a small number of characteristic ways, and each has a distinct detection signal and fallback order. Treating them generically is what turns a recoverable blip into a fleet-wide outage.

Connection-pool exhaustion under high churn. A worker that opens more sockets than max_http_connections allows will stall behind its own back-pressure. The signal is rising pending-state counts alongside flat throughput. The fallback order is: widen http_connections on jobs that genuinely need it, then cap total concurrent jobs via the scheduler’s max_jobs, then shard high-volume databases so no single worker dominates the pool.
Checkpoint drift and rewind. If source and target disagree on the last committed sequence — usually after a target restore or a compaction that outran a checkpoint — the worker rewinds and re-transfers a large window, spiking I/O. The signal is a sudden drop in the checkpointed sequence reported by _scheduler/jobs. The fallback is to let the rewind complete on a throttled job rather than killing and recreating it, which would discard the checkpoint entirely.
Clock skew across nodes. Any pipeline that leans on timestamps for tie-breaking is only as trustworthy as the fleet’s NTP discipline; skew silently corrupts last-write-wins ordering and can manifest as conflicts that “resolve” to stale data. Detection is out-of-band (monitor NTP offset per node), and the fallback is to prefer CouchDB’s generation-and-hash winner over wall-clock comparison whenever offsets exceed your tolerance.
Schema and authentication faults. A missing design document, a revoked replicator role, or a selector referencing a renamed field drives a job straight to failed — permanent, not retried. These are configuration bugs, and the escalation is to alert a human, not to retry a loop that cannot succeed. Distinguishing this permanent class from transient network errors, and encoding the right backoff and give-up policy for each, is the entire subject of Error Handling & Retry Logic.
Replication storms. A bulk overwrite, an accidental fan-out topology, or a reconciler bug can flood the scheduler with jobs and the network with revision transfers. The signal is a simultaneous jump in job count and _bulk_docs volume across many workers. The fallback order is to trip a circuit breaker in the reconciler, freeze new job creation, and drain the backlog before re-enabling.

The unifying rule is that retries must distinguish transient from permanent. A network timeout, a 503, or a target that is briefly unreachable warrants exponential backoff. A 401, a 404 on a design document, or a validation rejection does not — retrying it only burns connection budget while the underlying misconfiguration persists.

Each failure mode has its own detection signal, transient-versus-permanent classification, first fallback, and escalation trigger — the escalation policy in one glance.

Operational Hardening

Hardening a _replicator pipeline is mostly about bounding blast radius and shortening detection time. Four practices carry most of the weight.

Bound concurrency at the scheduler. Set max_jobs and max_churn in the [replicator] config so a runaway reconciler cannot instantiate unbounded workers; the scheduler will fair-share the allowed slots across ready jobs and hold the rest in pending. Tune connection_timeout, http_connections, and retries_per_request to the actual network profile — wide pools and long timeouts on fat central links, tight pools and short timeouts on lossy cellular edges — rather than accepting one global default for heterogeneous nodes.

Cache the observability reads. A supervisory loop that fetches _scheduler/jobs for every job on every tick becomes its own load source at fleet scale. Cache the scheduler view keyed by a short TTL and invalidate on _changes events from _replicator, so you pay for a full scan only when something actually transitioned. This keeps the monitor’s footprint flat as the fleet grows.

Trip circuit breakers before the CouchDB cluster does. Instrument conflict rate, resolution-failure ratio, unhealthy-job count, and checkpoint lag as first-class metrics. When any crosses a threshold, the reconciler should stop creating new jobs and the alerting path should page — a breaker that fires at the pipeline layer contains a storm that would otherwise cascade through the scheduler. Conflicts that the pipeline surfaces but cannot safely auto-merge belong in a holding area rather than a retry loop; route them to manual review sync queues so a domain expert adjudicates instead of the pipeline guessing.

Keep an emergency resync runbook. When a checkpoint is unrecoverable or a target was restored from a stale backup, the controlled recovery is: delete the affected _replicator document to tear down the worker cleanly after in-flight writes commit, verify the target’s true high sequence, recreate the job so it re-establishes a fresh checkpoint, and let it re-scan from a known since in a throttled window. Deleting the document — not setting cancel: true, which belongs to the legacy one-off API — is the supported way to stop a _replicator job. Rehearse this procedure before you need it; the middle of an incident is the wrong time to discover that a hand-set _rev corrupted your reconciler’s idempotency.

For the underlying wire protocol — how _revs_diff, _bulk_docs, and checkpointing interact, and the exact HTTP streaming semantics workers rely on — the authoritative reference is the Apache CouchDB Replication Protocol documentation.

Conclusion

Treating _replicator as a declarative control plane — documents as versioned infrastructure code, the CouchDB cluster state as a reconciled projection of a configuration manifest, and the scheduler endpoints as a first-class observability surface — is what turns CouchDB replication from an operational chore into a predictable service. The pipeline’s reliability rests on four disciplines working together: strict schema compliance so the scheduler instantiates workers deterministically, idempotent reconciliation so redeploys never spawn duplicate jobs, network-aware tuning so workers stay efficient across heterogeneous links, and retry logic that distinguishes transient faults from permanent misconfiguration. Get those right and eventual consistency stops being a liability to apologize for and becomes an observable, partition-tolerant synchronization primitive that scales from a constrained sensor to a globally distributed mobile fleet — with replication management codified into the deployment lifecycle rather than improvised during incidents.

Up: Home

_replicator Configuration & Sync Pipeline Management in CouchDB #

Core Concept & CouchDB Mechanics #

Scheduler states and the observable surface #

Production Detection / Configuration Pipeline #

Deterministic Resolution Architecture #

Declarative Automation & Rule Routing #

Failure Modes & Escalation Paths #

Operational Hardening #

Conclusion #

Related #