The `_replicator` Document Schema: Stateful Sync Control for Distributed Systems

When a replication job must survive node restarts, leader elections, and hours of edge-node disconnection, a fire-and-forget POST /_replicate call is the wrong tool — the moment its HTTP session drops, the job is gone. The _replicator database solves exactly that: each job is a persistent JSON document that the CouchDB cluster scheduler continuously evaluates, instantiates, supervises, and recovers. This page is the canonical field-by-field contract for that document — every authored field, every read-only field CouchDB writes back, their types and defaults, and the deployment patterns that make replication deterministic across pod restarts and fleet rollouts. It is the schema reference the rest of the _replicator Configuration & Sync Pipeline Management framework builds on: the continuous vs one-way sync decision, async monitoring & webhooks, and error handling & retry logic all operate on documents shaped by the contract below.

Configuration Schema & Required Parameters

Every replication job in CouchDB is represented by a single document in the _replicator database. The schema below aligns with CouchDB 3.x production standards and reflects the runtime expectations of the Erlang-based replication scheduler. Fields above the blank line are authored by you; the _replication_* fields below it are written back by the CouchDB cluster and must never be set by hand.

{
  "_id": "rep_edge_node_01_to_central",
  "_rev": "1-abc123def456",
  "source": "https://edge-node-01.local:5984/iot_telemetry",
  "target": "https://central-cluster.prod:5984/iot_telemetry",
  "create_target": false,
  "continuous": true,
  "filter": "app/by_device_type",
  "query_params": {"type": "temperature"},
  "user_ctx": {
    "name": "replicator_svc",
    "roles": ["_admin"]
  },
  "owner": "pipeline_automation",

  "_replication_id": "a1b2c3d4e5f6...",
  "_replication_state": "running",
  "_replication_state_time": "2026-05-29T12:00:00Z",
  "_replication_stats": {
    "doc_write_failures": 0,
    "docs_read": 1420,
    "docs_written": 1418,
    "missing_revisions_found": 2,
    "revisions_checked": 1420
  }
}

The single hard structural rule: doc_ids, filter, and selector are mutually exclusive — the scheduler rejects a document that sets more than one. Everything else is a matter of defaults and effect:

Field	Type	Default	Effect
`_id`	string	auto	Job identity. Use a deterministic value (derived from a node serial or route name) so re-applying the document is idempotent across restarts and leader elections.
`source`	string / object	— (required)	Origin database: an absolute URL, a local database name, or an object carrying `url`, `headers`, and `auth` for cross-cluster authentication.
`target`	string / object	— (required)	Destination database, same accepted forms as `source`.
`create_target`	boolean	`false`	When `true`, the scheduler creates the target database if absent — useful for lazily provisioned partitions during fleet scaling.
`continuous`	boolean	`false`	`true` attaches a persistent `_changes` listener; `false` runs one-shot to the source’s current update sequence, then stops. Governs the continuous vs one-way sync trade-off directly at the document level.
`doc_ids`	array	unset	Replicate only these document IDs. Mutually exclusive with `filter`/`selector`.
`filter`	string	unset	`ddoc/filtername` JavaScript filter function. Flexible but runs per-document on the source.
`selector`	object	unset	Mango selector evaluated server-side. Cheaper than a JS `filter`, still mutually exclusive with it.
`query_params`	object	unset	Parameters passed to a `filter` function; only meaningful alongside `filter`.
`user_ctx`	object	unset	Security context (`name`, `roles`) the job runs under. Roles must satisfy the target’s ACLs; `_admin` is required to write to protected databases. Bound to the broader security boundaries in replication model.
`owner`	string	auth user	Records which authenticated user owns the job; enforced when `[replicator] require_valid_user` is on.
`http_connections`	int	`20`	Max connections per replication. Lower it on constrained edges to cap memory.
`worker_processes`	int	`4`	Parallel worker count per job.
`worker_batch_size`	int	`500`	Revisions per batch. Smaller batches reduce peak memory on low-power gateways.

Fields CouchDB writes back — never author these

_replication_id — a computed hash of the job parameters the CouchDB cluster uses internally to deduplicate jobs. Observe it via the document or _scheduler/jobs; do not set it. Idempotency is your job’s _id, not this field.
_replication_state — the lifecycle state, transitioning through initializing, running, pending (queued or throttled by fair-share limits), crashing (transient error, will retry with exponential backoff), completed (a one-shot job that finished), and failed (permanent — the scheduler gave up). Also exposed via _scheduler/docs.
_replication_state_time — the RFC 3339 timestamp of the last state transition.
_replication_stats — real-time counters (docs_read, docs_written, doc_write_failures, missing_revisions_found, revisions_checked) for pipeline observability.

To stop a job, delete its _replicator document — that is the supported mechanism. The cancel: true flag belongs to the one-off POST /_replicate API, not to _replicator documents. On deletion the scheduler tears down the worker after committing in-flight writes.

Streaming Detection / Monitoring Setup

A deployed document is only the start; you need to observe what the scheduler did with it. The authoritative surface is _scheduler/docs (state from each document’s perspective) and _scheduler/jobs (the live worker view, including the checkpointed sequence and per-worker error history) — not the deprecated _active_tasks shape. The snippet below reads _scheduler/docs for a single document and reports its current state, so a deploy step can block until the job reaches running before declaring success.

import requests

def replicator_state(base_url: str, doc_id: str, auth: tuple,
                     db: str = "_replicator") -> dict:
    """Return the scheduler's view of one _replicator document.

    Reads _scheduler/docs rather than the document itself: the scheduler
    endpoint reflects the true worker state, including error history, even
    before the _replication_state field is flushed back onto the doc.
    """
    url = f"{base_url}/_scheduler/docs/{db}/{doc_id}"
    resp = requests.get(url, auth=auth, timeout=15)
    resp.raise_for_status()
    body = resp.json()
    return {
        "state": body.get("state"),          # running | pending | crashing | ...
        "info": body.get("info"),            # stats / last error object
        "error_count": body.get("error_count", 0),
        "last_updated": body.get("last_updated"),
    }

if __name__ == "__main__":
    snapshot = replicator_state(
        "http://localhost:5984", "rep_edge_node_01_to_central",
        ("admin", "password"),
    )
    print(f"{snapshot['state']} (errors: {snapshot['error_count']})")

For deep checkpoint inspection — the exact source sequence each worker has committed — poll _scheduler/jobs instead, as detailed in monitoring replication checkpoints via API.

Core Implementation

Treating _replicator documents as infrastructure-as-code means the code that produces them must validate the schema locally, PUT idempotently, and confirm the scheduler accepted the job — never blind-write and hope. The class below builds a document, enforces the mutually-exclusive scoping rule before it ever reaches the CouchDB cluster, deploys it under a deterministic _id, and polls until it reaches a terminal-enough state. It carries transport-level retries and structured logging so it drops straight into a provisioning pipeline or a fleet-rollout job.

#!/usr/bin/env python3
"""Deploy and verify CouchDB _replicator documents as infrastructure-as-code."""
import os
import time
import logging
import requests
from typing import Optional, Dict, Any
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
)

# Fields CouchDB writes back — reject them if a caller tries to author them.
_READ_ONLY = {
    "_replication_id", "_replication_state",
    "_replication_state_time", "_replication_stats",
}
# At most one of these may be set on a single document.
_SCOPING = ("doc_ids", "filter", "selector")


class ReplicatorDeployer:
    def __init__(self, base_url: str, username: str, password: str):
        self.base_url = base_url.rstrip("/")
        self.session = requests.Session()
        self.session.auth = (username, password)
        # Absorb transient 5xx/429 at the transport layer; 409 is handled
        # explicitly below because it means "document already exists".
        retry = Retry(total=5, backoff_factor=1,
                      status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount("http://", HTTPAdapter(max_retries=retry))
        self.session.mount("https://", HTTPAdapter(max_retries=retry))

    def validate(self, doc: Dict[str, Any]) -> None:
        """Fail fast on schema violations before the cluster ever sees them."""
        for field in ("_id", "source", "target"):
            if not doc.get(field):
                raise ValueError(f"_replicator document requires '{field}'")
        authored_readonly = _READ_ONLY & doc.keys()
        if authored_readonly:
            raise ValueError(f"do not author read-only fields: {authored_readonly}")
        set_scoping = [k for k in _SCOPING if k in doc]
        if len(set_scoping) > 1:
            raise ValueError(
                f"doc_ids/filter/selector are mutually exclusive; got {set_scoping}"
            )

    def deploy(self, doc: Dict[str, Any]) -> str:
        """Idempotently PUT a _replicator document under its deterministic _id.

        Re-applying an unchanged job is a no-op; a changed body is written as a
        new revision by carrying the current _rev forward.
        """
        self.validate(doc)
        doc_id = doc["_id"]
        url = f"{self.base_url}/_replicator/{doc_id}"

        existing = self.session.get(url)
        if existing.status_code == 200:
            doc = {**doc, "_rev": existing.json()["_rev"]}

        resp = self.session.put(url, json=doc)
        if resp.status_code == 409:
            # Lost a race with a concurrent deployer; re-read and retry once.
            existing = self.session.get(url)
            existing.raise_for_status()
            doc = {**doc, "_rev": existing.json()["_rev"]}
            resp = self.session.put(url, json=doc)
        resp.raise_for_status()
        logging.info("Deployed _replicator document %s", doc_id)
        return doc_id

    def wait_until_running(self, doc_id: str, timeout: float = 60.0,
                           interval: float = 2.0) -> str:
        """Block until the scheduler reports a non-transient state.

        Returns the terminal-ish state so a rollout can gate on it. 'failed'
        raises so a bad job aborts the deploy instead of silently draining.
        """
        deadline = time.time() + timeout
        url = f"{self.base_url}/_scheduler/docs/_replicator/{doc_id}"
        while time.time() < deadline:
            resp = self.session.get(url)
            if resp.status_code == 200:
                state = resp.json().get("state")
                logging.info("%s -> %s", doc_id, state)
                if state in ("running", "completed"):
                    return state
                if state == "failed":
                    raise RuntimeError(f"{doc_id} entered 'failed' state")
            time.sleep(interval)
        raise TimeoutError(f"{doc_id} did not reach running within {timeout}s")

    def teardown(self, doc_id: str) -> None:
        """Stop a job the supported way: delete its _replicator document."""
        url = f"{self.base_url}/_replicator/{doc_id}"
        rev = self.session.get(url).json().get("_rev")
        if rev:
            self.session.delete(url, params={"rev": rev}).raise_for_status()
            logging.info("Tore down %s", doc_id)


if __name__ == "__main__":
    deployer = ReplicatorDeployer(
        base_url=os.getenv("COUCHDB_URL", "http://localhost:5984"),
        username=os.getenv("COUCHDB_USER", "admin"),
        password=os.getenv("COUCHDB_PASS", "password"),
    )
    job = {
        "_id": "rep_edge_node_01_to_central",
        "source": os.getenv("SYNC_SOURCE", "http://edge-node-01:5984/iot_telemetry"),
        "target": os.getenv("SYNC_TARGET", "http://core-cluster:5984/iot_telemetry"),
        "continuous": True,
        "create_target": True,
        "user_ctx": {"name": "replicator_svc", "roles": ["_admin"]},
    }
    deployer.deploy(job)
    print("state:", deployer.wait_until_running(job["_id"]))

The validate method is the load-bearing part: catching a mutually-exclusive scoping violation or an accidentally authored _replication_state locally turns a silent scheduler rejection into an obvious deploy-time error. Keeping deploy, wait_until_running, and teardown on one client is what lets a fleet-rollout tool provision, gate, and roll back a job through one interface.

Strategy Variants & Trade-offs

The most consequential schema decision on any real deployment is how you scope the document — how much of the source database a single job carries. The three mutually-exclusive scoping fields plus the whole-database default give four named strategies, each with a distinct cost profile. Choosing well is also how you prevent conflict storms in high-churn IoT fleets: narrowing a job to a device namespace keeps divergent writes — and the revision tree mechanics that stack them — contained rather than fanned across the whole dataset.

Whole-database (no scoping field) — Replicate everything. Simplest and lowest per-document overhead; correct when the target genuinely needs the full dataset.
doc_ids allowlist — Replicate an explicit set of document IDs. Zero server-side evaluation cost, but the list is static in the document and must be redeployed to change.
filter (JavaScript design-document function) — Arbitrary predicate logic with query_params. Maximally expressive, but the function runs per-document on the source and is the slowest option at volume.
selector (Mango) — A declarative JSON predicate evaluated server-side. Cheaper than a JS filter and index-assisted, at the cost of Mango’s more limited expressiveness.

Strategy	Server-side cost	Expressiveness	Changes without redeploy	Best-fit scenario
Whole-database	Lowest	N/A	N/A	Full-dataset mirrors, backups
`doc_ids`	Lowest	Exact IDs only	No	Small, fixed document sets
`filter` (JS)	Highest	Arbitrary logic	Only via `query_params`	Complex, computed predicates
`selector` (Mango)	Moderate	Declarative predicates	Only via redeploy	Namespace / device-type scoping

A second axis is job lifetime: continuous: true holds a live _changes listener for near-real-time convergence and higher baseline network cost, while continuous: false runs one-shot and stops — the full trade matrix lives in continuous vs one-way sync. Note what none of these variants do: replication moves revisions, it never merges them. Reconciling the divergent leaves that scoping helps contain is the job of the conflict detection strategies layer, not of the _replicator document.

Deployment & Orchestration

Deploying _replicator at scale is about determinism and single-ownership, not throughput tuning:

Version-control every document. Store the JSON (or the code that generates it) in your config repo, validated against this schema in CI so a malformed job fails the pipeline, not the CouchDB cluster.
Derive _id deterministically. Base it on a hardware serial or cloud-assigned node identifier so re-applying during a fleet rollout is idempotent — a pod restart re-PUTs the identical document and the scheduler recognises it by its computed _replication_id rather than spawning a duplicate.
One owner per job. Do not let two provisioning agents write the same _id concurrently; a lost race surfaces as 409 and should re-read the _rev and retry (as the deployer above does).
Inject secrets via environment, never the image. Pass COUCHDB_URL, credentials, and SYNC_SOURCE/SYNC_TARGET at runtime. Put authentication in user_ctx or the source/target object’s auth, and keep TLS validation aligned with your network posture.
Expose a health signal. Gate rollout success on wait_until_running, and route ongoing state transitions and _replication_stats to your metrics stack so crashing/failed jobs trigger async monitoring & webhooks rather than being discovered by hand in Fauxton.

For resource-constrained gateways, http_connections, worker_processes, and worker_batch_size are set per document (with cluster-wide defaults in the [replicator] configuration section); pairing smaller batches with a continuous: false toggle during high-latency windows prevents memory exhaustion. The full low-power recipe — CPU throttling, batch-size optimisation, and offline queue strategy — is in Configuring _replicator for IoT Edge Nodes.

Troubleshooting & Common Errors

Symptom / code	Likely cause	Remediation
`403 Forbidden` on deploy	`user_ctx.roles` cannot satisfy the target ACL	Grant the service user the required role, or use `_admin`; verify `owner` if `require_valid_user` is on
`400 Bad Request` — “`doc_ids`/`filter`/`selector` incompatible”	More than one scoping field set	Set at most one; validate locally before the `PUT`
Job stuck in `crashing`	Bad `source`/`target` URL, TLS failure, or missing design doc for `filter`	Inspect `_scheduler/docs` `info.error`; fix connectivity/cert or the `ddoc/filter` name
`409 Conflict` on `PUT`	Document already exists (concurrent deployer or re-run)	Re-read the current `_rev` and re-`PUT`; see handling 409 conflicts in replication jobs
Job never starts, no error	`_replication_*` field authored by hand, so the scheduler rejected the doc	Strip all read-only fields; only author the input fields
`_replication_state` flapping `running`↔`crashing`	Intermittent network to a remote target	Tune retry/backoff via error handling & retry logic; lower `worker_batch_size` on flaky links
Duplicate jobs after redeploy	Non-deterministic `_id` per rollout	Derive `_id` from a stable node identifier so re-application is idempotent

FAQ

How do I cancel a running replication job?

Delete its _replicator document. That is the only supported mechanism for scheduler-managed jobs — the worker is torn down after in-flight writes commit. The cancel: true flag applies to the one-off POST /_replicate API, not to _replicator documents.

What's the difference between `_id` and `_replication_id`?

You author _id to make deploys idempotent; the CouchDB cluster computes _replication_id as a hash of the job parameters and uses it internally to deduplicate work. Never set _replication_id by hand — observe it via the document or _scheduler/jobs.

Can I set `doc_ids`, `filter`, and `selector` together to combine them?

No. They are mutually exclusive and the scheduler rejects a document that sets more than one. To combine predicates, express the whole condition inside a single selector or a single JavaScript filter function.

Why did my job silently never start?

The most common cause is authoring a read-only field (_replication_state, _replication_id, _replication_stats, or _replication_state_time). The scheduler rejects such documents. Author only the input fields and let the CouchDB cluster write the rest back.

Does the `_replicator` document resolve conflicts for me?

No. Replication moves revisions between nodes and stacks divergent leaves onto each document's revision tree; it never merges them. Reconciliation is an application concern handled by the conflict-detection and auto-merge layer, not by any field in the schema.

Part of: _replicator Configuration & Sync Pipeline Management

The _replicator Document Schema: Stateful Sync Control for Distributed Systems #

Configuration Schema & Required Parameters #

Fields CouchDB writes back — never author these #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #