Sync Topology Models for Distributed CouchDB Replication

A sync topology is the wiring diagram of your replication graph: which databases push to which, in what direction, and through which intermediaries. It is the single design decision that most strongly predicts your conflict rate, your WAN bill, and how gracefully a fleet recovers after a partition heals. Edge/IoT developers, mobile backend engineers, and Python sync builders reach this page when a naive “everything replicates to central” wiring has started to buckle — replication storms after a network blip, conflict counts that scale with device count, or a coordinator that becomes a single choke point. This page treats topology as a configurable, observable artifact expressed through _replicator documents rather than tribal knowledge, and it sits under CouchDB Replication Architecture & Revision Fundamentals, which establishes how revisions and the _changes feed drive every sync stream described below.

Every topology is ultimately a set of directed replication edges, and CouchDB gives you exactly one primitive to express an edge: a document in the _replicator database naming a source, a target, and a direction. Bidirectional sync between two nodes is simply two such documents. The three archetypes below differ only in how those edges are arranged:

Because CouchDB conflict handling is topology-agnostic — every node runs the same MVCC in CouchDB replication rules regardless of how streams are wired — the topology does not change how a conflict is resolved. What it changes is how often conflicts are generated, how far a divergent revision has to travel before it is reconciled, and how many redundant copies of a write cross the network. Choosing and tuning that wiring is the whole job.

Configuration Schema & Required Parameters

A topology is provisioned as a collection of _replicator documents, one per directed edge, deployed with the standard _replicator document schema. The document below defines one edge of a centralized topology — an edge node pushing continuously to a central cluster. To make the same pair bidirectional, you deploy a mirror document with source and target swapped.

{
  "_id": "topo_edge01_to_central",
  "source": {
    "url": "https://edge-node-01.local:6984/fleet_state",
    "auth": { "basic": { "username": "rep_edge_01", "password": "${EDGE_REP_PASSWORD}" } }
  },
  "target": {
    "url": "https://central-cluster.internal:6984/fleet_state",
    "auth": { "basic": { "username": "rep_central_writer", "password": "${CENTRAL_REP_PASSWORD}" } }
  },
  "continuous": true,
  "create_target": false,
  "selector": { "device_group": "west-1" },
  "checkpoint_interval": 30000,
  "worker_processes": 2,
  "http_connections": 10,
  "connection_timeout": 30000
}

The parameters that actually shape a topology’s behavior — as opposed to authenticating it — are the ones that govern edge direction, scope, and back-pressure:

Parameter	Type	Default	Effect on the topology
`source` / `target`	string \| object	— (required)	Defines the direction of one replication edge; the arrangement of all edges is the topology.
`continuous`	bool	`false`	`true` keeps a persistent `_changes` listener so the edge is live; `false` makes it a one-shot sweep you re-trigger on a schedule.
`selector`	object	none	Mango filter that scopes an edge to a device group or namespace, so a mesh does not flood every peer with every document.
`filter` / `doc_ids`	string \| array	none	Alternative scoping mechanisms; mutually exclusive with `selector` — set exactly one.
`checkpoint_interval`	int (ms)	`30000`	How often the edge persists its `_local` checkpoint; lower values shrink re-sync after a drop at the cost of more writes.
`worker_processes`	int	`4`	Parallel workers per edge; raise on high-fan-in central targets, lower on constrained edge devices.
`http_connections`	int	`20`	Connection-pool size for the edge; the dominant memory knob on low-power gateways in a dense mesh.
`create_target`	bool	`false`	Keep `false` so a mis-wired edge fails fast instead of silently minting an empty database.

The critical discipline is that _id values encode the edge, not just the job — a deterministic scheme like topo_<source>_to_<target> makes the whole topology idempotent, so re-applying your manifest after a pod restart or a leader election converges to the same graph instead of duplicating edges. Scope every edge with a selector (or filter): an unscoped mesh replicates every document to every peer, which is the fastest way to turn a healthy topology into a conflict generation model you did not intend.

Streaming Detection / Monitoring Setup

A topology is only as healthy as its weakest edge, so the first thing to observe is the live state of every replication job feeding it. Poll _scheduler/jobs — it reports one entry per running edge with its state, the source and target, and the last error — and derive a per-topology health signal from the aggregate. The minimal listener below yields any edge that is not in a healthy running/pending state:

import time

import httpx


def watch_topology_edges(cluster_url: str, auth: tuple[str, str], interval_s: int = 15):
    """Yield unhealthy replication edges observed on the CouchDB scheduler.

    Polls /_scheduler/jobs and emits (job_id, state, source, target, info)
    for any edge that has left the running/pending healthy band, so a
    supervisor can alert or re-provision the missing edge.
    """
    healthy = {"running", "pending"}
    with httpx.Client(auth=auth, timeout=30.0) as client:
        while True:
            resp = client.get(f"{cluster_url}/_scheduler/jobs", params={"limit": 500})
            resp.raise_for_status()
            for job in resp.json().get("jobs", []):
                state = job.get("state", "unknown")
                if state not in healthy:
                    yield (
                        job.get("id"),
                        state,
                        job.get("source"),
                        job.get("target"),
                        job.get("info"),
                    )
            time.sleep(interval_s)


if __name__ == "__main__":
    for job_id, state, src, tgt, info in watch_topology_edges(
        "http://localhost:5984", ("admin", "password")
    ):
        print(f"unhealthy edge {job_id}: {state} {src} -> {tgt} :: {info}")

Emit a metric here keyed by the edge’s source and target so a dashboard can show the topology as a graph with each edge coloured by health. A single crashing edge in a centralized topology degrades one device; the same edge in a relay chain silently strands every device downstream of it, so alert thresholds should account for an edge’s position, not just its state. For continuous edges the same scheduler view distinguishes a genuinely idle stream from a stalled one — a decision covered in depth by continuous vs one-way sync.

Core Implementation

The following manager turns a declarative topology description into a set of _replicator documents and reconciles the live cluster against it. It expands high-level edges (including bidirectional pairs) into concrete replication documents, uses deterministic _ids so re-application is idempotent, retries transient failures with backoff, and logs every action in a structured form suitable for CI/CD. Annotations mark the non-obvious lines.

import logging
import time
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional

import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("topology-manager")


@dataclass
class Edge:
    """One directed replication stream in a topology."""

    source: str
    target: str
    selector: Optional[Dict[str, Any]] = None
    continuous: bool = True
    bidirectional: bool = False
    extra: Dict[str, Any] = field(default_factory=dict)


class TopologyManager:
    """Reconciles a CouchDB replication topology from a declarative edge list."""

    def __init__(self, cluster_url: str, auth: tuple[str, str], max_retries: int = 4):
        self.cluster_url = cluster_url.rstrip("/")
        self.session = requests.Session()
        self.session.auth = auth
        self.session.headers.update({"Content-Type": "application/json"})
        self.max_retries = max_retries

    def _edge_id(self, source: str, target: str) -> str:
        """Deterministic id so re-applying the topology is idempotent."""
        src = source.split("/")[-1]
        tgt = target.split("/")[-1]
        return f"topo_{src}__to__{tgt}"

    def _put_with_retry(self, doc_id: str, body: Dict[str, Any]) -> None:
        """PUT a replication doc, tolerating a 409 by updating the live _rev."""
        for attempt in range(1, self.max_retries + 1):
            existing = self.session.get(f"{self.cluster_url}/_replicator/{doc_id}")
            if existing.status_code == 200:
                body["_rev"] = existing.json()["_rev"]  # update in place, do not duplicate the edge
            resp = self.session.put(f"{self.cluster_url}/_replicator/{doc_id}", json=body)
            if resp.status_code in (201, 202):
                log.info("edge applied id=%s", doc_id)
                return
            if resp.status_code == 409:  # lost the write race; re-read _rev and retry
                log.warning("edge conflict id=%s attempt=%d, retrying", doc_id, attempt)
                time.sleep(min(2 ** attempt, 15))
                continue
            resp.raise_for_status()
        raise RuntimeError(f"failed to apply edge {doc_id} after {self.max_retries} attempts")

    def _edge_to_doc(self, source: str, target: str, edge: Edge) -> Dict[str, Any]:
        doc: Dict[str, Any] = {
            "source": source,
            "target": target,
            "continuous": edge.continuous,
            "create_target": False,  # fail fast on a mis-wired edge
        }
        if edge.selector:
            doc["selector"] = edge.selector  # scope the edge so a mesh does not flood peers
        doc.update(edge.extra)
        return doc

    def apply(self, edges: List[Edge]) -> None:
        """Provision every edge; bidirectional edges expand into two documents."""
        for edge in edges:
            pairs = [(edge.source, edge.target)]
            if edge.bidirectional:
                pairs.append((edge.target, edge.source))
            for source, target in pairs:
                doc_id = self._edge_id(source, target)
                self._put_with_retry(doc_id, self._edge_to_doc(source, target, edge))

    def health(self) -> Dict[str, int]:
        """Aggregate scheduler job states into a topology-level summary."""
        resp = self.session.get(f"{self.cluster_url}/_scheduler/jobs", params={"limit": 1000})
        resp.raise_for_status()
        summary: Dict[str, int] = {}
        for job in resp.json().get("jobs", []):
            state = job.get("state", "unknown")
            summary[state] = summary.get(state, 0) + 1
        return summary


if __name__ == "__main__":
    manager = TopologyManager("http://localhost:5984", ("admin", "password"))

    # A hybrid topology: local peer edge plus each peer arbitrating through central.
    central = "https://central-cluster.internal:6984/fleet_state"
    peer_a = "https://edge-node-01.local:6984/fleet_state"
    peer_b = "https://edge-node-02.local:6984/fleet_state"

    topology = [
        Edge(peer_a, peer_b, selector={"device_group": "west-1"}, bidirectional=True),
        Edge(peer_a, central, selector={"device_group": "west-1"}, bidirectional=True),
        Edge(peer_b, central, selector={"device_group": "west-1"}, bidirectional=True),
    ]

    manager.apply(topology)
    log.info("topology health: %s", manager.health())

Running the manager twice against the same cluster produces the same graph, not a doubled one, because each edge resolves to a deterministic _id and the _put_with_retry path updates the live _rev in place. That idempotence is what lets you store a topology as version-controlled infrastructure and re-apply it on every deploy. When an edge can no longer be resolved cleanly — a permanent auth failure, a target that refuses writes — the failure surfaces through the scheduler and belongs in error handling & retry logic rather than an unbounded retry loop here.

Strategy Variants & Trade-offs

Four topology archetypes cover almost every production fleet. Each has a distinct conflict profile, a distinct failure blast radius, and a distinct cost curve, so the right choice is dictated by your connectivity model and audit requirements rather than by preference.

Centralized (star) — Every edge node replicates only to and from one central cluster. Conflict arbitration, authentication, and audit logging all converge on a single authority, which makes governance simple. The cost is a WAN round trip on every sync and a coordinator whose outage stalls the entire fleet.
Peer mesh — Nodes replicate directly to one another with no coordinator, enabling true offline-first operation and local-latency sync. It removes the central choke point but multiplies the number of edges (and therefore conflict-generation surface) quadratically, and it demands explicit firewall traversal. Provisioning it is covered step by step in setting up peer-to-peer sync topologies.
Hybrid arbitration — Local peers mesh for low-latency operation during connectivity gaps while each also carries an edge to a central arbiter for global consistency and audit. It captures most of the mesh’s resilience with a single authoritative reconciliation point, at the price of the most edges to provision and monitor.
Relay / tree — Edges cascade through intermediate aggregators (a gateway per site, sites rolling up to a region). It minimizes WAN fan-in and suits deeply hierarchical fleets, but a failed intermediate strands its entire subtree, so relay nodes need the tightest health monitoring.

Topology	Conflict surface	Failure blast radius	WAN cost	Best fit
Centralized (star)	Low — one merge point	Total — central is a single point of failure	High — every sync is a round trip	Always-connected fleets with strict audit needs
Peer mesh	High — grows with peer count	Local — one node only	Low on LAN, high to bootstrap	Offline-first, low-latency local collaboration
Hybrid arbitration	Medium — meshed but reconciled	Low — degrades to local mesh	Medium — central carries deltas	Intermittently connected edge fleets
Relay / tree	Medium — merged per level	Subtree — everything downstream	Low fan-in at the core	Deep site → region → cloud hierarchies

The comparison that matters is conflict surface against failure blast radius: centralized minimizes conflicts but maximizes the blast radius of an outage, while peer mesh inverts both. Hybrid arbitration is the pragmatic middle for edge fleets precisely because it keeps the low blast radius of a mesh while restoring a single authority for the deterministic winner selection defined by revision tree mechanics.

Deployment & Orchestration

Provision the topology from a single manifest and let one controller own it. The controller — the TopologyManager above wrapped in a small service — should run as exactly one replica per replication partition; running two controllers against the same partition doubles the _replicator write pressure and races them into 409 churn without wiring any additional edge. Drive the wiring through the environment so one image serves every site:

# Topology controller environment (one replica per partition)
COUCH_CLUSTER_URL=https://central-cluster.internal:6984
TOPOLOGY_MODEL=hybrid            # star | mesh | hybrid | relay
DEVICE_GROUP=west-1              # becomes the per-edge selector scope
EDGE_CONTINUOUS=true
CHECKPOINT_INTERVAL_MS=30000
HEALTHCHECK_PORT=8080

Expose a /healthz endpoint that reports the aggregated scheduler summary and fails when any edge the manifest expects is missing or crashing — a controller that reports green while an edge is silently absent is worse than no health check. Persist the manifest and the last-applied revision in durable storage so a restart reconciles against the intended graph rather than re-deriving it from whatever happens to be live. For deeply constrained gateways, cap http_connections and worker_processes per edge before adding replicas; memory in a dense mesh is dominated by connection pools, not by the controller. Align every edge’s credential scope with the network partition it crosses, following security boundaries in replication, so a compromised edge device cannot replicate outside its device group.

Troubleshooting & Common Errors

Symptom / error	Likely cause	Remediation
Conflict rate scales with device count	Unscoped mesh edges replicate every document to every peer	Add a `selector` per edge; shard by `device_group` so peers exchange only shared documents
Duplicate replication jobs after redeploy	Non-deterministic `_replicator` `_id`s minting a new edge each apply	Derive `_id` from source/target; update the live `_rev` in place instead of `POST`ing
Whole fleet stalls when one node drops	Centralized topology with the coordinator as a single point of failure	Move to hybrid arbitration or cluster the coordinator; alert on central `_scheduler` health
Replication storm on reconnect	Many edges resume simultaneously after a partition heals	Stagger edge restarts with jitter; lower `worker_processes`; raise `checkpoint_interval` transiently
Subtree goes dark, upstream looks healthy	Failed relay intermediate strands everything downstream	Monitor relay nodes as first-class; add a fallback direct-to-central edge for critical devices
`doc_update_conflict` after re-apply	Two controllers racing the same partition	Enforce one controller replica per partition; serialize manifest application
Edge stuck in `crashing`	Auth failure or unreachable target on that edge	Inspect `_scheduler/jobs` `info`; fix credentials/routing; route permanent failures to retry logic
Checkpoint drift / full re-sync on restart	`checkpoint_interval` too high or `_local` docs lost	Lower the interval; ensure `_local` checkpoints survive restarts on both endpoints

When an edge produces conflicts faster than any resolver drains them, the fix is rarely more resolver throughput — it is a topology change that shrinks the conflict surface (scoping edges, collapsing a mesh toward arbitration). Documents that no automated strategy can reconcile should still escalate to manual review sync queues rather than being overwritten, and the choice of resolver for the rest is covered by algorithm selection for merge.

FAQ

Does the topology change how CouchDB picks the winning revision?

No. Winner selection is deterministic and identical on every node — highest generation count, then the lexicographically highest revision hash — regardless of how streams are wired. Topology changes how often conflicts are generated and how far a divergent revision travels before it is reconciled, not the tiebreak itself. That is why you can restructure a topology to lower conflict volume without touching any resolution code.

How many _replicator documents does a peer mesh of N nodes need?

A fully connected bidirectional mesh needs N × (N − 1) directed edges — two documents per node pair — which grows quadratically. This is the main reason full meshes are reserved for small clusters of peers; larger fleets scope edges with selectors or adopt hybrid arbitration so most traffic flows through a coordinator instead of every possible pair.

Can I change topology on a live fleet without downtime?

Yes, if edges are provisioned idempotently. Because deleting a _replicator document tears down only that edge and adding one starts only that edge, you can migrate a star to a hybrid by adding peer edges first, verifying they reach running in _scheduler/jobs, then optionally pruning the now-redundant central edges. Deterministic _ids make the migration a diff between two manifests rather than a rebuild.

Why does my conflict count explode right after a partition heals?

While the partition held, each side accumulated independent revisions; on reconnect every edge resumes at once and merges those divergent trees simultaneously. The spike is expected — it is the split-brain generation model surfacing — but you can flatten it by staggering edge restarts with jitter and scoping edges so only genuinely shared documents replicate. Sustained storms after the initial spike indicate an unscoped mesh, not a healing artifact.

Should edge nodes replicate continuously or on a schedule?

It depends on the topology’s tolerance for staleness versus its connection budget. Continuous edges keep a persistent _changes listener and minimize convergence time but hold connections open — costly in a dense mesh of battery-powered devices. One-shot edges triggered on a schedule or on connectivity events conserve resources at the cost of longer convergence windows. The trade-off is analyzed in detail under continuous vs one-way sync.

Part of: CouchDB Replication Architecture & Revision Fundamentals

Sync Topology Models for Distributed CouchDB Replication #

Configuration Schema & Required Parameters #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #