Error Handling & Retry Logic for Distributed CouchDB Replication

Distributed synchronization across edge nodes, mobile clients, and intermittent cellular backhauls demands deterministic failure recovery: a job that dies during a network partition must resume without operator intervention, and one that fails permanently must surface loudly rather than retry forever. CouchDB’s scheduling replicator gives you baseline fault tolerance, but production pipelines need explicit failure classification, bounded backoff, circuit breaking, and dead-letter routing layered on top. This page is the recovery layer of _replicator Configuration & Sync Pipeline Management: it shows how to tune retry-related fields on the replication document, subscribe to job-state transitions, implement a production retry orchestrator in Python, and choose a backoff strategy that neither drifts silently nor triggers a retry storm. Conflicts that surface after a job recovers are a related concern handled by algorithm selection for merge; here we focus on keeping the replication job itself alive and honest.

Configuration Schema & Required Parameters

Resilient recovery starts on the _replicator document itself, where two real fields bound how hard a worker retries before it gives up on a request. There is no document-level retry or retry_interval field; job-level backoff between crashes is automatic and handled by the scheduler. Deploy the job with the standard _replicator document schema; a hardened continuous edge-to-cloud configuration looks like this:

{
  "_id": "rep_edge_to_cloud_001",
  "source": "https://edge-node-01.local:5984/sensor_telemetry",
  "target": "https://cloud-cluster.example.com:5984/telemetry_aggregate",
  "continuous": true,
  "create_target": false,
  "connection_timeout": 15000,
  "http_connections": 10,
  "retries_per_request": 5,
  "user_ctx": {
    "name": "sync_service",
    "roles": ["_admin", "sync_admin"]
  },
  "owner": "sync-pipeline-v2"
}

The three fields that govern retry pressure are retries_per_request, connection_timeout, and http_connections; the scheduler supplies the between-crash backoff for free. The table below fixes their types, defaults, and their concrete effect on failure recovery.

Parameter	Type	Default	Effect on failure recovery
`_id`	string	— (required)	Stable job identity; a deterministic id makes redeploys idempotent so a restart resumes the same job rather than spawning a duplicate.
`source` / `target`	string or object	— (required)	Database URLs (or objects carrying `url`, `headers`, `auth`). Both must be reachable from the replicator node or every request fails fast.
`continuous`	boolean	`false`	When `true`, the job holds an open `_changes` listener and is rescheduled after a crash; one-shot jobs (`false`) transition straight to `completed` or `failed`.
`retries_per_request`	integer	`5`	Caps HTTP-request retries before the worker abandons a single request, preventing a tight loop against a persistently failing target.
`connection_timeout`	integer (ms)	`30000`	Socket deadline per request; lower it for cellular/satellite uplinks so a dead peer is detected quickly instead of blocking a worker.
`http_connections`	integer	`20`	Size of the connection pool to the remote; too high amplifies a retry storm against a recovering endpoint.
`create_target`	boolean	`false`	Leave `false` in production so a source typo cannot silently create a stray target and mask the real failure.
`user_ctx`	object	none	Roles the job runs under; writing checkpoints and dead-letter documents needs `_admin` or an equivalent database role.

After deploying, confirm the job is actually scheduled and not stuck in crashing with GET /_scheduler/jobs or GET /_scheduler/docs. For authoritative field semantics, consult the Apache CouchDB replicator documentation. Whether the job runs continuously or as a scheduled sweep is a separate decision covered in continuous vs one-way sync; the orchestrator below is agnostic to that choice.

Streaming Detection / Monitoring Setup

You cannot retry intelligently without observing job state. The scheduler exposes per-job health at GET /_scheduler/jobs, which reports the current state (running, pending, crashing, failed), the last few state transitions in history, and counters such as changes_pending and doc_write_failures. Poll it on a fixed interval and classify each job so the orchestrator knows whether to wait, retry, or escalate. The minimal listener below yields only jobs that need attention:

import json
import httpx


def poll_unhealthy_jobs(couch_url: str):
    """Yield (job_id, state, latest_reason) for jobs that are not cleanly running.

    Reads GET /_scheduler/jobs and surfaces jobs whose current state is
    crashing/pending/failed, attaching the most recent history reason so the
    caller can classify the failure as transient or permanent.
    """
    resp = httpx.get(f"{couch_url}/_scheduler/jobs", timeout=15)
    resp.raise_for_status()
    for job in resp.json().get("jobs", []):
        state = job.get("state")
        if state in ("crashing", "pending", "failed"):
            history = job.get("history", [])
            reason = history[0].get("reason", "") if history else ""
            yield job["id"], state, reason


if __name__ == "__main__":
    for job_id, state, reason in poll_unhealthy_jobs("http://localhost:5984"):
        print(f"{job_id}: {state} -> {reason!r}")

Emit a metric per poll — job state, doc_write_failures, and the source-checkpoint sequence — so a job whose checkpoint stops advancing while its state still reads running is caught as a silent stall rather than an outright crash. For a deeper checkpoint-integrity workflow, see monitoring replication checkpoints via API, part of the broader async monitoring & webhooks toolkit.

Core Implementation

Native scheduler backoff reschedules a crashed Erlang job, but it cannot encode your domain’s notion of “permanent” failure, cannot open a circuit breaker after repeated crashes, and cannot route an unrecoverable job to a dead-letter database. An external orchestrator does. The class below classifies failures from the job history, applies jittered exponential backoff between its own remediation attempts, trips a circuit breaker after a configurable failure count, and dead-letters jobs it cannot recover — with structured logging throughout.

The delay before remediation attempt $n$ (zero-indexed) combines an exponential term with uniform jitter, where $b$ is the base delay in seconds:

$$ \text{delay}(n) = b \cdot 2^{n} + U(0, 1) $$

The exponential factor widens the gap between retries to relieve a recovering endpoint, while the additive jitter $U(0,1)$ desynchronizes a fleet of edge clients so they do not reconnect in lockstep — the classic “thundering herd.” The implementation realizes that formula directly:

import asyncio
import logging
import os
import random
from dataclasses import dataclass, field

import httpx

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("retry-orchestrator")

# HTTP statuses and error substrings we treat as transient (worth retrying).
TRANSIENT_HINTS = ("timeout", "econnrefused", "connection", "503", "502", "500")


def classify(reason: str) -> str:
    """Classify a scheduler history reason as 'transient' or 'permanent'."""
    text = (reason or "").lower()
    if any(hint in text for hint in TRANSIENT_HINTS):
        return "transient"
    if "unauthorized" in text or "not_found" in text or "forbidden" in text:
        return "permanent"  # bad credentials or missing DB never self-heal
    return "transient"  # default to retryable; a permanent fault will exhaust retries


@dataclass
class Breaker:
    """A per-job circuit breaker: open it after `threshold` consecutive failures."""
    threshold: int = 5
    failures: int = 0
    open: bool = field(default=False)

    def record_failure(self) -> None:
        self.failures += 1
        if self.failures >= self.threshold:
            self.open = True

    def record_success(self) -> None:
        self.failures = 0
        self.open = False


class RetryOrchestrator:
    """Recover crashing CouchDB replication jobs with bounded, jittered backoff."""

    def __init__(self, couch_url: str, base_delay: float = 2.0, max_retries: int = 6):
        self.couch_url = couch_url.rstrip("/")
        self.base_delay = base_delay
        self.max_retries = max_retries
        self.breakers: dict[str, Breaker] = {}

    async def job_state(self, client: httpx.AsyncClient, job_id: str) -> tuple[str, str]:
        resp = await client.get(f"{self.couch_url}/_scheduler/jobs/{job_id}")
        job = resp.json()
        history = job.get("history", [])
        reason = history[0].get("reason", "") if history else ""
        return job.get("state", "unknown"), reason

    async def supervise(self, client: httpx.AsyncClient, job_id: str) -> str:
        breaker = self.breakers.setdefault(job_id, Breaker())
        for attempt in range(self.max_retries):
            state, reason = await self.job_state(client, job_id)

            if state in ("running", "completed"):
                breaker.record_success()
                log.info("job %s healthy (state=%s)", job_id, state)
                return "recovered"

            kind = classify(reason)
            if kind == "permanent":
                log.error("job %s permanent failure: %s", job_id, reason)
                await self._dead_letter(client, job_id, reason)
                return "dead_lettered"

            breaker.record_failure()
            if breaker.open:
                log.error("circuit open for %s after %d failures", job_id, breaker.failures)
                await self._dead_letter(client, job_id, reason)
                return "dead_lettered"

            # Jittered exponential backoff before re-checking / re-triggering.
            delay = self.base_delay * (2 ** attempt) + random.uniform(0, 1)
            log.warning("job %s %s (%s); backing off %.1fs", job_id, state, reason, delay)
            await asyncio.sleep(delay)

        log.error("job %s exhausted %d retries", job_id, self.max_retries)
        await self._dead_letter(client, job_id, "retry budget exhausted")
        return "dead_lettered"

    async def _dead_letter(self, client: httpx.AsyncClient, job_id: str, reason: str) -> None:
        """Persist an unrecoverable job to a _dlq_replication database for review."""
        doc = {"job_id": job_id, "reason": reason, "status": "unrecoverable"}
        await client.post(f"{self.couch_url}/_dlq_replication", json=doc)


async def _main() -> None:
    couch_url = os.environ.get("COUCH_URL", "http://localhost:5984")
    job_id = os.environ.get("JOB_ID", "rep_edge_to_cloud_001")
    orch = RetryOrchestrator(couch_url)
    async with httpx.AsyncClient(timeout=30) as client:
        outcome = await orch.supervise(client, job_id)
        log.info("supervision outcome for %s: %s", job_id, outcome)


if __name__ == "__main__":
    asyncio.run(_main())

Two non-obvious lines matter most. classify defaults unknown reasons to transient so a novel error is retried rather than dropped, but pairs that with a bounded retry budget so a genuinely permanent fault still exhausts and dead-letters — you fail safe in both directions. The Breaker is per-job, not global: one flapping edge node must not trip the circuit for every other healthy replication in the fleet.

Strategy Variants & Trade-offs

There is no single correct backoff curve; the right one depends on how bursty your failures are and how many clients recover simultaneously. Four strategies cover almost every replication workload:

Fixed-interval retry waits a constant delay between attempts. It is trivial to reason about and fine for a single job against a stable target, but a fleet retrying on the same fixed interval reconnects in lockstep and hammers a recovering endpoint.

Exponential backoff doubles the delay each attempt, quickly relieving a struggling target. It is the scheduler’s own between-crash behaviour. Without jitter, however, synchronized clients still align on the same doubling schedule.

Exponential backoff with jitter — the orchestrator above — adds a random offset so retries spread across a widening window and desynchronize across clients. This is the default recommendation for any multi-client edge or mobile fleet.

Circuit breaker with dead-letter stops retrying entirely once consecutive failures cross a threshold, routing the job to a dead-letter database for human inspection. It bounds blast radius when a dependency is hard-down, at the cost of requiring an operator to clear the breaker.

Strategy	Recovery speed	Storm resistance	Complexity	Best fit
Fixed interval	Slow, predictable	Low	Low	Single job, stable target
Exponential backoff	Fast relief	Medium	Low	Bursty transient faults
Exponential + jitter	Fast relief	High	Medium	Multi-client edge/mobile fleets
Circuit breaker + dead-letter	Bounded, capped	Highest	Medium-High	Hard-down dependencies, audited pipelines

When a job trips the breaker or exhausts its budget, do not keep retrying blindly. Route the underlying documents through fallback resolution chains and, if still unresolved, into manual review sync queues so an operator inspects the failure with a full audit trail.

Deployment & Orchestration

Run the orchestrator as a small stateless service, one replica per replication partition. Two replicas supervising the same job double the remediation pressure and can race to write conflicting dead-letter documents, so scale by sharding job ids across workers, not by cloning workers onto one job. Configure everything through the environment so one image serves every edge:

# Container environment (one replica per partition)
COUCH_URL=https://cloud-cluster.example.com:5984
RETRY_BASE_DELAY=2.0
RETRY_MAX_RETRIES=6
BREAKER_THRESHOLD=5
HEALTHCHECK_PORT=8080

Expose a /healthz endpoint that confirms the worker can reach GET /_scheduler/jobs and that its own supervision loop has run within the last interval — a loop stuck longer than one poll cycle signals a hung worker, not a healthy idle. Pin the last-processed scheduler cursor in durable storage so a pod restart resumes supervision rather than re-alerting on already-dead-lettered jobs, and let the orchestrator restart the pod on a failed health check.

Troubleshooting & Common Errors

Symptom / signal	Likely cause	Remediation
Job stuck in `crashing`, `history` shows `5xx`	Target overloaded or down	Let native backoff run; if it persists past your breaker threshold, dead-letter and page the target owner
`state: failed` immediately after deploy	Bad `user_ctx`, missing target, or auth error	Classify as permanent; fix credentials/URL and redeploy — retries will never recover it
`doc_write_failures` climbing while `running`	Individual `409 Conflict` writes	Expected for conflicts; resolve via handling 409 conflicts in replication jobs rather than restarting the job
Checkpoint drift (source seq frozen, state `running`)	Silent stall / stuck listener	Alert on a non-advancing `checkpointed_source_seq`; restart the job by re-writing its `_replicator` document
Retry storm against a recovering target	Fixed-interval or jitterless backoff	Switch to exponential backoff with jitter; lower `http_connections`
Connection timeouts on cellular uplinks	`connection_timeout` too high	Lower it (e.g. 15000 ms) so dead peers free workers quickly
Endless retries never escalate	No circuit breaker / unbounded budget	Bound `max_retries` and enforce a `Breaker` so permanent faults dead-letter

Track three operational signals throughout: recovery latency (crash detected to state running), dead-letter rate (share of jobs escalated), and checkpoint lag (source-seq gap between source and last commit). Conflicts surfaced by a recovered job feed back into conflict generation models, closing the loop between transport recovery and data reconciliation.

FAQ

Does CouchDB retry a crashed replication job automatically?

Yes. The scheduling replicator marks a failing continuous job crashing and reschedules it with exponential backoff between crashes. What it does not do is classify failures as transient versus permanent, apply request-level jitter, open a circuit breaker, or dead-letter an unrecoverable job — those are the responsibilities of the external orchestrator layered on top.

Is there a `retry_interval` field on the `_replicator` document?

No. There is no document-level retry or retry_interval field. Per-request retries are bounded by retries_per_request, and the delay between job crashes is automatic exponential backoff managed by the scheduler. Any additional jittered or capped backoff must live in your own orchestration code.

Why add jitter to exponential backoff?

Plain exponential backoff still lets a fleet of edge clients align on the same doubling schedule, so they all reconnect at once and re-overload a recovering target — the thundering-herd effect. Adding a random offset, U(0,1) seconds, desynchronizes the clients so their retries spread across the window instead of colliding.

Should I restart a job when `doc_write_failures` is climbing?

Usually not. Individual document 409 Conflict writes increment doc_write_failures but rarely halt the job — the replicator records the failure and continues. A climbing counter points to a conflict-resolution problem, not a transport failure; resolve it with the conflict workflow rather than by bouncing the replication.

When should a job stop retrying and dead-letter instead?

When the failure is permanent (bad credentials, a missing target database, an authorization error) or when consecutive transient failures cross the circuit-breaker threshold. At that point continued retries only waste connections and mask the fault, so route the job to a dead-letter database and alert an operator to intervene.

Part of: _replicator Configuration & Sync Pipeline Management

Error Handling & Retry Logic for Distributed CouchDB Replication #

Configuration Schema & Required Parameters #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #