Fallback Resolution Chains for CouchDB Sync Pipelines

When a sensor fleet reconnects after an hours-long partition and floods your central node with divergent revisions, a single merge strategy is never enough: last-write-wins silently drops field edits, semantic merges choke on schema drift, and a naive resolver stalls the whole _replicator stream on the first document it cannot reconcile. A fallback resolution chain solves this by wiring several resolvers into a deterministic escalation path — deterministic field merge, then algorithmic heuristics, then business rules, then human review — so that every conflicting document reaches some terminal state and replication keeps flowing. This page is the escalation layer of the broader Conflict Detection & Automated Resolution Strategies framework; it assumes you already emit _conflicts from the changes feed and focuses on how to compose, configure, and operate the chain that consumes them.

The escalation path: every conflict takes the cheapest tier that converges, and anything unresolved reaches a non-destructive terminal state.

The chain is a sequential, idempotent pipeline. Each tier is stateless where possible, instrumented with explicit success and failure counters, and bounded in latency. Transitions between tiers are governed only by conflict metadata (_conflicts, _revisions, _deleted_conflicts) and replication checkpoint state — never by ad-hoc exceptions. Because CouchDB itself never discards losing branches (it selects a winner for reads via revision tree mechanics but leaves every leaf in place), the chain is fully responsible for reconciling and tombstoning divergent revisions.

Configuration Schema & Required Parameters

CouchDB’s _replicator database does not execute custom resolution logic — replication always preserves every conflicting revision, and your external pipeline consumes the changes feed to act on them. The chain therefore has two configuration surfaces: the _replicator document that narrows the stream, and the pipeline’s own tier configuration. Deploy the replication job using the standard _replicator document schema:

{
  "_id": "edge-sync-fallback-chain",
  "source": "https://edge-node-01.local:5984/sensor_data",
  "target": "https://central-sync.internal:5984/sensor_data",
  "continuous": true,
  "create_target": true,
  "filter": "sync_filters/sensor_docs",
  "user_ctx": {
    "name": "sync_service",
    "roles": ["_admin"]
  },
  "http_connections": 10,
  "connection_timeout": 30000,
  "retries_per_request": 5,
  "worker_processes": 4,
  "worker_batch_size": 50
}

The filter value uses the ddocname/filtername form (the sensor_docs function in _design/sync_filters), not a _design/.../_filter path. A replication filter only narrows which documents replicate — it cannot test for conflicts, because _conflicts is not available inside a filter function. Conflict detection happens in the consumer, which reads the changes feed with conflicts=true.

Parameter	Type	Default	Effect on the chain
`continuous`	bool	`false`	Must be `true`; a one-shot job would drain and exit, leaving late-arriving conflicts unprocessed.
`filter`	string	none	Scopes the stream to documents the chain owns, keeping unrelated churn out of the tiers.
`worker_batch_size`	int	`500`	Lower it (e.g. `50`) so a conflict storm produces small, back-pressurable batches rather than one giant flush.
`retries_per_request`	int	`10`	Governs transport-level retries inside replication; independent of your per-tier resolution retries.
`http_connections`	int	`20`	Cap connection fan-out to the source so partition recovery does not exhaust file descriptors.
`user_ctx.roles`	array	`[]`	Needs write access on the target to commit merged winners and tombstone losing leaves.

The pipeline’s own configuration — tier ordering, confidence thresholds, DLQ target, checkpoint store — should live in environment variables or a mounted config document, never hard-coded, so the same image can run against edge, staging, and central topologies. The exact ordering you choose is discussed under Strategy Variants & Trade-offs.

Streaming Detection & Monitoring Setup

Before wiring the full chain, confirm you can actually see conflicts. The minimal detector below subscribes to the continuous changes feed with conflicts=true — the flag that makes each document carry its computed _conflicts array — and prints any document with divergent leaves. If nothing prints, the problem is upstream in conflict generation models, not in your chain.

import asyncio
import json

import httpx

CHANGES = (
    "https://central-sync.internal:5984/sensor_data/_changes"
    "?feed=continuous&since=now&include_docs=true&conflicts=true"
    "&filter=sync_filters/sensor_docs"
)


async def watch_conflicts() -> None:
    """Print every document that arrives carrying a non-empty _conflicts array."""
    async with httpx.AsyncClient(timeout=None) as client:
        async with client.stream("GET", CHANGES) as response:
            response.raise_for_status()
            async for line in response.aiter_lines():
                if not line.strip():
                    continue  # heartbeat keep-alive newline
                doc = json.loads(line).get("doc") or {}
                if doc.get("_conflicts"):
                    print(doc["_id"], "->", doc["_conflicts"])


if __name__ == "__main__":
    asyncio.run(watch_conflicts())

Note timeout=None on the streaming client: a continuous feed is a long-lived response, and a read timeout would sever it on the first quiet interval. In production you emit a conflict_detected_total counter here rather than printing, and pair it with replication lag from the async monitoring & webhooks endpoints so you can correlate conflict spikes with partition-recovery windows.

Core Implementation

The production consumer wraps the same feed but routes each conflicted document through the escalation tiers, stopping at the first tier that converges. Every tier is a coroutine returning True on resolution; the orchestrator records which tier resolved the document and never re-runs an earlier tier. Structured logging, bounded concurrency, and a durable checkpoint make the pipeline safe to crash and restart.

import asyncio
import json
import logging
import os
from typing import Any, Dict, List, Optional

import httpx

logging.basicConfig(
    level=logging.INFO,
    format='{"ts":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}',
)
logger = logging.getLogger("couchdb_fallback_chain")


class FallbackPipeline:
    """Consume a conflicts=true changes feed and escalate each conflicted
    document through an ordered chain of resolvers until one converges.

    Tiers, in order:
      1. deterministic field-level merge   (cheap, total, no ambiguity)
      2. heuristic algorithmic resolution  (vector-clock / semantic diff)
      3. business-rule fallback            (compliance / priority overrides)
      4. dead-letter queue                 (human review; never drops data)
    """

    def __init__(self, db_url: str, changes_url: str, checkpoint) -> None:
        self.db_url = db_url.rstrip("/")
        self.changes_url = (
            f"{changes_url}?feed=continuous&include_docs=true"
            f"&conflicts=true&filter=sync_filters/sensor_docs"
        )
        self.checkpoint = checkpoint  # durable since-sequence store
        self.client = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, read=None),  # read=None: feed is long-lived
            limits=httpx.Limits(max_connections=20),
        )
        self._sem = asyncio.Semaphore(int(os.getenv("CHAIN_CONCURRENCY", "8")))

    async def consume_changes(self) -> None:
        since = self.checkpoint.load() or "now"
        logger.info(f"starting changes consumer since={since}")
        url = f"{self.changes_url}&since={since}"
        async with self.client.stream("GET", url) as response:
            response.raise_for_status()
            async for line in response.aiter_lines():
                if not line.strip():
                    continue
                try:
                    change = json.loads(line)
                except json.JSONDecodeError:
                    continue
                doc = change.get("doc")
                if doc and doc.get("_conflicts"):
                    async with self._sem:  # bound in-flight resolutions
                        await self.process_conflict(doc)
                if seq := change.get("seq"):
                    self.checkpoint.save(seq)  # advance only after handling

    async def process_conflict(self, doc: Dict[str, Any]) -> None:
        doc_id = doc["_id"]
        conflicts: List[str] = doc.get("_conflicts", [])

        # Idempotency guard: re-fetch the winner; if it advanced past what we
        # streamed, another worker already resolved this document -> skip.
        fresh = await self._current(doc_id)
        if fresh is None or fresh.get("_conflicts") is None and not conflicts:
            return

        tiers = (
            ("field_merge", self._attempt_field_merge),
            ("heuristic", self._attempt_heuristic),
            ("business_rules", self._apply_business_rules),
        )
        for name, resolver in tiers:
            try:
                if await resolver(doc, conflicts):
                    logger.info(f"resolved {doc_id} tier={name}")
                    return
            except Exception:  # a tier must never abort the whole stream
                logger.exception(f"tier {name} raised on {doc_id}; escalating")

        await self._route_to_dlq(doc)
        logger.warning(f"escalated {doc_id} tier=dlq conflicts={len(conflicts)}")

    async def _current(self, doc_id: str) -> Optional[Dict[str, Any]]:
        r = await self.client.get(f"{self.db_url}/{doc_id}?conflicts=true")
        return r.json() if r.status_code == 200 else None

    async def _commit(self, doc_id: str, winner: Dict, losers: List[str]) -> bool:
        # Resolution = write the merged winner AND tombstone every losing leaf.
        batch = {"docs": [winner, *(
            {"_id": doc_id, "_rev": rev, "_deleted": True} for rev in losers
        )]}
        r = await self.client.post(f"{self.db_url}/_bulk_docs", json=batch)
        return r.status_code in (200, 201)

    async def _attempt_field_merge(self, doc: Dict, conflicts: List[str]) -> bool:
        # Tier 1: deterministic last-write-wins on a whitelisted field set.
        return False  # implement per your schema

    async def _attempt_heuristic(self, doc: Dict, conflicts: List[str]) -> bool:
        # Tier 2: semantic diff / application-level vector clocks.
        return False

    async def _apply_business_rules(self, doc: Dict, conflicts: List[str]) -> bool:
        # Tier 3: compliance flags, retention policy, priority overrides.
        return False

    async def _route_to_dlq(self, doc: Dict) -> None:
        # Tier 4: persist untouched to the review database; do NOT tombstone.
        await self.client.put(
            f"{self.db_url.rsplit('/', 1)[0]}/conflict_review/{doc['_id']}",
            json={"doc": doc, "status": "pending_review"},
        )

    async def close(self) -> None:
        await self.client.aclose()


class NullCheckpoint:
    """Reference checkpoint; swap for SQLite/Redis in production."""

    def __init__(self) -> None:
        self._seq: Optional[str] = None

    def load(self) -> Optional[str]:
        return self._seq

    def save(self, seq: str) -> None:
        self._seq = seq


async def main() -> None:
    base = os.getenv("COUCH_DB", "https://central-sync.internal:5984/sensor_data")
    pipeline = FallbackPipeline(
        db_url=base,
        changes_url=f"{base}/_changes",
        checkpoint=NullCheckpoint(),
    )
    try:
        await pipeline.consume_changes()
    except asyncio.CancelledError:
        logger.info("shutdown requested")
    finally:
        await pipeline.close()


if __name__ == "__main__":
    asyncio.run(main())

Two lines carry most of the correctness weight. The _semaphore bounds how many resolutions run at once so a reconnect burst cannot spawn thousands of concurrent merges; and the per-tier try/except guarantees that a bug in one resolver escalates a single document rather than tearing down the feed. The actual merge bodies stay deliberately thin here — Tier 2 delegates to whatever you configure via algorithm selection for merge, and Tier 3 typically calls out to auto-merge rule engines for compliance-driven overrides. Anything that survives all three tiers lands in a review database rather than being force-merged, which is the contract that makes the chain safe to run unattended.

Strategy Variants & Trade-offs

The order and gating of tiers is a design decision, not a default. Four compositions cover almost every deployment:

Strict Sequential Chain — run tiers in fixed order, stop at first success. Simplest to reason about and to audit; every document takes the cheapest resolver that works. This is the implementation above.
Confidence-Gated Chain — each tier returns a confidence score, and the orchestrator only accepts a resolution above a threshold, otherwise escalating. Prevents an over-eager field merge from silently winning when the edits genuinely conflict.
Parallel-Probe Chain — evaluate tiers 1–3 concurrently, then pick the highest-confidence non-null result. Lowest latency under load, but wastes compute and complicates idempotency because multiple resolvers may propose writes.
Short-Circuit-to-Manual — for regulated document classes, skip heuristics entirely and route straight from a failed deterministic merge to review. Maximum auditability, minimum automation.

Strategy	Consistency guarantee	Latency	Complexity	Best for
Strict Sequential	Deterministic, order-dependent	Low–medium	Low	General edge/IoT telemetry
Confidence-Gated	Strong (rejects low-confidence merges)	Medium	Medium	Mixed structured configs
Parallel-Probe	Weak without dedup	Lowest	High	High-throughput, loss-tolerant streams
Short-Circuit-to-Manual	Strongest (human final)	High (queue wait)	Low	Compliance / financial records

Whichever composition you pick, the terminal tier must be non-destructive: unresolved documents flow to manual review sync queues with their full revision set intact, so an operator can reconstruct intent. A chain that force-merges its last tier is not a fallback chain — it is silent data loss with extra steps.

Deployment & Orchestration

Package the consumer as a single-purpose container. The one hard rule: run exactly one active replica per source partition. Two consumers on the same changes feed will both detect the same conflict, both attempt to tombstone the same losing leaves, and the second _bulk_docs write will collide on _rev — turning a resolved conflict into a 409 retry storm. Scale horizontally by partitioning the source (per database, per device namespace), not by adding replicas to one feed. Use a leader lease (a lock document, or your orchestrator’s StatefulSet with replicas: 1 per shard) to enforce this.

FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir "httpx[http2]==0.27.*"
COPY pipeline.py .
# Config strictly via env so one image runs every topology.
ENV COUCH_DB="" CHAIN_CONCURRENCY="8" CHECKPOINT_URL=""
HEALTHCHECK --interval=30s --timeout=5s CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"
CMD ["python", "pipeline.py"]

Expose a /healthz endpoint that reports two facts the orchestrator actually needs: whether the changes stream is connected, and how stale the checkpoint is. A liveness probe that only checks “process is running” will happily keep a wedged consumer whose feed silently died. Wire the same signals into your alerting so a stalled chain pages someone before the DLQ backs up. Env-var configuration (COUCH_DB, CHAIN_CONCURRENCY, CHECKPOINT_URL) keeps credentials and topology out of the image and lets you promote the exact bytes you tested.

Troubleshooting & Common Errors

Symptom	Likely cause	Remediation
`409 Conflict` on `_bulk_docs` commit	Stale `_rev`; the winner advanced between fetch and write, or two replicas race the feed	Re-fetch with `conflicts=true`, re-run the tier, and enforce one replica per partition — see handling 409 conflicts in replication jobs
Same document resolved repeatedly (reprocessing loop)	Checkpoint not persisted before ack, or losing leaves never tombstoned	Save `since` after handling each change; confirm `_commit` deletes every rev in `_conflicts`
Conflicts never detected	`conflicts=true` missing from the feed URL, or filter excludes the docs	Add `conflicts=true`; verify the `ddocname/filtername` filter matches the namespace
Checkpoint drift after restart	In-memory checkpoint lost on crash	Back the checkpoint with SQLite/Redis; see the broader error handling & retry logic patterns
DLQ depth climbing steadily	A resolver silently returning `False` (schema drift, failing rule service)	Alert on `dlq_depth`; inspect per-tier success counters to find the stalled tier
`doc_update_conflict` in logs	Attempting to delete a leaf whose `_rev` is already gone	Treat as a no-op; the losing leaf was tombstoned by a prior run (idempotent by design)

Because CouchDB change delivery is at-least-once, every remediation above assumes idempotent resolution: re-applying a completed merge must be a no-op, not a second write. Fetch the current _rev before acting, skip if the state has advanced, and persist the checkpoint atomically so a crash mid-batch replays cleanly instead of double-resolving.

FAQ

Does CouchDB run my fallback chain, or do I run it externally?

Externally. The _replicator database only copies revisions; it never executes resolution logic and never deletes losing branches. Your chain is a separate process that consumes the changes feed with conflicts=true, decides a winner, and writes it back. CouchDB’s role ends at faithfully preserving every conflicting leaf for you to reconcile.

What does "resolving" a conflict actually require?

Two writes, ideally in one _bulk_docs batch: write the merged winning revision, and delete (tombstone) every losing leaf _rev from the document’s _conflicts array. Writing the winner alone leaves the document conflicted — the losing leaves remain in the revision tree and will resurface on the next read with conflicts=true.

How many pipeline replicas can I run against one feed?

Exactly one active replica per source partition. Multiple consumers on the same changes feed detect the same conflict and race to tombstone the same leaves, producing 409 collisions. To scale, partition the source (per database or per device namespace) and run one consumer per partition, guarded by a leader lease.

Why not just use last-write-wins everywhere and skip the chain?

Last-write-wins is a fine Tier 1 for stateless telemetry, but it silently discards concurrent field edits and depends on trustworthy clocks. A chain lets you apply cheap deterministic merges where they are safe, escalate to semantic or rule-based resolution where they are not, and preserve anything ambiguous for human review — instead of losing data uniformly.

How do I keep the DLQ from becoming a graveyard?

Alert on DLQ depth and growth rate, not just absolute size, and track per-tier success counters so a rising DLQ points you at the specific resolver that started failing (usually schema drift or a down rule service). Route reviewed documents back through the chain after an operator annotates intent, so the queue is a workbench, not a landfill.

Part of: Conflict Detection & Automated Resolution Strategies

Fallback Resolution Chains for CouchDB Sync Pipelines #

Configuration Schema & Required Parameters #

Streaming Detection & Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #