Manual Review Sync Queues for CouchDB Replication

When an edge, IoT, or mobile-backend deployment hits a document conflict that no deterministic rule can safely reconcile — conflicting sensor telemetry with overlapping timestamps and missing provenance, or a financial record where either branch could be authoritative — forcing a lossy last-write-wins pick silently discards a real write. A manual review sync queue is the controlled holding layer that catches exactly these cases: it clones the conflicted document into a dedicated CouchDB database, locks it to one operator, and surfaces it through a deterministic API so a human resolves it with a full audit trail while continuous replication keeps flowing. This page is the escalation endpoint of the Conflict Detection & Automated Resolution Strategies pipeline: documents arrive here only after auto-merge rule engines and algorithm selection for merge have declined to guess, usually as the terminal hop of fallback resolution chains. The queue exists so that “we don’t know” is a first-class, non-destructive outcome rather than a coin flip.

The queue makes "we don't know" a durable, non-destructive state: only the operator's _bulk_docs write-back to the source clears the conflict.

Configuration Schema & Required Parameters

Two configuration artifacts stand up a manual review queue: the _replicator job that keeps every divergent leaf reaching your pipeline, and the queue database itself with an index that lets operators pull work by priority. Replication never discards losing branches on its own — it copies every conflicting leaf — so there is no “preserve conflicts” flag to set; you simply request conflicts=true on the downstream _changes feed. Deploy the sync job using the standard _replicator document schema:

{
  "_id": "edge_to_core_replication",
  "source": "https://edge-node-01.local:5984/device_data",
  "target": "https://core-cluster.internal:5984/device_data",
  "continuous": true,
  "create_target": false,
  "selector": {
    "sync_enabled": true
  },
  "user_ctx": {
    "name": "replicator_svc",
    "roles": ["_admin"]
  },
  "heartbeat": 30000,
  "retries_per_request": 5,
  "connection_timeout": 15000
}

Parameter	Type	Default	Effect on the queue pipeline
`_id`	string	— (required)	Names the replication job in `_replicator`; use one stable id per topology edge.
`source` / `target`	string or object	— (required)	Database URLs (or objects with embedded auth). Both must be reachable from the replicator node.
`continuous`	boolean	`false`	When `true`, conflicts surface within seconds so the router can enqueue them promptly instead of once per sweep.
`create_target`	boolean	`false`	Leave `false` in production so a typo cannot spawn a stray database.
`selector`	object	none	Mango selector narrowing the replicated stream. Use `selector` on its own; do not set `filter` to `_selector` in a replicator document — that value is only meaningful as a `_changes`-feed filter. Mutually exclusive with `filter` and `doc_ids`.
`heartbeat`	int (ms)	`10000`	Keeps long-poll connections alive across NAT tables that aggressively prune idle TCP sessions in constrained IoT networks.
`connection_timeout`	int (ms)	`30000`	Caps how long the replicator waits on a stalled socket before retrying.
`retries_per_request`	int	`5`	HTTP-layer retries for transient drops; the scheduler additionally retries crashed jobs with exponential backoff.

The queue database is an ordinary CouchDB database (sync_review_queue) plus one design document that indexes pending work by deadline so operators dequeue the most urgent item first:

{
  "_id": "_design/review",
  "views": {
    "pending_by_deadline": {
      "map": "function (doc) { if (doc.review_status === 'pending') { emit(doc.resolution_deadline, doc.original_id); } }"
    }
  }
}

Each queued entry carries a small, fixed envelope around the conflicted payload: review_status (pending → locked → resolved), a locked_by operator identifier, a resolution_deadline timestamp for escalation, and the complete conflicting_revs array so the resolver can tombstone every loser once a winner is chosen. Because the queue holds copies, the conflict itself lives on in the source document’s revision tree until the operator writes back — see revision tree mechanics for why an unresolved conflict never disappears on its own.

Streaming Detection / Monitoring Setup

Routing decisions are made where the pipeline consumes the _changes feed, before any downstream step prunes leaves. Request the feed with conflicts=true and include_docs=true so each row carries the computed _conflicts array without a second round trip. The minimal listener below yields only documents that actually have competing branches — the exact set eligible for review:

import httpx


def stream_conflicts(db_url: str, since: str = "now"):
    """Yield (doc_id, doc, conflicts[]) for each conflicted document.

    Uses the continuous _changes feed with conflicts=true so the computed
    _conflicts array rides along, and include_docs so the payload is present
    for the severity check without an extra GET.
    """
    params = {
        "feed": "continuous",
        "since": since,
        "include_docs": "true",
        "conflicts": "true",
        "heartbeat": "15000",
    }
    with httpx.stream("GET", f"{db_url}/_changes", params=params, timeout=None) as r:
        for line in r.iter_lines():
            if not line.strip():
                continue  # heartbeat keep-alive
            row = httpx.Response(200, text=line).json()
            doc = row.get("doc")
            if not doc or doc.get("_deleted"):
                continue
            conflicts = doc.get("_conflicts", [])
            if conflicts:
                yield doc["_id"], doc, conflicts


if __name__ == "__main__":
    for doc_id, _doc, revs in stream_conflicts("http://localhost:5984/device_data"):
        print(f"conflicted: {doc_id} ({len(revs)} competing revs)")

Emit a counter for conflicts seen and a separate counter for documents actually enqueued so you can watch the escalation ratio over time. A sudden spike usually means an upstream partition storm rather than a queue problem — cross-reference conflict generation models before you assume the router is misbehaving.

Core Implementation

The production router consumes the _changes feed, evaluates each conflict’s severity, and pushes only the genuinely ambiguous documents into sync_review_queue. It uses a persistent HTTP session with exponential backoff, a durable since checkpoint, and content-hashed queue IDs so re-delivery from the feed is idempotent — a document seen twice maps to the same queue key and simply no-ops. This assumes conflicts have already failed deterministic resolution upstream; when algorithm selection for merge cannot guarantee integrity, the document lands here.

#!/usr/bin/env python3
"""
CouchDB Manual Review Sync Queue Router
Routes unresolved conflicts to a dedicated queue DB for operator review.
"""
import os
import time
import logging
import hashlib
from typing import List, Optional

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)


class ConflictRouter:
    def __init__(self, source_db_url: str, queue_db_url: str, auth: Optional[tuple] = None):
        self.source_url = source_db_url.rstrip("/")
        self.queue_url = queue_db_url.rstrip("/")
        self.auth = auth
        self.session = self._init_session()
        self.since_token = "0"
        self.last_checkpoint_file = ".sync_router_since"

    def _init_session(self) -> requests.Session:
        session = requests.Session()
        if self.auth:
            session.auth = self.auth
        retry_strategy = Retry(
            total=5,
            backoff_factor=1.5,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session

    def _load_checkpoint(self) -> str:
        if os.path.exists(self.last_checkpoint_file):
            with open(self.last_checkpoint_file, "r") as f:
                return f.read().strip()
        return "0"

    def _save_checkpoint(self, token: str):
        with open(self.last_checkpoint_file, "w") as f:
            f.write(token)

    def _evaluate_conflict_severity(self, doc: dict, conflicts: List[str]) -> bool:
        """Return True if the document requires manual review.

        Heuristic: three or more competing revisions, a missing 'provenance'
        field, or a conflict window wider than 24h all signal that automated
        merge would likely discard a real write.
        """
        if len(conflicts) >= 3:
            return True
        if "provenance" not in doc or not doc["provenance"]:
            return True
        if "timestamp" in doc and "last_updated" in doc:
            t1 = doc.get("timestamp", 0)
            t2 = doc.get("last_updated", 0)
            if abs(t1 - t2) > 86_400_000:  # 24h in ms
                return True
        return False

    def _push_to_queue(self, doc_id: str, doc: dict, conflicts: List[str]):
        # Deterministic id: the same source doc always maps to the same queue
        # entry, so a re-delivered change is a harmless overwrite, not a dup.
        queue_id = f"review_{hashlib.sha256(doc_id.encode()).hexdigest()[:12]}"
        now_ms = int(time.time() * 1000)
        queue_doc = {
            "_id": queue_id,
            "original_id": doc_id,
            "conflicting_revs": conflicts,
            "payload": doc,
            "review_status": "pending",
            "locked_by": None,
            "created_at": now_ms,
            "resolution_deadline": now_ms + (7 * 86_400_000),  # 7-day SLA
        }
        # Preserve any existing _rev so re-enqueue updates rather than 409s.
        head = self.session.head(f"{self.queue_url}/{queue_id}")
        if head.status_code == 200 and "ETag" in head.headers:
            queue_doc["_rev"] = head.headers["ETag"].strip('"')
        resp = self.session.put(f"{self.queue_url}/{queue_id}", json=queue_doc)
        resp.raise_for_status()
        logger.info("Routed %s to review queue: %s", doc_id, queue_id)

    def run(self):
        self.since_token = self._load_checkpoint()
        logger.info("Starting conflict router from since=%s", self.since_token)
        changes_url = f"{self.source_url}/_changes"

        while True:
            try:
                params = {
                    "feed": "longpoll",
                    "since": self.since_token,  # advance each poll
                    "include_docs": "true",
                    "conflicts": "true",
                    "timeout": 30000,
                    "heartbeat": 15000,
                }
                resp = self.session.get(changes_url, params=params, timeout=35)
                resp.raise_for_status()
                data = resp.json()
                self.since_token = data.get("last_seq", self.since_token)

                for change in data.get("results", []):
                    doc = change.get("doc")
                    if not doc or doc.get("_deleted"):
                        continue
                    conflicts = doc.get("_conflicts", [])
                    if conflicts and self._evaluate_conflict_severity(doc, conflicts):
                        self._push_to_queue(doc["_id"], doc, conflicts)

                self._save_checkpoint(self.since_token)

            except requests.exceptions.RequestException as e:
                logger.error("Changes feed interrupted: %s. Retrying in 5s...", e)
                time.sleep(5)
            except Exception as e:  # noqa: BLE001 — keep the daemon alive
                logger.error("Unexpected routing error: %s", e)
                time.sleep(10)


if __name__ == "__main__":
    SOURCE = os.getenv("COUCHDB_SOURCE_URL", "https://edge-node-01.local:5984/device_data")
    QUEUE = os.getenv("COUCHDB_QUEUE_URL", "https://core-cluster.internal:5984/sync_review_queue")
    USER = os.getenv("COUCHDB_USER", "admin")
    PASS = os.getenv("COUCHDB_PASS", "password")

    router = ConflictRouter(SOURCE, QUEUE, auth=(USER, PASS))
    router.run()

The write-back step — which an operator triggers after choosing a revision — must commit the winner and a _deleted tombstone for every rev in conflicting_revs in one _bulk_docs batch against the source database. Writing only a new winner adds another leaf and leaves the document conflicted; the conflict clears only when each losing rev is tombstoned. Wrap that write in the same backoff discipline described in error handling & retry logic, because a 409 here means the source moved under you and the batch must be rebuilt against the fresh _rev.

Strategy Variants & Trade-offs

“Route to a human” is not one policy — it is a spectrum of enqueue rules, and picking the wrong one either floods operators or lets bad merges through. Bind the policy to the document class rather than applying one global rule:

Policy	How it decides to enqueue	Latency to resolution	Operator load	Best fit
Severity threshold	Enqueue only when a heuristic (rev count, missing provenance, wide window) trips	Low for most docs; only hard cases wait	Low — small, high-value queue	Telemetry with occasional ambiguous merges
Deadline-driven escalation	Auto-attempt merge first; enqueue only if unresolved before an SLA timer expires	Medium — bounded by the SLA window	Bursty near deadlines	Config/profile docs where a delayed correct answer beats a fast wrong one
Always-escalate (dead-letter)	Every conflict for a namespace goes to a human; no automated merge	Highest — gated on human throughput	High — full stream lands in queue	Regulated/financial records where lossy merges are unacceptable
Sampled review	Auto-resolve, but divert a random percentage to humans for audit	Zero for the stream; async for samples	Tunable via sample rate	Validating a new merge algorithm before trusting it fully

Severity threshold is the default for most edge workloads: it keeps the queue small enough that operators can actually clear it, at the cost of a heuristic you must tune as data drifts. Deadline-driven escalation trades resolution latency for a higher automated-merge rate — most conflicts clear before the timer fires. Always-escalate is the only defensible policy where a discarded write is a compliance event, and it depends directly on operator staffing; pair it with the access controls in security boundaries in replication so only authorized operators can commit a winner. Sampled review is an audit harness, not a resolution strategy — use it to build confidence in an automated path before you widen it.

Deployment & Orchestration

Run the router as a small stateless service, one replica per replication partition. Two replicas on the same _changes stream double the write pressure on the queue and race each other, so scale by sharding namespaces across routers, not by cloning workers onto one feed. Configure everything through the environment so a single image serves every edge:

# Container environment (one replica per partition)
COUCHDB_SOURCE_URL=https://edge-node-01.local:5984/device_data
COUCHDB_QUEUE_URL=https://core-cluster.internal:5984/sync_review_queue
COUCHDB_USER=replicator_svc
COUCHDB_PASS=__from_secret_store__
REVIEW_SLA_DAYS=7
HEALTHCHECK_PORT=8080

Mount the .sync_router_since checkpoint on durable storage (a volume or a small state document) so a restart resumes from the last seq instead of replaying the whole history and re-enqueuing cleared conflicts. Expose a /healthz endpoint that confirms two things: the worker can still reach both databases, and its _changes cursor has advanced within the last heartbeat interval — a cursor stuck longer than one interval is a stalled listener, not a healthy idle. Let the orchestrator restart the pod on a failed check. For push-style alerting when the queue depth crosses a threshold, wire the queue’s _changes feed into async monitoring webhooks rather than polling from a dashboard.

Troubleshooting & Common Errors

Symptom / error	Likely cause	Remediation
`409 Conflict` on write-back to source	Source `_rev` changed between operator read and commit	Re-read the source doc, rebuild the `_bulk_docs` batch against the fresh `_rev`, retry with backoff
`409 Conflict` when enqueuing	Re-delivered change PUT without the current queue `_rev`	Use the deterministic id + `HEAD`/ETag pattern shown above so re-enqueue updates instead of colliding
Conflict never clears after “resolution”	Winner written but losing revs not tombstoned	Include `{"_id": id, "_rev": rev, "_deleted": true}` for every entry in `conflicting_revs` in the same batch
Two operators editing one entry	Missing optimistic lock on the queue doc	Require the current `_rev` on the claim `PUT`; a stale `_rev` returns `409` and the claim fails safely
Checkpoint drift / replaying old changes	`.sync_router_since` not persisted across restarts	Store the last `seq` on durable storage and load it on boot
Queue grows unbounded	No retention/archival job	Archive `resolved` entries past the retention window; compact the queue DB in a low-traffic window
Escalation ratio climbing	Upstream partition storm, not a router bug	Compare against conflict-generation metrics; the heuristic is doing its job

FAQ

Does routing a document to the queue resolve the conflict in the source database?

No. The queue holds a copy for triage; the source document stays conflicted until an operator writes a winning revision and tombstones every losing rev back to the source. The queue’s job is to make “needs a human” durable and visible, not to change the source on its own.

How do I stop two operators from resolving the same document at once?

Rely on CouchDB’s optimistic concurrency. When an operator claims an entry, issue a conditional PUT carrying the current _rev. If another operator claimed it first, the _rev is stale and CouchDB returns 409, so the second claim fails cleanly and the UI can re-fetch the current locked_by.

Why hash the original document id into the queue key instead of using a random id?

The _changes feed can re-deliver the same change after a restart or network blip. A content-derived key means a re-delivered conflict maps to the exact same queue entry, so the second write is an idempotent overwrite rather than a duplicate review item cluttering the operator’s list.

What happens if the resolution_deadline passes with no operator action?

Nothing automatic to the source — the point of this queue is to never make a lossy write unattended. Instead, an escalation job should page an on-call reviewer, raise the entry’s priority in the pending_by_deadline view, or emit a webhook. The document stays conflicted and safe until a human acts.

Can the router and an automated merge engine run against the same stream?

Yes, and they usually should. The merge engine attempts deterministic resolution first; only what it declines flows to this router. In effect the router is the last hop of the fallback chain — it catches exactly the conflicts that automated logic refused to guess on.

Part of: Conflict Detection & Automated Resolution Strategies

Manual Review Sync Queues for CouchDB Replication #

Configuration Schema & Required Parameters #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #