How do I avoid losing events when my consumer reconnects?

Persist the last seq you processed and pass it as since on reconnect instead of now. now starts at the current tail and discards everything that happened while disconnected; resuming from the stored sequence replays the gap. Make the endpoint idempotent on replication_id and seq.

Async Monitoring & Webhooks for CouchDB `_replicator` Pipelines

When a replication job silently stalls at 3 a.m. and an edge fleet stops shipping telemetry, cron-based polling tells you far too late. This page shows how to turn CouchDB’s _replicator state transitions into a low-latency, push-driven event stream — consuming the system database’s _changes feed asynchronously, correlating each state change with checkpoint progress, and dispatching structured webhooks to downstream routers, message queues, and alerting systems. It is the observability layer for the broader _replicator Configuration & Sync Pipeline Management framework, and it is written for mobile backend engineers and Python sync-pipeline builders who need replication to behave like a first-class, monitored service rather than a background process nobody watches until it breaks.

CouchDB does not natively emit HTTP webhooks for _replicator lifecycle events. Instead it exposes the _changes feed on the _replicator system database as a reliable, append-only event bus. By consuming that feed with feed=continuous and include_docs=true, you can route scheduler states — initializing, running, pending, crashing, completed, and failed — to your own infrastructure and replace fragile timed polling with a deterministic, event-driven state machine.

Configuration Schema & Required Parameters

Async monitoring starts with a well-formed replication document. The fields you set on the _replicator document determine how observable the job is: heartbeat keeps idle feeds alive behind load balancers, and retries_per_request bounds how hard a struggling job tries before the scheduler reschedules it. The document below targets CouchDB 3.x and carries the fields most relevant to monitoring; the complete field contract lives in the _replicator Document Schema.

{
  "_id": "rep_edge_to_cloud_01",
  "source": "https://edge-node.local:5984/sensor_data",
  "target": "https://cloud-couchdb.internal:5984/aggregate_telemetry",
  "continuous": true,
  "create_target": false,
  "user_ctx": {
    "name": "replicator_svc",
    "roles": ["_admin"]
  },
  "heartbeat": 30000,
  "connection_timeout": 60000,
  "retries_per_request": 5,
  "socket_options": "[{keepalive, true}, {nodelay, false}]"
}

Replication options are top-level fields — there is no params wrapper. Never author _replication_state, _replication_stats, or _replication_id by hand; the scheduler writes them back and overwrites anything you set. Two footguns matter here: socket_options is an Erlang-term string (e.g. "[{keepalive, true}]"), not a JSON array, and the connection-level timeout field is connection_timeout, not timeout.

Parameter	Type	Default	Effect on monitoring
`continuous`	boolean	`false`	`true` keeps a persistent worker and emits incremental state deltas; `false` (one-shot) fires a single `completed`/`failed` transition and exits. See Continuous vs One-Way Sync.
`heartbeat`	integer (ms)	`10000`	Interval at which the source emits keep-alive newlines on the `_changes` feed, preventing idle-connection drops through proxies and NAT.
`connection_timeout`	integer (ms)	`30000`	Per-connection socket timeout; too low causes spurious `crashing` transitions on slow links, inflating webhook noise.
`retries_per_request`	integer	`5`	HTTP-request retries before the worker surfaces an error to the scheduler, which then applies its own exponential backoff to `crashing` jobs.
`socket_options`	Erlang-term string	`""`	TCP tuning (keepalive, nodelay); `keepalive` is essential for long-lived feeds behind stateful firewalls.
`user_ctx.roles`	array	—	Security context the worker runs under; a role mismatch surfaces as an immediate `failed` state rather than a silent stall.

Streaming Detection / Monitoring Setup

The minimal subscription is a single long-lived GET against the _replicator database’s _changes feed. The key parameters are feed=continuous (stream line-delimited JSON forever), include_docs=true (so each change carries the _replication_state), heartbeat (keep-alive newlines), and since=now (start at the tail, ignoring history). The snippet below prints every state transition and is enough to confirm the feed is reachable before you wire in dispatch:

import asyncio
import json
import aiohttp

FEED_URL = "https://couchdb.internal:5984/_replicator/_changes"


async def tail_replicator():
    """Print each _replicator state transition as it streams from the feed."""
    params = {"feed": "continuous", "include_docs": "true",
              "heartbeat": 30000, "since": "now"}
    timeout = aiohttp.ClientTimeout(total=None)  # never time out a live feed
    async with aiohttp.ClientSession() as session:
        async with session.get(FEED_URL, params=params, timeout=timeout) as resp:
            resp.raise_for_status()
            async for line in resp.content:
                if not line.strip():
                    continue  # heartbeat newline
                doc = json.loads(line).get("doc", {})
                state = doc.get("_replication_state")
                if state:
                    print(doc.get("_id"), "->", state)


if __name__ == "__main__":
    asyncio.run(tail_replicator())

For richer scheduler context — _replication_state_reason, error text, and per-job counters — poll GET /_scheduler/docs or GET /_scheduler/jobs alongside the feed. The feed tells you that a job changed state; _scheduler/docs tells you why.

Core Implementation

The production consumer below streams the feed with non-blocking I/O, extracts each replication state transition, and dispatches a structured payload to a downstream webhook with bounded exponential backoff. It uses structured logging so every dispatch is traceable, treats delivery as at-least-once, and never buffers the whole feed in memory. Reconnection is handled by an outer supervisor loop so a dropped TCP connection does not kill the consumer — this pairs naturally with the retry patterns in Error Handling & Retry Logic.

import asyncio
import json
import logging
from typing import Any, Dict, Optional

import aiohttp

logging.basicConfig(
    level=logging.INFO,
    format='{"ts":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}',
)
logger = logging.getLogger("replicator_monitor")

FEED_URL = "https://couchdb.internal:5984/_replicator/_changes"
WEBHOOK_URL = "https://pipeline-router.internal/api/v1/replication-events"

# Only these transitions are worth paging a human or a downstream system about.
TERMINAL_STATES = {"completed", "failed"}
ALERT_STATES = {"crashing", "failed"}


async def dispatch_webhook(session: aiohttp.ClientSession,
                           payload: Dict[str, Any]) -> bool:
    """POST one replication event to the webhook with exponential backoff.

    Returns True on a 2xx, False after exhausting retries. Delivery is
    at-least-once: the caller must make the endpoint idempotent on
    (replication_id, seq).
    """
    for attempt in range(4):
        try:
            timeout = aiohttp.ClientTimeout(total=5)
            async with session.post(WEBHOOK_URL, json=payload,
                                     timeout=timeout) as resp:
                if 200 <= resp.status < 300:
                    return True
                logger.warning("webhook %s on attempt %d for %s",
                               resp.status, attempt + 1,
                               payload["replication_id"])
        except aiohttp.ClientError as exc:
            logger.error("webhook delivery error: %s", exc)
        await asyncio.sleep(2 ** attempt)  # 1s, 2s, 4s, 8s
    return False


def build_payload(change: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Turn a raw _changes line into a structured event, or None to skip."""
    doc = change.get("doc") or {}
    state = doc.get("_replication_state")
    doc_id = doc.get("_id")
    if not state or not doc_id or doc_id.startswith("_design/"):
        return None
    return {
        "replication_id": doc_id,
        "state": state,
        "reason": doc.get("_replication_state_reason"),
        "seq": change.get("seq"),
        "source": doc.get("source"),
        "target": doc.get("target"),
        "alert": state in ALERT_STATES,
    }


async def consume_once(session: aiohttp.ClientSession, since: str) -> str:
    """Stream the feed until it drops; return the last seq seen for resume."""
    params = {"feed": "continuous", "include_docs": "true",
              "heartbeat": 30000, "since": since}
    timeout = aiohttp.ClientTimeout(total=None)
    async with session.get(FEED_URL, params=params, timeout=timeout) as resp:
        resp.raise_for_status()
        async for line in resp.content:
            if not line.strip():
                continue  # heartbeat keep-alive
            try:
                change = json.loads(line)
            except json.JSONDecodeError:
                continue
            since = change.get("seq", since)
            payload = build_payload(change)
            if payload is None:
                continue
            if not await dispatch_webhook(session, payload):
                logger.critical("dropped event for %s after retries",
                                payload["replication_id"])
    return since


async def run() -> None:
    """Supervisor loop: reconnect on drop, resuming from the last seq."""
    since = "now"
    connector = aiohttp.TCPConnector(keepalive_timeout=45, limit=10)
    async with aiohttp.ClientSession(connector=connector) as session:
        while True:
            try:
                since = await consume_once(session, since)
            except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
                logger.warning("feed dropped (%s); reconnecting", exc)
            await asyncio.sleep(1)  # brief pause before reconnect


if __name__ == "__main__":
    asyncio.run(run())

Two non-obvious lines carry the reliability weight. The since = change.get("seq", since) assignment inside consume_once records the last committed sequence so that when the feed drops and the supervisor reconnects, it resumes from where it left off instead of replaying now and missing everything in between. And skipping documents whose _id starts with _design/ prevents the design documents used for filtered replication from generating phantom events.

Strategy Variants & Trade-offs

There is no single correct way to surface replication state — the right delivery model depends on how much latency you can tolerate and how much delivery machinery you want to operate. Three patterns cover the practical space.

Continuous _changes streaming (direct webhook). The consumer above: a single long-lived feed connection dispatching webhooks inline. Lowest latency, minimal moving parts, but the consumer is a single point of failure and must be supervised and leader-guarded.
Scheduler polling (_scheduler/docs). A timed loop that reads GET /_scheduler/docs every N seconds and diffs states. Simpler to reason about and inherently idempotent (you emit only on observed transitions), but adds polling-interval latency and load, and can miss short-lived transitions that begin and end between polls.
Message-queue fan-out. The feed consumer publishes raw events to a broker (Kafka, NATS, SQS) and separate workers handle delivery, retry, and alerting. Highest reliability and horizontal scale, decoupled from the endpoint’s availability, at the cost of an extra system to run and end-to-end ordering guarantees you must design for.

Strategy	Latency	Delivery reliability	Operational complexity	Best fit
Continuous streaming + direct webhook	Lowest (sub-second)	At-least-once, endpoint-coupled	Low	Small fleets, one router endpoint
Scheduler polling (`_scheduler/docs`)	Medium (poll interval)	Exactly-on-transition, self-healing	Low	Batch dashboards, coarse alerting
Message-queue fan-out	Low + broker hop	Durable, replayable	High	Large fleets, many consumers

Whatever the transport, event ordering is per-document, not global: two jobs’ transitions can interleave arbitrarily, so downstream logic must key on replication_id and never assume a global sequence.

Checkpoint Correlation

State transitions alone do not tell you how far behind a job is. To measure sync lag you must correlate each running job with its _local/ checkpoint document, which stores a session_id, the last-processed sequence (source_last_seq), and a history array — not revision trees or conflict metadata. A job that reports running while its source_last_seq stays frozen across polls is stalled even though its state never changed, and that is exactly the silent-degradation case webhooks miss unless you enrich them. The full extraction and diffing procedure is covered in Monitoring Replication Checkpoints via API; wire its lag metric into the build_payload step so every running heartbeat carries a lag_seconds field your alerting can threshold on.

Deployment & Orchestration

The consumer is a long-lived stateful process, and the single hard rule is exactly one active consumer per source feed. Two consumers on the same _replicator feed will both observe the same transitions and both fire webhooks, doubling every alert; if they also act on conflicts they will race to tombstone the same leaves and collide with 409s. Enforce this with a leader lease (a short-TTL lock document, a Consul/etcd session, or a Kubernetes Lease) so exactly one replica is active and a standby takes over on failure.

Configuration should come from the environment, never hard-coded:

# Dockerfile — single-replica monitor
FROM python:3.12-slim
RUN pip install --no-cache-dir aiohttp
COPY monitor.py /app/monitor.py
ENV COUCH_URL="https://couchdb.internal:5984" \
    WEBHOOK_URL="https://pipeline-router.internal/api/v1/replication-events" \
    HEARTBEAT_MS="30000"
HEALTHCHECK --interval=30s --timeout=5s CMD python -c "import socket,os,sys; \
  s=socket.create_connection(('127.0.0.1', 8081), 3); sys.exit(0)"
CMD ["python", "/app/monitor.py"]

Run it as a Deployment with replicas: 1 (leader-guarded) or a StatefulSet when you shard by source database — one consumer per shard, each owning a disjoint set of _replicator documents. Expose a lightweight /healthz endpoint that returns 200 only while the feed connection is live and the last event (or heartbeat) arrived within a threshold; a liveness probe that trips on a dead feed forces a restart and a leader failover instead of leaving a zombie consumer holding the lease.

Troubleshooting & Common Errors

Symptom / code	Likely cause	Remediation
`401 Unauthorized` on the feed	`replicator_svc` credentials missing or lacking `_admin` on `_replicator`	Provide session/basic auth on the feed request; grant the role in the security context.
Feed disconnects every ~30–60s	Idle-connection reaping by a proxy or load balancer	Set `heartbeat` (e.g. 30000) and `socket_options` `keepalive`; raise the LB idle timeout above the heartbeat.
Duplicate webhooks for one transition	More than one active consumer, or a reconnect replaying from an old `since`	Enforce one leader per feed; persist and resume from the last `seq`; make the endpoint idempotent on `(replication_id, seq)`.
Job shows `crashing` repeatedly	Transient target errors, `connection_timeout` too low, or auth expiry	Read `_replication_state_reason` from `_scheduler/docs`; raise `connection_timeout`; refresh credentials.
`running` state but `source_last_seq` frozen	Stalled worker: write conflict, target validation reject, or revision-traversal loop	Correlate the checkpoint; inspect target `_active_tasks`; escalate unresolvable docs to manual review sync queues.
Webhook endpoint returns `409`	Non-idempotent handler treating a retried delivery as new	De-duplicate on `(replication_id, seq)`; return `2xx` for already-processed events.
Feed emits `completed` immediately	Job is one-shot (`continuous: false`)	Expected for one-way syncs; treat as a single lifecycle event, not a stall.

FAQ

Does CouchDB send webhooks itself, or do I have to build the consumer?

CouchDB has no built-in webhook mechanism for _replicator events. It exposes the _changes feed on the _replicator system database as an append-only event stream, and you build a consumer (like the one above) that translates state transitions into outbound HTTP calls, queue messages, or alerts. The database’s job ends at faithfully reporting state; delivery is yours.

How many consumers can I run against one `_replicator` feed?

Exactly one active consumer per feed. Multiple consumers on the same feed observe identical transitions and each fire webhooks, doubling alerts, and if they also act on conflicts they race to tombstone the same revisions and collide with 409s. Scale by sharding the source (per database or device namespace) with one leader-guarded consumer per shard.

Why do I still get "false alarm" `crashing` events on a healthy job?

crashing is a transient state: the scheduler enters it on any recoverable error (a slow target, a dropped socket, a brief auth blip) and then retries with exponential backoff. Only treat repeated or failed transitions as actionable — read _replication_state_reason from _scheduler/docs to distinguish a one-off blip from a persistent fault before you page anyone.

My consumer reconnected and I think I lost events during the gap — how do I avoid that?

Persist the last seq you processed and pass it as since on reconnect instead of now. now starts at the current tail and silently discards everything that happened while you were disconnected; resuming from the stored sequence replays the gap. CouchDB’s feed is at-least-once from a given sequence, so make your webhook endpoint idempotent on (replication_id, seq).

Can webhook automation resolve conflicts for me?

Only indirectly. The replicator copies every conflicting revision and CouchDB deterministically picks a winner for default reads by highest generation then highest revision hash — not by timestamp — leaving the conflict for your application. A webhook consumer can detect crashing/failed states, read documents with ?conflicts=true, and trigger a resolver or route the document onward, but the merge logic itself belongs to your algorithm selection for merge code, not to CouchDB.

Part of: _replicator Configuration & Sync Pipeline Management

Async Monitoring & Webhooks for CouchDB _replicator Pipelines #

Configuration Schema & Required Parameters #

Streaming Detection / Monitoring Setup #

Core Implementation #

Strategy Variants & Trade-offs #

Checkpoint Correlation #

Deployment & Orchestration #

Troubleshooting & Common Errors #

FAQ #

Related #