Monitoring CouchDB Replication Checkpoints via the API

Your continuous replication job still reports running, yet the target database has not advanced in twenty minutes and an edge fleet is quietly falling behind. The checkpoint — CouchDB’s authoritative record of how far a replication has safely progressed — has stalled, and nothing in the default logs shouts about it. This page shows how to read that checkpoint directly from the HTTP API, correlate it with live _active_tasks telemetry, and turn a silent stall into a measurable, alertable signal before it becomes data loss. It is written for edge/IoT developers, mobile backend engineers, and Python sync-pipeline builders, and it is the checkpoint-level companion to the Async Monitoring & Webhooks layer; if you have not yet wired the _replicator _changes feed into an event stream, start there and use this page to add checkpoint-progress detection on top.

Immediate Triage / Prerequisites

Before instrumenting anything, confirm the symptom is a genuine checkpoint stall and not a job that has already crashed. A crashed job is visible in the scheduler; a stalled-but-running job is not, which is exactly why checkpoints must be read directly. Start with the scheduler and the live task list:

# scheduler view: is the job even alive, and in what state?
curl -s http://localhost:5984/_scheduler/jobs | \
  python3 -c "import sys,json;[print(j['id'],j['info'].get('checkpointed_source_seq'),j['info'].get('changes_pending')) for j in json.load(sys.stdin)['jobs']]"

# live task telemetry for replication jobs only
curl -s http://localhost:5984/_active_tasks | \
  python3 -c "import sys,json;[print(t['doc_id'],t['source_seq'],t['checkpointed_source_seq']) for t in json.load(sys.stdin) if t['type']=='replication']"

If the job is present and running but checkpointed_source_seq is identical across two polls a minute apart while changes_pending stays above zero, you have a stalled checkpoint rather than an idle-but-healthy one. Contrast that with a job whose changes_pending is 0 — that job is simply caught up, and a static checkpoint is correct. Prerequisites for the steps below: Python 3.8+ with the requests library (pip install requests), admin credentials for the CouchDB cluster (the _local/ and _scheduler endpoints require them), and network reach to both source and target. One rule to internalize first: the checkpoint on the target is the source of truth for durability — _active_tasks shows in-flight progress, but only the persisted _local/ document tells you what has actually been committed and would survive a worker restart.

Step-by-Step Implementation

Follow these steps to read the checkpoint, decode its sequence state, and quantify drift. Each step includes a command or assertion so you can verify state before moving on.

Locate the replication ID. The checkpoint document ID is _local/<replication_id>, and the CouchDB cluster computes replication_id as a hash of the job parameters — you do not author it. Read it from the scheduler rather than guessing:
```
curl -s http://localhost:5984/_scheduler/docs/_replicator/rep_edge_to_cloud_01 | \
  python3 -c "import sys,json;print(json.load(sys.stdin)['info']['replication_id'])"
# 6f3c...+continuous
```
Verify the value is non-empty. An absent replication_id means the job never started — check for authored read-only fields against the _replicator document schema before going further.
Read the persisted checkpoint. Issue an authenticated GET for the _local/ document on the target database. It carries session_id, source_last_seq, and a history array of prior sync windows:
```
curl -s "http://localhost:5984/aggregate_telemetry/_local/6f3c..." | \
  python3 -c "import sys,json;d=json.load(sys.stdin);print('source_last_seq',d['source_last_seq']);print('sessions',len(d['history']))"
```
Verify source_last_seq is present. The numeric prefix before the first - is the committed source sequence; the rest is an opaque, node-specific packed value you must not compare across nodes.
Read the source’s current position. Compare the checkpoint against how far the source has actually moved. The database’s update_seq is the head of its _changes feed:
```
curl -s "http://localhost:5984/sensor_data?_=1" | \
  python3 -c "import sys,json;print('update_seq',json.load(sys.stdin)['update_seq'])"
```
The gap between the source update_seq and the checkpoint’s source_last_seq is your backlog in sequence units. A large, growing gap is the headline drift metric.
Cross-check the live worker. Read _active_tasks for the same job and compare source_seq (what the worker has read) with checkpointed_source_seq (what it has durably committed):
```
src = int("18509-g1AAA".split("-", 1)[0])          # source_seq prefix
ckpt = int("18472-g1AAA".split("-", 1)[0])          # checkpointed_source_seq prefix
assert src - ckpt >= 0                               # worker is ahead of, or level with, the checkpoint
```
A persistent nonzero source_seq − checkpointed_source_seq means the worker is reading but not committing — usually target-side write backpressure or doc_write_failures, not conflicts. Conflicts never block checkpoint advancement; the replicator stacks the divergent revision and moves on.
Classify the stall direction. If checkpointed_source_seq is frozen while docs_read keeps climbing, the worker is stuck flushing writes to the target. If both freeze together, the source read side has stalled (network partition, auth expiry, or a filter function error). Route the escalation accordingly using the error handling & retry logic playbook.

Complete Working Example

The script below is self-contained and runnable. It reads the persisted _local/ checkpoint, the source update_seq, and the live _active_tasks entry for one job, computes both drift metrics, and returns a structured verdict you can forward to a webhook or alerting sink. It parses only the integer sequence prefix — never the opaque packed tail — and treats a caught-up job (changes_pending == 0) as healthy even when the checkpoint is static.

import sys
import time

import requests


def seq_prefix(seq) -> int:
    """Return the integer sequence prefix from a CouchDB update_seq string.

    Sequences look like '18509-g1AAAAB...'; only the leading integer is
    portable. The opaque tail is node-specific and must never be compared.
    """
    if seq is None:
        return 0
    return int(str(seq).split("-", 1)[0])


def checkpoint_health(base_url, source_db, target_db, replication_id,
                      job_doc_id, session, stall_seconds=300, drift_budget=10000):
    """Compute checkpoint drift for one replication job.

    Reads the persisted _local checkpoint on the target, the source's
    current update_seq, and the live _active_tasks entry, then returns a
    verdict dict. A job with changes_pending == 0 is healthy even if the
    checkpoint is static (it has simply caught up).
    """
    # Persisted, durable checkpoint on the target -> source of truth.
    ckpt = session.get(
        f"{base_url}/{target_db}/_local/{replication_id}", timeout=30
    )
    ckpt.raise_for_status()
    committed = seq_prefix(ckpt.json().get("source_last_seq"))

    # Head of the source _changes feed -> how far there is to go.
    src = session.get(f"{base_url}/{source_db}", timeout=30)
    src.raise_for_status()
    source_head = seq_prefix(src.json()["update_seq"])

    # Live worker telemetry for this job, if a worker is currently running.
    tasks = session.get(f"{base_url}/_active_tasks", timeout=30).json()
    task = next(
        (t for t in tasks
         if t.get("type") == "replication" and t.get("doc_id") == job_doc_id),
        None,
    )
    changes_pending = task.get("changes_pending", 0) if task else 0
    read_ahead = (
        seq_prefix(task["source_seq"]) - seq_prefix(task["checkpointed_source_seq"])
        if task else 0
    )

    backlog = source_head - committed
    caught_up = changes_pending == 0
    stalled = (not caught_up) and backlog > drift_budget

    return {
        "committed_seq": committed,
        "source_head": source_head,
        "backlog": backlog,          # sequence units the target still owes
        "read_ahead": read_ahead,    # read but not yet checkpointed
        "changes_pending": changes_pending,
        "verdict": "stalled" if stalled else ("caught_up" if caught_up else "draining"),
        "checked_at": time.time(),
    }


if __name__ == "__main__":
    s = requests.Session()
    s.auth = ("admin", "password")  # _local/ and _active_tasks need admin
    report = checkpoint_health(
        base_url="http://localhost:5984",
        source_db="sensor_data",
        target_db="aggregate_telemetry",
        replication_id="6f3c...+continuous",
        job_doc_id="rep_edge_to_cloud_01",
        session=s,
    )
    print(report)
    sys.exit(0 if report["verdict"] != "stalled" else 1)

Run it on a healthy pipeline and the verdict reads draining or caught_up; run it against a wedged job and it returns stalled with a nonzero exit code, ready to drive an alert or a controlled reset.

Gotchas & Edge Cases

Never compare full sequence strings across nodes. The opaque suffix after the integer prefix (18509-g1AAAAB...) is a packed, node-specific token. Two nodes at “the same” logical position can carry different suffixes; only the leading integer is portable for lag math.
A static checkpoint on a caught-up job is correct, not broken. When changes_pending is 0 the worker has nothing to commit, so checkpointed_source_seq legitimately stops moving. Gate every stall alarm on changes_pending > 0 or you will page yourself over idle-but-healthy jobs.
Read the checkpoint from the target, not the source. CouchDB writes a _local/ checkpoint on both ends, but the target’s copy is the durability anchor for a source → target job. Reading only the source can show progress that never actually committed downstream.
missing_revisions_found is not an error. It counts revisions the target lacks during normal catch-up. A persistent docs_read − docs_written gap points at doc_write_failures (validation rejection, auth, backpressure) — a different failure class covered by handling 409 conflicts in replication jobs.
A controlled reset means a new _id, not an edited one. To force a fresh checkpoint you delete the _replicator document, purge the stale _local/ checkpoint on both ends, and recreate the job with a new _id. Reusing the old _id re-derives the same replication_id and resurrects the same checkpoint you were trying to abandon.

Verification & Observability

Confirm the checkpoint is healthy at two levels: the persisted document and the live scheduler. For durability, re-read the _local/ checkpoint and assert source_last_seq advanced between polls. For liveness, watch the scheduler drain:

# checkpoint is moving: run twice, a minute apart, and diff the prefixes
curl -s "http://localhost:5984/aggregate_telemetry/_local/6f3c..." | \
  python3 -c "import sys,json;print(json.load(sys.stdin)['source_last_seq'].split('-',1)[0])"

# scheduler backlog is trending toward zero, not flat
curl -s http://localhost:5984/_scheduler/jobs | \
  python3 -c "import sys,json;[print(j['id'],j['info'].get('changes_pending')) for j in json.load(sys.stdin)['jobs']]"

A healthy result is a source_last_seq prefix that increases between polls and a changes_pending value trending toward zero. Emit both the target backlog (source update_seq − source_last_seq) and the read-ahead (source_seq − checkpointed_source_seq) as gauges on a fixed interval; a backlog that climbs while the job stays running is the earliest machine-readable sign of drift. Feed a threshold breach into the same webhook fan-out described in Async Monitoring & Webhooks so an operator is paged before downstream consumers see stale data. For jobs that legitimately stop and restart, correlate the checkpoint against the continuous-versus-one-shot behaviour in continuous vs one-way sync.

FAQ

Why does checkpointed_source_seq lag behind source_seq even on a healthy job?

That gap is expected and harmless in moderation. source_seq is how far the worker has read from the source; checkpointed_source_seq is how far it has durably persisted a checkpoint. CouchDB checkpoints periodically (by batch and time), so the worker is normally a little ahead of its last commit. It only signals trouble when the gap grows and stays large while changes_pending is above zero — that means reads are happening but writes are not committing, usually target-side backpressure or doc_write_failures.

Do I read the _local/ checkpoint from the source or the target database?

Read it from the target for a source → target job. CouchDB persists a _local/<replication_id> checkpoint on both ends, but the target’s copy is the durability anchor — it records what has actually landed downstream and would survive a worker restart. The source-side copy can show a position that never committed to the target, so trusting it alone masks exactly the stalls you are trying to catch.

My checkpoint is frozen but the job is not crashing — how do I recover it?

First confirm it is genuinely stalled: changes_pending > 0 with a static checkpointed_source_seq across polls. If so, perform a controlled reset — delete the _replicator document to stop the worker, purge the _local/ checkpoint on both source and target, then recreate the job with a new _id so the CouchDB cluster derives a fresh replication_id and a fresh checkpoint. Watch the initial _changes progress to confirm sequences advance monotonically before you close the incident.

Part of: Async Monitoring & Webhooks

Monitoring CouchDB Replication Checkpoints via the API #

Immediate Triage / Prerequisites #

Step-by-Step Implementation #

Complete Working Example #

Gotchas & Edge Cases #

Verification & Observability #

FAQ #

Related #