Base Backup Integration for PITR: Anchoring Binary Log Archives to a Verifiable Recovery Coordinate

Point-in-time recovery in production MySQL rarely fails because binary logs are missing. It fails silently when the base backup drifts out of alignment with the archived transaction stream — the snapshot’s captured position and the first archived segment do not overlap, so mysqlbinlog replay either skips valid transactions or halts on a gap. A base backup integrated for PITR is not a scheduled dump; it is a deterministic anchor that establishes a verifiable recovery coordinate. This page solves the coordinate problem specifically: how to capture a gap-free GTID position at snapshot time, emit a signed manifest that interlocks with the Automated Binlog Archiving to Object Storage pipeline, and prove — before an incident — that the snapshot plus its downstream logs replay end to end. Naive approaches fail because they treat the backup and the log archive as independent artifacts with independent lifecycles; the moment their coordinate spaces disagree, recovery is impossible regardless of how healthy either side looks in isolation.

Visual Overview

Core Concept & Prerequisites

The integration rests on one invariant: the base backup’s captured gtid_executed must be a contiguous prefix of the archived binary log stream that follows it. If the snapshot ends at GTID interval ...:1-1050 and the first archived segment begins at ...:1052, transaction 1051 exists nowhere and the recovery coordinate is broken. Enforcing this invariant depends on a correctly configured server and a gap-free GTID Tracking & Enforcement pipeline upstream.

Prerequisites before any snapshot runs:

MySQL 8.0.22+ (tested through 8.4). GTID-based coordinate capture replaces legacy File/Position tracking, which cannot survive a primary failover without ambiguity. On MySQL 8.4 several defaults changed — noted inline below.
gtid_mode = ON and enforce_gtid_consistency = ON. Without both, the server may execute GTID-unsafe statements that poison the chain (see the ERROR 1785/1786/1787 family in Error Handling & Failure Modes).
binlog_format = ROW with binlog_row_image = FULL. Statement-based logs replay non-deterministically against a restored snapshot; the trade-offs are covered in ROW vs STATEMENT vs MIXED Formats.
A physical or logical backup engine: Percona XtraBackup or MySQL Enterprise Backup for physical snapshots, or mysqldump --single-transaction --set-gtid-purged=ON for logical ones.
Python 3.10+ with mysql-connector-python (pooled coordinate capture) and tenacity (retry orchestration). Downstream transport and at-rest protection are handled by the Compression & Encryption Workflows layer, so this module stops at manifest registration.

The retention window on the primary must also outlast the archiving lag by a wide margin, or a segment can be purged before the manifest confirms it — a boundary set by Binlog Retention Boundaries.

Production-Grade Python Implementation

The module below runs the coordinate-capture and manifest-registration stages. It uses a pooled connection for MySQL, tenacity for bounded retries on transient capture failures, typed dataclasses for the manifest contract, structured logging, and a match statement to classify outcomes. It captures the GTID coordinate immediately after backup finalization, normalizes it, and refuses to emit a manifest if a gap is detected.

# Requires: mysql-connector-python, tenacity ; MySQL 8.0.22+
from __future__ import annotations

import hashlib
import json
import logging
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path

import mysql.connector
from mysql.connector import pooling, Error as MySQLError
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

logging.basicConfig(
    level=logging.INFO,
    format='{"ts":"%(asctime)s","level":"%(levelname)s","stage":"%(name)s","msg":"%(message)s"}',
)
log = logging.getLogger("base-backup-anchor")


@dataclass(frozen=True, slots=True)
class BackupManifest:
    server_uuid: str
    gtid_executed: str          # contiguous, normalized GTID set at snapshot boundary
    gtid_purged: str            # what the restore target must SET GLOBAL gtid_purged to
    binlog_file: str
    binlog_position: int
    checksum_sha256: str
    timestamp_utc: str
    tenant_id: str


_POOL = pooling.MySQLConnectionPool(
    pool_name="anchor_pool",
    pool_size=4,
    host="127.0.0.1",
    user="binlog_backup",          # least-privilege role, see security-access-frameworks
    database="mysql",
    autocommit=True,
)


class GTIDGapError(RuntimeError):
    """Raised when the captured GTID set is non-contiguous and unsafe to anchor."""


def _is_contiguous(gtid_set: str) -> bool:
    """A single-UUID interval like '1-1050' is contiguous; '1-1050:1052' is not."""
    for uuid_block in gtid_set.split(","):
        _, _, intervals = uuid_block.strip().partition(":")
        if intervals.count(":") > 0:      # multiple disjoint intervals for one UUID
            return False
    return True


@retry(
    retry=retry_if_exception_type((MySQLError, GTIDGapError)),
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=0.5, max=8.0),
    reraise=True,
)
def capture_coordinate() -> tuple[str, str, str, str, int]:
    """Capture the recovery coordinate right after backup finalization.

    On a detected gap, flush the binary log and retry so the next closed
    segment starts on a clean interval boundary.
    """
    conn = _POOL.get_connection()
    try:
        cur = conn.cursor()
        cur.execute(
            "SELECT @@global.server_uuid, @@global.gtid_executed, @@global.gtid_purged"
        )
        server_uuid, gtid_executed, gtid_purged = cur.fetchone()
        # SHOW BINARY LOG STATUS replaces SHOW MASTER STATUS on MySQL 8.4+
        cur.execute("SHOW BINARY LOG STATUS")
        row = cur.fetchone()
        binlog_file, binlog_pos = row[0], int(row[1])
        cur.close()

        normalized = gtid_executed.replace("\n", "")
        if not _is_contiguous(normalized):
            log.warning("non-contiguous gtid_executed; flushing logs and retrying")
            flush = conn.cursor()
            flush.execute("FLUSH BINARY LOGS")
            flush.close()
            raise GTIDGapError(normalized)
        return server_uuid, normalized, gtid_purged, binlog_file, binlog_pos
    finally:
        conn.close()


def sha256_of(path: Path, chunk: int = 1 << 16) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while block := f.read(chunk):
            h.update(block)
    return h.hexdigest()


def build_manifest(artifact: Path, tenant_id: str) -> BackupManifest:
    server_uuid, gtid_executed, gtid_purged, binlog_file, binlog_pos = capture_coordinate()
    manifest = BackupManifest(
        server_uuid=server_uuid,
        gtid_executed=gtid_executed,
        gtid_purged=gtid_purged or gtid_executed,
        binlog_file=binlog_file,
        binlog_position=binlog_pos,
        checksum_sha256=sha256_of(artifact),
        timestamp_utc=datetime.now(timezone.utc).isoformat(timespec="seconds"),
        tenant_id=tenant_id,
    )
    manifest_path = artifact.with_suffix(artifact.suffix + ".manifest.json")
    manifest_path.write_text(json.dumps(asdict(manifest), indent=2))
    log.info("manifest written: %s", manifest_path.name)
    return manifest


def register(artifact: Path, tenant_id: str) -> None:
    try:
        manifest = build_manifest(artifact, tenant_id)
    except Exception as exc:  # classify terminal vs transient for the caller/alerting
        match exc:
            case GTIDGapError():
                log.error("terminal: gtid gap persisted after retries; quarantining artifact")
            case MySQLError():
                log.error("terminal: MySQL unreachable after retries: %s", exc)
            case _:
                log.error("terminal: unexpected failure: %s", exc)
        raise
    log.info("anchor registered for gtid_executed=%s", manifest.gtid_executed)


if __name__ == "__main__":
    register(Path("/mnt/backup/base_20260704.xbstream"), tenant_id="prod-cluster-alpha")

The capture_coordinate retry loop is the heart of the integration: a transient gap triggers FLUSH BINARY LOGS, forcing MySQL to close the current segment on a clean interval boundary so the next archived segment begins exactly where the manifest ends. Terminal failures propagate with a classified log line, letting the async worker route the artifact to a quarantine prefix rather than confirming a broken anchor. Because manifest emission is idempotent — keyed on the artifact path — a re-run after a crash reproduces the same file instead of duplicating state.

Configuration Reference

These server variables govern whether a captured coordinate is trustworthy. Recommended values assume a GTID-anchored PITR pipeline; verify defaults against your exact minor version.

Variable	Type	Default (8.0)	Recommended	PITR impact
`gtid_mode`	enum	`OFF`	`ON`	Enables the GTID coordinate space that the manifest anchors to; `OFF` forces fragile file/position tracking.
`enforce_gtid_consistency`	enum	`OFF`	`ON`	Blocks GTID-unsafe statements (ERROR 1785/1786/1787) that would corrupt the anchored chain.
`binlog_format`	enum	`ROW`	`ROW`	ROW gives deterministic replay against the restored snapshot; STATEMENT drifts on non-deterministic functions.
`binlog_row_image`	enum	`FULL`	`FULL`	FULL images make replayed row events self-contained during recovery.
`sync_binlog`	integer	`1`	`1`	`1` fsyncs each commit so a crash cannot leave the last archived transaction unrecorded.
`innodb_flush_log_at_trx_commit`	integer	`1`	`1`	Guarantees the redo/binlog boundary is durable at the captured coordinate.
`max_binlog_size`	integer	`1073741824`	`256M–1G`	Smaller segments tighten the granularity between a snapshot and its first archived log.
`binlog_expire_logs_seconds`	integer	`2592000`	`≥ 4× archiving lag`	Too short and a segment is purged before the manifest confirms it, breaking the chain.
`binlog_transaction_dependency_tracking`	enum	`COMMIT_ORDER`	`WRITESET`	WRITESET yields deterministic ordering under concurrency. Removed in MySQL 8.4 — WRITESET is the built-in behavior there.
`binlog_rows_query_log_events`	boolean	`OFF`	`ON`	Records the originating SQL, aiding forensic timestamp-to-GTID mapping during targeting.

Validation & Verification Gates

A manifest that has never been replayed is a hypothesis, not a recovery asset. Four gates promote it to trusted:

Checksum reconciliation. After the AWS S3 & GCS Sync Pipelines upload completes, download the object’s stored checksum metadata and compare it against checksum_sha256 in the manifest. A mismatch is non-retryable — quarantine, do not overwrite.
GTID overlap diff. Compute the difference between the manifest’s gtid_executed and the GTID interval opening the first archived segment that follows it. The correct result is a contiguous join with no hole. Any gap fails the gate and pages immediately.
Dry-run replay. Resolve the object set for a recent target timestamp and stream it through mysqlbinlog --start-position without applying, confirming every segment decodes and the interval chain is unbroken. Timestamp resolution itself is handled by Timestamp Targeting Strategies.
Full restore drill. Provision an ephemeral instance, restore the snapshot, SET GLOBAL gtid_purged to the manifest’s gtid_purged, replay archived logs to a randomized recent point, then run schema validation and row-count sampling before teardown. This is the only gate that measures real RTO.

Automate gate 4 on a schedule and publish a restore.success / restore.failure metric. A chain that passes checksums but has never been replayed end to end should not be trusted, however green the upload dashboards look.

Error Handling & Failure Modes

Coordinate integrity has a small set of characteristic failure signatures, each mapping to a specific root cause and recovery procedure.

ERROR 1785 / 1786 / 1787 (GTID-unsafe statement family). ER_GTID_UNSAFE_NON_TRANSACTIONAL_TABLE, ER_GTID_UNSAFE_CREATE_SELECT, and ER_GTID_UNSAFE_CREATE_DROP_TEMPORARY_TABLE_IN_TRANSACTION surface when enforce_gtid_consistency is ON and an application issues a statement that cannot be assigned a single GTID. Root cause: schema/DML patterns incompatible with GTID replication. Recovery: fix the offending statement; never disable enforcement to “make it work,” which would let an unassignable transaction slip into the anchored chain.
ERROR 1840 (ER_CANT_SET_GTID_PURGED_WHEN_GTID_EXECUTED_IS_NOT_EMPTY). Raised during a restore drill when you SET GLOBAL gtid_purged on an instance whose gtid_executed is already populated. Root cause: restoring onto a non-fresh server. Recovery: RESET BINARY LOGS AND GTIDS (or provision a clean instance) before setting gtid_purged from the manifest.
ERROR 1236 (ER_MASTER_FATAL_ERROR_READING_BINLOG). During replay, “Cannot replicate because the source purged required binary logs” means the archived stream is missing the segment the coordinate points to. Root cause: retention expired a segment before the manifest confirmed it, or a genuine gap. Recovery: fail the drill, restore from an earlier snapshot whose chain is intact, and widen binlog_expire_logs_seconds.
ERROR 1062 (ER_DUP_ENTRY) on replay. A duplicate-key collision while applying archived logs signals overlap — transactions in the log were already present in the snapshot. Root cause: the coordinate captured before the backup truly quiesced, so gtid_executed understates the snapshot. Recovery: recapture the coordinate strictly after finalization; the retry loop in the module above exists precisely to prevent this.

Transient capture failures (pool exhaustion, brief connection loss) are absorbed by tenacity with exponential backoff and jitter; broader queue-level backoff for downstream stages belongs to Error Handling & Retry Logic.

Observability & Alerting

Instrument the anchor stage so a drifting coordinate is caught before recovery is ever needed. Before a physical snapshot, query performance_schema for long-running DDL that would extend the lock window and blur the coordinate:

-- MySQL 8.0.22+
SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS,
       TIMER_WAIT / 1e12 AS held_seconds
FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'GRANTED'
ORDER BY held_seconds DESC
LIMIT 10;

Emit these structured log fields from every anchor run so they can be aggregated: server_uuid, gtid_executed, gtid_gap_detected (boolean), capture_retries, snapshot_seconds, manifest_checksum, and tenant_id. Alert thresholds worth wiring into your monitoring stack:

Archiving lag — time between snapshot finalization and the manifest’s first following segment being confirmed. Page when it approaches a configurable fraction of binlog_expire_logs_seconds.
gtid_gap_detected = true on any run — an immediate warning; recurring gaps indicate an upstream consistency problem.
capture_retries exceeding one — the coordinate is not stabilizing cleanly; investigate backup timing.
restore.failure from the drill — the highest-severity signal, because it means the anchor is not recoverable at all.

Correlate anchor telemetry with replication lag and the archiving pipeline’s binlog_queue_depth so a single dashboard answers the only question that matters: can we recover to any point since the last verified snapshot?

Frequently Asked Questions

Why capture the GTID coordinate after the backup instead of before?

Because the manifest must describe the exact state the restored snapshot will contain. If you capture gtid_executed before the backup quiesces, the snapshot ends up containing transactions the manifest doesn’t list, and replaying the archived logs re-applies them — producing ERROR 1062 duplicate-key collisions. Capturing strictly after finalization, then flushing on any detected gap, keeps the coordinate an exact prefix of the log stream.

What do I set gtid_purged to when restoring the snapshot?

Set it to the manifest’s gtid_purged value (which falls back to the captured gtid_executed) on a fresh instance whose own gtid_executed is empty. Doing this on a non-empty server raises ERROR 1840. The purge tells the restored instance “these GTIDs are already inside the snapshot,” so replay begins exactly at the next transaction with no gap and no overlap.

How small should max_binlog_size be for tight PITR granularity?

Small enough that the window between a snapshot and its first following segment is short, but not so small that segment churn overwhelms the transport pipeline. 256M–1G is a practical range for most OLTP workloads. Granularity for the actual recovery target comes from mysqlbinlog stop conditions, not from segment size, so treat segment size as a transport-and-alignment knob rather than a precision one.

Does the WRITESET removal in MySQL 8.4 affect existing manifests?

No. binlog_transaction_dependency_tracking controlled how the server orders dependent transactions in the log; on 8.4 the WRITESET behavior is the built-in default and the variable is simply gone. Manifests captured on 8.0 with WRITESET remain valid — the recorded GTID sets and checksums are unaffected. Only remove the variable from your 8.4 my.cnf to avoid a startup error.

AWS S3 & GCS Sync Pipelines — durable multipart transport and checksum verification for both the snapshot and its logs.
Compression & Encryption Workflows — zstd tuning and envelope encryption applied downstream of the manifest.
Rotation Scheduling & Cron Automation — lifecycle transitions that keep legacy artifacts until the new anchor is validated.
Timestamp Targeting Strategies — resolving a human timestamp to the precise GTID range to replay after the snapshot.
GTID Tracking & Enforcement — the gap-free coordinate space the anchor depends on.

Back to Automated Binlog Archiving to Object Storage · Explore all topics from the site home.

Base Backup Integration for PITR: Anchoring Binary Log Archives to a Verifiable Recovery Coordinate #

Visual Overview #

Core Concept & Prerequisites #

Production-Grade Python Implementation #

Configuration Reference #

Validation & Verification Gates #

Error Handling & Failure Modes #

Observability & Alerting #

Frequently Asked Questions #

Why capture the GTID coordinate after the backup instead of before? #

What do I set gtid_purged to when restoring the snapshot? #

How small should max_binlog_size be for tight PITR granularity? #

Does the WRITESET removal in MySQL 8.4 affect existing manifests? #

Related #

Related pages

Async Processing & Queue Management for Binary Log Archiving and PITR Automation

AWS S3 & GCS Sync Pipelines for MySQL Binary Log Archiving and PITR Automation

Compression & Encryption Workflows for MySQL Binary Log Archiving and PITR Automation

Error Handling & Retry Logic for MySQL Binary Log Archiving and PITR Automation

Rotation Scheduling & Cron Automation for MySQL Binary Log Archiving and PITR

Timestamp Targeting Strategies for MySQL Binary Log Archiving and PITR Automation