GTID Tracking & Enforcement for Binary Log Archiving and PITR Automation

Point-in-Time Recovery (PITR) automation fails silently when binary log continuity is assumed rather than verified. Legacy file-and-position tracking introduces severe ambiguity during topology shifts, partial log purges, or cross-instance restores: the same mysql-bin.000123:4 coordinate means different things on different servers, and nothing in the coordinate itself proves the intervening transactions were ever captured. Global Transaction Identifiers (GTIDs) eliminate that ambiguity by giving every commit an immutable, source-anchored identity that survives log rotation, crash recovery, and instance migration. This guide implements a deterministic pipeline that extracts, validates, and enforces GTID continuity across archived binary logs, so that every recovery request is either precisely satisfied or explicitly rejected before a single event is replayed — replacing heuristic timestamp guessing with rigorous transaction-set arithmetic. The building blocks here assume the event model and lifecycle laid out in MySQL Binary Log Architecture & GTID Fundamentals.

Visual Overview

Core Concept & Prerequisites

A GTID is a pair, source_uuid:transaction_id, where source_uuid identifies the originating server and transaction_id is a monotonic counter. MySQL exposes two authoritative sets that together define what is recoverable:

gtid_executed — every transaction the server has committed or applied. Persisted in the mysql.gtid_executed table and mirrored in the global variable.
gtid_purged — the subset of gtid_executed whose binary logs have already been discarded locally. Anything inside gtid_purged can only be recovered from an archive, never from the live server.

The recoverable window is therefore not “everything the server knows about” but the difference between what an archive physically holds and what has been purged. Getting this arithmetic wrong is the single most common cause of a PITR that runs to completion and still loses data.

Version and environment constraints:

MySQL 8.0.22+ for SHOW BINARY LOG STATUS (the older SHOW MASTER STATUS still works but is deprecated). gtid_mode=ON and enforce_gtid_consistency=ON are required; both are the correct defaults for any archiving-capable topology.
Python 3.10+ for the automation layer (structural pattern matching and the walrus operator are used below).
mysql-connector-python 8.0+ for connection pooling, and tenacity for retry orchestration — both already standard in this codebase.

Because format determines replay fidelity, this pipeline presumes row-based logging; the trade-offs are covered in ROW vs STATEMENT vs MIXED formats. GTIDs abstract the on-disk format, but they do not encode it — a statement-based event carrying a non-deterministic function will replay differently even though its GTID is contiguous, so format enforcement and GTID enforcement are complementary gates, not substitutes.

The extraction layer must behave as a stateless control-plane component. It must never rely on SHOW BINARY LOG STATUS alone for archival mapping, because that command reflects volatile runtime memory. Instead it queries the global variables, normalizes the output, and diffs it against a version-controlled manifest of archived files. This decoupling is what lets a replica rebuild or a primary promotion happen without corrupting the recovery timeline.

Production-Grade Python Implementation

The module below extracts and normalizes GTID sets with connection pooling, exponential-backoff retries, structured logging, and frozen dataclasses. It targets Python 3.10+ and mysql-connector-python 8.0+.

import logging
import re
from dataclasses import dataclass

import mysql.connector
from mysql.connector import Error, pooling
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential,
)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("gtid_tracker")


@dataclass(frozen=True, slots=True)
class GTIDRange:
    """One contiguous interval of transactions from a single source UUID."""
    source_uuid: str
    start: int
    end: int

    def contains(self, other: "GTIDRange") -> bool:
        return (
            self.source_uuid == other.source_uuid
            and self.start <= other.start
            and self.end >= other.end
        )


class GTIDExtractor:
    def __init__(self, pool: pooling.MySQLConnectionPool) -> None:
        self.pool = pool

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        retry=retry_if_exception_type((Error, ConnectionError)),
    )
    def _get_connection(self) -> mysql.connector.MySQLConnection:
        return self.pool.get_connection()

    def fetch_global_sets(self) -> tuple[str, str]:
        """Return (gtid_executed, gtid_purged) from MySQL 8.0+."""
        conn = self._get_connection()
        try:
            with conn.cursor(dictionary=True) as cur:
                # -- MySQL 8.0+: both variables are session-readable globals.
                cur.execute(
                    "SELECT @@GLOBAL.gtid_executed AS executed, "
                    "@@GLOBAL.gtid_purged  AS purged"
                )
                if (row := cur.fetchone()) is None:
                    raise RuntimeError("Failed to read GTID global variables.")
                return row.get("executed", ""), row.get("purged", "")
        finally:
            conn.close()

    @staticmethod
    def normalize_gtid_set(gtid_string: str) -> list[GTIDRange]:
        """Parse a raw GTID string into structured, contiguous ranges.

        Handles single-transaction GTIDs (uuid:N) and ranges (uuid:N-M),
        and multiple intervals per UUID (uuid:1-5:8-9).
        """
        if not gtid_string:
            return []
        pattern = re.compile(r"([0-9a-fA-F\-]{36}):(\d+)(?:-(\d+))?")
        ranges: list[GTIDRange] = []
        for uuid, start, end in pattern.findall(gtid_string):
            lo = int(start)
            hi = int(end) if end else lo
            ranges.append(GTIDRange(uuid, lo, hi))
        return ranges

    def compute_recoverable_set(self, executed: str, purged: str) -> list[GTIDRange]:
        """Ranges that are committed AND still physically available locally.

        In production, delegate exact set arithmetic to the server via
        GTID_SUBTRACT(executed, purged) rather than re-implementing interval
        math client-side; this method mirrors that result for offline diffing.
        """
        executed_ranges = self.normalize_gtid_set(executed)
        purged_ranges = self.normalize_gtid_set(purged)
        # A range is locally recoverable only if it is not fully purged.
        return [
            r
            for r in executed_ranges
            if not any(p.contains(r) for p in purged_ranges)
        ]

Delegate the authoritative arithmetic to MySQL itself whenever a live connection is available — SELECT GTID_SUBTRACT(@@GLOBAL.gtid_executed, @@GLOBAL.gtid_purged) yields the recoverable set without any risk of a client-side interval bug, and GTID_SUBSET(a, b) answers “is every transaction in a already contained in b?” for gap checks. The Python normalization above exists for the offline case: diffing a captured set against an archive manifest when the source server is gone.

Configuration Reference

These are the server variables that govern whether GTID tracking is trustworthy for recovery. Set them explicitly in my.cnf; relying on defaults across a fleet is how drift creeps in.

Variable	Type	Default (8.0)	Recommended	PITR impact
`gtid_mode`	enum	`OFF`	`ON`	Off means no GTIDs at all — every recovery falls back to fragile file/position math.
`enforce_gtid_consistency`	enum	`OFF`	`ON`	Rejects statements that cannot be assigned a single safe GTID, preventing non-deterministic replay.
`binlog_format`	enum	`ROW`	`ROW`	Non-`ROW` formats can replay differently than they committed, silently diverging the recovered copy.
`binlog_row_image`	enum	`FULL`	`MINIMAL`	`MINIMAL` shrinks archives (PK + changed columns only) with no loss of replay accuracy.
`sync_binlog`	int	`1`	`1`	`0` can leave a commit durable in InnoDB but missing from the binary log — a permanent recovery hole.
`binlog_expire_logs_seconds`	int	`2592000`	Tuned to archive lag	Server-side purge clock; must never outpace the archiver or `gtid_purged` overtakes the manifest.
`binlog_gtid_simple_recovery`	bool	`ON`	`ON`	Keeps `gtid_purged`/`gtid_executed` reconstruction cheap at startup by scanning only the newest and oldest logs.
`binlog_transaction_dependency_tracking`	enum	`WRITESET`	`WRITESET`	Enables parallel apply by row hash; removed in MySQL 8.4, where WRITESET is the built-in behaviour.

Two of these interact dangerously. binlog_expire_logs_seconds fires asynchronously and can advance gtid_purged past the oldest archived GTID if the archiver is behind; the recoverable window then quietly shrinks below what your manifest claims. That coupling is the subject of binlog retention boundaries, which formalizes the safe-purge intersection this pipeline must respect.

Validation & Verification Gates

Before any recovery proceeds, the pipeline runs a fixed sequence of gates. A failure at any gate halts execution and emits a machine-readable verdict — recovery never proceeds “optimistically”.

Checksum verification. Confirm every archived segment inside the requested window matches its recorded SHA-256, and that mysqlbinlog --verify-binlog-checksum accepts the file’s internal event checksums. A checksum mismatch means silent corruption; treat it as a hard stop.
GTID set diffing. Compute GTID_SUBTRACT(target_set, recoverable_set) server-side. A non-empty result names exactly the transactions the archive cannot supply — these are the gaps.
Purge-overlap check. If gtid_purged intersects the requested target and those GTIDs are absent from the archive manifest, the recoverable window has already been severed. Halt.
Dry-run replay. Stream mysqlbinlog --include-gtids=<target> into a throwaway staging schema and parse the output for syntax errors, unsupported DDL, and mid-stream GTID discontinuities before touching production.
Manifest reconciliation. Confirm the normalized executed set maps 1:1 onto physical archive locations, so no “phantom” range is trusted that has no file behind it.

The gate emits a structured JSON verdict that downstream orchestration consumes verbatim:

{
  "status": "PASS",
  "target_gtid": "3E11FA47-71CA-11E1-9E33-C80AA9429562:1-1050",
  "verified_ranges": ["3E11FA47-71CA-11E1-9E33-C80AA9429562:1-1050"],
  "missing_ranges": [],
  "dry_run_safe": true
}

The dry-run replay itself is deliberately idempotent. It applies archived logs to a staging host bounded by the verified GTID range, generating no binary log of its own:

import logging
import subprocess

logger = logging.getLogger("pitr_executor")


def execute_pitr_replay(target_gtid: str, archive_path: str) -> bool:
    """Replay archived binlogs bounded by target_gtid onto a staging host.

    mysqlbinlog exits after processing the file (no --stop-never), and
    --disable-log-bin prevents recursive binlog generation on the target.
    """
    cmd = [
        "mysqlbinlog",
        "--include-gtids", target_gtid,
        "--disable-log-bin",
        "--verify-binlog-checksum",
        archive_path,
    ]
    logger.info("Executing PITR replay: %s", " ".join(cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        logger.error("mysqlbinlog failed: %s", result.stderr)
        return False
    # In production, pipe stdout to the mysql client on the staging host.
    return True

Idempotency at the apply stage is enforced by recording applied ranges in a dedicated pitr_audit table. If a request targets an already-applied range, the pipeline short-circuits to success without re-executing transactions — a re-run of the same recovery is a no-op, not a double-apply.

Error Handling & Failure Modes

Enforcement surfaces as specific MySQL error numbers, most of them raised by enforce_gtid_consistency=ON at execution time. Mapping each to a root cause and a concrete remediation is what turns a cryptic failure into an automated decision. The classifier below uses structural pattern matching to route each error number:

from dataclasses import dataclass


@dataclass(frozen=True, slots=True)
class Remediation:
    cause: str
    action: str
    fatal: bool


def classify_gtid_error(errno: int) -> Remediation:
    match errno:
        case 1786:  # ER_GTID_UNSAFE_CREATE_SELECT
            return Remediation(
                cause="CREATE TABLE ... SELECT cannot get one atomic GTID.",
                action="Split into CREATE TABLE then INSERT ... SELECT.",
                fatal=False,
            )
        case 1787:  # ER_GTID_UNSAFE_..._TEMPORARY_TABLE_IN_TRANSACTION
            return Remediation(
                cause="CREATE/DROP TEMPORARY TABLE inside a transaction.",
                action="Move temp-table DDL outside the transaction / use autocommit.",
                fatal=False,
            )
        case 1785:  # ER_GTID_UNSAFE_NON_TRANSACTIONAL_TABLE
            return Remediation(
                cause="Mixed transactional + non-transactional writes in one statement.",
                action="Isolate MyISAM/CSV writes into their own autocommitted statement.",
                fatal=False,
            )
        case 1840:  # ER_CANT_SET_GTID_PURGED_WHEN_GTID_EXECUTED_IS_NOT_EMPTY
            return Remediation(
                cause="Tried to seed gtid_purged on a non-empty gtid_executed.",
                action="RESET MASTER on the fresh target before setting gtid_purged.",
                fatal=True,
            )
        case 1772:  # ER_MALFORMED_GTID_SET_SPECIFICATION
            return Remediation(
                cause="Malformed GTID set string passed to a set function.",
                action="Re-normalize the set; check for truncated UUIDs or stray commas.",
                fatal=True,
            )
        case _:
            return Remediation(
                cause=f"Unmapped GTID-related error {errno}.",
                action="Escalate: capture SHOW REPLICA STATUS and halt the pipeline.",
                fatal=True,
            )

The 1786/1787/1785 family is working as intended — those statements would replay non-deterministically, so the fix is to rewrite the offending SQL, never to relax enforce_gtid_consistency. The 1840 and 1772 cases are pipeline-level configuration or data faults and must abort the run. A frequent, subtler failure is a gtid_purged mismatch on a restored instance: the target’s captured gtid_purged disagrees with the GTIDs about to be replayed, so the first event is either rejected or treated as already-applied and skipped. Re-derive gtid_purged on the recovery target from the backup’s captured coordinates before replay. When enforcement blocks writes across multiple write-accepting nodes, the reconciliation procedure is detailed in enforcing GTID consistency in multi-primary clusters.

Observability & Alerting

You cannot enforce continuity you cannot see. Instrument three quantities continuously: the live executed/purged frontier, the archiver’s lag behind that frontier, and the outcome of every validation gate.

-- MySQL 8.0.22+: current GTID frontier and the purge low-water mark.
SELECT
    @@GLOBAL.gtid_executed AS executed_set,
    @@GLOBAL.gtid_purged   AS purged_set;

-- MySQL 8.0+: binary-log write throughput and fsync pressure, which
-- predicts when the archiver will start falling behind the purge clock.
SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT
FROM performance_schema.events_stages_summary_global_by_event_name
WHERE EVENT_NAME LIKE 'stage/sql/%binlog%';

-- MySQL 8.0+: physical I/O on the binary-log files themselves.
SELECT FILE_NAME, COUNT_READ, COUNT_WRITE, SUM_NUMBER_OF_BYTES_WRITE
FROM performance_schema.file_summary_by_instance
WHERE FILE_NAME LIKE '%bin.%'
ORDER BY SUM_NUMBER_OF_BYTES_WRITE DESC;

Emit each gate result as a structured log record with stable field names — event, target_gtid, missing_count, oldest_archived_gtid, purge_headroom_seconds, verdict — so alerting rules query fields rather than scraping message text. Recommended thresholds:

Archiving lag — alert when the newest committed GTID leads the newest archived GTID by more than one full binary-log file, and page when purge_headroom_seconds (time until binlog_expire_logs_seconds would purge the oldest un-archived transaction) drops below one archiver cycle. This is the alarm that fires before silent data loss, not after.
Gate failures — any verdict != PASS is an immediate warning; a purge-overlap failure is a page, because the recoverable window has already been cut.
Enforcement errors — a rising rate of ERROR 1786/1787 signals an application deploying GTID-unsafe DDL and should route to the owning team, not the DBA on call.

Access for the automation account stays least-privilege throughout: REPLICATION CLIENT for status, SELECT on mysql.gtid_executed, and PROCESS for thread inspection under load — never SUPER or ALL PRIVILEGES. The full privilege model, at-rest and in-transit encryption, and audit hooks live in security & access frameworks. When a gate hard-stops and no clean recovery path remains, degrade deterministically per fallback routing strategies rather than forcing an unsafe replay.

Frequently Asked Questions

Why does gtid_purged advance past my archive even though archiving is running?

binlog_expire_logs_seconds purges on a wall-clock timer that is completely independent of your archiver’s progress. If the archiver falls behind — a slow object-storage endpoint, a stalled queue — the server can delete a binary log whose transactions were never copied, and gtid_purged moves forward to cover them. Once that happens the transactions exist in no archive and cannot be recovered. The defence is to gate every PURGE BINARY LOGS on verified archival and to alert on purge_headroom_seconds before the timer fires; the safe-window computation is in binlog retention boundaries.

Should I compute GTID gaps in Python or let MySQL do it?

Let MySQL do it whenever a live connection exists. GTID_SUBTRACT(target, recoverable) and GTID_SUBSET(a, b) implement interval arithmetic that is easy to get subtly wrong client-side (multi-interval UUIDs like uuid:1-5:8-9, off-by-one boundaries). Reserve the Python normalizer for the offline case — diffing a captured set against an archive manifest when the source server is unavailable — and even then treat it as advisory, re-validating against the server before any apply.

Does a contiguous GTID set guarantee a deterministic replay?

No. A gap-free GTID set proves no transaction is missing; it says nothing about whether each transaction replays identically. A statement-format event containing NOW(), UUID(), or a user-variable dependency can produce a different result on replay despite a perfectly contiguous GTID. That is why this pipeline enforces binlog_format=ROW alongside GTID continuity — the two gates cover different failure modes. See ROW vs STATEMENT vs MIXED formats.

How do I make a recovery re-run safe to execute twice?

Bound the replay by an explicit GTID range (--include-gtids) and record every applied range in a pitr_audit table. Before applying, check whether the target range is already a GTID_SUBSET of what the target has executed; if so, short-circuit to success. Because GTIDs make already-applied transactions no-ops on a GTID-enabled target, a correctly bounded re-run replays nothing rather than double-applying — the property that lets orchestration retry a failed recovery without fear.

MySQL Binary Log Architecture & GTID Fundamentals — the event model, gtid_executed/gtid_purged lifecycle, and durability settings this pipeline builds on.
ROW vs STATEMENT vs MIXED formats — why row-based logging is the complementary gate to GTID continuity.
Binlog retention boundaries — computing safe purge windows so gtid_purged never overtakes the archive.
Enforcing GTID consistency in multi-primary clusters — merging and validating GTID sets across concurrent write nodes.
Security & access frameworks — least-privilege grants, encryption, and audit hooks for the archiving account.

Back to MySQL Binary Log Architecture & GTID Fundamentals.

GTID Tracking & Enforcement for Binary Log Archiving and PITR Automation #

Visual Overview #

Core Concept & Prerequisites #

Production-Grade Python Implementation #

Configuration Reference #

Validation & Verification Gates #

Error Handling & Failure Modes #

Observability & Alerting #

Frequently Asked Questions #

Why does gtid_purged advance past my archive even though archiving is running? #

Should I compute GTID gaps in Python or let MySQL do it? #

Does a contiguous GTID set guarantee a deterministic replay? #

How do I make a recovery re-run safe to execute twice? #

Related #

Explore this section

Enforcing GTID Consistency in Multi-Primary MySQL Clusters