[PATCH] Fix PITR pause bypass when initial XLOG_RUNNING_XACTS has subxid overflow
Hi folks,
We observed a case where our backup tooling was periodically failing
for a specific workload - nested subtrans overflowing subxid. We don't
have visibility on the specific customer workload (i.e. either SAVEPOINT
or EXCEPTION handling), but reproducing is covered in the TAP test.
The problem detail and proposed fix are described below. Happy to discuss
further.
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).
This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.
Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition to STANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.
As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.
INPROGRESS subxids are invisible without any SubTrans
lookup, so the missing SubTrans entries that STANDBY_SNAPSHOT_PENDING
guards against cannot cause incorrect visibility results.
Add a TAP test (052_pitr_subxid_overflow.pl) that exercises the scenario:
the overflow transaction is kept open during the base backup's forced
checkpoint so that the very first XLOG_RUNNING_XACTS the standby replays
has subxid_overflow=true. A named restore point is then created while
the overflow transaction is still open. Without the fix the standby
promotes silently at the target; with the fix it pauses and accepts
hot-standby queries.
Note: subtransaction XIDs are only assigned when the subtransaction writes,
so gen_subxids() must perform an INSERT at each recursion level to force
the PGPROC subxid cache to overflow.
I would consider this for backporting to supported releases.
Attachments:
0001-Fix-PITR-pause-bypass-when-initial-XLOG_RUNNING_XACT.patchapplication/octet-stream; name=0001-Fix-PITR-pause-bypass-when-initial-XLOG_RUNNING_XACT.patchDownload+170-1
On Thu, Feb 26, 2026 at 12:57:24PM +0000, Matt Blewitt wrote:
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).
Yes, this is an historical factor that exists since hot standby is a
thing. We cannot connect yet because we don't have a stable state
that live connections could rely on.
This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition to STANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.
This is an interesting argument. To be honest, while it is true that
subtransaction commits do not cause WAL records and flushes (as far as
I recall), I am not completely sure yet if it is always OK to rely on
that and open the server for connections earlier than we logically
can. PENDING has the rather old historical assumption that we should
never open connections yet, because we don't have a standby state
initialized yet. That makes the introduction of such shortcuts very
tricky to think about. The TAP test helps in showing what you are
looking for, thanks for that.
I would consider this for backporting to supported releases.
Note sure that I would agree with this position. This is also a
slight change of behavior regarding the end of recovery due the
interaction with the recovery target reached. It is not an area of
the code we should underestimate.
--
Michael
Hi Michael,
Re-reviewing the patch, I think it is not OK to rely on WAL and flush.
I'll update the test cases. I think based on the current cases, the flow is
something like:
1. lastOverflowedXid is set → standby snapshots marked suboverflowed
2. XidInMVCCSnapshot(S65 (subtrans 65), snapshot) → calls
SubTransGetTopmostTransaction(S65)
3. SubTrans entry for S65 is zeroed → walk stops immediately, returns S65
itself
4. S65 not in snapshot->subxip (never entered KnownAssignedXids) →
returns false
5. Wrong answer: S65 IS in-progress but snapshot says it isn't
Given that is unsafe, PENDING is correct and this is unsafe - took me a
while
to work through the logic.
Given this, I think the change really needs to be done in our backup
tooling to handle
this scenario - rather than being precise about the PITR target, we need to
allow for these overflowed subxids and wait until we can assert there are no
visibility issues, treating the target as a lower bound rather than an
exact point
in time.
Thanks for taking a look!
Matt
On Fri, Mar 6, 2026 at 5:05 AM Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Thu, Feb 26, 2026 at 12:57:24PM +0000, Matt Blewitt wrote:
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery
has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).Yes, this is an historical factor that exists since hot standby is a
thing. We cannot connect yet because we don't have a stable state
that live connections could rely on.This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition toSTANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.This is an interesting argument. To be honest, while it is true that
subtransaction commits do not cause WAL records and flushes (as far as
I recall), I am not completely sure yet if it is always OK to rely on
that and open the server for connections earlier than we logically
can. PENDING has the rather old historical assumption that we should
never open connections yet, because we don't have a standby state
initialized yet. That makes the introduction of such shortcuts very
tricky to think about. The TAP test helps in showing what you are
looking for, thanks for that.I would consider this for backporting to supported releases.
Note sure that I would agree with this position. This is also a
slight change of behavior regarding the end of recovery due the
interaction with the recovery target reached. It is not an area of
the code we should underestimate.
--
Michael