[PATCH] Fix PITR pause bypass when initial XLOG_RUNNING_XACTS has subxid overflow
Hi folks,
We observed a case where our backup tooling was periodically failing
for a specific workload - nested subtrans overflowing subxid. We don't
have visibility on the specific customer workload (i.e. either SAVEPOINT
or EXCEPTION handling), but reproducing is covered in the TAP test.
The problem detail and proposed fix are described below. Happy to discuss
further.
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).
This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.
Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition to STANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.
As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.
INPROGRESS subxids are invisible without any SubTrans
lookup, so the missing SubTrans entries that STANDBY_SNAPSHOT_PENDING
guards against cannot cause incorrect visibility results.
Add a TAP test (052_pitr_subxid_overflow.pl) that exercises the scenario:
the overflow transaction is kept open during the base backup's forced
checkpoint so that the very first XLOG_RUNNING_XACTS the standby replays
has subxid_overflow=true. A named restore point is then created while
the overflow transaction is still open. Without the fix the standby
promotes silently at the target; with the fix it pauses and accepts
hot-standby queries.
Note: subtransaction XIDs are only assigned when the subtransaction writes,
so gen_subxids() must perform an INSERT at each recursion level to force
the PGPROC subxid cache to overflow.
I would consider this for backporting to supported releases.
Attachments:
0001-Fix-PITR-pause-bypass-when-initial-XLOG_RUNNING_XACT.patchapplication/octet-stream; name=0001-Fix-PITR-pause-bypass-when-initial-XLOG_RUNNING_XACT.patchDownload+170-1
On Thu, Feb 26, 2026 at 12:57:24PM +0000, Matt Blewitt wrote:
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).
Yes, this is an historical factor that exists since hot standby is a
thing. We cannot connect yet because we don't have a stable state
that live connections could rely on.
This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition to STANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.
This is an interesting argument. To be honest, while it is true that
subtransaction commits do not cause WAL records and flushes (as far as
I recall), I am not completely sure yet if it is always OK to rely on
that and open the server for connections earlier than we logically
can. PENDING has the rather old historical assumption that we should
never open connections yet, because we don't have a standby state
initialized yet. That makes the introduction of such shortcuts very
tricky to think about. The TAP test helps in showing what you are
looking for, thanks for that.
I would consider this for backporting to supported releases.
Note sure that I would agree with this position. This is also a
slight change of behavior regarding the end of recovery due the
interaction with the recovery target reached. It is not an area of
the code we should underestimate.
--
Michael
Hi Michael,
Re-reviewing the patch, I think it is not OK to rely on WAL and flush.
I'll update the test cases. I think based on the current cases, the flow is
something like:
1. lastOverflowedXid is set → standby snapshots marked suboverflowed
2. XidInMVCCSnapshot(S65 (subtrans 65), snapshot) → calls
SubTransGetTopmostTransaction(S65)
3. SubTrans entry for S65 is zeroed → walk stops immediately, returns S65
itself
4. S65 not in snapshot->subxip (never entered KnownAssignedXids) →
returns false
5. Wrong answer: S65 IS in-progress but snapshot says it isn't
Given that is unsafe, PENDING is correct and this is unsafe - took me a
while
to work through the logic.
Given this, I think the change really needs to be done in our backup
tooling to handle
this scenario - rather than being precise about the PITR target, we need to
allow for these overflowed subxids and wait until we can assert there are no
visibility issues, treating the target as a lower bound rather than an
exact point
in time.
Thanks for taking a look!
Matt
On Fri, Mar 6, 2026 at 5:05 AM Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Thu, Feb 26, 2026 at 12:57:24PM +0000, Matt Blewitt wrote:
Problem: When the first XLOG_RUNNING_XACTS record seen during recovery
has
subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and
hot standby never activates (LocalHotStandbyActive stays false).Yes, this is an historical factor that exists since hot standby is a
thing. We cannot connect yet because we don't have a stable state
that live connections could rely on.This caused recovery_target_action = 'pause' to be silently bypassed:
recoveryPausesHere() returns immediately when hot standby is not yet
active, so the pause is skipped and the server promotes instead.Fix: in PerformWalRecovery(), when the recovery target is reached and
the snapshot is still PENDING, force a transition toSTANDBY_SNAPSHOT_READY
and call CheckRecoveryConsistency() to activate hot standby before the
target action switch is evaluated.As I understand it, this is safe because subtransaction
commits write to CLOG but produce no WAL entry, so standbys
always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED.This is an interesting argument. To be honest, while it is true that
subtransaction commits do not cause WAL records and flushes (as far as
I recall), I am not completely sure yet if it is always OK to rely on
that and open the server for connections earlier than we logically
can. PENDING has the rather old historical assumption that we should
never open connections yet, because we don't have a standby state
initialized yet. That makes the introduction of such shortcuts very
tricky to think about. The TAP test helps in showing what you are
looking for, thanks for that.I would consider this for backporting to supported releases.
Note sure that I would agree with this position. This is also a
slight change of behavior regarding the end of recovery due the
interaction with the recovery target reached. It is not an area of
the code we should underestimate.
--
Michael
Hi Hackers,
This is a follow-up to Matt Blewitt's report and patch from February [1]/messages/by-id/CACy-Nv24ZORVN9_S_yHF5Nsip45HKCBtKVNC3XdKgz+1wvGvEQ@mail.gmail.com, which identified the following bug: when the first XLOG_RUNNING_XACTS record a standby replays has subxid_overflow set, standbyState gets stuck at STANDBY_SNAPSHOT_PENDING, and hot standby is never activated. As a consequence, recovery_target_action = 'pause' is silently ignored: recoveryPausesHere() returns immediately because !LocalHotStandbyActive, the PAUSE case falls through, and the server promotes instead of pausing.
We'd like to propose an alternative fix for the same problem and describe why we believe serving read-only queries in this state is safe, and why we deliberately do not advance standbyState to STANDBY_SNAPSHOT_READY as the earlier patch did.
Patches
=======
0001 - Behavior-preserving refactor: pull the connection-enabling block
out of CheckRecoveryConsistency() into a small helper.
0002 - The fix: call EnableHotStandbyConnections() from the
RECOVERY_TARGET_ACTION_PAUSE path, just before recoveryPausesHere(), and add a TAP test.
Why we believe enabling reads is correct
=======
The reason a standby normally refuses queries from an overflowed snapshot is the risk of an incorrect visibility decision for a subtransaction whose top-level transaction is still running on the primary.
When the initial RUNNING_XACTS is overflowed, KnownAssignedXids may be missing some subxids. For such a recovery snapshot, XidInMVCCSnapshot() first maps the xid to its topmost parent via SubTransGetTopmostTransaction() and then looks it up in the in-progress set. If that mapping is not available in pg_subtrans and the xid is not present in KnownAssignedXids, the xid is treated as "not in the snapshot", and the final committed/aborted decision is delegated to TransactionIdDidCommit() in HeapTupleSatisfiesMVCC().
During active WAL replay, this is the dangerous case: a subxid S of a still-running top transaction T may have its row present on disk; if S cannot be resolved as in-progress and T's commit record is later replayed, CLOG flips T to committed, and a query could suddenly see a row that was not committed as of its snapshot's xmax. That is exactly why connections are withheld until a non-overflowed snapshot (STANDBY_SNAPSHOT_READY) gives complete knowledge.
At an end-of-recovery pause, this hazard disappears because replay is frozen. The only ways out of recoveryPausesHere(true) are promotion and shutdown. A pg_wal_replay_resume() at the end of recovery falls through to promotion rather than resuming replay.
Therefore, no commit record for any in-progress transaction will ever be replayed, so CLOG cannot transition T (or its subxids) to committed after this point. TransactionIdDidCommit() for such an xid stays false forever. So, the MVCC visibility fallback keeps the row invisible.
In short, the set of transactions a query can observe as committed is now stable, and is exactly the set that committed before replay stopped. The single condition that makes overflowed-snapshot reads unsafe during live replay (a transaction that was in progress as of a snapshot later being observed as committed) cannot arise once replay halts. So the pending snapshot, while still overflowed, yields correct and stable visibility.
Why we keep standbyState at STANDBY_SNAPSHOT_PENDING
=======
The earlier patch forced standbyState to STANDBY_SNAPSHOT_READY at the pause point and re-ran CheckRecoveryConsistency(). We chose not to do that.
STANDBY_SNAPSHOT_READY means we have full knowledge of the transactions that were running on the primary and that snapshots are complete and need not be treated as overflowed. That is not true here since the snapshot is still overflowed (visibility stays correct for the frozen-replay reason above, not because the snapshot has become complete). Forcing the state to READY would assert something false about the recovery state.
Original report and first patch by Matt Blewitt. Thanks also for the analysis in that thread.
Thoughts welcome.
[1]: /messages/by-id/CACy-Nv24ZORVN9_S_yHF5Nsip45HKCBtKVNC3XdKgz+1wvGvEQ@mail.gmail.com
Best Regards
Jan Nidzwetzki
On behalf of PlanetScale
Attachments:
0001-Refactor-extract-EnableHotStandbyConnections-helper.patchapplication/octet-stream; name=0001-Refactor-extract-EnableHotStandbyConnections-helper.patch; x-unix-mode=0644Download+33-14
0002-Honor-recovery_target_action-pause-on-inconsistent-s.patchapplication/octet-stream; name=0002-Honor-recovery_target_action-pause-on-inconsistent-s.patch; x-unix-mode=0644Download+167-1
Hello
This is safe because replay is frozen at this
point: the only ways out of the pause are promotion and shutdown, so no
transaction's commit status can change afterwards, and any transaction a
query finds committed in CLOG necessarily committed before that query's
snapshot.
But if I look at the documentation, after shutdown it allows a restart
with a later recovery target:
The intended use of the pause setting is to allow queries to be executed
against the database to check if this recovery target is the most desirable
point for recovery. The paused state can be resumed by using pg_wal_replay_resume()
(see Table 9.81), which then causes recovery to end. If this recovery target is
not the desired stopping point, then shut down the server, change the recovery
target settings to a later target and restart to continue recovery.
"so no transaction's commit status can change after this point" is
true within the lifetime of the paused instance, but if I shut down
and restart the server with a later recovery target?
Even a read-only query can mark a tuple with HEAP_XMIN_INVALID if
HeapTupleSatisfiesMVCC decides that a transaction aborted or crashed.
And then in bufmgr.c:MarkSharedBufferDirtyHint, we can see the
following conditions that prevent this change from being flushed with
an early return:
if (XLogHintBitIsNeeded() && (lockstate & BM_PERMANENT))
{
/*
* If we must not write WAL, due to a relfilelocator-specific
* condition or being in recovery, don't dirty the page. We can
* set the hint, just not dirty the page as a result so the hint
* is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
if (RecoveryInProgress() ||
RelFileLocatorSkippingWAL(BufTagGetRelFileLocator(&bufHdr->tag)))
return;
...
Where
#define XLogHintBitIsNeeded() (wal_log_hints || DataChecksumsNeedWrite())
So if we turn off both wal_log_hints and data checksums, that return
disappears, and we can cause data corruption with just a select in a
paused state with the patch.
See the attached tap test that showcases the problem.
Attachments:
Hello Zsolt,
Thank you very much for pointing out the problem and the TAP test to reproduce it. I missed that PostgreSQL can change data in recovery mode when the database is not using checksums and the server is running without 'wal_log_hints'. Rather than trying to make that path safe, I think the conservative fix is to log a message and shut down when an incomplete snapshot is present at the end of recovery with 'recovery_target_action = pause'.
The attached patch does that: when hot standby is not active at the recovery target (e.g., due to an incomplete snapshot), PostgreSQL will log a message and shut down instead of promoting silently. It mirrors how 'pause' is already downgraded to 'shutdown' when hot_standby is off. This lets the user choose a different recovery target or action. The patch also updates the documentation to clarify the behavior and adds a TAP test to verify the change.
Best regards
Jan