[PATCH] Fix WAIT FOR LSN standby_write/standby_flush for archive recovery cases

Started by SATYANARAYANA NARLAPURAM2 days ago3 messageshackers
Jump to latest
#1SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com

Hi Alexnader, Hackers,

GetCurrentLSNForWaitType() for WAIT_LSN_TYPE_STANDBY_WRITE and
WAIT_LSN_TYPE_STANDBY_FLUSH previously relied on the WAL receiver's
tracked write/flush positions (GetWalRcvWriteRecPtr/GetWalRcvFlushRecPtr).
There are two scenarios where WAIT FOR LSN queries can be stalled though
replay is making progress. Breaking it down to two to give clarity on
setups but
the underlying problem is the same.

There are two scenarios here:

(1). When the standby is disconnected from the primary and switched to WAL
archive mode, it continues to be in that mode until no more WAL is
available to replay
and then switch to streaming mode. Until then WAIT FOR LSN calls get stuck
on the
standby though replay catches up beyond the stale WAL receiver position.
Switching
XLog source from archive to streaming is separately tracked in [1]/messages/by-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com.

(2). In the case of Archive recovery, no WAL receiver process exists, so
these
functions return InvalidXLogRecPtr (0/0). WAIT FOR LSN with standby_flush or
standby_write modes would always time out, even for WAL that has been
fully replayed.

Fix by falling back to the replay LSN (GetXLogReplayRecPtr) when the WAL
receiver position is invalid or behind replay. This is correct because any
WAL that has been replayed has necessarily already been written and flushed
to disk. Attached the repro test case.

[1]: /messages/by-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
/messages/by-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com

Thanks,
Satya

Attachments:

0001-Fix-WAIT-FOR-LSN-standby_write-standby_flush-for-arc.patchapplication/octet-stream; name=0001-Fix-WAIT-FOR-LSN-standby_write-standby_flush-for-arc.patchDownload+27-4
0001-Add-TAP-test-for-WAIT-FOR-LSN-during-archive-recover.patchapplication/octet-stream; name=0001-Add-TAP-test-for-WAIT-FOR-LSN-during-archive-recover.patchDownload+169-1
#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#1)
Re: [PATCH] Fix WAIT FOR LSN standby_write/standby_flush for archive recovery cases

Hi, Satyanarayana!

On Wed, Apr 15, 2026 at 9:44 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

GetCurrentLSNForWaitType() for WAIT_LSN_TYPE_STANDBY_WRITE and
WAIT_LSN_TYPE_STANDBY_FLUSH previously relied on the WAL receiver's
tracked write/flush positions (GetWalRcvWriteRecPtr/GetWalRcvFlushRecPtr).
There are two scenarios where WAIT FOR LSN queries can be stalled though
replay is making progress. Breaking it down to two to give clarity on setups but
the underlying problem is the same.

There are two scenarios here:

(1). When the standby is disconnected from the primary and switched to WAL
archive mode, it continues to be in that mode until no more WAL is available to replay
and then switch to streaming mode. Until then WAIT FOR LSN calls get stuck on the
standby though replay catches up beyond the stale WAL receiver position. Switching
XLog source from archive to streaming is separately tracked in [1].

(2). In the case of Archive recovery, no WAL receiver process exists, so these
functions return InvalidXLogRecPtr (0/0). WAIT FOR LSN with standby_flush or
standby_write modes would always time out, even for WAL that has been
fully replayed.

Fix by falling back to the replay LSN (GetXLogReplayRecPtr) when the WAL
receiver position is invalid or behind replay. This is correct because any
WAL that has been replayed has necessarily already been written and flushed
to disk. Attached the repro test case.

Please, check, similar patch is already posted here.

/messages/by-id/CABPTF7Ukk8iJF7TpnK2mFOaboNJgWL1csfXu4e3J4GT0o7x0GQ@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

#3SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Alexander Korotkov (#2)
Re: [PATCH] Fix WAIT FOR LSN standby_write/standby_flush for archive recovery cases

Hi,

On Thu, Apr 16, 2026 at 12:31 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi, Satyanarayana!

On Wed, Apr 15, 2026 at 9:44 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

GetCurrentLSNForWaitType() for WAIT_LSN_TYPE_STANDBY_WRITE and
WAIT_LSN_TYPE_STANDBY_FLUSH previously relied on the WAL receiver's
tracked write/flush positions

(GetWalRcvWriteRecPtr/GetWalRcvFlushRecPtr).

There are two scenarios where WAIT FOR LSN queries can be stalled though
replay is making progress. Breaking it down to two to give clarity on

setups but

the underlying problem is the same.

There are two scenarios here:

(1). When the standby is disconnected from the primary and switched to

WAL

archive mode, it continues to be in that mode until no more WAL is

available to replay

and then switch to streaming mode. Until then WAIT FOR LSN calls get

stuck on the

standby though replay catches up beyond the stale WAL receiver position.

Switching

XLog source from archive to streaming is separately tracked in [1].

(2). In the case of Archive recovery, no WAL receiver process exists, so

these

functions return InvalidXLogRecPtr (0/0). WAIT FOR LSN with

standby_flush or

standby_write modes would always time out, even for WAL that has been
fully replayed.

Fix by falling back to the replay LSN (GetXLogReplayRecPtr) when the WAL
receiver position is invalid or behind replay. This is correct because

any

WAL that has been replayed has necessarily already been written and

flushed

to disk. Attached the repro test case.

Please, check, similar patch is already posted here.

/messages/by-id/CABPTF7Ukk8iJF7TpnK2mFOaboNJgWL1csfXu4e3J4GT0o7x0GQ@mail.gmail.com

Thanks, I will review it.