Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Started by Xuneng Zhou8 months ago19 messageshackers

xunengzhou@gmail.com

8 months ago

Hi hackers,

During a performance run [1]/messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com, I observed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
suggests replacing the current check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by the walsender:

1) Problem and Background
/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when waits are long. I built a
POC patch that swaps polling for CVs, but a single global CV (or even
separate “flush” and “replay” CVs) isn’t ideal:
• The wake-up routines don’t know which LSN each waiter cares about,
so they would need to broadcast on every flush/replay.

• Caching the minimum outstanding target LSN could reduce spurious
wake-ups but won’t eliminate them when multiple backends wait for
different LSNs simultaneously.

• The walsender accepts some broadcast overhead via two CVs for
different waiters. A more precise approach would require a request
queue that maps waiters to target LSNs and issues targeted
wake-ups—adding complexity.

2) Proposal
I came across the thread “Implement waiting for WAL LSN replay:
reloaded” [2]/messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com by Alexander. The “Implement WAIT FOR” patch in that
thread provides a well-established infrastructure for waiting on WAL
replay in backends. With modest adjustments, it could be generalized.

Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
with latch-based waiting:
• Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
“WAIT FOR” work.

• Replace the busy-wait in read_local_xlog_page_guts() with
WaitForLSNFlush and WaitForLSNReplay.

• Add wake-up calls in XLogFlush and XLogBackgroundFlush.

Edge Case: Timeline Switch During Wait
/*
* Check which timeline to get the record from.
*
* We have to do it each time through the loop because if we're in
* recovery as a cascading standby, the current timeline might've
* become historical. We can't rely on RecoveryInProgress() because in
* a standby configuration like
*
* A => B => C
*
* if we're a logical decoding session on C, and B gets promoted, our
* timeline will change while we remain in recovery.
*
* We can't just keep reading from the old timeline as the last WAL
* archive in the timeline will get renamed to .partial by
* StartupXLOG().

read_local_xlog_page_guts() re-evaluates the active timeline on each
loop iteration because, on a cascading standby, the current timeline
can become historical. Once that happens, there’s no need to keep
waiting for that timeline. A timeline switch could therefore render an
in-progress wait unnecessary.

One option is to add a wake-up at the point where the timeline switch
occurs, so waiting processes exit promptly. The current approach
chooses not to do this, given that most waits are short and timeline
changes in cascading standby are rare. Supporting timeline-switch
wake-ups would also require additional handling in both
WaitForLSNFlush and WaitForLSNReplay, increasing complexity.

Comments and suggestions are welcome.

[1]: /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
[2]: /messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

8 months ago

In reply to: Xuneng Zhou (#1)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

Attached the wrong patch
v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patch. The
correct one is attached again.

Show quoted text

On Wed, Aug 27, 2025 at 11:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

During a performance run [1], I observed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
suggests replacing the current check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by the walsender:

1) Problem and Background
/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when waits are long. I built a
POC patch that swaps polling for CVs, but a single global CV (or even
separate “flush” and “replay” CVs) isn’t ideal:
• The wake-up routines don’t know which LSN each waiter cares about,
so they would need to broadcast on every flush/replay.

• Caching the minimum outstanding target LSN could reduce spurious
wake-ups but won’t eliminate them when multiple backends wait for
different LSNs simultaneously.

• The walsender accepts some broadcast overhead via two CVs for
different waiters. A more precise approach would require a request
queue that maps waiters to target LSNs and issues targeted
wake-ups—adding complexity.

2) Proposal
I came across the thread “Implement waiting for WAL LSN replay:
reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that
thread provides a well-established infrastructure for waiting on WAL
replay in backends. With modest adjustments, it could be generalized.

Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
with latch-based waiting:
• Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
“WAIT FOR” work.

• Replace the busy-wait in read_local_xlog_page_guts() with
WaitForLSNFlush and WaitForLSNReplay.

• Add wake-up calls in XLogFlush and XLogBackgroundFlush.

Edge Case: Timeline Switch During Wait
/*
* Check which timeline to get the record from.
*
* We have to do it each time through the loop because if we're in
* recovery as a cascading standby, the current timeline might've
* become historical. We can't rely on RecoveryInProgress() because in
* a standby configuration like
*
* A => B => C
*
* if we're a logical decoding session on C, and B gets promoted, our
* timeline will change while we remain in recovery.
*
* We can't just keep reading from the old timeline as the last WAL
* archive in the timeline will get renamed to .partial by
* StartupXLOG().

read_local_xlog_page_guts() re-evaluates the active timeline on each
loop iteration because, on a cascading standby, the current timeline
can become historical. Once that happens, there’s no need to keep
waiting for that timeline. A timeline switch could therefore render an
in-progress wait unnecessary.

One option is to add a wake-up at the point where the timeline switch
occurs, so waiting processes exit promptly. The current approach
chooses not to do this, given that most waits are short and timeline
changes in cascading standby are rare. Supporting timeline-switch
wake-ups would also require additional handling in both
WaitForLSNFlush and WaitForLSNReplay, increasing complexity.

Comments and suggestions are welcome.

[1] /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
[2] /messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

8 months ago

In reply to: Xuneng Zhou (#2)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Xuneng Zhou (#3)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

In an off-list discussion, Alexander pointed out potential issues with
the current single-heap design for replay and flush when promotion
occurs concurrently with WAIT FOR. The following is a simple example
illustrating the problem:

During promotion, there's a window where we can have mixed waiter
types in the same heap:

T1: Process A calls read_local_xlog_page_guts on standby
T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
T3: Promotion begins
T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
T5: SharedRecoveryState = RECOVERY_STATE_DONE
T6: Process B calls read_local_xlog_page_guts
T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter

The problem is that replay LSNs and flush LSNs represent different
positions in the WAL stream. Having both types in the same heap can
lead to:
- Incorrect wakeup logic (comparing incomparable LSNs)
- Processes waiting forever
- Wrong waiters being woken up

To avoid this problem, patch v4 is updated to utilize two separate
heaps for flush and replay like Alexander suggested earlier. It also
introduces a new separate min LSN tracking field for flushing.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Xuneng Zhou (#4)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sun, Sep 28, 2025 at 9:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

In an off-list discussion, Alexander pointed out potential issues with
the current single-heap design for replay and flush when promotion
occurs concurrently with WAIT FOR. The following is a simple example
illustrating the problem:

During promotion, there's a window where we can have mixed waiter
types in the same heap:

T1: Process A calls read_local_xlog_page_guts on standby
T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
T3: Promotion begins
T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
T5: SharedRecoveryState = RECOVERY_STATE_DONE
T6: Process B calls read_local_xlog_page_guts
T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter

The problem is that replay LSNs and flush LSNs represent different
positions in the WAL stream. Having both types in the same heap can
lead to:
- Incorrect wakeup logic (comparing incomparable LSNs)
- Processes waiting forever
- Wrong waiters being woken up

To avoid this problem, patch v4 is updated to utilize two separate
heaps for flush and replay like Alexander suggested earlier. It also
introduces a new separate min LSN tracking field for flushing.

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

Best,
Xuneng

Michael Paquier

michael@paquier.xyz

7 months ago

In reply to: Xuneng Zhou (#5)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.
--
Michael

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Michael Paquier (#6)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Xuneng Zhou (#7)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Another potential use for this infra could be static XLogRecPtr
WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as
well.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Xuneng Zhou (#8)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sat, Oct 4, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Another potential use for this infra could be static XLogRecPtr
WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as
well.

Best,
Xuneng

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Best,
Xuneng

#10

Michael Paquier

michael@paquier.xyz

6 months ago

In reply to: Xuneng Zhou (#9)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote:

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Yes, I think that you should split your patch where you think that it
can make review easier, because your change touches a very sensitive
area of the code base:
- First patch tointroduce what you consider is the basic
infrastructure required for your patch, that can be shared between
multiple pieces. I doubt that you really need to have everything's
that in waitlsn.c to achieve what you want here.
- Second patch to introduce your actual feature, to make the callback
more responsive.
- Then, potentially have a third patch, that adds pieces of
infrastructure to waitlsn.c that you did not need in the first patch,
still are required for the waitlsn.c thread. It would be optionally
possible to rebase the waitlsn patch to use patches 1 and 3, then.

I'd even try to consider the problem from the angle of looking for
independent pieces that could be extracted from the first patch and
split as other patches, to ease even again the review. There is a
limit to this idea because you need a push/pull/reporting facility for
a flush LSN and a replay LSN depending on if you are on a primary, on
a standby, and even another case where you are dealing with a promoted
standby where you decide to loop back *inside* the callback (which I
suspect may not be always the right thing to do depending on the new
TLI selected), so there is a limit in what could be treated as an
independent piece. At least the bits about pairingheap_initialize()
may be worth considering.

+                       if (waitLSNState &&
+                           (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+                            pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+                           WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);

This code pattern looks like a copy-paste of what's done in
synchronous replication. Has some consolidation between syncrep.c and
this kind of facility ever been considered? In terms of queues, waits
and wakeups, the requirements are pretty similar, still your patch has
zero changes related to syncrep.c or syncrep.h.

As far as I can see based on your patch, you are repeating some of the
mistakes of the wait LSN patch, where I've complained about
WaitForLSNReplay() and the duplication it had. One thing you have
decided to pull for example is duplicated calls to
GetXLogReplayRecPtr().
--
Michael

#11

Xuneng Zhou

xunengzhou@gmail.com

6 months ago

In reply to: Michael Paquier (#10)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

Thanks again for your insightful review.

On Mon, Oct 6, 2025 at 10:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote:

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Yes, I think that you should split your patch where you think that it
can make review easier, because your change touches a very sensitive
area of the code base:
- First patch tointroduce what you consider is the basic
infrastructure required for your patch, that can be shared between
multiple pieces. I doubt that you really need to have everything's
that in waitlsn.c to achieve what you want here.
- Second patch to introduce your actual feature, to make the callback
more responsive.
- Then, potentially have a third patch, that adds pieces of
infrastructure to waitlsn.c that you did not need in the first patch,
still are required for the waitlsn.c thread. It would be optionally
possible to rebase the waitlsn patch to use patches 1 and 3, then.

I'd even try to consider the problem from the angle of looking for
independent pieces that could be extracted from the first patch and
split as other patches, to ease even again the review. There is a
limit to this idea because you need a push/pull/reporting facility for
a flush LSN and a replay LSN depending on if you are on a primary, on
a standby, and even another case where you are dealing with a promoted
standby where you decide to loop back *inside* the callback (which I
suspect may not be always the right thing to do depending on the new
TLI selected), so there is a limit in what could be treated as an
independent piece. At least the bits about pairingheap_initialize()
may be worth considering.

+1 for further split to smooth the review process. The timeout in wait
for patch is not needed for the polling problem in the current thread.
I'll remove other unused mechanisms as well.

Yeh, just looping back *inside* the callback could be problematic if
the wait for LSNs don't exist on the current timeline. I'll add a
check for it.

The new patch set will look like this per your suggestion:

Patch 0: pairingheap infrastructure (independent)
src/backend/lib/pairingheap.c | +14 -4
src/include/lib/pairingheap.h | +3
Adds pairingheap_initialize() for shared memory usage.

Provides WaitForLSNReplay() and WaitForLSNFlush() for internal WAL consumers.

Patch 2: Replace polling in read_local_xlog_page_guts
src/backend/access/transam/xlogutils.c | +40 -5
src/backend/access/transam/xlog.c | +10
src/backend/access/transam/xlogrecovery.c | +6
src/backend/replication/walsender.c | -4

Uses Patch 1 infrastructure to eliminate busy-waiting.

Patch 3 Extend LSN waiting infrastructure that WAIT FOR needs

Patch Wait for command based on Patch 1 and 3
SQL interface, full error handling.

+                       if (waitLSNState &&
+                           (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+                            pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+                           WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
This code pattern looks like a copy-paste of what's done in
synchronous replication. Has some c between syncrep.c and
this kind of facility ever been considered? In terms of queues, waits
and wakeups, the requirements are pretty similar, still your patch has
zero changes related to syncrep.c or syncrep.h.

I'm not aware of this before; they do share some basic requirements here.
I'll explore the possibility of consolidating them.

As far as I can see based on your patch, you are repeating some of the
mistakes of the wait LSN patch, where I've complained about
WaitForLSNReplay() and the duplication it had. One thing you have
decided to pull for example is duplicated calls to
GetXLogReplayRecPtr().
--

Will refactor this.

Best,
Xuneng

#12

Xuneng Zhou

xunengzhou@gmail.com

6 months ago

In reply to: Michael Paquier (#6)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

Best,
Xuneng

#13

Xuneng Zhou

xunengzhou@gmail.com

6 months ago

In reply to: Xuneng Zhou (#12)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Best,
Xuneng

#14

Xuneng Zhou

xunengzhou@gmail.com

6 months ago

In reply to: Xuneng Zhou (#13)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.

Best,
Xuneng

#15

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#14)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Oct 15, 2025 at 4:43 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

Best,
Xuneng

#16

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Xuneng Zhou (#15)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

#17

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Michael Paquier (#16)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

Thanks again for your earlier suggestion on splitting the patches to
make the review process smoother.

Although this version is simpler in terms of the amount of code, the
review effort still feels non-trivial. During my own self-review, a
few points stood out as areas that merit careful consideration:

1) Reliance on the new wait-for-LSN infrastructure

The stability and correctness of this patch now depend heavily on the
newly added wait-for-LSN infrastructure, which has not yet been
battle-tested. This puts the patch in a bit of a dilemma: we want the
infrastructure to be as reliable as possible, but it could be hard to
fully validate its robustness without using it in real scenarios, even
after careful review.

2) Wake-up behavior

Are the waiting processes waking up at the correct points and under
the right conditions? Ensuring proper wake-ups is essential for both
correctness and performance.

3) Edge cases

Are edge cases—such as a promotion occurring while a process is
waiting in standby—handled correctly and without introducing races or
inconsistent states?

--
Best,
Xuneng

#18

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#17)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Nov 19, 2025 at 11:44 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

Thanks again for your earlier suggestion on splitting the patches to
make the review process smoother.

Although this version is simpler in terms of the amount of code, the
review effort still feels non-trivial. During my own self-review, a
few points stood out as areas that merit careful consideration:

1) Reliance on the new wait-for-LSN infrastructure

The stability and correctness of this patch now depend heavily on the
newly added wait-for-LSN infrastructure, which has not yet been
battle-tested. This puts the patch in a bit of a dilemma: we want the
infrastructure to be as reliable as possible, but it could be hard to
fully validate its robustness without using it in real scenarios, even
after careful review.

Here are some of my incomplete interpretations of the behaviors:

2) Wake-up behavior

Are the waiting processes waking up at the correct points and under
the right conditions? Ensuring proper wake-ups is essential for both
correctness and performance.

Primary (Flush Wait):

The patch adds WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush)
in XLogFlush and XLogBackgroundFlush right after
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests(true, !RecoveryInProgress());
where walsenders get notified.

Standby (Replay Wait):
-- The "End of Recovery" Wake-up
Location: xlog.c (inside StartupXLOG, around line 6266)

/*
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This call happens immediately after the server transitions from
"Recovery" to "Production" mode (RECOVERY_STATE_DONE).

-- The "Continuous Replay" Wake-up
Location: xlogrecovery.c (inside the main redo loop, around line 1850)
/*
* If we replayed an LSN that someone was waiting for then walk
* over the shared memory array and set latches to notify the
* waiters.
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);

It handles the continuous stream of updates during normal standby operation.

3) Edge cases

Are edge cases—such as a promotion occurring while a process is
waiting in standby—handled correctly and without introducing races or
inconsistent states?

Consider the following sequence which I traced through the logic:

1. Pre-Promotion: A backend (e.g., a logical decoding session) calls
read_local_xlog_page_guts for a future LSN. RecoveryInProgress()
returns true, so it enters WaitForLSN(WAIT_LSN_TYPE_REPLAY, ...).

2. The Event: pg_promote() is issued. The Startup process finishes
recovery and broadcasts a wake-up to all waiters.

3. Detection: WaitForLSN returns WAIT_LSN_RESULT_NOT_IN_RECOVERY. The
code explicitly handles this case:

case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
/* Promoted while waiting... loop back */
break;

4. The Transition: The loop restarts.
-- RecoveryInProgress() is checked again and now returns false.
-- The logic automatically switches branches to
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...).

5. This transition relaxes the constraint from "wait for replay"
(required for consistency on standby) to "wait for flush" (required
for durability on primary).

6. Timeline Divergence:
XLogReadDetermineTimeline is called at the top of the loop.

-- Scenario A: Waiting for Historical Data (Pre-Promotion)
If we were waiting for LSN 0/5000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will see that 0/5000
belongs to TLI 1 (now historical).
Result: state->currTLI (1) != currTLI (2).
Action: The loop breaks immediately (via the else block), skipping
the wait. Since the data is historical, it is immutable and assumed to
be on disk.

-- Scenario B: Waiting for Future Data (Post-Promotion)
If we were waiting for LSN 0/7000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will identify that 0/7000
belongs to the new TLI 2.
Result: state->currTLI (2) == currTLI (2).
Action: The loop continues, and we enter
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...) to wait for the new primary to
generate this data.

-- Scenario C: Waiting exactly at the Switch Point
If we were waiting for the exact LSN where the timeline switched.
Action: XLogReadDetermineTimeline handles the boundary calculation
(tliSwitchPoint), ensuring we read from the correct segment file
(e.g., switching from 00000001... to 00000002...).

--
Best,
Xuneng

#19

Xuneng Zhou

xunengzhou@gmail.com

2 months ago

In reply to: Xuneng Zhou (#18)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

Just rebase.

The performance comparison before and after applying the patch is attached.

--
Best,
Xuneng

Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: