Incorrect logic in XLogNeedsFlush()

melanieplageman@gmail.com

8 months ago

Hi,

If you call XLogNeedsFlush() immediately after calling XLogFlush() in
FlushBuffer(), it can return true.

With this diff:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 350cc0402aa..91c3fe99d6e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4342,7 +4342,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object,
     * skip the flush if the buffer isn't permanent.
     */
    if (buf_state & BM_PERMANENT)
+   {
        XLogFlush(recptr);
+       if(XLogNeedsFlush(recptr))
+           elog(ERROR, "should be flushed and isn't");
+   }

recovery/015_promotion_pages errors out.

During an end-of-recovery checkpoint, XLogNeedsFlush() should compare
the provided LSN to the flush pointer and not the min recovery point.

This only errors out during an end-of-recovery checkpoint after a
crash when the ControlFile->minRecoveryPoint has not been correctly
updated. In this case, LocalMinRecoveryPoint is initialized to an
incorrect value potentially causing XLogNeedsFlush() to incorrectly
return true.

This trivial diff makes the test pass:

@@ -3115,7 +3125,7 @@ XLogNeedsFlush(XLogRecPtr record)
     * instead. So "needs flush" is taken to mean whether minRecoveryPoint
     * would need to be updated.
     */
-   if (RecoveryInProgress())
+   if (RecoveryInProgress() && !XLogInsertAllowed())
    {

However, since all of this code is new to me, there are some remaining
things I don't understand:

Why is it okay for other processes than the startup process to
initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint
during crash recovery?

All of the comments and the code seem to indicate that this is allowed.

Couldn't it just as likely be invalid for them (e.g. checkpointer
here) as it could for the startup process? Like if we crashed before
updating it?

Besides this question, I find the use of the global variable
InRecovery as a proxy for RecoveryInProgress() + "I'm the startup
process" to be confusing. It seems like UpdateMinRecoveryPoint() and
XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make
it explicit.

- Melanie

dilipbalaut@gmail.com

8 months ago

In reply to: Melanie Plageman (#1)

Re: Incorrect logic in XLogNeedsFlush()

On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Hi,

If you call XLogNeedsFlush() immediately after calling XLogFlush() in
FlushBuffer(), it can return true.

With this diff:
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 350cc0402aa..91c3fe99d6e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4342,7 +4342,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object,
* skip the flush if the buffer isn't permanent.
*/
if (buf_state & BM_PERMANENT)
+   {
XLogFlush(recptr);
+       if(XLogNeedsFlush(recptr))
+           elog(ERROR, "should be flushed and isn't");
+   }
recovery/015_promotion_pages errors out.

During an end-of-recovery checkpoint, XLogNeedsFlush() should compare
the provided LSN to the flush pointer and not the min recovery point.

That right, "end-of-recovery checkpoint" actually performs a flush
instead of updating the minRecoveryPoint so we should compare it with
flush pointer.

This only errors out during an end-of-recovery checkpoint after a
crash when the ControlFile->minRecoveryPoint has not been correctly
updated. In this case, LocalMinRecoveryPoint is initialized to an
incorrect value potentially causing XLogNeedsFlush() to incorrectly
return true.

This trivial diff makes the test pass:
@@ -3115,7 +3125,7 @@ XLogNeedsFlush(XLogRecPtr record)
* instead. So "needs flush" is taken to mean whether minRecoveryPoint
* would need to be updated.
*/
-   if (RecoveryInProgress())
+   if (RecoveryInProgress() && !XLogInsertAllowed())
{
However, since all of this code is new to me, there are some remaining
things I don't understand:

Why is it okay for other processes than the startup process to
initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint
during crash recovery?

During crash recovery we don't yet know the minRecoveryPoint until we
replay all the WALs and once it is done it will be updated in the
ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord().
After this point it will be valid to use by any process including the
StartupProcess. So I don't think until ControlFile->minRecoveryPoint
is initialized it is safe to use for any process. In fact, before any
connections are allowed this has to be set either by
CreateEndOfRecoveryRecord() if recovering a primary from crash or by
ReachedEndOfBackup() if you are starting a standby from backup.

Besides this question, I find the use of the global variable
InRecovery as a proxy for RecoveryInProgress() + "I'm the startup
process" to be confusing. It seems like UpdateMinRecoveryPoint() and
XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make
it explicit.

IIUC InRecovery is slightly different than "RecoveryInProgress() +
"I'm the startup", in StartupXlog, we first set InRecovery to false
and then update the minRecoveryPoint in CreateEndOfRecoveryRecord()
and after that we clear RecoveryInProgress (i.e. set
XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;).

--
Regards,
Dilip Kumar
Google

melanieplageman@gmail.com

8 months ago

In reply to: Dilip Kumar (#2)

Re: Incorrect logic in XLogNeedsFlush()

Thanks for taking time to reply!

On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Why is it okay for other processes than the startup process to
initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint
during crash recovery?

During crash recovery we don't yet know the minRecoveryPoint until we
replay all the WALs and once it is done it will be updated in the
ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord().
After this point it will be valid to use by any process including the
StartupProcess. So I don't think until ControlFile->minRecoveryPoint
is initialized it is safe to use for any process. In fact, before any
connections are allowed this has to be set either by
CreateEndOfRecoveryRecord() if recovering a primary from crash or by
ReachedEndOfBackup() if you are starting a standby from backup.

So, looking at the code, it seems like most places we examine
ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery".
Is this something we want to do in XLogNeedsFlush() to avoid reading
from ControlFile->minRecoveryPoint()?

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

/* Quick exit if already known to be updated or cannot be updated */
if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint)
return false;

Besides this question, I find the use of the global variable
InRecovery as a proxy for RecoveryInProgress() + "I'm the startup
process" to be confusing. It seems like UpdateMinRecoveryPoint() and
XLogNeedsFlush() could just test MyBackendType == B_STARTUP and make
it explicit.

IIUC InRecovery is slightly different than "RecoveryInProgress() +
"I'm the startup", in StartupXlog, we first set InRecovery to false
and then update the minRecoveryPoint in CreateEndOfRecoveryRecord()
and after that we clear RecoveryInProgress (i.e. set
XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;).

Hmm. So, RecoveryInProgress() can be true in the startup process and
InRecovery is false. I see.

- Melanie

pgsql@j-davis.com

8 months ago

In reply to: Melanie Plageman (#3)

Re: Incorrect logic in XLogNeedsFlush()

On Tue, 2025-09-09 at 16:29 -0400, Melanie Plageman wrote:

On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Why is it okay for other processes than the startup process to
initialize LocalMinRecoveryPoint from ControlFile-

minRecoveryPoint

during crash recovery?

During crash recovery we don't yet know the minRecoveryPoint until
we
replay all the WALs and once it is done it will be updated in the
ControlFile at PerformRecoveryXLogAction()-

CreateEndOfRecoveryRecord().

...

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

I am confused about whether we are discussing "incorrect" or "invalid".

Regards,
Jeff Davis

Show quoted text

dilipbalaut@gmail.com

8 months ago

In reply to: Melanie Plageman (#3)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 1:59 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Thanks for taking time to reply!

On Wed, Sep 3, 2025 at 11:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Aug 30, 2025 at 4:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Why is it okay for other processes than the startup process to
initialize LocalMinRecoveryPoint from ControlFile->minRecoveryPoint
during crash recovery?

During crash recovery we don't yet know the minRecoveryPoint until we
replay all the WALs and once it is done it will be updated in the
ControlFile at PerformRecoveryXLogAction()->CreateEndOfRecoveryRecord().
After this point it will be valid to use by any process including the
StartupProcess. So I don't think until ControlFile->minRecoveryPoint
is initialized it is safe to use for any process. In fact, before any
connections are allowed this has to be set either by
CreateEndOfRecoveryRecord() if recovering a primary from crash or by
ReachedEndOfBackup() if you are starting a standby from backup.

So, looking at the code, it seems like most places we examine
ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery".
Is this something we want to do in XLogNeedsFlush() to avoid reading
from ControlFile->minRecoveryPoint()?

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

/* Quick exit if already known to be updated or cannot be updated */
if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint)
return false;

That's not quite right. Before the end-of-recovery checkpoint, the
InRecovery flag is already set to false. This means that even if
LocalMinRecoveryPoint is invalid, it won't matter, and
updateMinRecoveryPoint will not be set to false. Since
LocalMinRecoveryPoint is 0, the condition if (record <=
LocalMinRecoveryPoint) will also fail, causing the process to continue
and read from the ControlFile.

--
Regards,
Dilip Kumar
Google

michael@paquier.xyz

8 months ago

In reply to: Dilip Kumar (#5)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 12:48:21PM +0530, Dilip Kumar wrote:

So, looking at the code, it seems like most places we examine
ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery".
Is this something we want to do in XLogNeedsFlush() to avoid reading
from ControlFile->minRecoveryPoint()?

How would that help when the checkpointer does XLogNeedsFlush() calls,
which is where you are reporting the disturbance is during the
end-of-recovery checkpoint? InArchiveRecovery is only true in the
startup process, set when we are doing archive recovery. This can be
switched from the beginning of recovery or later, once all the local
WAL has been replayed.

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

/* Quick exit if already known to be updated or cannot be updated */
if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint)
return false;

That's not quite right. Before the end-of-recovery checkpoint, the
InRecovery flag is already set to false. This means that even if
LocalMinRecoveryPoint is invalid, it won't matter, and
updateMinRecoveryPoint will not be set to false. Since
LocalMinRecoveryPoint is 0, the condition if (record <=
LocalMinRecoveryPoint) will also fail, causing the process to continue
and read from the ControlFile.

Similarly, InRecovery can only be true in the startup process.

In the case of 015_promotion_pages, I can see that when the check you
are adding fails, the LSN is already flushed by XLogFLush(). So,
while I don't see any active bug here, I tend to agree with your
position that XLogNeedsFlush() could be improved in the context of the
end-of-recovery checkpoint or just when a process is allowed WAL
inserts while we are still in recovery.

You have a good point here, especially knowing that for
CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the
checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed().
So, relying on RecoveryInProgress() in this context leads to a wrong
result with XLogNeedsFlush(): WAL inserts are allowed, the minimum
recovery point is not relevant for the evaluation.

Or are you worried about the case of GetVictimBuffer() where a second
call of XLogNeedsFlush() lives?
--
Michael

melanieplageman@gmail.com

8 months ago

In reply to: Dilip Kumar (#5)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 3:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 10, 2025 at 1:59 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

/* Quick exit if already known to be updated or cannot be updated */
if (record <= LocalMinRecoveryPoint || !updateMinRecoveryPoint)
return false;

That's not quite right. Before the end-of-recovery checkpoint, the
InRecovery flag is already set to false. This means that even if
LocalMinRecoveryPoint is invalid, it won't matter, and
updateMinRecoveryPoint will not be set to false. Since
LocalMinRecoveryPoint is 0, the condition if (record <=
LocalMinRecoveryPoint) will also fail, causing the process to continue
and read from the ControlFile.

Ah, right, I got turned around. My original investigation showed me
that the checkpointer incorrectly read from the ControlFile when I
added the XLogNeedsFlush() precisely because InRecovery is false
outside of the startup process.

What I want is for it to be safe and accurate to call XLogNeedsFlush()
in any backend (one that might flush WAL, that is).

- Melanie

melanieplageman@gmail.com

8 months ago

In reply to: Jeff Davis (#4)

Re: Incorrect logic in XLogNeedsFlush()

On Tue, Sep 9, 2025 at 9:50 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2025-09-09 at 16:29 -0400, Melanie Plageman wrote:

Though, it seems like LocalMinRecoveryPoint must be getting
incorrectly set elsewhere, otherwise this would have guarded us from
examining the control file:

I am confused about whether we are discussing "incorrect" or "invalid".

I meant incorrect as in when it was not supposed to be -- not InvalidXLogRecPtr.

But, I've realized I misspoke -- because InRecovery is only true in
the startup process, so

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)

is false for checkpointer.

- Melanie

[1]: /messages/by-id/CAAKRu_Yurj0sabXZB5EVqO3fKgwN1vELe1H4e0s1zXV3MKbZjQ@mail.gmail.com

melanieplageman@gmail.com

8 months ago

In reply to: Michael Paquier (#6)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 3:58 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 10, 2025 at 12:48:21PM +0530, Dilip Kumar wrote:

So, looking at the code, it seems like most places we examine
ControlFile->minRecoveryPoint, we condition it on "InArchiveRecovery".
Is this something we want to do in XLogNeedsFlush() to avoid reading
from ControlFile->minRecoveryPoint()?

How would that help when the checkpointer does XLogNeedsFlush() calls,
which is where you are reporting the disturbance is during the
end-of-recovery checkpoint? InArchiveRecovery is only true in the
startup process, set when we are doing archive recovery. This can be
switched from the beginning of recovery or later, once all the local
WAL has been replayed.

Ah right, I made the same mistake with the InRecovery global variable
initially too. This is my first time reading this code. I thought that
there needed to be something guarding processes other than the startup
process from reading the minimum recovery point from the ControlFile
during crash recovery until it is valid. But perhaps not -- more on
that below.

In the case of 015_promotion_pages, I can see that when the check you
are adding fails, the LSN is already flushed by XLogFLush(). So,
while I don't see any active bug here, I tend to agree with your
position that XLogNeedsFlush() could be improved in the context of the
end-of-recovery checkpoint or just when a process is allowed WAL
inserts while we are still in recovery.

You have a good point here, especially knowing that for
CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the
checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed().
So, relying on RecoveryInProgress() in this context leads to a wrong
result with XLogNeedsFlush(): WAL inserts are allowed, the minimum
recovery point is not relevant for the evaluation.

So, would you consider the defining characteristic of whether or not
we should use the flush pointer instead of min recovery point in
XLogNeedsFlush() to be whether or not WAL inserts are allowed?

e.g.

@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
     * instead. So "needs flush" is taken to mean whether minRecoveryPoint
     * would need to be updated.
     */
-   if (RecoveryInProgress())
+   if (RecoveryInProgress() && !XLogInsertAllowed())

If I look at the difference between XLogFlush() and XLogNeedsFlush(),
it checks if XLogInsertAllowed() and, if not, does not try to flush
WAL -- which makes sense. There shouldn't be WAL to flush if we aren't
allowed to insert any.

In XLogFlush() -> UpdateMinRecoveryPoint() (called after checking if
WAL inserts are allowed), it has similar logic to XLogNeedsFlush():

if (!updateMinRecoveryPoint || (!force && lsn <= LocalMinRecoveryPoint))
return;
if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

The order of these two if statements is reversed in XLogNeedsFlush()

So, what I don't understand is: why it would be okay for processes
other than the startup process to read the minimum recovery point from
the control file during crash recovery before it is valid.

It seems in crash recovery, a backend other than the startup process
will fail this test:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)

and then potentially read from the control file

LocalMinRecoveryPoint = ControlFile->minRecoveryPoint;
LocalMinRecoveryPointTLI = ControlFile->minRecoveryPointTLI;

There is a comment right after that says

/*
* Check minRecoveryPoint for any other process than the startup
* process doing crash recovery, which should not update the control
* file value if crash recovery is still running.
*/

Which seems to say that it is okay for other processes than the
startup process to read the minimum recovery point from the Control
File during crash recovery. But I don't quite understand why.

Or are you worried about the case of GetVictimBuffer() where a second
call of XLogNeedsFlush() lives?

I want it to be fixed because of another patch I am writing to do
batched writes with smgrwritev() [1]/messages/by-id/CAAKRu_Yurj0sabXZB5EVqO3fKgwN1vELe1H4e0s1zXV3MKbZjQ@mail.gmail.com. In this case, I've split up the
code in FlushBuffer() and flush the WAL to the required LSN in a
different place in code than where I call smgrwritev(). So, I want to
add an assert-only verification that checks that none of the buffer's
need WAL flushed. This would be invoked by any process calling
FlushBuffer(), so it would have to work for all types of backends.

- Melanie

#10

/messages/by-id/b4ad535a72fc02ea43076cf525e4dbaa72b00d5b.camel@j-davis.com

pgsql@j-davis.com

8 months ago

In reply to: Melanie Plageman (#9)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, 2025-09-10 at 11:12 -0400, Melanie Plageman wrote:

So, would you consider the defining characteristic of whether or not
we should use the flush pointer instead of min recovery point in
XLogNeedsFlush() to be whether or not WAL inserts are allowed?

That was my question here:

It seems like XLogFlush() and XLogNeedsFlush() should use the same
test, otherwise you could always get some confusing inconsistency.
Right?

Regards,
Jeff Davis

#11

michael@paquier.xyz

8 months ago

In reply to: Melanie Plageman (#9)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 11:12:37AM -0400, Melanie Plageman wrote:

On Wed, Sep 10, 2025 at 3:58 AM Michael Paquier <michael@paquier.xyz> wrote:

You have a good point here, especially knowing that for
CHECKPOINT_END_OF_RECOVERY case in CreateCheckPoint(), called by the
checkpointer, we allow WAL inserts with LocalSetXLogInsertAllowed().
So, relying on RecoveryInProgress() in this context leads to a wrong
result with XLogNeedsFlush(): WAL inserts are allowed, the minimum
recovery point is not relevant for the evaluation.

So, would you consider the defining characteristic of whether or not
we should use the flush pointer instead of min recovery point in
XLogNeedsFlush() to be whether or not WAL inserts are allowed?

Yeah, using the flush pointer makes sense if WAL inserts are allowed,
even if that's just temporary. We're telling the process where the
inserts are allowed that we are not officially out of recovery yet,
but close to it. If your suggestion is to update XLogNeedsFlush() so
as it uses !XLogInsertAllowed() rather than RecoveryInProgress(), that
makes sense to me. It looks like we've never really had a case up to
now where XLogNeedsFlush() would matter for the processes where
LocalSetXLogInsertAllowed() has been called to temporarily allow WAL
inserts.

LocalSetXLogInsertAllowed() is not a one-way trip as far as I recall,
with the checkpointer being a special case.

In XLogFlush() -> UpdateMinRecoveryPoint() (called after checking if
WAL inserts are allowed), it has similar logic to XLogNeedsFlush():

if (!updateMinRecoveryPoint || (!force && lsn <= LocalMinRecoveryPoint))
return;
if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;

The order of these two if statements is reversed in XLogNeedsFlush()

You mean in XLogNeedsFlush()->RecoveryInProgress() for the first set
of checks, and the top of UpdateMinRecoveryPoint() for the second set
of similar checks. Alexander Kukushkin may have some test cases
fancier than the in-core tests to stress this code path, still +1
for more consistency.

So, what I don't understand is: why it would be okay for processes
other than the startup process to read the minimum recovery point from
the control file during crash recovery before it is valid.

InvalidXLogRecPtr stored in ControlFile->minRecoveryPoint can be used
as a way to check if we are running crash recovery. I don't see why
we could not reverse the order of the checks to make both code paths
more consistent. Cannot think of a reason on top of my mind.

It seems in crash recovery, a backend other than the startup process
will fail this test:

if (XLogRecPtrIsInvalid(LocalMinRecoveryPoint) && InRecovery)

Yes. InRecovery can only ever be true in the startup process.
Anything else will fail this test.

FYI, looking a bit around with the history of XLogNeedsFlush(), the
last changes were a few years ago, with c186ba135eca and 8d68ee6f31ca
(cough):
/messages/by-id/20180831.145206.05203037.horiguchi.kyotaro@lab.ntt.co.jp

The only process calling UpdateMinRecoveryPoint() during crash
recovery should be the startup process, and I don't recall that we've
changed anything in this area lately to change this rule.

There is a comment right after that says

/*
* Check minRecoveryPoint for any other process than the startup
* process doing crash recovery, which should not update the control
* file value if crash recovery is still running.
*/

Which seems to say that it is okay for other processes than the
startup process to read the minimum recovery point from the Control
File during crash recovery. But I don't quite understand why.

minRecoveryPoint as an invalid LSN can allow another process to know
if we are in crash recovery. These days one could just use
GetRecoveryState() if they'd wish to know the state of recovery we are
in based on how much the startup process has replayed. I suspect that
this code could be simplified to rely more on GetRecoveryState(),
instead. This API has been introduced later than the recent changes
of XLogNeedsFlush(), as in 4e87c4836ab9 vs c186ba135eca. You are
right that there are some simplications that can be done here. That
looks like an independent list of patches, at least three:
1) The order of the sanity checks.
2) Potential simplifications in XLogNeedsFlush() with
GetRecoveryState().
3) Switch XLogNeedsFlush() to use !XLogInsertAllowed() rather than
RecoveryInProgress() that discard the local-wal-insert case, with the
addition of an extra assertion. bufmgr.c would be one location, I am
wondering if we should not plant that elsewhere, like in XLogFlush()
itself. There should be no risk of an infinite recursion, if my
reading of the code is right this morning.

Anyway, yes, it really comes down to the point that we've never
bothered using XLogNeedsFlush() to support your case. 3) would be
enough for your case, the other two seem like nice improvements to
make this part of the recovery code leaner.

Or are you worried about the case of GetVictimBuffer() where a second
call of XLogNeedsFlush() lives?

I want it to be fixed because of another patch I am writing to do
batched writes with smgrwritev() [1]. In this case, I've split up the
code in FlushBuffer() and flush the WAL to the required LSN in a
different place in code than where I call smgrwritev(). So, I want to
add an assert-only verification that checks that none of the buffer's
need WAL flushed. This would be invoked by any process calling
FlushBuffer(), so it would have to work for all types of backends.

Adding an assertion based on needs-flush sounds like a good idea in
the context of what you are doing in the other patch, yes. Recovery
test 015 would fail as well in your case, right?
--
Michael

#12

michael@paquier.xyz

8 months ago

In reply to: Jeff Davis (#10)

Re: Incorrect logic in XLogNeedsFlush()

On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote:

It seems like XLogFlush() and XLogNeedsFlush() should use the same
test, otherwise you could always get some confusing inconsistency.
Right?

Even if the checks are duplicated (dependency could be documented as
well), it would make sense to me to plant a check of XLogNeedsFlush()
inside XLogFlush(), I think. I have not tried if some parts of the
tests blow up when trying to do that even after switching
XLogNeedsFlush() to check if WAL inserts are allowed rather than if we
are in recovery.
--
Michael

#13

dilipbalaut@gmail.com

7 months ago

In reply to: Michael Paquier (#12)

Re: Incorrect logic in XLogNeedsFlush()

On Thu, Sep 11, 2025 at 4:51 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote:

It seems like XLogFlush() and XLogNeedsFlush() should use the same
test, otherwise you could always get some confusing inconsistency.
Right?

Even if the checks are duplicated (dependency could be documented as
well), it would make sense to me to plant a check of XLogNeedsFlush()
inside XLogFlush(), I think. I have not tried if some parts of the
tests blow up when trying to do that even after switching
XLogNeedsFlush() to check if WAL inserts are allowed rather than if we
are in recovery.

+1, it really makes XLogFlush() to directly check using
XLogNeedsFlush() after adding the "WAL inserts are allowed" check in
XLogNeedsFlush(), this is the best way to avoid any inconsistencies in
future as well.

--
Regards,
Dilip Kumar
Google

#14

dilipbalaut@gmail.com

7 months ago

In reply to: Dilip Kumar (#13)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Sep 11, 2025 at 4:51 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 10, 2025 at 09:58:08AM -0700, Jeff Davis wrote:

It seems like XLogFlush() and XLogNeedsFlush() should use the same
test, otherwise you could always get some confusing inconsistency.
Right?

Even if the checks are duplicated (dependency could be documented as
well), it would make sense to me to plant a check of XLogNeedsFlush()
inside XLogFlush(), I think. I have not tried if some parts of the
tests blow up when trying to do that even after switching
XLogNeedsFlush() to check if WAL inserts are allowed rather than if we
are in recovery.

+1, it really makes XLogFlush() to directly check using
XLogNeedsFlush() after adding the "WAL inserts are allowed" check in
XLogNeedsFlush(), this is the best way to avoid any inconsistencies in
future as well.

I tried with the attached patch, at least check-world reports no issue.

--
Regards,
Dilip Kumar
Google

#15

michael@paquier.xyz

7 months ago

In reply to: Dilip Kumar (#14)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, Sep 12, 2025 at 08:45:36AM +0530, Dilip Kumar wrote:

On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+1, it really makes XLogFlush() to directly check using
XLogNeedsFlush() after adding the "WAL inserts are allowed" check in
XLogNeedsFlush(), this is the best way to avoid any inconsistencies in
future as well.

I tried with the attached patch, at least check-world reports no issue.

@@ -2797,7 +2797,7 @@ XLogFlush(XLogRecPtr record)
}

     /* Quick exit if already known flushed */
-    if (record <= LogwrtResult.Flush)
+    if (!XLogNeedsFlush(record))
         return;

Hmm, no. You are making more expensive a check that is written to be
cheap. I was more thinking about an assertion at the bottom of
XLogFlush() once a flush is completed. Input and ideas from others
are of course welcome on the matter.
--
Michael

#16

dilipbalaut@gmail.com

7 months ago

In reply to: Michael Paquier (#15)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, Sep 12, 2025 at 9:47 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 12, 2025 at 08:45:36AM +0530, Dilip Kumar wrote:

On Fri, Sep 12, 2025 at 8:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+1, it really makes XLogFlush() to directly check using
XLogNeedsFlush() after adding the "WAL inserts are allowed" check in
XLogNeedsFlush(), this is the best way to avoid any inconsistencies in
future as well.

I tried with the attached patch, at least check-world reports no issue.

@@ -2797,7 +2797,7 @@ XLogFlush(XLogRecPtr record)
}
/* Quick exit if already known flushed */
-    if (record <= LogwrtResult.Flush)
+    if (!XLogNeedsFlush(record))
return;
Hmm, no. You are making more expensive a check that is written to be
cheap. I was more thinking about an assertion at the bottom of
XLogFlush() once a flush is completed. Input and ideas from others
are of course welcome on the matter.

Yeah, asserting it at the end makes sense, as we can ensure that
XLogFlush() and XLogNeedsFlush() agree on the same thing without
adding additional overhead. Here is the revised one.

--
Regards,
Dilip Kumar
Google

#17

michael@paquier.xyz

7 months ago

In reply to: Dilip Kumar (#16)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote:

Yeah, asserting it at the end makes sense, as we can ensure that
XLogFlush() and XLogNeedsFlush() agree on the same thing without
adding additional overhead. Here is the revised one.

Melanie and Jeff, what do you think?
--
Michael

#18

pgsql@j-davis.com

7 months ago

In reply to: Michael Paquier (#17)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, 2025-09-12 at 14:04 +0900, Michael Paquier wrote:

On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote:

Yeah, asserting it at the end makes sense, as we can ensure that
XLogFlush() and XLogNeedsFlush() agree on the same thing without
adding additional overhead. Here is the revised one.

Melanie and Jeff, what do you think?

IIUC: during the end-of-recovery checkpoint, a hypothetical call to
XLogNeedsFlush() would have incorrectly used the LocalMinRecoveryPoint;
and with the change, it would correctly use the flush pointer.

That wasn't a live bug because XLogNeedsFlush() was never actually
called during the end-of-recovery checkpoint. CreateCheckPoint()
called XLogFlush() directly, which correctly checked the flush pointer.
But, hypothetically, if that code had logic based around
XLogNeedsFlush() before calling XLogFlush(), that would have been a
bug.

So this fix has no behavior change, but it would make the code more
resilient against changes to CreateCheckPoint(), more consistent, and
less confusing. Is that right?

Assuming I'm right so far, then I agree with changing the test in
XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does.

The comment in the patch is describing what's happening, but lost the
"why". Perhaps something like:

"During recovery, we don't flush WAL but update minRecoveryPoint
instead. So "needs flush" is taken to mean whether minRecoveryPoint
would need to be updated. The end-of-recovery checkpoint still must
flush WAL like normal, though, so check !XLogInsertAllowed(). This
check must be consistent with XLogFlush()."

The commit message is vague about whether there's a live bug or not; I
suggest clarifying that the inconsistency between the two functions
could have been a bug but wasn't. Also, I generally use past tense for
the stuff that's being fixed and present tense for what the patch does.

One loose end: there was some discussion about the times when
LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely
understood -- is that no longer a concern?

Regards,
Jeff Davis

#19

dilipbalaut@gmail.com

7 months ago

In reply to: Jeff Davis (#18)

Re: Incorrect logic in XLogNeedsFlush()

On Fri, Sep 12, 2025 at 10:02 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-09-12 at 14:04 +0900, Michael Paquier wrote:

On Fri, Sep 12, 2025 at 10:21:27AM +0530, Dilip Kumar wrote:

Yeah, asserting it at the end makes sense, as we can ensure that
XLogFlush() and XLogNeedsFlush() agree on the same thing without
adding additional overhead. Here is the revised one.

Melanie and Jeff, what do you think?

IIUC: during the end-of-recovery checkpoint, a hypothetical call to
XLogNeedsFlush() would have incorrectly used the LocalMinRecoveryPoint;
and with the change, it would correctly use the flush pointer.

That wasn't a live bug because XLogNeedsFlush() was never actually
called during the end-of-recovery checkpoint. CreateCheckPoint()
called XLogFlush() directly, which correctly checked the flush pointer.
But, hypothetically, if that code had logic based around
XLogNeedsFlush() before calling XLogFlush(), that would have been a
bug.

So this fix has no behavior change, but it would make the code more
resilient against changes to CreateCheckPoint(), more consistent, and
less confusing. Is that right?

That's right.

Assuming I'm right so far, then I agree with changing the test in
XLogNeedsFlush() to !XLogInsertAllowed(), as the patch does.

The comment in the patch is describing what's happening, but lost the
"why". Perhaps something like:

"During recovery, we don't flush WAL but update minRecoveryPoint
instead. So "needs flush" is taken to mean whether minRecoveryPoint
would need to be updated. The end-of-recovery checkpoint still must
flush WAL like normal, though, so check !XLogInsertAllowed(). This
check must be consistent with XLogFlush()."

Updated as per suggestion

The commit message is vague about whether there's a live bug or not; I
suggest clarifying that the inconsistency between the two functions
could have been a bug but wasn't. Also, I generally use past tense for
the stuff that's being fixed and present tense for what the patch does.

I tried to improve it in v2

One loose end: there was some discussion about the times when
LocalMinRecoveryPoint is valid/correct. I'm not sure I entirely
understood -- is that no longer a concern?

IMHO during crash recovery LocalMinRecoveryPoint and
ControlFile->minRecoveryPoint can not initialized until we replay all
WAL so those should never get accessed. However if XLogNeedsFlush()
were called during the end of the recovery checkpoint it was accessing
this which was wrong although it was not a live bug and this patch is
making that more resilient. OTOH if we are creating a restartpoint
during standby replay or archive recovery then we do access
LocalMinRecoveryPoint and ControlFile->minRecoveryPoint because by
that time we have already done the minimum recovery require for
reaching to the consistent state and we would have initialized
ControlFile->minRecoveryPoint.

IMHO during crash recovery, the variables LocalMinRecoveryPoint and
ControlFile->minRecoveryPoint remain uninitialized until all required
WAL has been replayed. Consequently, they must not be accessed during
this phase. Previously, calling XLogNeedsFlush() during the "end of
recovery checkpoint" incorrectly accessed these uninitialized values.
While this was not a live bug, this patch hardens the code against
this type of race condition.

OTOH, when creating a RestartPoint during standby replay or doing
archive recovery, access to both LocalMinRecoveryPoint and
ControlFile->minRecoveryPoint is correct. By this stage, it has
completed the minimum recovery required to achieve a consistent state,
and ControlFile->minRecoveryPoint would have been properly
initialized.

--
Regards,
Dilip Kumar
Google

#20

pgsql@j-davis.com

7 months ago

In reply to: Dilip Kumar (#19)

Re: Incorrect logic in XLogNeedsFlush()

On Sun, 2025-09-14 at 13:39 +0530, Dilip Kumar wrote:

I tried to improve it in v2

Thank you. Looks good to me.

IMHO during crash recovery LocalMinRecoveryPoint and
ControlFile->minRecoveryPoint can not initialized until we replay all
WAL so those should never get accessed. However if XLogNeedsFlush()
were called during the end of the recovery checkpoint it was
accessing
this which was wrong although it was not a live bug and this patch is
making that more resilient.

OK, so this will no longer be a concern after the patch (and not a live
bug before, anyway).

Regards,
Jeff Davis

#21

dilipbalaut@gmail.com

7 months ago

In reply to: Jeff Davis (#20)

#22