False "pg_serial": apparent wraparound” in logs

Started by Sami Imseihover 2 years ago15 messageshackers
Jump to latest
#1Sami Imseih
samimseih@gmail.com

Hi,

I Recently encountered a situation on the field in which the message
“could not truncate directory "pg_serial": apparent wraparound”
was logged even through there was no danger of wraparound. This
was on a brand new cluster and only took a few minutes to see
the message in the logs.

Reading on some history of this error message, it appears that there
was work done to improve SLRU truncation and associated wraparound
log messages [1]/messages/by-id/20190202083822.GC32531@gust.leadboat.com. The attached repro on master still shows that this message
can be logged incorrectly.

The repro runs updates with 90 threads in serializable mode and kicks
off a “long running” select on the same table in serializable mode.

As soon as the long running select commits, the next checkpoint fails
to truncate the SLRU and logs the error message.

Besides the confusing log message, there may also also be risk with
pg_serial getting unnecessarily bloated and depleting the disk space.

Is this a bug?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

Regards,

Sami Imseih
Amazon Web Services (AWS)

Attachments:

repro.txttext/plain; name=repro.txtDownload
#2Sami Imseih
samimseih@gmail.com
In reply to: Sami Imseih (#1)
Re: False "pg_serial": apparent wraparound” in logs

Hi,

I dug a bit into this and what looks to be happening is the comparison
of the page containing the latest cutoff xid could falsely be reported
as in the future of the last page number because the latest
page number of the Serial slru is only set when the page is
initialized [1]/messages/by-id/20190202083822.GC32531@gust.leadboat.com.

So under the correct conditions, such as in the repro, the serializable
XID has moved past the last page number, therefore to the next checkpoint
which triggers a CheckPointPredicate, it will appear that the slru
has wrapped around.

It seems what may be needed here is to advance the
latest_page_number during SerialSetActiveSerXmin and if
we are using the SLRU. See below:

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..6946ed21b4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -992,6 +992,9 @@ SerialSetActiveSerXmin(TransactionId xid)

serialControl->tailXid = xid;

+       if (serialControl->headPage > 0)
+               SerialSlruCtl->shared->latest_page_number = SerialPage(xid);
+
        LWLockRelease(SerialSLRULock);
}

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

Regards,

Sami

From: "Imseih (AWS), Sami" <simseih@amazon.com>
Date: Tuesday, August 22, 2023 at 7:56 PM
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Subject: False "pg_serial": apparent wraparound” in logs

Hi,

I Recently encountered a situation on the field in which the message
“could not truncate directory "pg_serial": apparent wraparound”
was logged even through there was no danger of wraparound. This
was on a brand new cluster and only took a few minutes to see
the message in the logs.

Reading on some history of this error message, it appears that there
was work done to improve SLRU truncation and associated wraparound
log messages [1]/messages/by-id/20190202083822.GC32531@gust.leadboat.com. The attached repro on master still shows that this message
can be logged incorrectly.

The repro runs updates with 90 threads in serializable mode and kicks
off a “long running” select on the same table in serializable mode.

As soon as the long running select commits, the next checkpoint fails
to truncate the SLRU and logs the error message.

Besides the confusing log message, there may also also be risk with
pg_serial getting unnecessarily bloated and depleting the disk space.

Is this a bug?

[1]: /messages/by-id/20190202083822.GC32531@gust.leadboat.com

Regards,

Sami Imseih
Amazon Web Services (AWS)

#3Sami Imseih
samimseih@gmail.com
In reply to: Sami Imseih (#2)
Re: False "pg_serial": apparent wraparound” in logs

Attached a patch with a new CF entry: https://commitfest.postgresql.org/44/4516/

Regards,

Sami Imseih
Amazon Web Services (AWS)

Attachments:

0001-v1-Fix-false-report-of-wraparound-in-pg_serial.patchapplication/octet-stream; name=0001-v1-Fix-false-report-of-wraparound-in-pg_serial.patchDownload+7-1
#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Sami Imseih (#3)
Re: False "pg_serial": apparent wraparound” in logs

On 25/08/2023 07:29, Imseih (AWS), Sami wrote:

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..7e7be3b885 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -992,6 +992,13 @@ SerialSetActiveSerXmin(TransactionId xid)

serialControl->tailXid = xid;

+	/*
+	 * If the SLRU is being used, set the latest page number to
+	 * the current tail xid.
+	 */
+	if (serialControl->headPage > 0)
+		SerialSlruCtl->shared->latest_page_number = SerialPage(serialControl->tailXid);
+
LWLockRelease(SerialSLRULock);
}

I don't really understand what exactly the problem is, or how this fixes
it. But this doesn't feel right:

Firstly, isn't headPage == 0 also a valid value? We initialize headPage
to -1 when it's not in use.

Secondly, shouldn't we set it to the page corresponding to headXid
rather than tailXid.

Thirdly, I don't think this code should have any business setting
latest_page_number directly. latest_page_number is set in
SimpleLruZeroPage(). Are we missing a call to SimpleLruZeroPage() somewhere?

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Sami Imseih
samimseih@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: False "pg_serial": apparent wraparound” in logs

I don't really understand what exactly the problem is, or how this fixes
it. But this doesn't feel right:

As the repro show, false reports of "pg_serial": apparent wraparound”
messages are possible. For a very busy system which checkpoints frequently
and heavy usage of serializable isolation, this will flood the error logs, and
falsely cause alarm to the user. It also prevents the SLRU from being
truncated.

In my repro, I end up seeing, even though the SLRU does not wraparound.
" LOG: could not truncate directory "pg_serial": apparent wraparound"

Firstly, isn't headPage == 0 also a valid value? We initialize headPage
to -1 when it's not in use.

Yes. You are correct. This is wrong.

Secondly, shouldn't we set it to the page corresponding to headXid
rather than tailXid.

Thirdly, I don't think this code should have any business setting
latest_page_number directly. latest_page_number is set in
SimpleLruZeroPage().

Correct, after checking again, I do realize the patch is wrong.

Are we missing a call to SimpleLruZeroPage() somewhere?

That is a good point.

The initial idea was to advance the latest_page_number
during SerialSetActiveSerXmin, but the initial approach is
obviously wrong.

When SerialSetActiveSerXmin is called for a new active
serializable xmin, and at that point we don't need to keep any
any earlier transactions, should SimpleLruZeroPage be called
to ensure there is a target page for the xid?

I tried something like below, which fixes my repro, by calling
SimpleLruZeroPage at the end of SerialSetActiveSerXmin.

@@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
 static void
 SerialSetActiveSerXmin(TransactionId xid)
 {
+       int targetPage = SerialPage(xid);
+
        LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);

/*
@@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)

serialControl->tailXid = xid;

+       if (serialControl->headPage != targetPage)
+               SimpleLruZeroPage(SerialSlruCtl, targetPage);
+
        LWLockRelease(SerialSLRULock);
 }

Regards,

Sami

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Sami Imseih (#5)
Re: False "pg_serial": apparent wraparound” in logs

On 30/09/2023 02:16, Imseih (AWS), Sami wrote:

The initial idea was to advance the latest_page_number
during SerialSetActiveSerXmin, but the initial approach is
obviously wrong.

That approach at high level could work, a

When SerialSetActiveSerXmin is called for a new active
serializable xmin, and at that point we don't need to keep any
any earlier transactions, should SimpleLruZeroPage be called
to ensure there is a target page for the xid?

I tried something like below, which fixes my repro, by calling
SimpleLruZeroPage at the end of SerialSetActiveSerXmin.

@@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
+       int targetPage = SerialPage(xid);
+
LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);

/*
@@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)

serialControl->tailXid = xid;

+       if (serialControl->headPage != targetPage)
+               SimpleLruZeroPage(SerialSlruCtl, targetPage);
+
LWLockRelease(SerialSLRULock);
}

No, that's very wrong too. You are zeroing the page containing the
oldest XID that's still needed. That page still contains important
information. It might work if you zero the previous page, but I think
you need to do a little more than that. (I wish we had tests that would
catch that.)

The crux of the problem is that 'tailXid' can advance past 'headXid'. I
was bit surprised by that, but I think it's by design. I wish it was
called out explicitly in a comment though. The code mostly handles that
fine, except that it confuses the "apparent wraparound" check.

'tailXid' is the oldest XID that we might still need to look up in the
SLRU, based on the transactions that are still active, and 'headXid' is
the newest XID that has been written out to the SLRU. But we only write
an XID out to the SLRU and advance headXid if the shared memory data
structure fills up. So it's possible that as old transactions age out,
we advance 'tailXid' past 'headXid'.

SerialAdd() tolerates tailXid > headXid. It will zero out all the pages
between the old headXid and tailXid, even though no lookups can occur on
those pages. That's unnecessary but harmless.

I think the smallest fix here would be to change CheckPointPredicate()
so that if tailPage > headPage, pass headPage to SimpleLruTruncate()
instead of tailPage. Or perhaps it should go into the "The SLRU is no
longer needed" codepath in that case. If tailPage > headPage, the SLRU
isn't needed at the moment.

In addition to that, we could change SerialAdd() to not zero out the
pages between old headXid and tailXid unnecessarily, but that's more of
an optimization than bug fix.

--
Heikki Linnakangas
Neon (https://neon.tech)

#7Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#6)
Re: False "pg_serial": apparent wraparound” in logs

On Sun, Oct 01, 2023 at 09:43:21PM +0300, Heikki Linnakangas wrote:

I think the smallest fix here would be to change CheckPointPredicate() so
that if tailPage > headPage, pass headPage to SimpleLruTruncate() instead of
tailPage. Or perhaps it should go into the "The SLRU is no longer needed"
codepath in that case. If tailPage > headPage, the SLRU isn't needed at the
moment.

Good idea. Indeed that should be good and simple enough for the
back-branches, at quick glance.
--
Michael

#8Sami Imseih
samimseih@gmail.com
In reply to: Heikki Linnakangas (#6)
Re: False "pg_serial": apparent wraparound” in logs

I think the smallest fix here would be to change CheckPointPredicate()
so that if tailPage > headPage, pass headPage to SimpleLruTruncate()
instead of tailPage. Or perhaps it should go into the "The SLRU is no
longer needed" codepath in that case. If tailPage > headPage, the SLRU
isn't needed at the moment.

I spent sometime studying this and it appears to be a good approach.

Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to
SimpleLruTruncate is already being done when the tailXid is not a valid XID.
I added an additional condition to make sure that the tailPage proceeds the headPage
as well.

Attached is v2 of the patch.

In addition to that, we could change SerialAdd() to not zero out the
pages between old headXid and tailXid unnecessarily, but that's more of
an optimization than bug fix.

Yes, I did notice that in my debugging, but will not address this in the current patch.

Regards,

Sami

Attachments:

0001-v2-Fix-false-apparent-wraparound-detected-in-pg_serial.patchapplication/octet-stream; name=0001-v2-Fix-false-apparent-wraparound-detected-in-pg_serial.patchDownload+12-5
#9Sami Imseih
samimseih@gmail.com
In reply to: Sami Imseih (#8)
Re: False "pg_serial": apparent wraparound” in logs

Correct a typo in my last message:

Instead of:
“ I added an additional condition to make sure that the tailPage proceeds the headPage
as well. “

It should be:
“ I added an additional condition to make sure that the tailPage precedes the headPage
as well. ”

Regards,

Sami

Show quoted text

On Oct 5, 2023, at 6:29 PM, Imseih (AWS), Sami <simseih@amazon.com> wrote:

I added an additional condition to make sure that the tailPage proceeds the headPage
as well.

#10Michael Paquier
michael@paquier.xyz
In reply to: Sami Imseih (#8)
Re: False "pg_serial": apparent wraparound” in logs

On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote:

I spent sometime studying this and it appears to be a good approach.

Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to
SimpleLruTruncate is already being done when the tailXid is not a valid XID.
I added an additional condition to make sure that the tailPage proceeds the headPage
as well.

Attached is v2 of the patch.

Thanks for the updated patch. I have begun looking at what you have
here.
--
Michael

#11Michael Paquier
michael@paquier.xyz
In reply to: Sami Imseih (#8)
Re: False "pg_serial": apparent wraparound” in logs

On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote:

I spent sometime studying this and it appears to be a good approach.

Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to
SimpleLruTruncate is already being done when the tailXid is not a valid XID.
I added an additional condition to make sure that the tailPage proceeds the headPage
as well.

I have been studying the whole area, and these threads from 2011 have
come to me, with two separate attempts:
/messages/by-id/4DF105ED.1050800@enterprisedb.com
/messages/by-id/4D8F54E6020000250003BD16@gw.wicourts.gov

Back then, we were pretty much OK with the amount of space that could
be wasted even in this case. Actually, how much space are we talking
about here when a failed truncation happens? As this is basically
harmless, still leads to a confusing message, do we really need a
backpatch here?

Anyway, it looks like you're right, we don't really need the SLRU once
the tail is ahead of the tail because the SLRU has wrapped around due
to the effect of transactions aging out, so making the truncation a
bit smarter should be OK.

+    /*
+     * Check if the tailXid is valid and that the tailPage is not ahead of
+     * the headPage, otherwise the SLRU is no longer needed.
+     */

Hmm. This doesn't seem enough. Shouldn't we explain at least in
which scenarios the tail can get ahead of the head (aka at least
with long running transactions that make the SLRU wrap-around)?
Except if I am missing something, there is no explanation of that in
predicate.c.
--
Michael

#12Sami Imseih
samimseih@gmail.com
In reply to: Michael Paquier (#11)
Re: False "pg_serial": apparent wraparound” in logs

Sorry for the delay in response.

Back then, we were pretty much OK with the amount of space that could
be wasted even in this case. Actually, how much space are we talking
about here when a failed truncation happens?

It is a transient waste in space as it will eventually clean up.

As this is basically
harmless, still leads to a confusing message,

Correct, and especially because the message has
"wraparound" in the text.

do we really need a backpatch here?

No, I don't think a backpatch is necessary.

Anyway, it looks like you're right, we don't really need the SLRU once
the tail is ahead of the tail because the SLRU has wrapped around due
to the effect of transactions aging out, so making the truncation a
bit smarter should be OK.

I assume you meant " the tail is ahead of the head".

SummarizeOldestCommittedSxact advances the headXid, but if we
checkpoint before this is called, then the tail could be ahead. The tail is
advanced by SerialSetActiveSerXmin whenever there is a new serializable
transaction.

Hmm. This doesn't seem enough. Shouldn't we explain at least in
which scenarios the tail can get ahead of the head (aka at least
with long running transactions that make the SLRU wrap-around)?
Except if I am missing something, there is no explanation of that in
predicate.c.

After looking at this a bit more, I don't think the previous rev is correct.
We should not fall through to the " The SLRU is no longer needed." Which
also sets the headPage to invalid. We should only truncate up to the
head page.

Please see attached v3.

Regards,

Sami

Attachments:

0001-v3-Fix-false-apparent-wraparound-detected-in-pg_serial.patchapplication/octet-stream; name=0001-v3-Fix-false-apparent-wraparound-detected-in-pg_serial.patchDownload+17-5
#13Michael Paquier
michael@paquier.xyz
In reply to: Sami Imseih (#12)
Re: False "pg_serial": apparent wraparound” in logs

On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote:

Anyway, it looks like you're right, we don't really need the SLRU once
the tail is ahead of the tail because the SLRU has wrapped around due
to the effect of transactions aging out, so making the truncation a
bit smarter should be OK.

I assume you meant " the tail is ahead of the head".

Damn fingers on a keyboard who don't know how to type.

Hmm. This doesn't seem enough. Shouldn't we explain at least in
which scenarios the tail can get ahead of the head (aka at least
with long running transactions that make the SLRU wrap-around)?
Except if I am missing something, there is no explanation of that in
predicate.c.

After looking at this a bit more, I don't think the previous rev is correct.
We should not fall through to the " The SLRU is no longer needed." Which
also sets the headPage to invalid. We should only truncate up to the
head page.

Seems correct to me. Or this would count as if the SLRU is not in
use, but it's being used.
--
Michael

#14Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#13)
Re: False "pg_serial": apparent wraparound” in logs

On Mon, Oct 16, 2023 at 04:58:31PM +0900, Michael Paquier wrote:

On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote:

After looking at this a bit more, I don't think the previous rev is correct.
We should not fall through to the " The SLRU is no longer needed." Which
also sets the headPage to invalid. We should only truncate up to the
head page.

Seems correct to me. Or this would count as if the SLRU is not in
use, but it's being used.

So, I've spent more time on that and applied the simplification today,
doing as you have suggested to use the head page rather than the tail
page when the tail XID is ahead of the head XID, but without disabling
the whole. I've simplified a bit the code and the comments, though,
while on it (some renames and a slight refactoring of tailPage, for
example).
--
Michael

#15Sami Imseih
samimseih@gmail.com
In reply to: Michael Paquier (#14)
Re: False "pg_serial": apparent wraparound” in logs

So, I've spent more time on that and applied the simplification today,
doing as you have suggested to use the head page rather than the tail
page when the tail XID is ahead of the head XID, but without disabling
the whole. I've simplified a bit the code and the comments, though,
while on it (some renames and a slight refactoring of tailPage, for
example).
--
Michael

Thank you!

Regards,

Sami