Is this a problem in GenericXLogFinish()?

Started by Jeff Davisover 2 years ago64 messages
#1Jeff Davis
pgsql@j-davis.com

src/backend/transam/README says:

...
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This 
must happen before the WAL record is inserted; see notes in 
SyncOneBuffer().)
...

But GenericXLogFinish() does this:

...
/* Insert xlog record */
lsn = XLogInsert(RM_GENERIC_ID, 0);

/* Set LSN and mark buffers dirty */
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
{
PageData *pageData = &state->pages[i];

if (BufferIsInvalid(pageData->buffer))
continue;
PageSetLSN(BufferGetPage(pageData->buffer), lsn);
MarkBufferDirty(pageData->buffer);
}
END_CRIT_SECTION();

Am I missing something or is that a problem?

Regards,
Jeff Davis

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jeff Davis (#1)
Re: Is this a problem in GenericXLogFinish()?

On 22/09/2023 23:52, Jeff Davis wrote:

src/backend/transam/README says:

...
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This
must happen before the WAL record is inserted; see notes in
SyncOneBuffer().)
...

But GenericXLogFinish() does this:

...
/* Insert xlog record */
lsn = XLogInsert(RM_GENERIC_ID, 0);

/* Set LSN and mark buffers dirty */
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
{
PageData *pageData = &state->pages[i];

if (BufferIsInvalid(pageData->buffer))
continue;
PageSetLSN(BufferGetPage(pageData->buffer), lsn);
MarkBufferDirty(pageData->buffer);
}
END_CRIT_SECTION();

Am I missing something or is that a problem?

Yes, that's a problem.

I wish we had an assertion for that. XLogInsert() could assert that the
page is already marked dirty, for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#2)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote:

Yes, that's a problem.

Patch attached. I rearranged the code a bit to follow the expected
pattern of: write, mark dirty, WAL, set LSN. I think computing the
deltas could also be moved earlier, outside of the critical section,
but I'm not sure that would be useful.

Do you have a suggestion for any kind of test addition, or should we
just review carefully?

I wish we had an assertion for that. XLogInsert() could assert that
the
page is already marked dirty, for example.

Unfortunately that's not always the case, e.g. log_newpage_range().

Regards,
Jeff Davis

Attachments:

v1-0001-Fix-GenericXLogFinish.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-GenericXLogFinish.patchDownload
From c54749a7939721f0a838115abce09967216cbc0f Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 26 Sep 2023 12:10:11 -0700
Subject: [PATCH v1] Fix GenericXLogFinish().

Mark the buffers dirty before writing WAL.

Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi
---
 src/backend/access/transam/generic_xlog.c | 66 ++++++++++++-----------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6c68191ca6..2b7e054ebe 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -343,10 +343,12 @@ GenericXLogFinish(GenericXLogState *state)
 	if (state->isLogged)
 	{
 		/* Logged relation: make xlog record in critical section. */
-		XLogBeginInsert();
-
 		START_CRIT_SECTION();
 
+		/*
+		 * Compute deltas if necessary, write changes to buffers, and mark
+		 * them dirty.
+		 */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = &state->pages[i];
@@ -359,41 +361,42 @@ GenericXLogFinish(GenericXLogState *state)
 			page = BufferGetPage(pageData->buffer);
 			pageHeader = (PageHeader) pageData->image;
 
+			/* delta not needed for a full page image */
+			if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE))
+				computeDelta(pageData, page, (Page) pageData->image);
+
+			/*
+			 * Apply the image, being careful to zero the "hole" between
+			 * pd_lower and pd_upper in order to avoid divergence between
+			 * actual page state and what replay would produce.
+			 */
+			memcpy(page, pageData->image, pageHeader->pd_lower);
+			memset(page + pageHeader->pd_lower, 0,
+				   pageHeader->pd_upper - pageHeader->pd_lower);
+			memcpy(page + pageHeader->pd_upper,
+				   pageData->image + pageHeader->pd_upper,
+				   BLCKSZ - pageHeader->pd_upper);
+
+			MarkBufferDirty(pageData->buffer);
+		}
+
+		XLogBeginInsert();
+
+		/* Register buffers */
+		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
+		{
+			PageData   *pageData = &state->pages[i];
+
+			if (BufferIsInvalid(pageData->buffer))
+				continue;
+
 			if (pageData->flags & GENERIC_XLOG_FULL_IMAGE)
 			{
-				/*
-				 * A full-page image does not require us to supply any xlog
-				 * data.  Just apply the image, being careful to zero the
-				 * "hole" between pd_lower and pd_upper in order to avoid
-				 * divergence between actual page state and what replay would
-				 * produce.
-				 */
-				memcpy(page, pageData->image, pageHeader->pd_lower);
-				memset(page + pageHeader->pd_lower, 0,
-					   pageHeader->pd_upper - pageHeader->pd_lower);
-				memcpy(page + pageHeader->pd_upper,
-					   pageData->image + pageHeader->pd_upper,
-					   BLCKSZ - pageHeader->pd_upper);
-
 				XLogRegisterBuffer(i, pageData->buffer,
 								   REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
 			}
 			else
 			{
-				/*
-				 * In normal mode, calculate delta and write it as xlog data
-				 * associated with this page.
-				 */
-				computeDelta(pageData, page, (Page) pageData->image);
-
-				/* Apply the image, with zeroed "hole" as above */
-				memcpy(page, pageData->image, pageHeader->pd_lower);
-				memset(page + pageHeader->pd_lower, 0,
-					   pageHeader->pd_upper - pageHeader->pd_lower);
-				memcpy(page + pageHeader->pd_upper,
-					   pageData->image + pageHeader->pd_upper,
-					   BLCKSZ - pageHeader->pd_upper);
-
 				XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD);
 				XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);
 			}
@@ -402,7 +405,7 @@ GenericXLogFinish(GenericXLogState *state)
 		/* Insert xlog record */
 		lsn = XLogInsert(RM_GENERIC_ID, 0);
 
-		/* Set LSN and mark buffers dirty */
+		/* Set LSN */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = &state->pages[i];
@@ -410,7 +413,6 @@ GenericXLogFinish(GenericXLogState *state)
 			if (BufferIsInvalid(pageData->buffer))
 				continue;
 			PageSetLSN(BufferGetPage(pageData->buffer), lsn);
-			MarkBufferDirty(pageData->buffer);
 		}
 		END_CRIT_SECTION();
 	}
-- 
2.34.1

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jeff Davis (#3)
Re: Is this a problem in GenericXLogFinish()?

On 26/09/2023 22:32, Jeff Davis wrote:

On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote:

Yes, that's a problem.

Patch attached. I rearranged the code a bit to follow the expected
pattern of: write, mark dirty, WAL, set LSN. I think computing the
deltas could also be moved earlier, outside of the critical section,
but I'm not sure that would be useful.

Looks correct. You now loop through all the block IDs three times,
however. I wonder if that is measurably slower, but even if it's not,
was there a reason you wanted to move the XLogRegisterBuffer() calls to
a separate loop?

Do you have a suggestion for any kind of test addition, or should we
just review carefully?

I wish we had an assertion for that. XLogInsert() could assert that
the page is already marked dirty, for example.

Unfortunately that's not always the case, e.g. log_newpage_range().

Hmm, I'm sure there are exceptions but log_newpage_range() actually
seems to be doing the right thing; it calls MarkBufferDirty() before
XLogInsert(). It only calls it after XLogRegisterBuffer() though, and I
concur that XLogRegisterBuffer() would be the logical place for the
assertion. We could tighten this up, require that you call
MarkBufferDirty() before XLogRegisterBuffer(), and fix all the callers.

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#4)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Wed, 2023-09-27 at 00:14 +0300, Heikki Linnakangas wrote:

Looks correct. You now loop through all the block IDs three times,
however. I wonder if that is measurably slower, but even if it's not,
was there a reason you wanted to move the XLogRegisterBuffer() calls
to
a separate loop?

I did so to correspond more closely to what's outlined in the README
and in other places in the code, where marking the buffers dirty
happens before XLogBeginInsert(). It didn't occur to me that one extra
loop would matter, but I can combine them again.

It would be a bit more concise to do the XLogBeginInsert() first (like
before) and then register the buffers in the same loop that does the
writes and marks the buffers dirty. Updated patch attached.

Hmm, I'm sure there are exceptions but log_newpage_range() actually
seems to be doing the right thing; it calls MarkBufferDirty() before
XLogInsert(). It only calls it after XLogRegisterBuffer() though, and
I
concur that XLogRegisterBuffer() would be the logical place for the
assertion. We could tighten this up, require that you call
MarkBufferDirty() before XLogRegisterBuffer(), and fix all the
callers.

That site is pretty trivial to fix, but there are also a couple places
in hash.c and hashovfl.c that are registering a clean page and it's not
clear to me exactly what's going on.

Regards,
Jeff Davis

Attachments:

v2-0001-Fix-GenericXLogFinish.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-GenericXLogFinish.patchDownload
From 9548bb49865db3a04bbdda4c63df611142e22964 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 26 Sep 2023 12:10:11 -0700
Subject: [PATCH v2] Fix GenericXLogFinish().

Mark the buffers dirty before writing WAL.

Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi
---
 src/backend/access/transam/generic_xlog.c | 53 ++++++++++-------------
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 6c68191ca6..2d5ff4aa7c 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -347,6 +347,10 @@ GenericXLogFinish(GenericXLogState *state)
 
 		START_CRIT_SECTION();
 
+		/*
+		 * Compute deltas if necessary, write changes to buffers, mark
+		 * buffers dirty, and register changes.
+		 */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = &state->pages[i];
@@ -359,41 +363,31 @@ GenericXLogFinish(GenericXLogState *state)
 			page = BufferGetPage(pageData->buffer);
 			pageHeader = (PageHeader) pageData->image;
 
+			/* delta not needed for a full page image */
+			if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE))
+				computeDelta(pageData, page, (Page) pageData->image);
+
+			/*
+			 * Apply the image, being careful to zero the "hole" between
+			 * pd_lower and pd_upper in order to avoid divergence between
+			 * actual page state and what replay would produce.
+			 */
+			memcpy(page, pageData->image, pageHeader->pd_lower);
+			memset(page + pageHeader->pd_lower, 0,
+				   pageHeader->pd_upper - pageHeader->pd_lower);
+			memcpy(page + pageHeader->pd_upper,
+				   pageData->image + pageHeader->pd_upper,
+				   BLCKSZ - pageHeader->pd_upper);
+
+			MarkBufferDirty(pageData->buffer);
+
 			if (pageData->flags & GENERIC_XLOG_FULL_IMAGE)
 			{
-				/*
-				 * A full-page image does not require us to supply any xlog
-				 * data.  Just apply the image, being careful to zero the
-				 * "hole" between pd_lower and pd_upper in order to avoid
-				 * divergence between actual page state and what replay would
-				 * produce.
-				 */
-				memcpy(page, pageData->image, pageHeader->pd_lower);
-				memset(page + pageHeader->pd_lower, 0,
-					   pageHeader->pd_upper - pageHeader->pd_lower);
-				memcpy(page + pageHeader->pd_upper,
-					   pageData->image + pageHeader->pd_upper,
-					   BLCKSZ - pageHeader->pd_upper);
-
 				XLogRegisterBuffer(i, pageData->buffer,
 								   REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
 			}
 			else
 			{
-				/*
-				 * In normal mode, calculate delta and write it as xlog data
-				 * associated with this page.
-				 */
-				computeDelta(pageData, page, (Page) pageData->image);
-
-				/* Apply the image, with zeroed "hole" as above */
-				memcpy(page, pageData->image, pageHeader->pd_lower);
-				memset(page + pageHeader->pd_lower, 0,
-					   pageHeader->pd_upper - pageHeader->pd_lower);
-				memcpy(page + pageHeader->pd_upper,
-					   pageData->image + pageHeader->pd_upper,
-					   BLCKSZ - pageHeader->pd_upper);
-
 				XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD);
 				XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);
 			}
@@ -402,7 +396,7 @@ GenericXLogFinish(GenericXLogState *state)
 		/* Insert xlog record */
 		lsn = XLogInsert(RM_GENERIC_ID, 0);
 
-		/* Set LSN and mark buffers dirty */
+		/* Set LSN */
 		for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
 		{
 			PageData   *pageData = &state->pages[i];
@@ -410,7 +404,6 @@ GenericXLogFinish(GenericXLogState *state)
 			if (BufferIsInvalid(pageData->buffer))
 				continue;
 			PageSetLSN(BufferGetPage(pageData->buffer), lsn);
-			MarkBufferDirty(pageData->buffer);
 		}
 		END_CRIT_SECTION();
 	}
-- 
2.34.1

#6Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#5)
Re: Is this a problem in GenericXLogFinish()?

On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis <pgsql@j-davis.com> wrote:

That site is pretty trivial to fix, but there are also a couple places
in hash.c and hashovfl.c that are registering a clean page and it's not
clear to me exactly what's going on.

Huh, I wonder if that's just a bug. Do you know where exactly?

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#6)
Re: Is this a problem in GenericXLogFinish()?

On Wed, 2023-09-27 at 10:36 -0400, Robert Haas wrote:

On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis <pgsql@j-davis.com> wrote:

That site is pretty trivial to fix, but there are also a couple
places
in hash.c and hashovfl.c that are registering a clean page and it's
not
clear to me exactly what's going on.

Huh, I wonder if that's just a bug. Do you know where exactly?

hashovfl.c:665 and hash.c:831

In both cases it's registering the bucket_buf, and has a comment like:

/*
* bucket buffer needs to be registered to ensure that we can
* acquire a cleanup lock on it during replay.
*/

And various redo routines have comments like:

/*
* Ensure we have a cleanup lock on primary bucket page before we
start
* with the actual replay operation. This is to ensure that neither a
* scan can start nor a scan can be already-in-progress during the
replay
* of this operation. If we allow scans during this operation, then
they
* can miss some records or show the same record multiple times.
*/

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?

Regards,
Jeff Davis

#8Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#7)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis <pgsql@j-davis.com> wrote:

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?

I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#8)
Re: Is this a problem in GenericXLogFinish()?

On 27/09/2023 18:47, Robert Haas wrote:

On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis <pgsql@j-davis.com> wrote:

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?

I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(

That seems OK. Maybe there would be a better way to do that, but there's
nothing wrong with that approach per se.

We could define a new REGBUF_NO_CHANGE flag to signal that the buffer is
registered just for locking purposes, and not actually modified by the
WAL record.

--
Heikki Linnakangas
Neon (https://neon.tech)

#10Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#9)
Re: Is this a problem in GenericXLogFinish()?

On Wed, 2023-09-27 at 18:57 +0300, Heikki Linnakangas wrote:

We could define a new REGBUF_NO_CHANGE flag to signal that the buffer
is
registered just for locking purposes, and not actually modified by
the
WAL record.

Out of curiosity I also added a trial assert (not intending to actually
add this):

Assert(!(flags & REGBUF_NO_CHANGE) || !BufferIsExclusiveLocked());

I expected that to fail, but it didn't -- perhaps that buffer is only
locked during replay or something? I don't think we want that specific
Assert; I was just experimenting with some cross-checks we could do to
verify that REGBUF_NO_CHANGE is used properly.

Also, I ran into some problems with GIN that might require a bit more
refactoring in ginPlaceToPage(). Perhaps we could consider
REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
use it temporarily until we have a chance to analyze/refactor each of
the callers that need it.

Regards,
        Jeff Davis

#11Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#10)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

I committed and backported 0001 (the GenericXLogFinish() fix but not
the Assert).

Strangely I didn't see the -committers email come through yet. If
anyone notices anything wrong, please let me know before the final v11
release.

On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote:

Also, I ran into some problems with GIN that might require a bit more
refactoring in ginPlaceToPage(). Perhaps we could consider
REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
use it temporarily until we have a chance to analyze/refactor each of
the callers that need it.

For the Assert, I have attached a new patch for v17.

I changed the name of the flag to REGBUF_CLEAN_OK, because for some of
the callsites it was not clear to me whether the page is always
unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a
way to bypass the Assert for callsites where we either know that we
actually want to register an unchanged page, or for callsites where we
don't know if the page is changed or not.

If we commit this, ideally someone would look into the places where
REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps
refactor so that it would benefit from the Assert. But refactoring
those places is outside of the scope of this patch.

It sounds like we have no intention to change the hash index code, so
are we OK if this flag just lasts forever? Do you still think it offers
a useful check?

Regards,
Jeff Davis

Attachments:

v3-0002-Assert-that-buffers-are-marked-dirty-before-XLogR.patchtext/x-patch; charset=UTF-8; name=v3-0002-Assert-that-buffers-are-marked-dirty-before-XLogR.patchDownload
From 859c8f4284de1030ab070630fa06af3c171264d9 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 28 Sep 2023 11:19:06 -0700
Subject: [PATCH v3] Assert that buffers are marked dirty before
 XLogRegisterBuffer().

A page must be marked dirty before writing WAL, as documented in
transam/README.

Enforce this rule in XLogRegisterBuffer(), though it's not actually a
bug if (a) the page really is clean; or (b) the buffer is marked dirty
after XLogRegisterBuffer() and before XLogInsert().

Update log_newpage_range(), where it's trivial to change the order so
that we can check in XLogRegisterBuffer(). Other callers are more
complex, and updating them is outside the scope of this patch (and
perhaps not desirable), so introduce a flag REGBUF_CLEAN_OK to bypass
the Assert.

Note that this commit may have missed some callsites which can, at
least in some cases, register a clean buffer. If such a callsite is
found, it can be updated to use REGBUF_CLEAN_OK assuming it doesn't
represent an actual bug.

Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi
---
 src/backend/access/gin/ginbtree.c       |  3 ++-
 src/backend/access/gin/ginfast.c        |  2 +-
 src/backend/access/hash/hash.c          | 10 ++++++---
 src/backend/access/hash/hashovfl.c      |  9 +++++---
 src/backend/access/transam/xloginsert.c |  3 ++-
 src/backend/storage/buffer/bufmgr.c     | 30 +++++++++++++++++++++++++
 src/include/access/xloginsert.h         |  2 ++
 src/include/storage/bufmgr.h            |  1 +
 8 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 06e97a7fbb..c98176af77 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -389,7 +389,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
 		{
 			XLogBeginInsert();
-			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
+			XLogRegisterBuffer(0, stack->buffer,
+							   REGBUF_STANDARD | REGBUF_CLEAN_OK);
 			if (BufferIsValid(childbuf))
 				XLogRegisterBuffer(1, childbuf, REGBUF_STANDARD);
 		}
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index eb6c554831..e46e11df07 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,7 +399,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 		Assert((ptr - collectordata) <= collector->sumsize);
 		if (needWal)
 		{
-			XLogRegisterBuffer(1, buffer, REGBUF_STANDARD);
+			XLogRegisterBuffer(1, buffer, REGBUF_STANDARD | REGBUF_CLEAN_OK);
 			XLogRegisterBufData(1, collectordata, collector->sumsize);
 		}
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606..7a62100c89 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -824,11 +824,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
 
 				/*
-				 * bucket buffer needs to be registered to ensure that we can
-				 * acquire a cleanup lock on it during replay.
+				 * bucket buffer was not changed, but still needs to be
+				 * registered to ensure that we can acquire a cleanup lock on
+				 * it during replay.
 				 */
 				if (!xlrec.is_primary_bucket_page)
-					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+				{
+					uint8 flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_CLEAN_OK;
+					XLogRegisterBuffer(0, bucket_buf, flags);
+				}
 
 				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
 				XLogRegisterBufData(1, (char *) deletable,
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 39bb2cb9f6..57688e26b5 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -658,11 +658,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
 
 		/*
-		 * bucket buffer needs to be registered to ensure that we can acquire
-		 * a cleanup lock on it during replay.
+		 * bucket buffer was not changed, but still needs to be registered to
+		 * ensure that we can acquire a cleanup lock on it during replay.
 		 */
 		if (!xlrec.is_prim_bucket_same_wrt)
-			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+		{
+			uint8 flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_CLEAN_OK;
+			XLogRegisterBuffer(0, bucketbuf, flags);
+		}
 
 		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 		if (xlrec.ntups > 0)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 588626424e..80df735897 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -246,6 +246,7 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
 
 	/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
 	Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
+	Assert((flags & REGBUF_CLEAN_OK) || BufferIsDirty(buffer));
 	Assert(begininsert_called);
 
 	if (block_id >= max_registered_block_id)
@@ -1313,8 +1314,8 @@ log_newpage_range(Relation rel, ForkNumber forknum,
 		START_CRIT_SECTION();
 		for (i = 0; i < nbufs; i++)
 		{
-			XLogRegisterBuffer(i, bufpack[i], flags);
 			MarkBufferDirty(bufpack[i]);
+			XLogRegisterBuffer(i, bufpack[i], flags);
 		}
 
 		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8b96759b84..23b80322a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2098,6 +2098,36 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	return first_block;
 }
 
+/*
+ * BufferIsDirty
+ *
+ *		Checks if buffer is already dirty.
+ *
+ * Buffer must be pinned and exclusive-locked.  (If caller does not hold
+ * exclusive lock, then the result may be stale before it's returned.)
+ */
+bool
+BufferIsDirty(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	if (BufferIsLocal(buffer))
+	{
+		int bufid = -buffer - 1;
+		bufHdr = GetLocalBufferDescriptor(bufid);
+	}
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+	}
+
+	Assert(BufferIsPinned(buffer));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE));
+
+	return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 31785dc578..3469890cf1 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -37,6 +37,8 @@
 									 * will be skipped) */
 #define REGBUF_KEEP_DATA	0x10	/* include data even if a full-page image
 									 * is taken */
+#define REGBUF_CLEAN_OK		0x20	/* intentionally register buffer that may
+									 * be clean */
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d89021f918..8103113f09 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -179,6 +179,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
+extern bool BufferIsDirty(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
-- 
2.34.1

#12Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jeff Davis (#11)
Re: Is this a problem in GenericXLogFinish()?

On 10/10/2023 22:57, Jeff Davis wrote:

On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote:

Also, I ran into some problems with GIN that might require a bit more
refactoring in ginPlaceToPage(). Perhaps we could consider
REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and
use it temporarily until we have a chance to analyze/refactor each of
the callers that need it.

For the Assert, I have attached a new patch for v17.

Thanks for working on this!

+/*
+ * BufferIsDirty
+ *
+ *		Checks if buffer is already dirty.
+ *
+ * Buffer must be pinned and exclusive-locked.  (If caller does not hold
+ * exclusive lock, then the result may be stale before it's returned.)
+ */
+bool
+BufferIsDirty(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	if (BufferIsLocal(buffer))
+	{
+		int bufid = -buffer - 1;
+		bufHdr = GetLocalBufferDescriptor(bufid);
+	}
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+	}
+
+	Assert(BufferIsPinned(buffer));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE));
+
+	return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY;
+}

The comment suggests that you don't need to hold an exclusive lock when
you call this, but there's an assertion that you do.

Is it a new requirement that you must hold an exclusive lock on the
buffer when you call XLogRegisterBuffer()? I think that's reasonable.

I thought if that might be a problem when you WAL-log a page when you
set hint bits on it and checksums are enabled. Hint bits can be set
holding just a share lock. But it uses XLogSaveBufferForHint(), which
calls XLogRegisterBlock() instead of XLogRegisterBuffer()

There is a comment above XLogSaveBufferForHint() that implies that
XLogRegisterBuffer() requires you to hold an exclusive-lock but I don't
see that spelled out explicitly in XLogRegisterBuffer() itself. Maybe
add an assertion for that in XLogRegisterBuffer() to make it more explicit.

--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -246,6 +246,7 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)

/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
+ Assert((flags & REGBUF_CLEAN_OK) || BufferIsDirty(buffer));
Assert(begininsert_called);

if (block_id >= max_registered_block_id)

I'd suggest a comment here to explain what's wrong if someone hits this
assertion. Something like "The buffer must be marked as dirty before
registering, unless REGBUF_CLEAN_OK is set to indicate that the buffer
is not being modified".

I changed the name of the flag to REGBUF_CLEAN_OK, because for some of
the callsites it was not clear to me whether the page is always
unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a
way to bypass the Assert for callsites where we either know that we
actually want to register an unchanged page, or for callsites where we
don't know if the page is changed or not.

Ok. A flag to explicitly mark that the page is not modified would
actually be nice for pg_rewind. It could ignore such page references.
Not that it matters much in practice, since those records are so rare.
And for that, we'd need to include the flag in the WAL record too. So I
think this is fine.

If we commit this, ideally someone would look into the places where
REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps
refactor so that it would benefit from the Assert. But refactoring
those places is outside of the scope of this patch.

About that: you added the flag to a couple of XLogRegisterBuffer() calls
in GIN, because they call MarkBufferDirty() only after
XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest
doing that now.

It sounds like we have no intention to change the hash index code, so
are we OK if this flag just lasts forever? Do you still think it offers
a useful check?

Yeah, I think this is a useful assertion. It might catch some bugs in
extensions too.

--
Heikki Linnakangas
Neon (https://neon.tech)

#13Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#12)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Oct 11, 2023 at 7:53 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

+ * Buffer must be pinned and exclusive-locked.  (If caller does not hold
+ * exclusive lock, then the result may be stale before it's returned.)

The comment suggests that you don't need to hold an exclusive lock when
you call this, but there's an assertion that you do.

I don't think the comment suggests that. It would if you only read the
sentence in parentheses. But if you read both of them it seems clear
enough. I guess the parenthetical sentence cloud say "If the caller
did not hold an exclusive lock, then the result might become stale
even before it was returned," basically putting the whole thing in the
subjunctive.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#12)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Wed, 2023-10-11 at 14:53 +0300, Heikki Linnakangas wrote:

The comment suggests that you don't need to hold an exclusive lock
when
you call this, but there's an assertion that you do.

Reworded.

Is it a new requirement that you must hold an exclusive lock on the
buffer when you call XLogRegisterBuffer()? I think that's reasonable.

Agreed.

I'd suggest a comment here to explain what's wrong if someone hits
this
assertion. Something like "The buffer must be marked as dirty before
registering, unless REGBUF_CLEAN_OK is set to indicate that the
buffer
is not being modified".

Done, with different wording.

If we commit this, ideally someone would look into the places where
REGBUF_CLEAN_OK is used and make sure that it's not a bug, and
perhaps
refactor so that it would benefit from the Assert. But refactoring
those places is outside of the scope of this patch.

I don't think it makes sense to register a buffer that is perhaps clean
and perhaps not. After registering a buffer and writing WAL, you need
to PageSetLSN(), and that should only be done after MarkBufferDirty(),
right?

So if you need to condition PageSetLSN() on whether MarkBufferDirty()
has happened, it would be trivial to use the same condition for
XLogRegisterBuffer(). Therefore I changed the name back to
REGBUF_NO_CHANGE, as you suggested originally.

The hash indexing code knows it's not modifying the bucket buffer,
doesn't mark it dirty, and doesn't set the LSN. I presume that's safe.

However, there are quite a few places where
XLogRegisterBuffer()+WAL+PageSetLSN() all happen, but it's not
immediately obvious that all code paths to get there also
MarkBufferDirty(). For instanace:

lazy_scan_heap() -- conditionally marks heapbuf as dirty
visibilitymap_set() -- conditionally calls log_heap_visible
log_heap_visible()
XLogRegisterBuffer(1, heap_buffer, flags)

if those conditions don't match up exactly, it seems we could get into
a situation where we update the LSN of a clean page, which seems bad.

There are a few other places in the hash code and spgist code where I
have similar concerns. Not many, though, I looked at all of the call
sites (at least briefly) and most of them are fine.

About that: you added the flag to a couple of XLogRegisterBuffer()
calls
in GIN, because they call MarkBufferDirty() only after
XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest
doing that now.

Done.

It sounds like we have no intention to change the hash index code,
so
are we OK if this flag just lasts forever? Do you still think it
offers
a useful check?

Yeah, I think this is a useful assertion. It might catch some bugs in
extensions too.

I was asking more about the REGBUF_NO_CHANGE flag. It feels very
specific to the hash indexes, and I'm not sure we want to encourage
extensions to use it.

Though maybe the locking protocol used by hash indexes is more
generally applicable, and other indexing strategies might want to use
it, too?

Another option might be to just change the hash indexing code to follow
the correct protocol, locking and calling MarkBufferDirty() in those 3
call sites. Marking the buffer dirty is easy, but making sure that it's
locked might require some refactoring. Robert, would following the
right protocol here affect performance?

Regards,
Jeff Davis

Attachments:

v4-0002-Assert-that-buffers-are-marked-dirty-before-XLogR.patchtext/x-patch; charset=UTF-8; name=v4-0002-Assert-that-buffers-are-marked-dirty-before-XLogR.patchDownload
From f0c2d489dd8f2268803f4d2cc8642e29cb0d3576 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 28 Sep 2023 11:19:06 -0700
Subject: [PATCH v4] Assert that buffers are marked dirty before
 XLogRegisterBuffer().

Enforce the rule from transam/README in XLogRegisterBuffer(), and
update callers to follow the rule.

Hash indexes sometimes register clean pages as a part of the locking
protocol, so provide a REGBUF_NO_CHANGE flag to support that use.

Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi
---
 src/backend/access/gin/ginbtree.c       | 14 +++---
 src/backend/access/gin/gindatapage.c    |  6 +++
 src/backend/access/gin/ginentrypage.c   |  3 ++
 src/backend/access/gin/ginfast.c        |  5 ++-
 src/backend/access/hash/hash.c          | 11 +++--
 src/backend/access/hash/hashovfl.c      | 21 ++++++---
 src/backend/access/transam/xloginsert.c | 16 ++++++-
 src/backend/storage/buffer/bufmgr.c     | 59 +++++++++++++++++++++++++
 src/include/access/xloginsert.h         |  1 +
 src/include/storage/bufmgr.h            |  2 +
 10 files changed, 118 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 06e97a7fbb..fc694b40f1 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -387,24 +387,22 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 		START_CRIT_SECTION();
 
 		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
-		{
 			XLogBeginInsert();
-			XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD);
-			if (BufferIsValid(childbuf))
-				XLogRegisterBuffer(1, childbuf, REGBUF_STANDARD);
-		}
 
-		/* Perform the page update, and register any extra WAL data */
+		/*
+		 * Perform the page update, dirty and register stack->buffer, and
+		 * register any extra WAL data.
+		 */
 		btree->execPlaceToPage(btree, stack->buffer, stack,
 							   insertdata, updateblkno, ptp_workspace);
 
-		MarkBufferDirty(stack->buffer);
-
 		/* An insert to an internal page finishes the split of the child. */
 		if (BufferIsValid(childbuf))
 		{
 			GinPageGetOpaque(childpage)->flags &= ~GIN_INCOMPLETE_SPLIT;
 			MarkBufferDirty(childbuf);
+			if (RelationNeedsWAL(btree->index) && !btree->isBuild)
+				XLogRegisterBuffer(1, childbuf, REGBUF_STANDARD);
 		}
 
 		if (RelationNeedsWAL(btree->index) && !btree->isBuild)
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 344e1c5e6b..2494273085 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -721,9 +721,12 @@ dataExecPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 	/* Apply changes to page */
 	dataPlaceToPageLeafRecompress(buf, leaf);
 
+	MarkBufferDirty(buf);
+
 	/* If needed, register WAL data built by computeLeafRecompressWALData */
 	if (RelationNeedsWAL(btree->index) && !btree->isBuild)
 	{
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterBufData(0, leaf->walinfo, leaf->walinfolen);
 	}
 }
@@ -1155,6 +1158,8 @@ dataExecPlaceToPageInternal(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 	pitem = (PostingItem *) insertdata;
 	GinDataPageAddPostingItem(page, pitem, off);
 
+	MarkBufferDirty(buf);
+
 	if (RelationNeedsWAL(btree->index) && !btree->isBuild)
 	{
 		/*
@@ -1167,6 +1172,7 @@ dataExecPlaceToPageInternal(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 		data.offset = off;
 		data.newitem = *pitem;
 
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterBufData(0, (char *) &data,
 							sizeof(ginxlogInsertDataInternal));
 	}
diff --git a/src/backend/access/gin/ginentrypage.c b/src/backend/access/gin/ginentrypage.c
index 5a8c0eb98d..379496735f 100644
--- a/src/backend/access/gin/ginentrypage.c
+++ b/src/backend/access/gin/ginentrypage.c
@@ -571,6 +571,8 @@ entryExecPlaceToPage(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 		elog(ERROR, "failed to add item to index page in \"%s\"",
 			 RelationGetRelationName(btree->index));
 
+	MarkBufferDirty(buf);
+
 	if (RelationNeedsWAL(btree->index) && !btree->isBuild)
 	{
 		/*
@@ -583,6 +585,7 @@ entryExecPlaceToPage(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 		data.isDelete = insertData->isDelete;
 		data.offset = off;
 
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterBufData(0, (char *) &data,
 							offsetof(ginxlogInsertEntry, tuple));
 		XLogRegisterBufData(0, (char *) insertData->entry,
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index eb6c554831..c8fe7c78a7 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -397,6 +397,9 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 		}
 
 		Assert((ptr - collectordata) <= collector->sumsize);
+
+		MarkBufferDirty(buffer);
+
 		if (needWal)
 		{
 			XLogRegisterBuffer(1, buffer, REGBUF_STANDARD);
@@ -404,8 +407,6 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 		}
 
 		metadata->tailFreeSize = PageGetExactFreeSpace(page);
-
-		MarkBufferDirty(buffer);
 	}
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606..7a025f33cf 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -824,11 +824,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
 
 				/*
-				 * bucket buffer needs to be registered to ensure that we can
-				 * acquire a cleanup lock on it during replay.
+				 * bucket buffer was not changed, but still needs to be
+				 * registered to ensure that we can acquire a cleanup lock on
+				 * it during replay.
 				 */
 				if (!xlrec.is_primary_bucket_page)
-					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+				{
+					uint8		flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
+
+					XLogRegisterBuffer(0, bucket_buf, flags);
+				}
 
 				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
 				XLogRegisterBufData(1, (char *) deletable,
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 39bb2cb9f6..9d1ff20b92 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -658,11 +658,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
 
 		/*
-		 * bucket buffer needs to be registered to ensure that we can acquire
-		 * a cleanup lock on it during replay.
+		 * bucket buffer was not changed, but still needs to be registered to
+		 * ensure that we can acquire a cleanup lock on it during replay.
 		 */
 		if (!xlrec.is_prim_bucket_same_wrt)
-			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+		{
+			uint8		flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
+
+			XLogRegisterBuffer(0, bucketbuf, flags);
+		}
 
 		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 		if (xlrec.ntups > 0)
@@ -960,11 +964,16 @@ readpage:
 						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
 
 						/*
-						 * bucket buffer needs to be registered to ensure that
-						 * we can acquire a cleanup lock on it during replay.
+						 * bucket buffer was not changed, but still needs to
+						 * be registered to ensure that we can acquire a
+						 * cleanup lock on it during replay.
 						 */
 						if (!xlrec.is_prim_bucket_same_wrt)
-							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+						{
+							int			flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
+
+							XLogRegisterBuffer(0, bucket_buf, flags);
+						}
 
 						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 						XLogRegisterBufData(1, (char *) itup_offsets,
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 588626424e..e4aaa551a0 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -248,6 +248,20 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
 	Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
 	Assert(begininsert_called);
 
+	/*
+	 * Ordinarily, buffer should be exclusive-locked and marked dirty before
+	 * we get here, otherwise we could end up violating one of the rules in
+	 * access/transam/README.
+	 *
+	 * Some callers intentionally register a clean page and never update that
+	 * page's LSN; in that case they can pass the flag REGBUF_NO_CHANGE to
+	 * bypass these checks.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	if (!(flags & REGBUF_NO_CHANGE))
+		Assert(BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer));
+#endif
+
 	if (block_id >= max_registered_block_id)
 	{
 		if (block_id >= max_registered_buffers)
@@ -1313,8 +1327,8 @@ log_newpage_range(Relation rel, ForkNumber forknum,
 		START_CRIT_SECTION();
 		for (i = 0; i < nbufs; i++)
 		{
-			XLogRegisterBuffer(i, bufpack[i], flags);
 			MarkBufferDirty(bufpack[i]);
+			XLogRegisterBuffer(i, bufpack[i], flags);
 		}
 
 		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8b96759b84..21a29f9081 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2098,6 +2098,65 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	return first_block;
 }
 
+/*
+ * BufferIsExclusiveLocked
+ *
+ *      Checks if buffer is exclusive-locked.
+ *
+ * Buffer must be pinned.
+ */
+bool
+BufferIsExclusiveLocked(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	if (BufferIsLocal(buffer))
+	{
+		int			bufid = -buffer - 1;
+
+		bufHdr = GetLocalBufferDescriptor(bufid);
+	}
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+	}
+
+	Assert(BufferIsPinned(buffer));
+	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE);
+}
+
+/*
+ * BufferIsDirty
+ *
+ *		Checks if buffer is already dirty.
+ *
+ * Buffer must be pinned and exclusive-locked.  (Without an exclusive lock,
+ * the result may be stale before it's returned.)
+ */
+bool
+BufferIsDirty(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	if (BufferIsLocal(buffer))
+	{
+		int			bufid = -buffer - 1;
+
+		bufHdr = GetLocalBufferDescriptor(bufid);
+	}
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+	}
+
+	Assert(BufferIsPinned(buffer));
+	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+								LW_EXCLUSIVE));
+
+	return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 31785dc578..cace867497 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -37,6 +37,7 @@
 									 * will be skipped) */
 #define REGBUF_KEEP_DATA	0x10	/* include data even if a full-page image
 									 * is taken */
+#define REGBUF_NO_CHANGE	0x20	/* intentionally register clean buffer */
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d89021f918..9241f86861 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -179,6 +179,8 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										bool permanent);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
+extern bool BufferIsExclusiveLocked(Buffer buffer);
+extern bool BufferIsDirty(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
-- 
2.34.1

#15Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#14)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Oct 16, 2023 at 7:31 PM Jeff Davis <pgsql@j-davis.com> wrote:

Another option might be to just change the hash indexing code to follow
the correct protocol, locking and calling MarkBufferDirty() in those 3
call sites. Marking the buffer dirty is easy, but making sure that it's
locked might require some refactoring. Robert, would following the
right protocol here affect performance?

Sorry, I'm not sure I understand the question. Are you asking whether
dirtying buffers unnecessarily might be slower than not doing that?

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#15)
Re: Is this a problem in GenericXLogFinish()?

On Tue, 2023-10-17 at 08:41 -0400, Robert Haas wrote:

Sorry, I'm not sure I understand the question. Are you asking whether
dirtying buffers unnecessarily might be slower than not doing that?

I meant: are those cleanup operations frequent enough that dirtying
those buffers in that case would matter?

Regards,
Jeff Davis

#17Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#16)
Re: Is this a problem in GenericXLogFinish()?

On Tue, Oct 17, 2023 at 12:38 PM Jeff Davis <pgsql@j-davis.com> wrote:

I meant: are those cleanup operations frequent enough that dirtying
those buffers in that case would matter?

Honestly, I'm not sure. Probably not? I mean, hashbucketcleanup()
seems to only be called during vacuum or a bucket split, and I don't
think you can have super-frequent calls to _hash_freeovflpage()
either. For what it's worth, though, I think it would be better to
just make these cases exceptions to your Assert, as you did in the
patch, rather than changing them to dirty the buffer. There doesn't
seem to be enough upside to making the assert unconditional to justify
changing stuff that might have a real-world performance cost ... even
if we don't think it would amount to much.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#17)
Re: Is this a problem in GenericXLogFinish()?

On Thu, 2023-10-19 at 16:12 -0400, Robert Haas wrote:

For what it's worth, though, I think it would be better to
just make these cases exceptions to your Assert

OK, I'll probably commit something like v4 then.

I still have a question though: if a buffer is exclusive-locked,
unmodified and clean, and then the caller registers it and later does
PageSetLSN (just as if it were dirty), is that a definite bug?

There are a couple callsites where the control flow is complex enough
that it's hard to be sure the buffer is always marked dirty before
being registered (like in log_heap_visible(), as I mentioned upthread).
But those callsites are all doing PageSetLSN, unlike the hash index
case.

Regards,
Jeff Davis

#19Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#18)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Oct 20, 2023 at 12:28 PM Jeff Davis <pgsql@j-davis.com> wrote:

I still have a question though: if a buffer is exclusive-locked,
unmodified and clean, and then the caller registers it and later does
PageSetLSN (just as if it were dirty), is that a definite bug?

There are a couple callsites where the control flow is complex enough
that it's hard to be sure the buffer is always marked dirty before
being registered (like in log_heap_visible(), as I mentioned upthread).
But those callsites are all doing PageSetLSN, unlike the hash index
case.

I think that would be a bug. I think it would be OK to just change the
page LSN and nothing else, but that requires calling MarkBufferDirty()
at some point. If you only call PageSetLSN() and never
MarkBufferDirty(), that sounds messed up: either the former should be
skipped, or the latter should be added. We shouldn't modify a shared
buffer without marking it dirty.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#19)
Re: Is this a problem in GenericXLogFinish()?

On Mon, 2023-10-23 at 10:14 -0400, Robert Haas wrote:

I think that would be a bug. I think it would be OK to just change
the
page LSN and nothing else, but that requires calling
MarkBufferDirty()
at some point. If you only call PageSetLSN() and never
MarkBufferDirty(), that sounds messed up: either the former should be
skipped, or the latter should be added. We shouldn't modify a shared
buffer without marking it dirty.

In that case, I think REGBUF_NO_CHANGE is the right name for the flag.

Committed.

Regards,
Jeff Davis

#21Alexander Lakhin
exclusion@gmail.com
In reply to: Jeff Davis (#20)
Re: Is this a problem in GenericXLogFinish()?

Hello hackers,

24.10.2023 03:45, Jeff Davis wrote:

Committed.

I've encountered a case with a hash index, when an assert added by the
commit fails:

CREATE TABLE t (i INT);
CREATE INDEX hi ON t USING HASH (i);
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);

BEGIN;
INSERT INTO t SELECT 1 FROM generate_series(1, 1000);
ROLLBACK;

CHECKPOINT;

VACUUM t;

Leads to:
Core was generated by `postgres: law regression [local] VACUUM                                       '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/211944' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139924569388864, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f42b99ed476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f42b99d37f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055f128e83f1b in ExceptionalCondition (conditionName=0x55f128f33520 "BufferIsExclusiveLocked(buffer) &&
BufferIsDirty(buffer)", fileName=0x55f128f333a8 "xloginsert.c", lineNumber=262) at assert.c:66
#6  0x000055f1287edce3 in XLogRegisterBuffer (block_id=1 '\001', buffer=1808, flags=8 '\b') at xloginsert.c:262
#7  0x000055f128742833 in _hash_freeovflpage (rel=0x7f42adb95c88, bucketbuf=1808, ovflbuf=1825, wbuf=1808,
itups=0x7ffc3f18c010, itup_offsets=0x7ffc3f18bce0, tups_size=0x7ffc3f18ccd0, nitups=0, bstrategy=0x55f12a562820)
    at hashovfl.c:671
#8  0x000055f128743567 in _hash_squeezebucket (rel=0x7f42adb95c88, bucket=6, bucket_blkno=7, bucket_buf=1808,
bstrategy=0x55f12a562820) at hashovfl.c:1064
#9  0x000055f12873ca2a in hashbucketcleanup (rel=0x7f42adb95c88, cur_bucket=6, bucket_buf=1808, bucket_blkno=7,
bstrategy=0x55f12a562820, maxbucket=7, highmask=15, lowmask=7, tuples_removed=0x7ffc3f18fb28,
    num_index_tuples=0x7ffc3f18fb30, split_cleanup=false, callback=0x55f1289ba1de <vac_tid_reaped>,
callback_state=0x55f12a566778) at hash.c:921
#10 0x000055f12873bfcf in hashbulkdelete (info=0x7ffc3f18fc30, stats=0x0, callback=0x55f1289ba1de <vac_tid_reaped>,
callback_state=0x55f12a566778) at hash.c:549
#11 0x000055f128776fbb in index_bulk_delete (info=0x7ffc3f18fc30, istat=0x0, callback=0x55f1289ba1de <vac_tid_reaped>,
callback_state=0x55f12a566778) at indexam.c:709
#12 0x000055f1289ba03d in vac_bulkdel_one_index (ivinfo=0x7ffc3f18fc30, istat=0x0, dead_items=0x55f12a566778) at
vacuum.c:2480
#13 0x000055f128771286 in lazy_vacuum_one_index (indrel=0x7f42adb95c88, istat=0x0, reltuples=-1, vacrel=0x55f12a4b9c30)
at vacuumlazy.c:2768
#14 0x000055f1287704a3 in lazy_vacuum_all_indexes (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2338
#15 0x000055f128770275 in lazy_vacuum (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2256
...

It looks like the buffer is not dirty in the problematic call.

Best regards,
Alexander

#22Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Lakhin (#21)
Re: Is this a problem in GenericXLogFinish()?

On Thu, 2023-10-26 at 07:00 +0300, Alexander Lakhin wrote:

It looks like the buffer is not dirty in the problematic call.

Thank you for the report! I was able to reproduce and observe that the
buffer is not marked dirty.

The call (hashovfl.c:671):

XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD)

is followed unconditionally by:

PageSetLSN(BufferGetPage(wbuf), recptr)

so if the Assert were not there, it would be setting the LSN on a page
that's not marked dirty. Therefore I think this is a bug, or at least
an interesting/unexpected case.

Perhaps the registration and the PageSetLSN don't need to happen when
nitups==0?

Regards,
Jeff Davis

#23Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#22)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Oct 25, 2023 at 10:06:23PM -0700, Jeff Davis wrote:

Thank you for the report! I was able to reproduce and observe that the
buffer is not marked dirty.

The call (hashovfl.c:671):

XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD)

is followed unconditionally by:

PageSetLSN(BufferGetPage(wbuf), recptr)

so if the Assert were not there, it would be setting the LSN on a page
that's not marked dirty. Therefore I think this is a bug, or at least
an interesting/unexpected case.

Perhaps the registration and the PageSetLSN don't need to happen when
nitups==0?

Hmm. Looking at hash_xlog_squeeze_page(), it looks like wbuf is
expected to always be registered. So, you're right here: it should be
OK to be less aggressive with setting the page LSN on wbuf if ntups is
0, but there's more to it? For example, it is possible that
bucketbuf, prevbuf and wbuf refer to the same buffer, causing
is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true.
Particularly, if nextbuf and wbuf are the same buffers, we finish by
registering twice the same buffer with two different IDs to perform
the tuple insertion and the opaque area updates in two steps. And
that's.. Err.. not really necessary? I mean, as far as I read this
code you could just reuse the buffer registered at ID 1 and update its
opaque area if is_prim_bucket_same_wrt is true?
--
Michael

#24Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#23)
Re: Is this a problem in GenericXLogFinish()?

On Thu, Oct 26, 2023 at 3:31 AM Michael Paquier <michael@paquier.xyz> wrote:

Hmm. Looking at hash_xlog_squeeze_page(), it looks like wbuf is
expected to always be registered. So, you're right here: it should be
OK to be less aggressive with setting the page LSN on wbuf if ntups is
0, but there's more to it? For example, it is possible that
bucketbuf, prevbuf and wbuf refer to the same buffer, causing
is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true.
Particularly, if nextbuf and wbuf are the same buffers, we finish by
registering twice the same buffer with two different IDs to perform
the tuple insertion and the opaque area updates in two steps. And
that's.. Err.. not really necessary? I mean, as far as I read this
code you could just reuse the buffer registered at ID 1 and update its
opaque area if is_prim_bucket_same_wrt is true?

Read the comments for _hash_squeezebucket. Particularly, note this part:

* Try to squeeze the tuples onto pages occurring earlier in the
* bucket chain in an attempt to free overflow pages. When we start
* the "squeezing", the page from which we start taking tuples (the
* "read" page) is the last bucket in the bucket chain and the page
* onto which we start squeezing tuples (the "write" page) is the
* first page in the bucket chain. The read page works backward and
* the write page works forward; the procedure terminates when the
* read page and write page are the same page.

Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
the same (your first scenario) but the second scenario you mention
(nextbuf == wbuf) should be impossible.

It seems to me that maybe we shouldn't even be registering wbuf or
doing anything at all to it if there are no tuples that need moving.
That would also require some adjustment of the redo routine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#25Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#24)
Re: Is this a problem in GenericXLogFinish()?

On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote:

Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
the same (your first scenario) but the second scenario you mention
(nextbuf == wbuf) should be impossible.

Okay..

It seems to me that maybe we shouldn't even be registering wbuf or
doing anything at all to it if there are no tuples that need moving.
That would also require some adjustment of the redo routine.

Hmm. So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same? I'd like to think that what you say is OK,
still I am not completely sure after reading the lock assumptions in
the hash README or 6d46f4783efe. A simpler thing would be to mark
buffer 1 with REGBUF_NO_CHANGE when the primary and write buffers are
the same if we expect the lock to always be taken, I guess..

I've noticed that the replay paths for XLOG_HASH_MOVE_PAGE_CONTENTS
and XLOG_HASH_SQUEEZE_PAGE are similar with their page handlings (some
copy-pastes?). A MOVE record should never have zero tuples, still the
replay path assumes that this can be possible, so it could be
simplified.
--
Michael

#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#25)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Oct 27, 2023 at 5:45 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote:

Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
the same (your first scenario) but the second scenario you mention
(nextbuf == wbuf) should be impossible.

Okay..

It seems to me that maybe we shouldn't even be registering wbuf or
doing anything at all to it if there are no tuples that need moving.
That would also require some adjustment of the redo routine.

Hmm. So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same?

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

--
With Regards,
Amit Kapila.

#27Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#26)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the
primary and write buffers are the same and there are no tuples to
move. Say with something like the attached?
--
Michael

Attachments:

fix-hash-replay.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..9928068179 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -647,6 +647,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
 		int			i;
+		int			wbuf_flags;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -668,7 +669,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		/*
+		 * If the bucket buffer and the primary buffer are the same, a cleanup
+		 * lock is still required on the write buffer even if there no tuples
+		 * to move, so it needs to be registered in this case.
+		 */
+		wbuf_flags = REGBUF_STANDARD;
+		if (xlrec.is_prim_bucket_same_wrt && xlrec.ntups == 0)
+			wbuf_flags |= REGBUF_NO_CHANGE;
+		XLogRegisterBuffer(1, wbuf, wbuf_flags);
 		if (xlrec.ntups > 0)
 		{
 			XLogRegisterBufData(1, (char *) itup_offsets,
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#27)
Re: Is this a problem in GenericXLogFinish()?

On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the
primary and write buffers are the same and there are no tuples to
move. Say with something like the attached?

What if the primary and write buffer are not the same but ntups is
zero? Won't we hit the assertion again in that case?

--
With Regards,
Amit Kapila.

#29Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#26)
Re: Is this a problem in GenericXLogFinish()?

On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same?

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Amit, thanks for weighing in, but I'm not convinced. I thought we only
ever used a cleanup lock on the main bucket page to guard against
concurrent scans. Here you seem to be saying that we need a cleanup
lock on some page that may be an overflow page somewhere in the middle
of the chain, and that doesn't seem right to me.

So ... are you sure? If yes, can you provide any more detailed justification?

--
Robert Haas
EDB: http://www.enterprisedb.com

#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#29)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Oct 30, 2023 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same?

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Amit, thanks for weighing in, but I'm not convinced. I thought we only
ever used a cleanup lock on the main bucket page to guard against
concurrent scans.

My understanding is the same. It is possible that my wording is not
clear. Let me try to clarify again, Michael asked: "do we need the
cleanup lock on the write buffer even if there are no tuples, and even
if primary bucket and the write bucket are the same?" My reading of
his question was do we need a cleanup lock even if the primary bucket
and write bucket are the same which means the write bucket will be the
first page in the chain and we need a cleanup lock on it. I think the
second condition (no tuples) seems irrelevant here as whether that is
true or false we need a cleanup lock.

Here you seem to be saying that we need a cleanup

lock on some page that may be an overflow page somewhere in the middle
of the chain, and that doesn't seem right to me.

Sorry, I don't intend to say this.

--
With Regards,
Amit Kapila.

#31Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#30)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Oct 30, 2023 at 11:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

My understanding is the same. It is possible that my wording is not
clear. Let me try to clarify again, Michael asked: "do we need the
cleanup lock on the write buffer even if there are no tuples, and even
if primary bucket and the write bucket are the same?" My reading of
his question was do we need a cleanup lock even if the primary bucket
and write bucket are the same which means the write bucket will be the
first page in the chain and we need a cleanup lock on it. I think the
second condition (no tuples) seems irrelevant here as whether that is
true or false we need a cleanup lock.

Ah, OK, I see now. I missed the fact that, per what Michael wrote, you
were assuming the primary buffer and the write buffer were the same.
In that case I agree, but the whole formulation seems backwards. I
think the clearer way to say this is: we need a cleanup lock on the
primary bucket page no matter what; and we need to write tuples into
the write buffer if there are any tuples to write but not if there
aren't. If the two buffers are the same then the requirements are
added together.

--
Robert Haas
EDB: http://www.enterprisedb.com

#32Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#28)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote:

On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the
primary and write buffers are the same and there are no tuples to
move. Say with something like the attached?

What if the primary and write buffer are not the same but ntups is
zero? Won't we hit the assertion again in that case?

The code tells that it should be able to handle such a case, so this
would mean to set REGBUF_NO_CHANGE only when we have no tuples to
move:
-        XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+        /*
+         * A cleanup lock is still required on the write buffer even
+         * if there are no tuples to move, so it needs to be registered
+         * in this case.
+         */
+        wbuf_flags = REGBUF_STANDARD;
+        if (xlrec.ntups == 0)
+            wbuf_flags |= REGBUF_NO_CHANGE;
+        XLogRegisterBuffer(1, wbuf, wbuf_flags);

Anyway, there is still a hole here: the regression tests have zero
coverage for the case where there are no tuples to move if
!is_prim_bucket_same_wrt. There are only two queries that stress the
squeeze path with no tuples, both use is_prim_bucket_same_wrt:
INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a;
VACUUM hash_split_heap;

Perhaps this should have more coverage to make sure that all these
scenarios are covered at replay (likely with 027_stream_regress.pl)?
The case posted by Alexander at [1]/messages/by-id/f045c8f7-ee24-ead6-3679-c04a43d21351@gmail.com -- Michael falls under the same category
(is_prim_bucket_same_wrt with no tuples).

[1]: /messages/by-id/f045c8f7-ee24-ead6-3679-c04a43d21351@gmail.com -- Michael
--
Michael

#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#32)
1 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Nov 1, 2023 at 12:54 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote:

On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:

Yes, we need it to exclude any concurrent in-progress scans that could
return incorrect tuples during bucket squeeze operation.

Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the
primary and write buffers are the same and there are no tuples to
move. Say with something like the attached?

What if the primary and write buffer are not the same but ntups is
zero? Won't we hit the assertion again in that case?

The code tells that it should be able to handle such a case,

It should be possible when we encounter some page in between the chain
that has all dead items and write buffer points to some page after the
primary bucket page and before the read buffer page.

so this
would mean to set REGBUF_NO_CHANGE only when we have no tuples to
move:
-        XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+        /*
+         * A cleanup lock is still required on the write buffer even
+         * if there are no tuples to move, so it needs to be registered
+         * in this case.
+         */
+        wbuf_flags = REGBUF_STANDARD;
+        if (xlrec.ntups == 0)
+            wbuf_flags |= REGBUF_NO_CHANGE;
+        XLogRegisterBuffer(1, wbuf, wbuf_flags);

Why register wbuf at all if there are no tuples to add and it is not
the same as bucketbuf? Also, I think this won't be correct if prevbuf
and wrtbuf are the same and also we have no tuples to add to wbuf. I
have attached a naive and crude way to achieve it. This needs more
work both in terms of trying to find a better way to change the code
or ensure this won't break any existing case. I have just run the
existing tests. Such a fix certainly required more testing.

--
With Regards,
Amit Kapila.

Attachments:

fix_hash_squeeze_wal_1.patchapplication/octet-stream; name=fix_hash_squeeze_wal_1.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e8e06c62a9..2364d3af3f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -655,7 +655,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		 */
 		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
 
-		action = XLogReadBufferForRedo(record, 1, &writebuf);
+		if ((xldata->ntups == 0 && xldata->is_prev_bucket_same_wrt) ||
+			xldata->ntups > 0)
+			action = XLogReadBufferForRedo(record, 1, &writebuf);
+		else
+			action = BLK_NOTFOUND;
 	}
 
 	/* replay the record for adding entries in overflow buffer */
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..2fc7760b17 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -668,9 +668,25 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
-		if (xlrec.ntups > 0)
+		/*
+		 * A write buffer needs to be registered even if no tuples are added to
+		 * it to ensure that we can acquire a cleanup lock on it if it is the
+		 * same as primary bucket buffer or update the nextblkno if it is same
+		 * as the previous bucket buffer.
+		 */
+		if (xlrec.ntups == 0 &&
+			(xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt))
+		{
+			bool		wbuf_flags;
+
+			wbuf_flags = REGBUF_STANDARD;
+			if (!xlrec.is_prev_bucket_same_wrt)
+				wbuf_flags |= REGBUF_NO_CHANGE;
+			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+		}
+		else if (xlrec.ntups > 0)
 		{
+			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
 			for (i = 0; i < nitups; i++)
#34Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#33)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Amit, Michael,

Thanks for making the patch!

Why register wbuf at all if there are no tuples to add and it is not
the same as bucketbuf? Also, I think this won't be correct if prevbuf
and wrtbuf are the same and also we have no tuples to add to wbuf. I
have attached a naive and crude way to achieve it. This needs more
work both in terms of trying to find a better way to change the code
or ensure this won't break any existing case. I have just run the
existing tests. Such a fix certainly required more testing.

I'm verifying the idea (currently it seems OK), but at least there is a coding error -
wbuf_flags should be uint8 here. PSA the fixed patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

fix_hash_squeeze_wal_2.patchapplication/octet-stream; name=fix_hash_squeeze_wal_2.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e8e06c62a9..2364d3af3f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -655,7 +655,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		 */
 		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
 
-		action = XLogReadBufferForRedo(record, 1, &writebuf);
+		if ((xldata->ntups == 0 && xldata->is_prev_bucket_same_wrt) ||
+			xldata->ntups > 0)
+			action = XLogReadBufferForRedo(record, 1, &writebuf);
+		else
+			action = BLK_NOTFOUND;
 	}
 
 	/* replay the record for adding entries in overflow buffer */
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..2a240f01d7 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -668,9 +668,25 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
-		if (xlrec.ntups > 0)
+		/*
+		 * A write buffer needs to be registered even if no tuples are added to
+		 * it to ensure that we can acquire a cleanup lock on it if it is the
+		 * same as primary bucket buffer or update the nextblkno if it is same
+		 * as the previous bucket buffer.
+		 */
+		if (xlrec.ntups == 0 &&
+			(xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt))
+		{
+			uint8		wbuf_flags;
+
+			wbuf_flags = REGBUF_STANDARD;
+			if (!xlrec.is_prev_bucket_same_wrt)
+				wbuf_flags |= REGBUF_NO_CHANGE;
+			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+		}
+		else if (xlrec.ntups > 0)
 		{
+			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
 			for (i = 0; i < nitups; i++)
#35Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#34)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear hackers,

Dear Amit, Michael,

Thanks for making the patch!

Why register wbuf at all if there are no tuples to add and it is not
the same as bucketbuf? Also, I think this won't be correct if prevbuf
and wrtbuf are the same and also we have no tuples to add to wbuf. I
have attached a naive and crude way to achieve it. This needs more
work both in terms of trying to find a better way to change the code
or ensure this won't break any existing case. I have just run the
existing tests. Such a fix certainly required more testing.

I'm verifying the idea (currently it seems OK), but at least there is a coding error -
wbuf_flags should be uint8 here. PSA the fixed patch.

Here is a new patch which is bit refactored. It did:

* If-conditions in _hash_freeovflpage() were swapped.
* Based on above, an Assert(xlrec.ntups == 0) was added.
* A condition in hash_xlog_squeeze_page() was followed the change as well
* comments were adjusted

Next we should add some test codes. I will continue considering but please post anything
If you have idea.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

fix_hash_squeeze_wal_3.patchapplication/octet-stream; name=fix_hash_squeeze_wal_3.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e8e06c62a9..87a18126e3 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -655,10 +655,16 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		 */
 		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
 
-		action = XLogReadBufferForRedo(record, 1, &writebuf);
+		if (xldata->ntups > 0 || xldata->is_prev_bucket_same_wrt)
+			action = XLogReadBufferForRedo(record, 1, &writebuf);
+		else
+			action = BLK_NOTFOUND;
 	}
 
-	/* replay the record for adding entries in overflow buffer */
+	/*
+	 * replay the record for adding entries in the overflow buffer and update
+	 * the nextblkno, if required
+	 */
 	if (action == BLK_NEEDS_REDO)
 	{
 		Page		writepage;
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..1e6289562a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -646,7 +646,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
-		int			i;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -668,14 +667,31 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 		if (xlrec.ntups > 0)
 		{
+			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
-			for (i = 0; i < nitups; i++)
+			for (int i = 0; i < nitups; i++)
 				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
 		}
+		else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+		{
+			uint8		wbuf_flags;
+
+			Assert(xlrec.ntups == 0);
+
+			/*
+			 * A write buffer needs to be registered even if no tuples are added to
+			 * it to ensure that we can acquire a cleanup lock on it if it is the
+			 * same as primary bucket buffer or update the nextblkno if it is same
+			 * as the previous bucket buffer.
+			 */
+			wbuf_flags = REGBUF_STANDARD;
+			if (!xlrec.is_prev_bucket_same_wrt)
+				wbuf_flags |= REGBUF_NO_CHANGE;
+			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+		}
 
 		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
 
#36Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#35)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear hackers,

Next we should add some test codes. I will continue considering but please post
anything
If you have idea.

And I did, PSA the patch. This patch adds two parts in hash_index.sql.

In the first part, the primary bucket page is filled by live tuples and some overflow
pages are by dead ones. This leads removal of overflow pages without moving tuples.
HEAD will crash if this were executed, which is already reported on hackers.

The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false &&
is_prev_bucket_same_wrt == true), which is never done before.

Also, I measured the execution time of before/after patching. Below is a madian
for 9 measurements.

BEFORE -> AFTER
647 ms -> 710 ms

This means that the execution time increased -10%, it will not affect so much.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

fix_hash_squeeze_wal_4.patchapplication/octet-stream; name=fix_hash_squeeze_wal_4.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e8e06c62a9..40debf4028 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -655,7 +655,10 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		 */
 		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
 
-		action = XLogReadBufferForRedo(record, 1, &writebuf);
+		if (xldata->ntups > 0 || xldata->is_prev_bucket_same_wrt)
+			action = XLogReadBufferForRedo(record, 1, &writebuf);
+		else
+			action = BLK_NOTFOUND;
 	}
 
 	/* replay the record for adding entries in overflow buffer */
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..40647a7cb3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -646,7 +646,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
-		int			i;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -668,14 +667,31 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 		if (xlrec.ntups > 0)
 		{
+			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
-			for (i = 0; i < nitups; i++)
+			for (int i = 0; i < nitups; i++)
 				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
 		}
+		/*
+		 * A write buffer needs to be registered even if no tuples are added to
+		 * it to ensure that we can acquire a cleanup lock on it if it is the
+		 * same as primary bucket buffer or update the nextblkno if it is same
+		 * as the previous bucket buffer.
+		 */
+		else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+		{
+			uint8		wbuf_flags;
+
+			Assert(xlrec.ntups == 0);
+
+			wbuf_flags = REGBUF_STANDARD;
+			if (!xlrec.is_prev_bucket_same_wrt)
+				wbuf_flags |= REGBUF_NO_CHANGE;
+			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+		}
 
 		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
 
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index a2036a1597..bff84c16f9 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -271,6 +271,31 @@ ALTER INDEX hash_split_index SET (fillfactor = 10);
 REINDEX INDEX hash_split_index;
 -- Clean up.
 DROP TABLE hash_split_heap;
+-- Testcases for removing overflow pages.
+CREATE TABLE hash_cleanup_heap(keycol INT);
+CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);
+-- Insert many tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+ABORT;
+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+TRUNCATE hash_cleanup_heap;
+-- Insert few tuples, the primary bucket page will not satisfy
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ABORT;
+-- And insert some tuples again. Only intermediate buckets are dead.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+-- Clean up.
+DROP TABLE hash_cleanup_heap;
 -- Index on temp table.
 CREATE TEMP TABLE hash_temp_heap (x int, y int);
 INSERT INTO hash_temp_heap VALUES (1,1);
diff --git a/src/test/regress/sql/hash_index.sql b/src/test/regress/sql/hash_index.sql
index 527024f710..165dc18f0f 100644
--- a/src/test/regress/sql/hash_index.sql
+++ b/src/test/regress/sql/hash_index.sql
@@ -247,6 +247,37 @@ REINDEX INDEX hash_split_index;
 -- Clean up.
 DROP TABLE hash_split_heap;
 
+-- Testcases for removing overflow pages.
+CREATE TABLE hash_cleanup_heap(keycol INT);
+CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);
+
+-- Insert many tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+ABORT;
+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+
+TRUNCATE hash_cleanup_heap;
+
+-- Insert few tuples, the primary bucket page will not satisfy
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ABORT;
+-- And insert some tuples again. Only intermediate buckets are dead.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+
+-- Clean up.
+DROP TABLE hash_cleanup_heap;
+
 -- Index on temp table.
 CREATE TEMP TABLE hash_temp_heap (x int, y int);
 INSERT INTO hash_temp_heap VALUES (1,1);
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#36)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Nov 10, 2023 at 9:17 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Next we should add some test codes. I will continue considering but please post
anything
If you have idea.

And I did, PSA the patch. This patch adds two parts in hash_index.sql.

In the first part, the primary bucket page is filled by live tuples and some overflow
pages are by dead ones. This leads removal of overflow pages without moving tuples.
HEAD will crash if this were executed, which is already reported on hackers.

+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;

Why do we need CHECKPOINT in the above test?

The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false &&
is_prev_bucket_same_wrt == true), which is never done before.

+-- Insert few tuples, the primary bucket page will not satisfy
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;

What do you mean by the second part of the sentence (the primary
bucket page will not satisfy)?

Few other minor comments:
=======================
1. Can we use ROLLBACK instead of ABORT in the tests?

2.
- for (i = 0; i < nitups; i++)
+ for (int i = 0; i < nitups; i++)

I don't think there is a need to make this unrelated change.

3.
+ /*
+ * A write buffer needs to be registered even if no tuples are added to
+ * it to ensure that we can acquire a cleanup lock on it if it is the
+ * same as primary bucket buffer or update the nextblkno if it is same
+ * as the previous bucket buffer.
+ */
+ else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+ {
+ uint8 wbuf_flags;
+
+ Assert(xlrec.ntups == 0);

Can we move this comment inside else if, just before Assert?

--
With Regards,
Amit Kapila.

#38Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#37)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Amit,

Thanks for reviewing! PSA new version patch.

Next we should add some test codes. I will continue considering but please

post

anything
If you have idea.

And I did, PSA the patch. This patch adds two parts in hash_index.sql.

In the first part, the primary bucket page is filled by live tuples and some

overflow

pages are by dead ones. This leads removal of overflow pages without moving

tuples.

HEAD will crash if this were executed, which is already reported on hackers.

+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;

Why do we need CHECKPOINT in the above test?

CHECKPOINT command is needed for writing a hash pages to disk. IIUC if the command
is omitted, none of tuples are recorded to hash index *file*, so squeeze would not
be occurred.

You can test by 1) restoring changes for hashovfl.c then 2) removing CHECKPOINT
command. Server crash would not be occurred.

The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt ==

false &&

is_prev_bucket_same_wrt == true), which is never done before.

+-- Insert few tuples, the primary bucket page will not satisfy
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;

What do you mean by the second part of the sentence (the primary
bucket page will not satisfy)?

I meant to say that the primary bucket page still can have more tuples. Maybe I
should say "will not be full". Reworded.

Few other minor comments:
=======================
1. Can we use ROLLBACK instead of ABORT in the tests?

Changed. IIRC they have same meaning, but it seems that most of sql files have
"ROLLBACK".

2.
- for (i = 0; i < nitups; i++)
+ for (int i = 0; i < nitups; i++)

I don't think there is a need to make this unrelated change.

Reverted. I thought that the variable i would be used only when nitups>0 so that
it could be removed, but it was not my business.

3.
+ /*
+ * A write buffer needs to be registered even if no tuples are added to
+ * it to ensure that we can acquire a cleanup lock on it if it is the
+ * same as primary bucket buffer or update the nextblkno if it is same
+ * as the previous bucket buffer.
+ */
+ else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+ {
+ uint8 wbuf_flags;
+
+ Assert(xlrec.ntups == 0);

Can we move this comment inside else if, just before Assert?

Moved.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

fix_hash_squeeze_wal_5.patchapplication/octet-stream; name=fix_hash_squeeze_wal_5.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e8e06c62a9..40debf4028 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -655,7 +655,10 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		 */
 		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
 
-		action = XLogReadBufferForRedo(record, 1, &writebuf);
+		if (xldata->ntups > 0 || xldata->is_prev_bucket_same_wrt)
+			action = XLogReadBufferForRedo(record, 1, &writebuf);
+		else
+			action = BLK_NOTFOUND;
 	}
 
 	/* replay the record for adding entries in overflow buffer */
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d1ff20b92..b869313d3b 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -668,14 +668,31 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			XLogRegisterBuffer(0, bucketbuf, flags);
 		}
 
-		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 		if (xlrec.ntups > 0)
 		{
+			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
 			for (i = 0; i < nitups; i++)
 				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
 		}
+		else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
+		{
+			uint8		wbuf_flags;
+
+			/*
+			 * A write buffer needs to be registered even if no tuples are added to
+			 * it to ensure that we can acquire a cleanup lock on it if it is the
+			 * same as primary bucket buffer or update the nextblkno if it is same
+			 * as the previous bucket buffer.
+			 */
+			Assert(xlrec.ntups == 0);
+
+			wbuf_flags = REGBUF_STANDARD;
+			if (!xlrec.is_prev_bucket_same_wrt)
+				wbuf_flags |= REGBUF_NO_CHANGE;
+			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+		}
 
 		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
 
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index a2036a1597..ecb78eecee 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -271,6 +271,31 @@ ALTER INDEX hash_split_index SET (fillfactor = 10);
 REINDEX INDEX hash_split_index;
 -- Clean up.
 DROP TABLE hash_split_heap;
+-- Testcases for removing overflow pages.
+CREATE TABLE hash_cleanup_heap(keycol INT);
+CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);
+-- Insert many tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+ROLLBACK;
+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+TRUNCATE hash_cleanup_heap;
+-- Insert few tuples, the primary bucket page will not be full.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ROLLBACK;
+-- And insert some tuples again. Only intermediate buckets are dead.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+-- Clean up.
+DROP TABLE hash_cleanup_heap;
 -- Index on temp table.
 CREATE TEMP TABLE hash_temp_heap (x int, y int);
 INSERT INTO hash_temp_heap VALUES (1,1);
diff --git a/src/test/regress/sql/hash_index.sql b/src/test/regress/sql/hash_index.sql
index 527024f710..28ba413def 100644
--- a/src/test/regress/sql/hash_index.sql
+++ b/src/test/regress/sql/hash_index.sql
@@ -247,6 +247,37 @@ REINDEX INDEX hash_split_index;
 -- Clean up.
 DROP TABLE hash_split_heap;
 
+-- Testcases for removing overflow pages.
+CREATE TABLE hash_cleanup_heap(keycol INT);
+CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);
+
+-- Insert many tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+ROLLBACK;
+-- And do CHECKPOINT and vacuum. Some overflow pages will be removed.
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+
+TRUNCATE hash_cleanup_heap;
+
+-- Insert few tuples, the primary bucket page will not be full.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ROLLBACK;
+-- And insert some tuples again. Only intermediate buckets are dead.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+
+-- Clean up.
+DROP TABLE hash_cleanup_heap;
+
 -- Index on temp table.
 CREATE TEMP TABLE hash_temp_heap (x int, y int);
 INSERT INTO hash_temp_heap VALUES (1,1);
#39Robert Haas
robertmhaas@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#38)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:

+           /*
+            * A write buffer needs to be registered even if no tuples are
+            * added to it to ensure that we can acquire a cleanup lock on it
+            * if it is the same as primary bucket buffer or update the
+            * nextblkno if it is same as the previous bucket buffer.
+            */

But surely if the buffer is the same as one of those others, then it's
registered anyway, and if it isn't, then it doesn't need to be.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#39)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Nov 13, 2023 at 10:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:

+           /*
+            * A write buffer needs to be registered even if no tuples are
+            * added to it to ensure that we can acquire a cleanup lock on it
+            * if it is the same as primary bucket buffer or update the
+            * nextblkno if it is same as the previous bucket buffer.
+            */

But surely if the buffer is the same as one of those others, then it's
registered anyway,

I don't think for others it's registered. For example, consider the
case when prevpage and the writepage are the same (aka
xlrec.is_prev_bucket_same_wrt is true), it won't be registered in
another code path (see comment [1]/* * If prevpage and the writepage (block in which we are moving tuples * from overflow) are same, then no need to separately register * prevpage. During replay, we can directly update the nextblock in * writepage. */).

and if it isn't, then it doesn't need to be.

In the previous example, we need it to update the nextblockno during replay.

I am not sure if I understand the scenario you are worried about, so
if my response doesn't address your concern, can you please explain a
bit more about the scenario you have in mind?

[1]: /* * If prevpage and the writepage (block in which we are moving tuples * from overflow) are same, then no need to separately register * prevpage. During replay, we can directly update the nextblock in * writepage. */
/*
* If prevpage and the writepage (block in which we are moving tuples
* from overflow) are same, then no need to separately register
* prevpage. During replay, we can directly update the nextblock in
* writepage.
*/

--
With Regards,
Amit Kapila.

#41Alexander Lakhin
exclusion@gmail.com
In reply to: Robert Haas (#39)
Re: Is this a problem in GenericXLogFinish()?

Hello,

13.11.2023 20:21, Robert Haas wrote:

On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Moved.

I see that this patch was committed, but I'm not very convinced that
the approach is correct. The comment says this:

I've discovered that that patch introduced a code path leading to an
uninitialized memory access.
With the following addition to hash_index.sql:
  -- Fill overflow pages by "dead" tuples.
  BEGIN;
  INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;
  ROLLBACK;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i;

make check -C src/test/recovery/ PROVE_TESTS="t/027*"
when executed under Valgrind, triggers:
==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised value(s)
==00:00:02:30.285 97744==    at 0x227585: BufferIsValid (bufmgr.h:303)
==00:00:02:30.285 97744==    by 0x227585: hash_xlog_squeeze_page (hash_xlog.c:781)
==00:00:02:30.285 97744==    by 0x228133: hash_redo (hash_xlog.c:1083)
==00:00:02:30.285 97744==    by 0x2C2801: ApplyWalRecord (xlogrecovery.c:1943)
==00:00:02:30.285 97744==    by 0x2C5C52: PerformWalRecovery (xlogrecovery.c:1774)
==00:00:02:30.285 97744==    by 0x2B63A1: StartupXLOG (xlog.c:5559)
==00:00:02:30.285 97744==    by 0x558165: StartupProcessMain (startup.c:282)
==00:00:02:30.285 97744==    by 0x54DFE8: AuxiliaryProcessMain (auxprocess.c:141)
==00:00:02:30.285 97744==    by 0x5546B0: StartChildProcess (postmaster.c:5331)
==00:00:02:30.285 97744==    by 0x557A53: PostmasterMain (postmaster.c:1458)
==00:00:02:30.285 97744==    by 0x4720C2: main (main.c:198)
==00:00:02:30.285 97744==
(in 027_stream_regress_standby_1.log)

That is, when line
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661
is reached, writebuf stays uninitialized.

Best regards,
Alexander

#42Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Alexander Lakhin (#41)
RE: Is this a problem in GenericXLogFinish()?

Dear Alexander,

I've discovered that that patch introduced a code path leading to an
uninitialized memory access.
With the following addition to hash_index.sql:
-- Fill overflow pages by "dead" tuples.
BEGIN;
INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000)
as i;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
i;
ROLLBACK;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as
i;

make check -C src/test/recovery/ PROVE_TESTS="t/027*"
when executed under Valgrind, triggers:
==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised
value(s)
==00:00:02:30.285 97744== at 0x227585: BufferIsValid (bufmgr.h:303)
==00:00:02:30.285 97744== by 0x227585: hash_xlog_squeeze_page
(hash_xlog.c:781)
==00:00:02:30.285 97744== by 0x228133: hash_redo (hash_xlog.c:1083)
==00:00:02:30.285 97744== by 0x2C2801: ApplyWalRecord
(xlogrecovery.c:1943)
==00:00:02:30.285 97744== by 0x2C5C52: PerformWalRecovery
(xlogrecovery.c:1774)
==00:00:02:30.285 97744== by 0x2B63A1: StartupXLOG (xlog.c:5559)
==00:00:02:30.285 97744== by 0x558165: StartupProcessMain
(startup.c:282)
==00:00:02:30.285 97744== by 0x54DFE8: AuxiliaryProcessMain
(auxprocess.c:141)
==00:00:02:30.285 97744== by 0x5546B0: StartChildProcess
(postmaster.c:5331)
==00:00:02:30.285 97744== by 0x557A53: PostmasterMain
(postmaster.c:1458)
==00:00:02:30.285 97744== by 0x4720C2: main (main.c:198)
==00:00:02:30.285 97744==
(in 027_stream_regress_standby_1.log)

That is, when line
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.ht
ml#661
is reached, writebuf stays uninitialized.

Good catch, thank you for reporting! I will investigate more about it and post my analysis.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#43Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#42)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Alexander,

Good catch, thank you for reporting! I will investigate more about it and post my
analysis.

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

In the reported case, neither is_prim_bucket_same_wrt nor is_prev_bucket_same_wrt
is set in the WAL record, and ntups is also zero. This means that the wbuf is not
written in the WAL record on primary side (see [1]``` else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt) { uint8 wbuf_flags;).
Also, there are no reasons to read (and lock) other buffer page because we do
not modify it. Based on them, I think that just initializing as InvalidBuffer is sufficient.

I want to add a test case for it as well. I've tested on my env and found that proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller value.)

I compared the execution time before/after patching, and they are not so different
(1077 ms -> 1125 ms).

How do you think?

[1]: ``` else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt) { uint8 wbuf_flags;
```
else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt)
{
uint8 wbuf_flags;

/*
* A write buffer needs to be registered even if no tuples are
* added to it to ensure that we can acquire a cleanup lock on it
* if it is the same as primary bucket buffer or update the
* nextblkno if it is same as the previous bucket buffer.
*/
Assert(xlrec.ntups == 0);

wbuf_flags = REGBUF_STANDARD;
if (!xlrec.is_prev_bucket_same_wrt)
wbuf_flags |= REGBUF_NO_CHANGE;
XLogRegisterBuffer(1, wbuf, wbuf_flags);
}
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

initialize_writebuf.patchapplication/octet-stream; name=initialize_writebuf.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 40debf4028..f1e233a817 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -632,7 +632,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
 	Buffer		bucketbuf = InvalidBuffer;
-	Buffer		writebuf;
+	Buffer		writebuf = InvalidBuffer;
 	Buffer		ovflbuf;
 	Buffer		prevbuf = InvalidBuffer;
 	Buffer		mapbuf;
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index 0df348b5dd..416d851644 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -298,6 +298,19 @@ ROLLBACK;
 INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
 CHECKPOINT;
 VACUUM hash_cleanup_heap;
+TRUNCATE hash_cleanup_heap;
+-- Insert tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ROLLBACK;
+-- And insert some tuples again. During squeeze operation, these will be moved
+-- to another overflow pages. Also, other overflow pages filled by dead tuples
+-- would be free'd.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
 -- Clean up.
 DROP TABLE hash_cleanup_heap;
 -- Index on temp table.
diff --git a/src/test/regress/sql/hash_index.sql b/src/test/regress/sql/hash_index.sql
index 943bd0ecf1..94861c296c 100644
--- a/src/test/regress/sql/hash_index.sql
+++ b/src/test/regress/sql/hash_index.sql
@@ -284,6 +284,22 @@ INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
 CHECKPOINT;
 VACUUM hash_cleanup_heap;
 
+TRUNCATE hash_cleanup_heap;
+
+-- Insert tuples to both the primary bucket page and overflow pages.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
+-- Fill overflow pages by "dead" tuples.
+BEGIN;
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1500) as i;
+ROLLBACK;
+-- And insert some tuples again. During squeeze operation, these will be moved
+-- to another overflow pages. Also, other overflow pages filled by dead tuples
+-- would be free'd.
+INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i;
+
+CHECKPOINT;
+VACUUM hash_cleanup_heap;
+
 -- Clean up.
 DROP TABLE hash_cleanup_heap;
 
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#43)
Re: Is this a problem in GenericXLogFinish()?

On Thu, Nov 30, 2023 at 12:58 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Good catch, thank you for reporting! I will investigate more about it and post my
analysis.

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

In the reported case, neither is_prim_bucket_same_wrt nor is_prev_bucket_same_wrt
is set in the WAL record, and ntups is also zero. This means that the wbuf is not
written in the WAL record on primary side (see [1]).
Also, there are no reasons to read (and lock) other buffer page because we do
not modify it. Based on them, I think that just initializing as InvalidBuffer is sufficient.

Agreed.

I want to add a test case for it as well. I've tested on my env and found that proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller value.)

I compared the execution time before/after patching, and they are not so different
(1077 ms -> 1125 ms).

In my environment, it increased from 375ms to 385ms (median of five
runs). I think it is acceptable especially if it increases code
coverage. Can you once check that?

--
With Regards,
Amit Kapila.

#45Alexander Lakhin
exclusion@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#43)
Re: Is this a problem in GenericXLogFinish()?

Hello Kuroda-san,

30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote:

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

I agree with the change. It aligns hash_xlog_squeeze_page() with
hash_xlog_move_page_contents() in regard to initialization of writebuf.
Interestingly enough, the discrepancy exists since these functions
appeared with c11453ce0, and I can't see what justified adding the
initialization inside hash_xlog_move_page_contents() only.
So I think that if this doesn't affect performance, having aligned coding
(writebuf initialized in both functions) is better.

I want to add a test case for it as well. I've tested on my env and found that proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller value.)

I compared the execution time before/after patching, and they are not so different
(1077 ms -> 1125 ms).

How do you think?

I can confirm that the test case proposed demonstrates what is expected
and the duration increase is tolerable, as to me.

Best regards,
Alexander

#46Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#44)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Amit,

I think it is acceptable especially if it increases code
coverage. Can you once check that?

PSA the screen shot. I did "PROVE_TESTS="t/027*" make check" in src/test/recovery, and
generated a report.
Line 661 was not hit in the HEAD, but the screen showed that it was executed.

[1]: https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

coverage.pngimage/png; name=coverage.pngDownload
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Lakhin (#45)
Re: Is this a problem in GenericXLogFinish()?

On Thu, Nov 30, 2023 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote:

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.

I agree with the change.

Pushed!

--
With Regards,
Amit Kapila.

#48Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#47)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Dec 01, 2023 at 03:27:33PM +0530, Amit Kapila wrote:

Pushed!

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?
--
Michael

#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#48)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Feb 2, 2024 at 12:40 PM Michael Paquier <michael@paquier.xyz> wrote:

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?

Right, I see the problem. I'll look into it further.

--
With Regards,
Amit Kapila.

#50Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Michael Paquier (#48)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Michael, Amit,

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?

Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the
issue.

This patch can avoid the inconsistency due to the LSN setting and output a debug
LOG when we met such a case. I executed hash_index.sql and confirmed the log was
output [1]``` LOG: XXX: is_wbuf_registered: false CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" STATEMENT: VACUUM hash_cleanup_heap; ```. This meant that current test has already had a workload which meets below
conditions:

- the overflow page has no tuples (xlrec.ntups is 0),
- to-be-written page - wbuf - is not the primary (xlrec.is_prim_bucket_same_wrt
is false), and
- to-be-written buffer is not next to the overflow page
(xlrec.is_prev_bucket_same_wrt is false)

So, I think my patch (after removing elog(...) part) can fix the issue. Thought?

[1]: ``` LOG: XXX: is_wbuf_registered: false CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" STATEMENT: VACUUM hash_cleanup_heap; ```
```
LOG: XXX: is_wbuf_registered: false
CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap"
STATEMENT: VACUUM hash_cleanup_heap;
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

avoid_registration.patchapplication/octet-stream; name=avoid_registration.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b512d0bcb6..7a4d28e8a6 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -647,6 +647,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
 		int			i;
+		bool		wbuf_registered = false;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -671,6 +672,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		if (xlrec.ntups > 0)
 		{
 			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+
+			/* Track the registration status for later use */
+			wbuf_registered = true;
+
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
 			for (i = 0; i < nitups; i++)
@@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			if (!xlrec.is_prev_bucket_same_wrt)
 				wbuf_flags |= REGBUF_NO_CHANGE;
 			XLogRegisterBuffer(1, wbuf, wbuf_flags);
+
+			/* Track the registration status for later use */
+			wbuf_registered = true;
 		}
 
 		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
@@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
 
-		PageSetLSN(BufferGetPage(wbuf), recptr);
+		/* Set LSN to wbuf page buffer only when it is being registered */
+		if (wbuf_registered)
+			PageSetLSN(BufferGetPage(wbuf), recptr);
+		else
+			elog(LOG, "XXX: is_wbuf_registered: %s", wbuf_registered ? "true" : "false");
+
 		PageSetLSN(BufferGetPage(ovflbuf), recptr);
 
 		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#50)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Feb 5, 2024 at 10:00 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Amit, this has been applied as of 861f86beea1c, and I got pinged about
the fact this triggers inconsistencies because we always set the LSN
of the write buffer (wbuf in _hash_freeovflpage) but
XLogRegisterBuffer() would *not* be called when the two following
conditions happen:
- When xlrec.ntups <= 0.
- When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt

And it seems to me that there is still a bug here: there should be no
point in setting the LSN on the write buffer if we don't register it
in WAL at all, no?

Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the
issue.

@@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,
Buffer ovflbuf,
  if (!xlrec.is_prev_bucket_same_wrt)
  wbuf_flags |= REGBUF_NO_CHANGE;
  XLogRegisterBuffer(1, wbuf, wbuf_flags);
+
+ /* Track the registration status for later use */
+ wbuf_registered = true;
  }

XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
@@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer
bucketbuf, Buffer ovflbuf,

recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);

- PageSetLSN(BufferGetPage(wbuf), recptr);
+ /* Set LSN to wbuf page buffer only when it is being registered */
+ if (wbuf_registered)
+ PageSetLSN(BufferGetPage(wbuf), recptr);

Why set LSN when the page is not modified (say when we use the flag
REGBUF_NO_CHANGE)? I think you need to use a flag mod_wbuf and set it
in appropriate places during register buffer calls.

--
With Regards,
Amit Kapila.

#52Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#51)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Amit,

@@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,
Buffer ovflbuf,
if (!xlrec.is_prev_bucket_same_wrt)
wbuf_flags |= REGBUF_NO_CHANGE;
XLogRegisterBuffer(1, wbuf, wbuf_flags);
+
+ /* Track the registration status for later use */
+ wbuf_registered = true;
}

XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
@@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer
bucketbuf, Buffer ovflbuf,

recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);

- PageSetLSN(BufferGetPage(wbuf), recptr);
+ /* Set LSN to wbuf page buffer only when it is being registered */
+ if (wbuf_registered)
+ PageSetLSN(BufferGetPage(wbuf), recptr);

Why set LSN when the page is not modified (say when we use the flag
REGBUF_NO_CHANGE)? I think you need to use a flag mod_wbuf and set it
in appropriate places during register buffer calls.

You are right. Based on the previous discussions, PageSetLSN() must be called
after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d.
Other pages, e.g., metabuf, has already been followed the rule.

I updated the patch based on the requirement.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

v2_avoid_registration.patchapplication/octet-stream; name=v2_avoid_registration.patchDownload
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b512d0bcb6..d9c3472123 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -647,6 +647,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
 		int			i;
+		bool		mod_wbuf = false;
 
 		xlrec.prevblkno = prevblkno;
 		xlrec.nextblkno = nextblkno;
@@ -671,6 +672,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		if (xlrec.ntups > 0)
 		{
 			XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+
+			/* Track the registration status for later use */
+			mod_wbuf = true;
+
 			XLogRegisterBufData(1, (char *) itup_offsets,
 								nitups * sizeof(OffsetNumber));
 			for (i = 0; i < nitups; i++)
@@ -691,6 +696,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 			wbuf_flags = REGBUF_STANDARD;
 			if (!xlrec.is_prev_bucket_same_wrt)
 				wbuf_flags |= REGBUF_NO_CHANGE;
+			else
+			{
+				/* Track the registration status for later use */
+				mod_wbuf = true;
+			}
 			XLogRegisterBuffer(1, wbuf, wbuf_flags);
 		}
 
@@ -719,7 +729,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
 
-		PageSetLSN(BufferGetPage(wbuf), recptr);
+		/* Set LSN to wbuf page buffer only when it is being modified */
+		if (mod_wbuf)
+			PageSetLSN(BufferGetPage(wbuf), recptr);
+
 		PageSetLSN(BufferGetPage(ovflbuf), recptr);
 
 		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
#53Michael Paquier
michael@paquier.xyz
In reply to: Hayato Kuroda (Fujitsu) (#52)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Feb 05, 2024 at 07:57:09AM +0000, Hayato Kuroda (Fujitsu) wrote:

You are right. Based on the previous discussions, PageSetLSN() must be called
after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d.
Other pages, e.g., metabuf, has already been followed the rule.

At quick glance, this v2 seems kind of right to me: you are setting
the page LSN only when the page is registered in the record and
actually dirtied.
--
Michael

#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#53)
Re: Is this a problem in GenericXLogFinish()?

On Mon, Feb 5, 2024 at 1:33 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Feb 05, 2024 at 07:57:09AM +0000, Hayato Kuroda (Fujitsu) wrote:

You are right. Based on the previous discussions, PageSetLSN() must be called
after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d.
Other pages, e.g., metabuf, has already been followed the rule.

At quick glance, this v2 seems kind of right to me: you are setting
the page LSN only when the page is registered in the record and
actually dirtied.

Thanks for the report and looking into it. Pushed!

--
With Regards,
Amit Kapila.

#55Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#54)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:

Thanks for the report and looking into it. Pushed!

Thanks Amit for the commit and Kuroda-san for the patch!
--
Michael

#56Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#55)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:

On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:

Thanks for the report and looking into it. Pushed!

Thanks Amit for the commit and Kuroda-san for the patch!

I have been pinged about this thread and I should have looked a bit
closer, but wait a minute here.

There is still some divergence between the code path of
_hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
squeezing a page: we do not set the LSN of the write buffer if
(xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
!xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
but at replay we call PageSetLSN() on the write buffer and mark it
dirty in this case. Isn't that incorrect? It seems to me that we
should be able to make the conditional depending on the state of the
xl_hash_squeeze_page record, no?
--
Michael

#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#56)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Apr 5, 2024 at 9:56 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:

On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:

Thanks for the report and looking into it. Pushed!

Thanks Amit for the commit and Kuroda-san for the patch!

I have been pinged about this thread and I should have looked a bit
closer, but wait a minute here.

There is still some divergence between the code path of
_hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
squeezing a page: we do not set the LSN of the write buffer if
(xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
!xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
but at replay we call PageSetLSN() on the write buffer and mark it
dirty in this case. Isn't that incorrect?

Agreed. We will try to reproduce this.

It seems to me that we
should be able to make the conditional depending on the state of the
xl_hash_squeeze_page record, no?

I think we can have a flag like mod_buf and set it in both the
conditions if (xldata->ntups > 0) and if
(xldata->is_prev_bucket_same_wrt). If the flag is set then we can set
the LSN and mark buffer dirty.

--
With Regards,
Amit Kapila.

#58Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Michael Paquier (#56)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Michael,

There is still some divergence between the code path of
_hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
squeezing a page: we do not set the LSN of the write buffer if
(xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
!xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
but at replay we call PageSetLSN() on the write buffer and mark it
dirty in this case. Isn't that incorrect? It seems to me that we
should be able to make the conditional depending on the state of the
xl_hash_squeeze_page record, no?

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1]/messages/by-id/CAA4eK1+ayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA@mail.gmail.com.

Below part contained my analysis and how I reproduced.
I posted for clarifying the issue to others.

=====

## Pointed issue

Assuming the case
- xlrec.ntups is 0,
- xlrec.is_prim_bucket_same_wrt is true, and
- xlrec.is_prev_bucket_same_wrt is false.
This meant that there are several overflow pages and the tail deadpage is removing.
In this case, the primary page is not have to be modified.

While doing the normal operation, the removal is done in _hash_freeovflpage().
If above condition is met, mod_wbuf is not set to true so PageSetLSN() is skipped.

While doing the recovery, the squeeze and removal is done in hash_xlog_squeeze_page().
In this function PageSetLSN() is set unconditionally.
He said in this case the PageSetLSN should be avoided as well.

## Analysis

IIUC there is the same issue with pointed by [2]/messages/by-id/ZbyVVG_7eW3YD5-A@paquier.xyz.
He said the PageSetLSN() should be set when the page is really modified.
In the discussing case, wbug is not modified (just removing the tail entry) so that no need
to assign LSN to it. However, we might miss to update the redo case as well.

## How to reproduce

I could reproduce the case below steps.
1. Added the debug log like [3]``` --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record) writeopaque->hasho_nextblkno = xldata->nextblkno; }
2. Constructed a physical replication.
3. Ran hash_index.sql
4. Found the added debug log.

[1]: https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz
[3]:
```
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
            writeopaque->hasho_nextblkno = xldata->nextblkno;
        }
+        if (xldata->ntups == 0 &&
+            xldata->is_prim_bucket_same_wrt &&
+            !xldata->is_prev_bucket_same_wrt)
+            elog(LOG, "XXX no need to set PageSetLSN");
+
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

add_modbuf_flag.diffapplication/octet-stream; name=add_modbuf_flag.diffDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..6e9bd70c8a 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_buf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, &datalen);
 
@@ -695,6 +696,8 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 				ninserted++;
 			}
+
+			mod_buf = true;
 		}
 
 		/*
@@ -711,10 +714,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_buf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_buf)
+		{
+			PageSetLSN(writepage, lsn);
+			MarkBufferDirty(writebuf);
+		}
 	}
 
 	/* replay the record for initializing overflow buffer */
#59Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#56)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Apr 05, 2024 at 01:26:29PM +0900, Michael Paquier wrote:

On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:

On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:

Thanks for the report and looking into it. Pushed!

Thanks Amit for the commit and Kuroda-san for the patch!

I have been pinged about this thread and I should have looked a bit
closer, but wait a minute here.

I have forgotten to mention that Zubeyr Eryilmaz, a colleague, should
be credited here.
--
Michael

#60Michael Paquier
michael@paquier.xyz
In reply to: Hayato Kuroda (Fujitsu) (#58)
Re: Is this a problem in GenericXLogFinish()?

On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1].

+ bool mod_buf = false;

Perhaps you could name that mod_wbuf, to be consistent with the WAL
insert path.

I'm slightly annoyed by the fact that there is no check on
is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
show the symmetry between the insert and replay paths. Shouldn't
there be at least an assert for that in the branch when there are no
tuples (imagine an else to cover xldata->ntups == 0)? I mean between
just before updating the hash page's opaque area when
is_prev_bucket_same_wrt.

I've been thinking about ways to make such cases detectable in an
automated fashion. The best choice would be
verifyBackupPageConsistency(), just after RestoreBlockImage() on the
copy of the block image stored in WAL before applying the page masking
which would mask the LSN. A problem, unfortunately, is that we would
need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
flag so we would be able to track if the block rebuilt from the record
has the *same* LSN as the copy used for the consistency check. So
this edge consistency case would come at a cost, I am afraid, and the
8 bits of bimg_info are precious :/
--
Michael

#61Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Michael Paquier (#60)
1 attachment(s)
RE: Is this a problem in GenericXLogFinish()?

Dear Michael,

On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1].

+ bool mod_buf = false;

Perhaps you could name that mod_wbuf, to be consistent with the WAL
insert path.

Right, fixed.

I'm slightly annoyed by the fact that there is no check on
is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
show the symmetry between the insert and replay paths. Shouldn't
there be at least an assert for that in the branch when there are no
tuples (imagine an else to cover xldata->ntups == 0)? I mean between
just before updating the hash page's opaque area when
is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

I've been thinking about ways to make such cases detectable in an
automated fashion. The best choice would be
verifyBackupPageConsistency(), just after RestoreBlockImage() on the
copy of the block image stored in WAL before applying the page masking
which would mask the LSN. A problem, unfortunately, is that we would
need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
flag so we would be able to track if the block rebuilt from the record
has the *same* LSN as the copy used for the consistency check. So
this edge consistency case would come at a cost, I am afraid, and the
8 bits of bimg_info are precious :/

I could not decide that the change has more benefit than its cost, so I did
nothing for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

v2_add_modbuf_flag.diffapplication/octet-stream; name=v2_add_modbuf_flag.diffDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..a9188a5d2a 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_wbuf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, &datalen);
 
@@ -695,7 +696,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 				ninserted++;
 			}
+
+			mod_wbuf = true;
 		}
+		else
+			Assert(xldata->is_prim_bucket_same_wrt || xldata->is_prev_bucket_same_wrt);
 
 		/*
 		 * number of tuples inserted must be same as requested in REDO record.
@@ -711,10 +716,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_wbuf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_wbuf)
+		{
+			PageSetLSN(writepage, lsn);
+			MarkBufferDirty(writebuf);
+		}
 	}
 
 	/* replay the record for initializing overflow buffer */
#62Michael Paquier
michael@paquier.xyz
In reply to: Hayato Kuroda (Fujitsu) (#61)
2 attachment(s)
Re: Is this a problem in GenericXLogFinish()?

On Tue, Apr 09, 2024 at 10:25:36AM +0000, Hayato Kuroda (Fujitsu) wrote:

On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:
I'm slightly annoyed by the fact that there is no check on
is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
show the symmetry between the insert and replay paths. Shouldn't
there be at least an assert for that in the branch when there are no
tuples (imagine an else to cover xldata->ntups == 0)? I mean between
just before updating the hash page's opaque area when
is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

Yep, close enough. Thanks to the insert path we know that there will
be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
and the replay path where the assertion is added.

I've been thinking about ways to make such cases detectable in an
automated fashion. The best choice would be
verifyBackupPageConsistency(), just after RestoreBlockImage() on the
copy of the block image stored in WAL before applying the page masking
which would mask the LSN. A problem, unfortunately, is that we would
need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
flag so we would be able to track if the block rebuilt from the record
has the *same* LSN as the copy used for the consistency check. So
this edge consistency case would come at a cost, I am afraid, and the
8 bits of bimg_info are precious :/

I could not decide that the change has more benefit than its cost, so I did
nothing for it.

It does not prevent doing an implementation that can be used for some
test validation in the context of this thread. Using this idea, I
have done the attached 0002. This is not something to merge into the
tree, of course, just a toy to:
- Validate the fix for the problem we know.
- More braodly, check if there are other areas covered by the main
regression test suite if there are other problems.

And from what I can see, applying 0002 without 0001 causes the
following test to fail as the standby chokes on a inconsistent LSN
with a FATAL because the LSN of the apply page coming from the primary
and the LSN saved in the page replayed don't match. Here is a command
to reproduce the problem:
cd src/test/recovery/ && \
PG_TEST_EXTRA=wal_consistency_checking \
PROVE_TESTS=t/027_stream_regress.pl make check

And then in the logs you'd see stuff like:
2024-04-10 16:52:21.844 JST [2437] FATAL: inconsistent page LSN
replayed 0/A7E5CD18 primary 0/A7E56348
2024-04-10 16:52:21.844 JST [2437] CONTEXT: WAL redo at 0/A7E59F68
for Hash/SQUEEZE_PAGE: prevblkno 28, nextblkno 4294967295, ntups 0,
is_primary T; blkref #1: rel 1663/16384/25434, blk 7 FPW; blkref #2:
rel 1663/16384/25434, blk 29 FPW; blkref #3: rel 1663/16384/25434, blk
28 FPW; blkref #5: rel 1663/16384/25434, blk 9 FPW; blkref #6: rel
1663/16384/25434, blk 0 FPW

I don't see other areas with a similar problem, at the extent of the
core regression tests, that is to say. That's my way to say that your
patch looks good to me and that I'm planning to apply it to fix the
issue.

This shows me a more interesting issue unrelated to this thread:
027_stream_regress.pl would be stuck if the standby finds an
inconsistent page under wal_consistency_checking. This needs to be
fixed before I'm able to create a buildfarm animal with this setup.
I'll spawn a new thread about that tomorrow.
--
Michael

Attachments:

v3-0001-Fix-inconsistency-with-replay-of-hash-squeeze-rec.patchtext/x-diff; charset=us-asciiDownload
From 4c6597f2ce57439696aecdbda08b374857ab715d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 10 Apr 2024 16:47:29 +0900
Subject: [PATCH v3 1/2] Fix inconsistency with replay of hash squeeze record
 for clean buffers

Blah.
---
 src/backend/access/hash/hash_xlog.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..d7ba74579b 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_wbuf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, &datalen);
 
@@ -695,6 +696,17 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 				ninserted++;
 			}
+
+			mod_wbuf = true;
+		}
+		else
+		{
+			/*
+			 * See _hash_freeovflpage() which has a similar assertion when
+			 * there are no tuples.
+			 */
+			Assert(xldata->is_prim_bucket_same_wrt ||
+				   xldata->is_prev_bucket_same_wrt);
 		}
 
 		/*
@@ -711,10 +723,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_wbuf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_wbuf)
+		{
+			PageSetLSN(writepage, lsn);
+			MarkBufferDirty(writebuf);
+		}
 	}
 
 	/* replay the record for initializing overflow buffer */
-- 
2.43.0

v3-0002-Implement-consistency-check-with-clean-buffers-fo.patchtext/x-diff; charset=us-asciiDownload
From 22d5d6bbc093a80688ff4e9d4b5865c8fde37d9c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 10 Apr 2024 16:48:21 +0900
Subject: [PATCH v3 2/2] Implement consistency check with clean buffers for
 wal_consistency_checking

This extends WAL records to track if a buffer registered in a record was
"clean", aka it is registered without being modified.  This is done by
adding a new flag called BKPIMAGE_CLEAN, recorded when a block uses
REGBUF_NO_CHANGE.

This uses a bit in a record for nothing, so while it is useful for
sanity checks, this should never *ever* be applied as-is.
---
 src/include/access/xlogreader.h           |  3 +++
 src/include/access/xlogrecord.h           |  2 ++
 src/backend/access/rmgrdesc/xlogdesc.c    | 11 +++++++++--
 src/backend/access/transam/xloginsert.c   |  4 ++++
 src/backend/access/transam/xlogreader.c   |  1 +
 src/backend/access/transam/xlogrecovery.c | 15 +++++++++++++++
 6 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2e9e5f43eb..6e42346c31 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -135,6 +135,7 @@ typedef struct
 	/* Information on full-page image, if any */
 	bool		has_image;		/* has image, even for consistency checking */
 	bool		apply_image;	/* has image that should be restored */
+	bool		clean_image;	/* image cleanly registered  */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -424,6 +425,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].apply_image)
+#define XLogRecBlockImageClean(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].clean_image)
 #define XLogRecHasBlockData(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].has_data)
 
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index b9e5c59fae..c38de8a96e 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -157,6 +157,8 @@ typedef struct XLogRecordBlockImageHeader
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_APPLY			0x02	/* page image should be restored
 										 * during replay */
+#define BKPIMAGE_CLEAN			0x20	/* clean buffer registered */
+
 /* compression methods supported */
 #define BKPIMAGE_COMPRESS_PGLZ	0x04
 #define BKPIMAGE_COMPRESS_LZ4	0x08
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e455400716..f045dfe8a0 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -272,8 +272,10 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 						method = "unknown";
 
 					appendStringInfo(buf,
-									 " (FPW%s); hole: offset: %u, length: %u, "
+									 " (FPW%s%s); hole: offset: %u, length: %u, "
 									 "compression saved: %u, method: %s",
+									 XLogRecBlockImageClean(record, block_id) ?
+									 "" : " clean",
 									 XLogRecBlockImageApply(record, block_id) ?
 									 "" : " for WAL verification",
 									 XLogRecGetBlock(record, block_id)->hole_offset,
@@ -286,7 +288,9 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 				else
 				{
 					appendStringInfo(buf,
-									 " (FPW%s); hole: offset: %u, length: %u",
+									 " (FPW%s%s); hole: offset: %u, length: %u",
+									 XLogRecBlockImageClean(record, block_id) ?
+									 "" : " clean",
 									 XLogRecBlockImageApply(record, block_id) ?
 									 "" : " for WAL verification",
 									 XLogRecGetBlock(record, block_id)->hole_offset,
@@ -329,6 +333,9 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 					appendStringInfoString(buf, " FPW");
 				else
 					appendStringInfoString(buf, " FPW for WAL verification");
+
+				if (XLogRecBlockImageClean(record, block_id))
+					appendStringInfoString(buf, " clean");
 			}
 		}
 	}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 9047601534..227c83c477 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -720,6 +720,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 			if (needs_backup)
 				bimg.bimg_info |= BKPIMAGE_APPLY;
 
+			/* If buffer is clean, register this information */
+			if (regbuf->flags & REGBUF_NO_CHANGE)
+				bimg.bimg_info |= BKPIMAGE_CLEAN;
+
 			if (is_compressed)
 			{
 				/* The current compression is stored in the WAL record */
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..6737cbe9e3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1791,6 +1791,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
 
 				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+				blk->clean_image = ((blk->bimg_info & BKPIMAGE_CLEAN) != 0);
 
 				if (BKPIMAGE_COMPRESSED(blk->bimg_info))
 				{
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b2fe2d04cc..98ecb85c0b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2551,6 +2551,21 @@ verifyBackupPageConsistency(XLogReaderState *record)
 					(errcode(ERRCODE_INTERNAL_ERROR),
 					 errmsg_internal("%s", record->errormsg_buf)));
 
+		/*
+		 * If the replayed block was registered while clean, check that its
+		 * LSN matches with the copy coming from the primary.
+		 */
+		if (XLogRecBlockImageClean(record, block_id))
+		{
+			XLogRecPtr	primary_lsn = PageGetLSN((Page) primary_image_masked);
+			XLogRecPtr	replay_lsn = PageGetLSN((Page) replay_image_masked);
+
+			if (primary_lsn != replay_lsn)
+				elog(FATAL, "inconsistent page LSN replayed %X/%X primary %X/%X",
+					 LSN_FORMAT_ARGS(replay_lsn),
+					 LSN_FORMAT_ARGS(primary_lsn));
+		}
+
 		/*
 		 * If masking function is defined, mask both the primary and replay
 		 * images
-- 
2.43.0

#63Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#62)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Apr 10, 2024 at 1:27 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 09, 2024 at 10:25:36AM +0000, Hayato Kuroda (Fujitsu) wrote:

On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:
I'm slightly annoyed by the fact that there is no check on
is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
show the symmetry between the insert and replay paths. Shouldn't
there be at least an assert for that in the branch when there are no
tuples (imagine an else to cover xldata->ntups == 0)? I mean between
just before updating the hash page's opaque area when
is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

Yep, close enough. Thanks to the insert path we know that there will
be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
and the replay path where the assertion is added.

It is fine to have an assertion for this path.

+ else
+ {
+ /*
+ * See _hash_freeovflpage() which has a similar assertion when
+ * there are no tuples.
+ */
+ Assert(xldata->is_prim_bucket_same_wrt ||
+    xldata->is_prev_bucket_same_wrt);

I can understand this comment as I am aware of this code but not sure
it would be equally easy for the people first time looking at this
code. One may try to find the equivalent assertion in
_hash_freeovflpage(). The alternative could be: "Ensure that the
required flags are set when there are no tuples. See
_hash_freeovflpage().". I am also fine if you prefer to go with your
proposed comment.

Otherwise, the patch looks good to me.

--
With Regards,
Amit Kapila.

#64Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#63)
Re: Is this a problem in GenericXLogFinish()?

On Wed, Apr 10, 2024 at 03:28:22PM +0530, Amit Kapila wrote:

I can understand this comment as I am aware of this code but not sure
it would be equally easy for the people first time looking at this
code. One may try to find the equivalent assertion in
_hash_freeovflpage(). The alternative could be: "Ensure that the
required flags are set when there are no tuples. See
_hash_freeovflpage().". I am also fine if you prefer to go with your
proposed comment.

Yes, I can see your point about why that's confusing. Your suggestion
is much better, so after a second look I've used your version of the
comment and applied the patch on HEAD.

I am wondering if we have other problems like that with dirty buffers
at replay. Perhaps I should put my nose more onto the replay paths
and extend these automated checks with wal_consistency_checking.
--
Michael