Assert in heapgettup_pagemode() fails due to underlying buffer change
Hello hackers,
I tried to investigate a recent buildfarm test failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-06-04%2003%3A27%3A47
29/295 postgresql:recovery / recovery/026_overwrite_contrecord ERROR 39.55s exit status 32
log/026_overwrite_contrecord_standby.log
TRAP: failed Assert("ItemIdIsNormal(lpp)"), File: "../pgsql/src/backend/access/heap/heapam.c", Line: 1002, PID: 3740958
postgres: standby: bf postgres [local] startup(ExceptionalCondition+0x81)[0x56c60bf9]
postgres: standby: bf postgres [local] startup(+0xf776e)[0x5667276e]
postgres: standby: bf postgres [local] startup(heap_getnextslot+0x40)[0x56672ee1]
postgres: standby: bf postgres [local] startup(+0x11c218)[0x56697218]
postgres: standby: bf postgres [local] startup(systable_getnext+0xfa)[0x56697c1a]
postgres: standby: bf postgres [local] startup(+0x6d29c7)[0x56c4d9c7]
postgres: standby: bf postgres [local] startup(+0x6d372c)[0x56c4e72c]
postgres: standby: bf postgres [local] startup(+0x6d8288)[0x56c53288]
postgres: standby: bf postgres [local] startup(RelationCacheInitializePhase3+0x149)[0x56c52d71]
(It's not the only failure of that ilk in the buildfarm.)
and managed to reproduce the failure by running many
026_overwrite_contrecord tests in parallel (with fsync=on).
Analyzing the core dump added some info:
...
#3 0x0000000000bb43cc in ExceptionalCondition (conditionName=0xc45c77 "ItemIdIsNormal(lpp)",
fileName=0xc45aa8 "heapam.c", lineNumber=1002) at assert.c:66
#4 0x00000000004f7f13 in heapgettup_pagemode (scan=0x19f5660, dir=ForwardScanDirection, nkeys=2, key=0x19f61d0)
at heapam.c:1002
#5 0x00000000004f86d1 in heap_getnextslot (sscan=0x19f5660, direction=ForwardScanDirection, slot=0x19f5da0)
at heapam.c:1307
#6 0x000000000051d028 in table_scan_getnextslot (sscan=0x19f5660, direction=ForwardScanDirection, slot=0x19f5da0)
at ../../../../src/include/access/tableam.h:1081
#7 0x000000000051da80 in systable_getnext (sysscan=0x19f5470) at genam.c:530
#8 0x0000000000ba0937 in RelationBuildTupleDesc (relation=0x7fa004feea88) at relcache.c:572
#9 0x0000000000ba17b9 in RelationBuildDesc (targetRelId=2679, insertIt=true) at relcache.c:1184
#10 0x0000000000ba6520 in load_critical_index (indexoid=2679, heapoid=2610) at relcache.c:4353
#11 0x0000000000ba607d in RelationCacheInitializePhase3 () at relcache.c:4132
#12 0x0000000000bcb704 in InitPostgres (in_dbname=0x196ca30 "postgres", dboid=5, username=0x19a91b8 "law", useroid=0,
flags=1, out_dbname=0x0) at postinit.c:1193
...
(gdb) frame 4
(gdb) p lpp->lp_flags
$2 = 1
(gdb) p ItemIdIsNormal(lpp)
$12 = 1
So it looks like the Assert had failed when lpp->lp_flags had some other
contents...
I added the following debugging code:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -995,10 +995,14 @@ continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
ItemId lpp;
+ ItemIdData iid;
OffsetNumber lineoff;
lineoff = scan->rs_vistuples[lineindex];
lpp = PageGetItemId(page, lineoff);
+ iid = *((ItemIdData *)lpp);
+
+ Assert(ItemIdIsNormal(&iid));
Assert(ItemIdIsNormal(lpp));
and got:
...
#2 0x000055b68dc6998c in ExceptionalCondition (conditionName=0x55b68dcfe5f7 "ItemIdIsNormal(&iid)",
fileName=0x55b68dcfe428 "heapam.c", lineNumber=1010) at assert.c:66
#3 0x000055b68d588a78 in heapgettup_pagemode (scan=0x55b68f0905e0, dir=ForwardScanDirection, nkeys=2,
key=0x55b68f091150) at heapam.c:1010
#4 0x000055b68d58930e in heap_getnextslot (sscan=0x55b68f0905e0, direction=ForwardScanDirection, slot=0x55b68f090d20)
at heapam.c:1322
...
(gdb) frame 3
#3 0x000055b68d588a78 in heapgettup_pagemode (...) at heapam.c:1010
1010 Assert(ItemIdIsNormal(&iid));
(gdb) info locals
lpp = 0x7f615c34b0ec
iid = {lp_off = 0, lp_flags = 0, lp_len = 0}
lineoff = 54
tuple = 0x55b68f090638
page = 0x7f615c34b000 ""
(gdb) p *lpp
$1 = {lp_off = 3160, lp_flags = 1, lp_len = 136}
It seemingly confirms that the underlying memory was changed while being
processed in heapgettup_pagemode().
I've tried to add checks for the page buffer content as below:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -953,11 +953,15 @@ heapgettup_pagemode(HeapScanDesc scan,
Page page;
int lineindex;
int linesleft;
+char page_copy[BLCKSZ];
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
page = BufferGetPage(scan->rs_cbuf);
+memcpy(page_copy, page, BLCKSZ);
+for (int i = 0; i < 100; i++)
+Assert(memcmp(page_copy, page, BLCKSZ) == 0);
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -986,6 +990,10 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
page = BufferGetPage(scan->rs_cbuf);
+memcpy(page_copy, page, BLCKSZ);
+for (int i = 0; i < 100; i++)
+Assert(memcmp(page_copy, page, BLCKSZ) == 0);
+
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
and got the assertion failures even during `make check`:
...
#5 0x00005577f29e0bc4 in ExceptionalCondition (
conditionName=conditionName@entry=0x5577f2a4a5d0 "memcmp(page_copy, page, BLCKSZ) == 0",
fileName=fileName@entry=0x5577f2a4aa38 "heapam.c", lineNumber=lineNumber@entry=966) at assert.c:66
#6 0x00005577f24faa68 in heapgettup_pagemode (scan=scan@entry=0x5577f46574e8, dir=ForwardScanDirection, nkeys=0,
key=0x0) at heapam.c:966
...
(gdb) frame 6
#6 0x00005577f24faa68 in heapgettup_pagemode (...) at heapam.c:966
966 Assert(memcmp(page_copy, page, BLCKSZ) == 0);
(gdb) p i
$1 = 25
Am I missing something or the the page buffer indeed lacks locking there?
Best regards,
Alexander
On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Am I missing something or the the page buffer indeed lacks locking there?
I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.
Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hello Robert,
06.06.2024 19:36, Robert Haas wrote:
On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Am I missing something or the the page buffer indeed lacks locking there?
I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.
Yes, my last experiment with memcmp for the whole buffer was wrong,
given the comment above heapgettup_pagemode(). I think the correct
check would be:
ItemId lpp;
OffsetNumber lineoff;
+ItemIdData iid;
lineoff = scan->rs_vistuples[lineindex];
lpp = PageGetItemId(page, lineoff);
+iid = *((ItemIdData *)lpp);
+for (int i = 0; i < 1000; i++)
+Assert(memcmp(&iid, lpp, sizeof(iid)) == 0);
It significantly alleviates reproducing of the test failure for me.
Will try to bisect this anomaly tomorrow.
Best regards,
Alexander
On Thu, Jun 06, 2024 at 12:36:32PM -0400, Robert Haas wrote:
On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Am I missing something or the the page buffer indeed lacks locking there?
I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.
$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmisch@google.com (Hot standby queries see
transient all-zeros pages).
Hello Noah,
06.06.2024 22:07, Noah Misch wrote:
I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmisch@google.com (Hot standby queries see
transient all-zeros pages).
Thank you for the reference! Yes, it looks very similar. Though I can't
say the sleep you proposed helps the failure reproduction (I've tried
026_overwrite_contrecord.pl and saw no more frequent failures or so).
My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
Author: Thomas Munro <tmunro@postgresql.org>
Date: Wed Apr 3 00:03:08 2024 +1300
Provide vectored variant of ReadBuffer().
Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08
Best regards,
Alexander
On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commitAuthor: Thomas Munro <tmunro@postgresql.org>
Date: Wed Apr 3 00:03:08 2024 +1300Provide vectored variant of ReadBuffer().
Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08
Looking...
On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commitAuthor: Thomas Munro <tmunro@postgresql.org>
Date: Wed Apr 3 00:03:08 2024 +1300Provide vectored variant of ReadBuffer().
Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08Looking...
What Noah described[1]/messages/by-id/20240512171658.7e.nmisch@google.com is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page. Oops. The attached
seems to cure his repro for me. Does it also cure your test? I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation. I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.
Attachments:
0001-Fix-RBM_ZERO_AND_LOCK.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-RBM_ZERO_AND_LOCK.patchDownload
From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH] Fix RBM_ZERO_AND_LOCK.
Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool. It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.
Reported-by: Noah Misch <noah@leadboat.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/20240512171658.7e.nmisch@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
src/backend/storage/buffer/bufmgr.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..238fc0e3547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
* return.
*/
static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
{
BufferDesc *bufHdr;
uint32 buf_state;
@@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode)
LockBufferForCleanup(buffer);
}
- memset(BufferGetPage(buffer), 0, BLCKSZ);
+ if (zero)
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
if (BufferIsLocal(buffer))
{
@@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
forkNum, blockNum, strategy, &found);
- ZeroBuffer(buffer, mode);
+ ZeroBuffer(buffer, mode, !found);
return buffer;
}
--
2.45.1
On 2024-Jun-07, Thomas Munro wrote:
static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
This change makes the API very strange. Should the function be called
ZeroAndLockBuffer() instead? Then the addition of a "bool zero"
argument makes a lot more sense.
In passing, I noticed that WaitReadBuffers has zero comments, which
seems an insufficient number of them.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Hello Thomas,
07.06.2024 09:06, Thomas Munro wrote:
On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commitAuthor: Thomas Munro <tmunro@postgresql.org>
Date: Wed Apr 3 00:03:08 2024 +1300Provide vectored variant of ReadBuffer().
Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08Looking...
What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page. Oops. The attached
seems to cure his repro for me. Does it also cure your test? I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation. I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.
Thank you for looking and for the fix!
Using the same testing procedure (applying patch for checking lpp,
multiplying 026_overwrite_contrecord.pl tests and running 30 tests in
parallel, with fsync=on) which I used for bisecting, I got failures on
iterations 8, 19, 4 without the fix, but with the fix applied, 125
iterations passed. I think The Cure is sound.
Best regards,
Alexander
On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)This change makes the API very strange. Should the function be called
ZeroAndLockBuffer() instead? Then the addition of a "bool zero"
argument makes a lot more sense.
I agree that's better, but it still looks a bit weird. You have to
realize that 'bool zero' means 'is already zeroed' here -- or at
least, I guess that's the intention. But then I wonder why you'd call
a function called ZeroAndLockBuffer if all you need to do is
LockBuffer.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Sat, Jun 8, 2024 at 12:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)This change makes the API very strange. Should the function be called
ZeroAndLockBuffer() instead? Then the addition of a "bool zero"
argument makes a lot more sense.I agree that's better, but it still looks a bit weird. You have to
realize that 'bool zero' means 'is already zeroed' here -- or at
least, I guess that's the intention. But then I wonder why you'd call
a function called ZeroAndLockBuffer if all you need to do is
LockBuffer.
The name weirdness comes directly from RBM_ZERO_AND_LOCK (the fact
that it doesn't always zero despite shouting ZERO is probably what
temporarily confused me). But coming up with a better name is hard
and I certainly don't propose to change it now. I think it's
reasonable for this internal helper function to have that matching
name as Alvaro suggested, with a good comment about that.
Even though that quick-demonstration change fixed the two reported
repros, I think it is still probably racy (or if it isn't, it relies
on higher level interlocking that I don't want to rely on). This case
really should be using the standard StartBufferIO/TerminateBufferIO
infrastructure as it was before. I had moved that around to deal with
multi-block I/O, but dropped the ball on the zero case... sorry.
Here's a version like that. The "zero" argument (yeah that was not a
good name) is now inverted and called "already_valid", but it's only a
sort of optimisation for the case where we already know for sure that
it's valid. If it isn't, we do the standard
BM_IO_IN_PROGRESS/BM_VALID/CV dance, for correct interaction with any
concurrent read or zero operation.
Attachments:
v2-0001-Fix-RBM_ZERO_AND_LOCK.patchapplication/octet-stream; name=v2-0001-Fix-RBM_ZERO_AND_LOCK.patchDownload
From 144fc2a434e4c912df6adc1d4a8245497fd44261 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH v2] Fix RBM_ZERO_AND_LOCK.
Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool. It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet. This is now done with the
standard BM_IO_IN_PROGRESS infrastructure to test for BM_VALID and wake
concurrent waiters without races (as it was in earlier releases).
Reported-by: Noah Misch <noah@leadboat.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20240512171658.7e.nmisch@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
src/backend/storage/buffer/bufmgr.c | 54 +++++++++++++++++++----------
1 file changed, 35 insertions(+), 19 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f9..9b1ef3fde5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1010,43 +1010,59 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
}
/*
- * Zero a buffer and lock it, as part of the implementation of
+ * Lock and optionally zero a buffer, as part of the implementation of
* RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK. The buffer must be already
- * pinned. It does not have to be valid, but it is valid and locked on
- * return.
+ * pinned. If the buffer is not already valid, it is zeroed and made valid.
*/
static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
{
BufferDesc *bufHdr;
- uint32 buf_state;
+ bool need_to_zero;
Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
- if (BufferIsLocal(buffer))
+ if (already_valid)
+ {
+ /*
+ * If the caller already knew the buffer was valid, we can skip some
+ * header interaction. The caller just wants to lock the buffer.
+ */
+ need_to_zero = false;
+ }
+ else if (BufferIsLocal(buffer))
+ {
+ /* Simple case for non-shared buffers. */
bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+ }
else
{
+ /* Take BM_IO_IN_PROGRESS, or discover that BM_VALID has been set concurrently. */
bufHdr = GetBufferDescriptor(buffer - 1);
+ need_to_zero = StartBufferIO(bufHdr, true, false);
+ }
+
+ if (need_to_zero)
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ /* Acquire the requested lock before setting BM_VALID. */
+ if (!BufferIsLocal(buffer))
+ {
if (mode == RBM_ZERO_AND_LOCK)
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
else
LockBufferForCleanup(buffer);
}
- memset(BufferGetPage(buffer), 0, BLCKSZ);
-
- if (BufferIsLocal(buffer))
+ /* Set BM_VALID, and wake anyone waiting for BM_IO_IN_PROGRESS to be cleared. */
+ if (need_to_zero)
{
- buf_state = pg_atomic_read_u32(&bufHdr->state);
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- buf_state = LockBufHdr(bufHdr);
- buf_state |= BM_VALID;
- UnlockBufHdr(bufHdr, buf_state);
+ if (BufferIsLocal(buffer))
+ pg_atomic_write_u32(&bufHdr->state,
+ pg_atomic_read_u32(&bufHdr->state) | BM_VALID);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
}
}
@@ -1185,7 +1201,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
forkNum, blockNum, strategy, &found);
- ZeroBuffer(buffer, mode);
+ ZeroAndLockBuffer(buffer, mode, found);
return buffer;
}
--
2.39.3 (Apple Git-146)
New version. Same code as v2, but comments improved to explain the
reasoning, with reference to README's buffer access rules. I'm
planning to push this soon if there are no objections.
Attachments:
v3-0001-Fix-RBM_ZERO_AND_LOCK.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-RBM_ZERO_AND_LOCK.patchDownload
From 1fa26f407622cd69d82f3b4ea68fddf2c357cf06 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH v3] Fix RBM_ZERO_AND_LOCK.
Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool. It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet. This is done with the standard
BM_IO_IN_PROGRESS infrastructure to test for BM_VALID and wake
concurrent waiters without races (as it was in earlier releases).
Reported-by: Noah Misch <noah@leadboat.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> (earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier version)
Discussion: https://postgr.es/m/20240512171658.7e.nmisch@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
src/backend/storage/buffer/bufmgr.c | 64 ++++++++++++++++++++---------
1 file changed, 45 insertions(+), 19 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..a8e3377263f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1010,43 +1010,69 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
}
/*
- * Zero a buffer and lock it, as part of the implementation of
+ * Lock and optionally zero a buffer, as part of the implementation of
* RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK. The buffer must be already
- * pinned. It does not have to be valid, but it is valid and locked on
- * return.
+ * pinned. If the buffer is not already valid, it is zeroed and made valid.
*/
static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
{
BufferDesc *bufHdr;
- uint32 buf_state;
+ bool need_to_zero;
Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
- if (BufferIsLocal(buffer))
+ if (already_valid)
+ {
+ /*
+ * If the caller already knew the buffer was valid, we can skip some
+ * header interaction. The caller just wants to lock the buffer.
+ */
+ need_to_zero = false;
+ }
+ else if (BufferIsLocal(buffer))
+ {
+ /* Simple case for non-shared buffers. */
bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+ }
else
{
+ /*
+ * Take BM_IO_IN_PROGRESS, or discover that BM_VALID has been set
+ * concurrently. Even though we aren't doing I/O, that protocol
+ * ensures that we don't zero a page that someone else has pinned. An
+ * exclusive content lock wouldn't be enough, because readers are
+ * allowed to drop the content lock after determining that a tuple is
+ * visible (see buffer access rules in README).
+ */
bufHdr = GetBufferDescriptor(buffer - 1);
+ need_to_zero = StartBufferIO(bufHdr, true, false);
+ }
+
+ if (need_to_zero)
+ memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+ /* Acquire the requested lock before setting BM_VALID. */
+ if (!BufferIsLocal(buffer))
+ {
if (mode == RBM_ZERO_AND_LOCK)
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
else
LockBufferForCleanup(buffer);
}
- memset(BufferGetPage(buffer), 0, BLCKSZ);
-
- if (BufferIsLocal(buffer))
- {
- buf_state = pg_atomic_read_u32(&bufHdr->state);
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
+ /*
+ * Set BM_VALID, and wake anyone waiting for BM_IO_IN_PROGRESS to be
+ * cleared.
+ */
+ if (need_to_zero)
{
- buf_state = LockBufHdr(bufHdr);
- buf_state |= BM_VALID;
- UnlockBufHdr(bufHdr, buf_state);
+ if (BufferIsLocal(buffer))
+ pg_atomic_write_u32(&bufHdr->state,
+ pg_atomic_read_u32(&bufHdr->state) | BM_VALID);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID, true);
}
}
@@ -1185,7 +1211,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
forkNum, blockNum, strategy, &found);
- ZeroBuffer(buffer, mode);
+ ZeroAndLockBuffer(buffer, mode, found);
return buffer;
}
--
2.45.1
On Fri, Jun 7, 2024 at 8:05 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
In passing, I noticed that WaitReadBuffers has zero comments, which
seems an insufficient number of them.
Ack. Here is a patch for that. I guess I hadn't put a comment there
because it's hard to write anything without sort of duplicating what
is already said by the StartReadBuffers() comment and doubling down on
descriptions of future plans... but, in for a penny, in for a pound as
they say.
Attachments:
0001-Add-comment-for-WaitReadBuffers.patchapplication/x-patch; name=0001-Add-comment-for-WaitReadBuffers.patchDownload
From db5e56825e4b4c595885373c413011c50fedd3e8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 9 Jun 2024 09:51:02 +1200
Subject: [PATCH] Add comment for WaitReadBuffers().
Per complaint from Alvaro while reviewing something else.
Reported-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/202406070805.icofqromovvn%40alvherre.pgsql
---
src/backend/storage/buffer/bufmgr.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a8e3377263f..409210a6ed2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1371,6 +1371,17 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
}
+/*
+ * The second step of a StartReadBuffers(), WaitReadBuffers() sequence. It is
+ * only necessary to call this if StartReadBuffers() returned true, indicating
+ * that I/O was necessary.
+ *
+ * Currently, this function performs synchronous I/O system calls. The earlier
+ * StartReadBuffers() call might have issued advice to give the OS a head
+ * start so that it completes quickly. In future work, a true asynchronous I/O
+ * could be started by StartReadBuffers(), and this function would wait for it
+ * to complete.
+ */
void
WaitReadBuffers(ReadBuffersOperation *operation)
{
--
2.45.1
I adjusted the code a bit more to look like the 16 coding including
restoring some very useful comments that had been lost, and pushed.
Thanks very much to Alexander and Noah for (independently) chasing
this down and reporting, testing etc, and Alvaro and Robert for review
comments.
(Passing thought: it's a bit weird that we need to zero pages at all
before restoring FPIs or initialising them. Perhaps there is some way
to defer marking the buffer valid until after the caller gets a chance
to initialise? Or something like that...)