Could not read block at end of the relation

Started by Ronan Dunklauabout 2 years ago33 messagesbugs

ronan.dunklau@aiven.io

about 2 years ago

Hello,

I'm sorry as this will be a very poor bug report. On PG16, I'm am experiencing
random errors which share the same characteristics:

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is running
on the table at the same time the error is triggered

Then, several backends get the same error at once "ERROR: could not read
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different
block numbers. The relation is always a table (regular or toast). The blocks
are past the end of the relation, and the different backends are all trying to
read a different block. The offending queries are either an INSERT / UPDATE /
COPY.

I've seen that several bugs have been fixed in 16.1 and 16.2 regarding the new
relation extension infrastructure, involving partitioned tables in one case
and temp tables in the other one so I suspect maybe some other corner cases
are uncovered in there.

I suspected the FSM could be corrupted in some way but taking a look at it
just after the errors have been triggered, the offending (non existing)blocks
are just not present in the FSM either.

I'm desperately trying to reproduce the issue in a test environment, without
any luck so far... I suspected a race condition with VACUUM trying to reclaim
the space at the end of the relation, but running a custom build trying to
reproduce that (by always trying to truncate the relation during VACUUM
regardless of the amount of possibly-freeable-space) hasn't led me anywhere.

It seems that for some reason, a backend is extending the relation for the
other waiting ones, and the newly allocated blocks don't end up being pinned
in shared_buffers. They could then be evicted, and the waiting backend is now
trying to read from a block which has to be read from disk but has never been
marked dirty and never persisted. I don't have anything to back that
hypothesis though...

Once again I'm sorry that this report is too vague, I'll update if I manage to
reproduce the issue or gather some more information, but in the meantime has
anybody witnessed something similar ? And more importantly, do you have any
pointers on how to investigate to try to trigger the issue manually ?

Best regards,

--
Ronan Dunklau

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Ronan Dunklau (#1)

FSM Corruption (was: Could not read block at end of the relation)

Le mardi 27 février 2024, 11:34:14 CET Ronan Dunklau a écrit :

I suspected the FSM could be corrupted in some way but taking a look at it
just after the errors have been triggered, the offending (non
existing)blocks are just not present in the FSM either.

I think I may have missed something on my first look. On other affected
clusters, the FSM is definitely corrupted. So it looks like we have an FSM
corruption bug on our hands.

The occurence of this bug happening makes it hard to reproduce, but it's
definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.

In our case, we need to repair the FSM. The instructions on the wiki do work,
but maybe we should add something like the attached patch (modeled after the
same feature in pg_visibility) to make it possible to repair the FSM
corruption online. What do you think about it ?

The investigation of the corruption is still ongoing.

Best regards,

--
Ronan Dunklau

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Ronan Dunklau (#2)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote:

In our case, we need to repair the FSM. The instructions on the wiki do work,
but maybe we should add something like the attached patch (modeled after the
same feature in pg_visibility) to make it possible to repair the FSM
corruption online. What do you think about it ?

I vaguely recall that something like that has been suggested around
917dc7d2393c in the past.

The investigation of the corruption is still ongoing.

Thanks!
--
Michael

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#2)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is running
on the table at the same time the error is triggered

Then, several backends get the same error at once "ERROR: could not read
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different

What are some of the specific block numbers reported?

has anybody witnessed something similar ?

/messages/by-id/flat/CA+hUKGK+5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D+Zg@mail.gmail.com
reminded me of this. Did you upgrade your OS recently?

On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote:

I think I may have missed something on my first look. On other affected
clusters, the FSM is definitely corrupted. So it looks like we have an FSM
corruption bug on our hands.

What corruption signs did you observe in the FSM? Since FSM is intentionally
not WAL-logged, corruption is normal, but corruption causing errors is not
normal. That said, if any crash leaves a state that the freespace/README
"self-correcting measures" don't detect, errors may happen. Did the clusters
crash recently?

The occurence of this bug happening makes it hard to reproduce, but it's
definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.

You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
problem stops happening. That would corroborate the VACUUM theory.

Can you use backtrace_functions to get a stack track?

In our case, we need to repair the FSM. The instructions on the wiki do work,
but maybe we should add something like the attached patch (modeled after the
same feature in pg_visibility) to make it possible to repair the FSM
corruption online. What do you think about it ?

That's reasonable in concept.

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#4)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :

On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is
running on the table at the same time the error is triggered

Then, several backends get the same error at once "ERROR: could not read
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with
different

What are some of the specific block numbers reported?

Nothing specific, except they are past the end of the table, by an offset
starting at one. I've seen it happens on regular tables restored by pg_restore
quite a lot, but also on bigger tables, regular or TOAST. We've seen this on a
dozen different clusters, but nothing reproducible for now.

has anybody witnessed something similar ?

/messages/by-id/flat/CA+hUKGK+5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8
D%2BZg%40mail.gmail.com reminded me of this. Did you upgrade your OS
recently?

No recent upgrades I'm aware of. Also, it only touches PG16 clusters, not
earlier versions. But it looks like the same kind of errors.

What corruption signs did you observe in the FSM? Since FSM is
intentionally not WAL-logged, corruption is normal, but corruption causing
errors is not normal. That said, if any crash leaves a state that the
freespace/README "self-correcting measures" don't detect, errors may
happen. Did the clusters crash recently?

Well we still have full page images for the FSM.
I'll take as an example the cluster I'm currently investigating on.
The corruption I'm seeing is that I have an additional entry in the FSM for a
block that does not exist in the relation: IIUC, slot 4129 corresponds to
block 34, which is the last one I have in the relation. However, I have an
entry (slot 4130) for block number 35, with MaxFSMRequestSize

fsm_page_contents
-------------------
0: 255 +
1: 255 +
3: 255 +
...
4129: 114 +
4130: 255 +

This then causes attempts to insert into that block to fail.

Looking at when the corruption was WAL-logged, this particular case is quite
easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table
(probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM
running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog
records for each of the relation main fork, followed by a lonely FPI for the
leaf page of it's FSM:

rmgr: Heap2 len (rec/tot): 5698/ 5698, tx: 1637, lsn:
1/810763B8, prev 1/81076388, desc: MULTI_INSERT+INIT ntuples: 124, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 0
rmgr: XLOG len (rec/tot): 49/ 8241, tx: 1637, lsn:
1/81077A00, prev 1/810763B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW
rmgr: Heap2 len (rec/tot): 5762/ 5762, tx: 1637, lsn:
1/81079A50, prev 1/81077A00, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 1
rmgr: Heap2 len (rec/tot): 5746/ 5746, tx: 1637, lsn:
1/8107B0F0, prev 1/81079A50, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 2
rmgr: Heap2 len (rec/tot): 5778/ 5778, tx: 1637, lsn:
1/8107C780, prev 1/8107B0F0, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 3

(up until blk 34 is wal logged)

At some point pages 0 and 1 get a FPI, the leaf page (blk2) does not during
the transaction.

Later on, we have this for every main fork block:

rmgr: XLOG len (rec/tot): 49/ 8201, tx: 0, lsn:
2/0E1ECEC8, prev 2/0E1ECE10, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 blk 0 FPW
rmgr: Heap2 len (rec/tot): 325/ 325, tx: 0, lsn:
2/0E1EEEF0, prev 2/0E1ECEC8, desc: FREEZE_PAGE snapshotConflictHorizon: 1637,
nplans: 2, plans: [{ xmax: 0, infomask: 2816, infomask2: 5, ntuples: 86,
offsets: [1, 2, 3, 5, 7, 10, 14, 16, 21, 22, 28, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 45, 46, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60,
64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 85, 86,
87, 90, 92, 94, 95, 96, 98, 100, 102, 103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124] }, { xmax: 0,
infomask: 2817, infomask2: 5, ntuples: 38, offsets: [4, 6, 8, 9, 11, 12, 13,
15, 17, 18, 19, 20, 23, 24, 25, 26, 27, 29, 43, 44, 47, 48, 59, 61, 62, 63,
66, 70, 80, 84, 88, 89, 91, 93, 97, 99, 101, 116] }], blkref #0: rel
1663/17343/17959 blk 0
rmgr: Heap2 len (rec/tot): 64/ 8256, tx: 0, lsn:
2/0E1EF038, prev 2/0E1EEEF0, desc: VISIBLE snapshotConflictHorizon: 0, flags:
0x03, blkref #0: rel 1663/17343/17959 fork vm blk 0 FPW, blkref #1: rel
1663/17343/17959 blk 0

And then the single FSM FPI, which contains the invalid record:

rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn:
2/0E28F640, prev 2/0E28F600, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW

I've dumped the associated FPIs, and took a look at them. First FPI for leaf
block (#2 in the fsm) happens in the beginning of the multi insert
transaction, and only contains one leaf node for block 0 (slot 4095).

Later FPI (during what I think is a VACUUM) persists the corruption in the
WAL.
So presumably something happened during the COPY operation which caused it to
add an invalid entry for inexisting block 35 in the FSM.

There are no traces of relation truncation happening in the WAL.

This case only shows a single invalid entry in the FSM, but I've noticed as
much as 62 blocks present in the FSM while they do not exist on disk, all
tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk
extension mechanism.

The occurence of this bug happening makes it hard to reproduce, but it's
definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.

You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
problem stops happening. That would corroborate the VACUUM theory.

Can you use backtrace_functions to get a stack track?

I got a stack trace from the error after the corruption has occured (even got
to attach gdb to it as I was able to restore a corrupted cluster), but I don't
think it's of any use (this is from a different occurence than the one I wrote
in details above, so the block numbers don't match but it's the exact same
thing).

#0 errstart (elevel=21, domain=0x0) at utils/error/elog.c:348
#1 0x000055c81ea83a0a in errstart_cold (domain=0x0, elevel=21) at utils/
error/elog.c:333
#2 mdread (reln=0x55c8210bbe00, forknum=MAIN_FORKNUM, blocknum=1203,
buffer=0x7f51b4c32000) at storage/smgr/md.c:796
#3 0x000055c81ee396cf in smgrread (buffer=0x7f51b4c32000, blocknum=1203,
forknum=MAIN_FORKNUM, reln=0x55c8210bbe00) at storage/smgr/smgr.c:565
#4 ReadBuffer_common (smgr=0x55c8210bbe00,
relpersistence=relpersistence@entry=112 'p',
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=1203,
mode=mode@entry=RBM_NORMAL, strategy=<optimized out>, hit=0x7fff6e22f607)
at storage/buffer/bufmgr.c:1124
#5 0x000055c81ee39d0b in ReadBufferExtended (reln=0x7f51b294f0d0,
forkNum=MAIN_FORKNUM, blockNum=1203, mode=RBM_NORMAL, strategy=<optimized
out>)
at access/brin/../../../../src/include/utils/rel.h:576
#6 0x000055c81eb1525c in ReadBufferBI (mode=RBM_NORMAL, bistate=0x0,
targetBlock=1203, relation=<optimized out>) at access/heap/hio.c:96
#7 RelationGetBufferForTuple (relation=<optimized out>, len=<optimized out>,
otherBuffer=0, options=<optimized out>, bistate=0x0, vmbuffer=<optimized out>,
vmbuffer_other=<optimized out>, num_pages=<optimized out>) at access/heap/
hio.c:620
#8 0x000055c81eafcace in heap_insert (relation=0x7f51b294f0d0,
tup=0x55c821042580, cid=<optimized out>, options=0, bistate=0x0) at access/
heap/heapam.c:1862
#9 0x000055c81eb08bab in heapam_tuple_insert (relation=0x7f51b294f0d0,
slot=0x55c821042478, cid=0, options=0, bistate=0x0) at access/heap/
heapam_handler.c:252
#10 0x000055c81ecce2f1 in table_tuple_insert (bistate=0x0, options=0,
cid=<optimized out>, slot=0x55c821042478, rel=<optimized out>)
at executor/../../../src/include/access/tableam.h:1400
#11 ExecInsert (context=context@entry=0x7fff6e22f9b0,
resultRelInfo=resultRelInfo@entry=0x55c8210410c0,
slot=slot@entry=0x55c821042478, canSetTag=<optimized out>,
inserted_tuple=inserted_tuple@entry=0x0,
insert_destrel=insert_destrel@entry=0x0) at executor/nodeModifyTable.c:1133
#12 0x000055c81eccf1eb in ExecModifyTable (pstate=<optimized out>) at
executor/nodeModifyTable.c:3810

Just a regular insertion failing to read the block it's trying to insert into.

In our case, we need to repair the FSM. The instructions on the wiki do
work, but maybe we should add something like the attached patch (modeled
after the same feature in pg_visibility) to make it possible to repair
the FSM corruption online. What do you think about it ?

That's reasonable in concept.

Ok, I'll start a separate thread on -hackers for that.

Thank you !

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#5)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:

Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a ï¿½crit :

On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is
running on the table at the same time the error is triggered

Looking at when the corruption was WAL-logged, this particular case is quite
easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table
(probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM
running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog
records for each of the relation main fork, followed by a lonely FPI for the
leaf page of it's FSM:

You're using data_checksums, right? Thanks for the wal dump excerpts; I agree
with this summary thereof.

There are no traces of relation truncation happening in the WAL.

That is notable.

This case only shows a single invalid entry in the FSM, but I've noticed as
much as 62 blocks present in the FSM while they do not exist on disk, all
tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk
extension mechanism.

Is this happening after an OS crash, a replica promote, or a PITR restore? If
so, I think I see the problem. We have an undocumented rule that FSM shall
not contain references to pages past the end of the relation. To facilitate
that, relation truncation WAL-logs FSM truncate. However, there's no similar
protection for relation extension, which is not WAL-logged. We break the rule
whenever we write FSM for block X before some WAL record initializes block X.
data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT
records for FSM changes. A replica promote or PITR ending just after the FSM
FPI_FOR_HINT would yield this broken state. While v16 RelationAddBlocks()
made this easier to hit, I suspect it's reproducible in all supported
branches. For example, lazy_scan_new_or_empty() and multiple index AMs break
the rule via RecordPageWithFreeSpace() on a PageIsNew() page.

I think the fix is one of:

- Revoke the undocumented rule. Make FSM consumers resilient to the FSM
returning a now-too-large block number.

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For example,
in each PageIsNew() case, either don't update FSM or WAL-log an init (like
lazy_scan_new_or_empty() does when PageIsEmpty()).

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#6)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :

On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:

Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :

On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:

You're using data_checksums, right? Thanks for the wal dump excerpts; I
agree with this summary thereof.

Yes, data checksums so wal_log_hints is implied.

There are no traces of relation truncation happening in the WAL.

That is notable.

This case only shows a single invalid entry in the FSM, but I've noticed
as
much as 62 blocks present in the FSM while they do not exist on disk, all
tagged with MaxFSMRequestSize so I suppose something is wrong with the
bulk
extension mechanism.

Is this happening after an OS crash, a replica promote, or a PITR restore?

I need to double check all occurences, but I wouldn't be surprised if they all
went trough a replica promotion.

If so, I think I see the problem. We have an undocumented rule that FSM
shall not contain references to pages past the end of the relation. To
facilitate that, relation truncation WAL-logs FSM truncate. However,
there's no similar protection for relation extension, which is not
WAL-logged. We break the rule whenever we write FSM for block X before
some WAL record initializes block X. data_checksums makes the trouble
easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A
replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
this broken state. While v16 RelationAddBlocks() made this easier to hit,
I suspect it's reproducible in all supported branches. For example,
lazy_scan_new_or_empty() and multiple index AMs break the rule via
RecordPageWithFreeSpace() on a PageIsNew() page.

Very interesting. I understand the reasoning, and am now able to craft a very
crude test case, which I will refine into something more actionable:

- start a session 1, which will just vacuum our table continuously
- start a ession 2, which will issue checkpoints continuously
- in a session 3, COPY enough data so that we prealllocate too many pages. In
my example, I 'm copying 33150 single integer rows, which fit on a bit more
than 162 pages. The idea behind that is to make sure we don't fall on a round
number of pages, and we leave several iterations of the extend mechanism
running to issue a full page write of the fsm.

So to trigger the bug, it seems to me one needs to run in parallel:
- COPY
- VACUUM
- CHECKPOINT

Which would explain why a surprisingly large number of occurences I've seen
seemed to involve a pg_restore invocation.

I think the fix is one of:

- Revoke the undocumented rule. Make FSM consumers resilient to the FSM
returning a now-too-large block number.

Performance wise, that seems to be the better answer but won't this kind of
built-in resilience encourage subtle bugs creeping in ?

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For
example, in each PageIsNew() case, either don't update FSM or WAL-log an
init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page in
lazy_scan_new_or_empty in the New case as you propose seems a very good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be that
many pages that it gives VACUUM too much work.

I can try my hand at such a patch it if looks like a good idea.

Thank you very much for your insights.

Best regards,

--
Ronan Dunklau

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#7)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote:

Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a ï¿½crit :

Is this happening after an OS crash, a replica promote, or a PITR restore?

I need to double check all occurences, but I wouldn't be surprised if they all
went trough a replica promotion.

If so, I think I see the problem. We have an undocumented rule that FSM
shall not contain references to pages past the end of the relation. To
facilitate that, relation truncation WAL-logs FSM truncate. However,
there's no similar protection for relation extension, which is not
WAL-logged. We break the rule whenever we write FSM for block X before
some WAL record initializes block X. data_checksums makes the trouble
easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A
replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
this broken state. While v16 RelationAddBlocks() made this easier to hit,
I suspect it's reproducible in all supported branches. For example,
lazy_scan_new_or_empty() and multiple index AMs break the rule via
RecordPageWithFreeSpace() on a PageIsNew() page.

Very interesting. I understand the reasoning, and am now able to craft a very
crude test case, which I will refine into something more actionable:

- start a session 1, which will just vacuum our table continuously
- start a ession 2, which will issue checkpoints continuously
- in a session 3, COPY enough data so that we prealllocate too many pages. In
my example, I 'm copying 33150 single integer rows, which fit on a bit more
than 162 pages. The idea behind that is to make sure we don't fall on a round
number of pages, and we leave several iterations of the extend mechanism
running to issue a full page write of the fsm.

So to trigger the bug, it seems to me one needs to run in parallel:
- COPY
- VACUUM
- CHECKPOINT

Which would explain why a surprisingly large number of occurences I've seen
seemed to involve a pg_restore invocation.

Does that procedure alone reproduce the ERROR, without any replica promote,
postmaster restart, etc.? That, I do not have reason to expect.

I think the fix is one of:

- Revoke the undocumented rule. Make FSM consumers resilient to the FSM
returning a now-too-large block number.

Performance wise, that seems to be the better answer

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

but won't this kind of
built-in resilience encourage subtle bugs creeping in ?

Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you
reached, even if we introduced a "main-fork WAL before FSM" rule. We'll have
plenty of need for resilience. Hence, I'd lean toward this approach.

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For
example, in each PageIsNew() case, either don't update FSM or WAL-log an
init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page in
lazy_scan_new_or_empty in the New case as you propose seems a very good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be that
many pages that it gives VACUUM too much work.

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes. Still, this is a good
backup if the first option measurably slows INSERT throughput.

I can try my hand at such a patch it if looks like a good idea.

Please do.

Thank you very much for your insights.

Thanks for the detailed report that made it possible.

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#8)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :

Does that procedure alone reproduce the ERROR, without any replica promote,
postmaster restart, etc.? That, I do not have reason to expect.

Oh no sorrry for not being clear: it needs to go through recovery either
through a backup restoration, a replica or anything else. This is why I've
been struggling to reproduce the issue, it didn't occur to me that the fsm was
only sane on the primary, as the extensions were not wal logged.

I think the fix is one of:

- Revoke the undocumented rule. Make FSM consumers resilient to the FSM

returning a now-too-large block number.

Performance wise, that seems to be the better answer

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

Makes sense.

but won't this kind of
built-in resilience encourage subtle bugs creeping in ?

Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you
reached, even if we introduced a "main-fork WAL before FSM" rule. We'll
have plenty of need for resilience. Hence, I'd lean toward this approach.

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For
example, in each PageIsNew() case, either don't update FSM or WAL-log an
init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page
in lazy_scan_new_or_empty in the New case as you propose seems a very
good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be
that many pages that it gives VACUUM too much work.

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes. Still, this is a good
backup if the first option measurably slows INSERT throughput.

Would it actually ? At least for the currently discussed VACUUM case, we will
fall back to a generic FPI for the new page, which should work on any AM. But
the requirement to WAL log the page before recording it in the FSM should be
publicized, and the common infrastructure should adhere to it. I'd have to
look into the other cases you mentioned where that can happen.

Show quoted text

I can try my hand at such a patch it if looks like a good idea.

Please do.

Thank you very much for your insights.

Thanks for the detailed report that made it possible.

#10

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#9)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Tue, Mar 05, 2024 at 12:33:13AM +0100, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a ï¿½crit :

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For
example, in each PageIsNew() case, either don't update FSM or WAL-log an
init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page
in lazy_scan_new_or_empty in the New case as you propose seems a very
good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be
that many pages that it gives VACUUM too much work.

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes. Still, this is a good
backup if the first option measurably slows INSERT throughput.

Would it actually ? At least for the currently discussed VACUUM case, we will
fall back to a generic FPI for the new page, which should work on any AM. But
the requirement to WAL log the page before recording it in the FSM should be
publicized, and the common infrastructure should adhere to it.

I agree each AM can use generic FPI code. I'm looking at code like this as
needing change under this candidate design:

static void
spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
{
...
if (PageIsNew(page) || PageIsEmpty(page))
{
RecordFreeIndexPage(index, blkno);

The change there would be fairly simple, but non-core access methods may
require similar simple changes. That's fine if this approach has other
reasons to win, but it is a drawback. A grep of pgxn code shows "rum" as the
only module containing a RecordFreeIndexPage() call. No module contains a
RecordPageWithFreeSpace() call. So this drawback is all but negligible.

#11

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Noah Misch (#10)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Mon, Mar 04, 2024 at 04:13:46PM -0800, Noah Misch wrote:

I agree each AM can use generic FPI code. I'm looking at code like this as
needing change under this candidate design:

static void
spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
{
...
if (PageIsNew(page) || PageIsEmpty(page))
{
RecordFreeIndexPage(index, blkno);

The change there would be fairly simple, but non-core access methods may
require similar simple changes. That's fine if this approach has other
reasons to win, but it is a drawback. A grep of pgxn code shows "rum" as the
only module containing a RecordFreeIndexPage() call. No module contains a
RecordPageWithFreeSpace() call. So this drawback is all but negligible.

This design does not sound that bad to be, FWIW, if what you are
discussing impacts data inserts enough so be noticeable in the default
cases. There are not that many out-of-core index AMs out there, and
"rum" is as far as I know not supported but any of the major cloud
vendors. So contacting the authors of such AMs and raising awareness
would be enough. So I would worry much more about performance. My
2c.

(You should be able to implement a cheap test using two injection
points, from what I can read, if you need coordination between three
actions.)
--
Michael

#12

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Michael Paquier (#11)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mardi 5 mars 2024, 01:56:12 CET Michael Paquier a écrit :

(You should be able to implement a cheap test using two injection
points, from what I can read, if you need coordination between three
actions.)

I'm sorry, but I don't understand this sentence at all. Would you mind to
elaborate ?

Thanks !

Show quoted text

--
Michael

#13

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Ronan Dunklau (#12)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we detect
if the FSM returns a page past the end of the relation, and ignore it.
In that case we will fallback through the extension mechanism.

For the corrupted-FSM case it is not great performance wise, as we will extend
the relation in small steps every time we find a non existing block in the FSM,
until the actual relation size matches what is recorded in the FSM. But since
those seldom happen, I figured it was better to keep the code really simple for
a bugfix.

I wanted to test the impact in terms of performance, and I thought about the
worst possible case for this.

Then, run a pgbench doing insertions in the table. With the attached patch the
worst case I could come up with is:
- remember which page we last inserted into
- notice we don't have enough space
- ask the FSM for a block
- now have to compare that to the actual relation size

So I came up with the following initialization steps:

- create a table with vacuum_truncate = off, with a tuple size big enough that
it's impossible to fit two tuples on the same page
- insert lots of tuple in it until it reaches a decent size
- delete them all
- vacuum
- all of this fitting in shared_buffers

As in:

CREATE TABLE test_perf (c1 char(5000));
ALTER TABLE test_perf ALTER c1 SET STORAGE PLAIN;
ALTER TABLE test_perf SET (VACUUM_TRUNCATE = off);
INSERT INTO test_perf (c1) SELECT 'c' FROM generate_series(1, 1000000);
DELETE FROM test_perf;
VACUUM test_perf;

Then I ran pgbench with a single client, with a script only inserting the same
value over and over again, for 1000000 transactions (initial table size).

I noticed no difference running with or without the patch, but maybe someone
else can try to run that or find another adversarial case ?

Best regards,

--
Ronan Dunklau

#14

Patrick Stählin

patrick.staehlin@aiven.io

about 2 years ago

In reply to: Ronan Dunklau (#13)

Re: FSM Corruption (was: Could not read block at end of the relation)

On 3/6/24 10:31, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we detect
if the FSM returns a page past the end of the relation, and ignore it.
In that case we will fallback through the extension mechanism.

@@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len,
  		 * We have no cached target page, so ask the FSM for an initial
  		 * target.
  		 */
+		nblocks = RelationGetNumberOfBlocks(relation);
  		targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);

I'd move the fetching of nblocks to after getting the targetBlock. This
avoids extending the relation in the unlikely event of a FSM/relation
extension in-between those two calls.

Patrick

#15

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Patrick Stählin (#14)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mercredi 6 mars 2024, 11:59:43 CET Patrick Stählin a écrit :

On 3/6/24 10:31, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :

I would guess this one is more risky from a performance perspective,
since
we'd be adding to a hotter path under RelationGetBufferForTuple().
Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we
detect if the FSM returns a page past the end of the relation, and ignore
it. In that case we will fallback through the extension mechanism.

@@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an

initial

* target.
*/
+ nblocks = RelationGetNumberOfBlocks(relation);
targetBlock = GetPageWithFreeSpace(relation,

targetFreeSpace);

I'd move the fetching of nblocks to after getting the targetBlock. This
avoids extending the relation in the unlikely event of a FSM/relation
extension in-between those two calls.

Good catch. Please find attached v2 of the patch.

--
Ronan Dunklau

#16

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#13)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a ï¿½crit :

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we detect
if the FSM returns a page past the end of the relation, and ignore it.
In that case we will fallback through the extension mechanism.

For the corrupted-FSM case it is not great performance wise, as we will extend
the relation in small steps every time we find a non existing block in the FSM,
until the actual relation size matches what is recorded in the FSM. But since
those seldom happen, I figured it was better to keep the code really simple for
a bugfix.

That could be fine. Before your message, I assumed we would treat this like
the case of a full page. That is, we'd call RecordAndGetPageWithFreeSpace(),
which would remove the FSM entry and possibly return another. That could be
problematic if another backend is extending the relation; we don't want to
erase the extender's new FSM entry. I'm parking this topic for the moment.

I wanted to test the impact in terms of performance, and I thought about the
worst possible case for this.

[test details]

I noticed no difference running with or without the patch, but maybe someone
else can try to run that or find another adversarial case ?

The patch currently adds an lseek() whenever the FSM finds a block. I see
relevant-looking posts from mailing list searches with subsets of these terms:
RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should reduce
this cost add. Some ideas:

- No need for a system call if the FSM-returned value is low enough with
respect to any prior RelationGetNumberOfBlocks() call.

- We'd be happy and efficient with a ReadBufferMode such that
ReadBufferExtended() returns NULL after a 0-byte read, with all other errors
handled normally. That sounds like the specification of RBM_NORMAL_NO_LOG.
Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM
could satisfy both sets of needs.

On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:

Subject: [PATCH 2/2] Detect invalid FSM when finding a block.

Whenever the FSM points to a block past the end of the relation, detect
it and fallback to the relation extension mechanism.

This may arise when a FPI for the FSM has been triggered, and we end up
WAL-logging a page of the FSM pointing to newly extended blocks in the
relation which have never been WAL-logged.
---
src/backend/access/heap/hio.c | 20 +++++++++++++++++++-

Each GetPageWithFreeSpace() caller would need this, not just heap. Please
evaluate whether this logic belongs inside freespace.c vs. in each caller.
freespace.c comments and/or freespace/README may need updates. Please
evaluate header comments of freespace.c functions whose callers you are
changing, and evaluate comments about truncation.

#17

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#16)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit :

On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :

I would guess this one is more risky from a performance perspective,
since
we'd be adding to a hotter path under RelationGetBufferForTuple().
Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we
detect if the FSM returns a page past the end of the relation, and ignore
it. In that case we will fallback through the extension mechanism.

For the corrupted-FSM case it is not great performance wise, as we will
extend the relation in small steps every time we find a non existing
block in the FSM, until the actual relation size matches what is recorded
in the FSM. But since those seldom happen, I figured it was better to
keep the code really simple for a bugfix.

That could be fine. Before your message, I assumed we would treat this like
the case of a full page. That is, we'd call
RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and
possibly return another. That could be problematic if another backend is
extending the relation; we don't want to erase the extender's new FSM
entry. I'm parking this topic for the moment.

I would argue this is fine, as corruptions don't happen often, and FSM is not
an exact thing anyway. If we record a block as full, it will just be unused
until the next vacuum adds it back to the FSM with a correct value.

I wanted to test the impact in terms of performance, and I thought about
the worst possible case for this.

[test details]

I noticed no difference running with or without the patch, but maybe
someone else can try to run that or find another adversarial case ?

The patch currently adds an lseek() whenever the FSM finds a block. I see
relevant-looking posts from mailing list searches with subsets of these
terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should
reduce this cost add. Some ideas:

- No need for a system call if the FSM-returned value is low enough with

respect to any prior RelationGetNumberOfBlocks() call.

I'm not sure what you mean by "low enough". What I chose to implement instead
is to rely on the cached value even when not in recovery for that particular
case. The reasoning being, as above, that if someone extended the relation
since the last cached value, worst case scenario is that we do not insert into
that old block. Whenever we detect a case like this, we just erase the FSM
entry to mark the block as full, so that it's not picked up again next time.

- We'd be happy and efficient with a ReadBufferMode such that
ReadBufferExtended() returns NULL after a 0-byte read, with all other
errors handled normally. That sounds like the specification of
RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that,
but perhaps one RBM could satisfy both sets of needs.

I'm not sure about this one, this seems to have far more reaching
consequences. This would mean we don't have any cooperation from the FSM, and
would need to implement error recovery scenarios in each caller. But I admit I
haven't looked too much into it, and focused instead in seeing if the current
approach can get us to an acceptable state.

On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:

Subject: [PATCH 2/2] Detect invalid FSM when finding a block.

Whenever the FSM points to a block past the end of the relation, detect
it and fallback to the relation extension mechanism.

This may arise when a FPI for the FSM has been triggered, and we end up
WAL-logging a page of the FSM pointing to newly extended blocks in the
relation which have never been WAL-logged.
---

src/backend/access/heap/hio.c | 20 +++++++++++++++++++-

Each GetPageWithFreeSpace() caller would need this, not just heap. Please
evaluate whether this logic belongs inside freespace.c vs. in each caller.
freespace.c comments and/or freespace/README may need updates. Please
evaluate header comments of freespace.c functions whose callers you are
changing, and evaluate comments about truncation.

I implemented the new behaviour in fsm_search and
RecordAndGetPageWithFreeSpace.

Regards,

--
Ronan

#18

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#17)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Thu, Mar 07, 2024 at 12:21:24PM +0100, Ronan Dunklau wrote:

Le mercredi 6 mars 2024 19:42:28 CET, vous avez ï¿½crit :

On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:

Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a ï¿½crit :

I would guess this one is more risky from a performance perspective,
since
we'd be adding to a hotter path under RelationGetBufferForTuple().
Still,
it's likely fine.

I ended up implementing this in the attached patch. The idea is that we
detect if the FSM returns a page past the end of the relation, and ignore
it. In that case we will fallback through the extension mechanism.

For the corrupted-FSM case it is not great performance wise, as we will
extend the relation in small steps every time we find a non existing
block in the FSM, until the actual relation size matches what is recorded
in the FSM. But since those seldom happen, I figured it was better to
keep the code really simple for a bugfix.

That could be fine. Before your message, I assumed we would treat this like
the case of a full page. That is, we'd call
RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and
possibly return another. That could be problematic if another backend is
extending the relation; we don't want to erase the extender's new FSM
entry. I'm parking this topic for the moment.

I would argue this is fine, as corruptions don't happen often, and FSM is not
an exact thing anyway. If we record a block as full, it will just be unused
until the next vacuum adds it back to the FSM with a correct value.

If this happened only under FSM corruption, it would be fine. I was thinking
about normal extension, with an event sequence like this:

- RelationExtensionLockWaiterCount() returns 10.
- Backend B1 extends by 10 pages.
- B1 records 9 pages in the FSM.
- B2, B3, ... B11 wake up and fetch from the FSM. In each of those backends,
fsm_does_block_exist() returns false, because the cached relation size is
stale. Those pages remain unused until VACUUM re-records them in the FSM.

PostgreSQL added the multiple-block extension logic (commit 719c84c) from
hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of
that change's benefit, that's probably okay. If it loses 20%, we should work
to do better. While we could rerun the 2016 benchmark to see how the patch
regresses it, there might be a simple fix. fsm_does_block_exist() could skip
the cache when blknumber>=cached, like this:

return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) &&
blknumber < smgr->smgr_cached_nblocks[MAIN_FORKNUM]) ||
blknumber < RelationGetNumberOfBlocks(rel));

How do you see it? That still leaves us with an avoidable lseek in B2, B3,
... B11, but that feels tolerable. I don't know how to do better without
switching to the ReadBufferMode approach.

The patch currently adds an lseek() whenever the FSM finds a block. I see
relevant-looking posts from mailing list searches with subsets of these
terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should
reduce this cost add. Some ideas:

- No need for a system call if the FSM-returned value is low enough with

respect to any prior RelationGetNumberOfBlocks() call.

I'm not sure what you mean by "low enough". What I chose to implement instead

I just meant "less than". Specifically, if "fsm_returned_value < MAX(all
RelationGetNumberOfBlocks() calls done on this relation, since last postmaster
start)" then the FSM-returned value won't have the problem seen in $SUBJECT.

- We'd be happy and efficient with a ReadBufferMode such that
ReadBufferExtended() returns NULL after a 0-byte read, with all other
errors handled normally. That sounds like the specification of
RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that,
but perhaps one RBM could satisfy both sets of needs.

I'm not sure about this one, this seems to have far more reaching
consequences. This would mean we don't have any cooperation from the FSM, and
would need to implement error recovery scenarios in each caller. But I admit I
haven't looked too much into it, and focused instead in seeing if the current
approach can get us to an acceptable state.

I, too, am not sure. I agree that RBM approach wouldn't let you isolate the
changes like the last patch does. Each GetFreeIndexPage() caller would need
code for the return-NULL case. Avoiding that is nice. (If we ever do the RBM
change in the future, the fsm_readbuf() and vm_readbuf() !extend cases should
also use it.)

--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c

+			if (addr.level == FSM_BOTTOM_LEVEL) {
+				BlockNumber blkno = fsm_get_heap_blk(addr, slot);
+				Page page;
+				/*
+				 * If the buffer is past the end of the relation,
+				 * update the page and restarts from the root
+				 */
+				if (fsm_does_block_exist(rel, blkno))
+				{
+					ReleaseBuffer(buf);
+					return blkno;
+				}
+				page = BufferGetPage(buf);
+				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+				fsm_set_avail(page, slot, 0);
+				MarkBufferDirtyHint(buf, false);

I wondered if we should zero all slots on the same FSM page that correspond to
main fork blocks past the end of the main fork. The usual "advancenext"
behavior is poor for this case, since every later slot is invalid in same way.
My current thinking: no, it's not worth the code bulk to do better. A single
extension caps at 64 blocks, so the worst case is roughly locking the buffer
64 times and restarting from FSM root 64 times. For something this
infrequent, that's not bad.

Thanks,
nm

#19

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#18)

Re: FSM Corruption (was: Could not read block at end of the relation)

Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit :

I would argue this is fine, as corruptions don't happen often, and FSM is
not an exact thing anyway. If we record a block as full, it will just be
unused until the next vacuum adds it back to the FSM with a correct
value.

If this happened only under FSM corruption, it would be fine. I was
thinking about normal extension, with an event sequence like this:

- RelationExtensionLockWaiterCount() returns 10.
- Backend B1 extends by 10 pages.
- B1 records 9 pages in the FSM.
- B2, B3, ... B11 wake up and fetch from the FSM. In each of those
backends, fsm_does_block_exist() returns false, because the cached relation
size is stale. Those pages remain unused until VACUUM re-records them in
the FSM.

PostgreSQL added the multiple-block extension logic (commit 719c84c) from
hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of
that change's benefit, that's probably okay. If it loses 20%, we should
work to do better. While we could rerun the 2016 benchmark to see how the
patch regresses it, there might be a simple fix. fsm_does_block_exist()
could skip the cache when blknumber>=cached, like this:

return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])

blknumber < smgr-
smgr_cached_nblocks[MAIN_FORKNUM]) ||
blknumber < RelationGetNumberOfBlocks(rel));

How do you see it? That still leaves us with an avoidable lseek in B2, B3,
... B11, but that feels tolerable. I don't know how to do better without
switching to the ReadBufferMode approach.

Hi Noah and thanks for your feedback.

That makes sense. I'm a bit worried about triggering additional lseeks when we
in fact have other free pages in the map, that would pass the test. I'm not
sure how relevant it is given the way we search the FSM with fp_next_slot
though...

To address that, I've given a bit of thought about enabling / disabling the
auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the
cases where we know we have a somewhat up-to-date value compared to the case
where we don't (as in, for heap, try without repair, then get an uptodate
value to try the last block, and if we need to look at the FSM again then ask
for it to be repaired) but it brings way too much complexity and would need
careful thought for each AM.

So please find attached a patch with the change you propose.

Do you have a link to the benchmark you mention though to evaluate the patch
against it ?

The patch currently adds an lseek() whenever the FSM finds a block. I
see
relevant-looking posts from mailing list searches with subsets of these
terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we
should
reduce this cost add. Some ideas:

- No need for a system call if the FSM-returned value is low enough with

respect to any prior RelationGetNumberOfBlocks() call.

I'm not sure what you mean by "low enough". What I chose to implement
instead

I just meant "less than". Specifically, if "fsm_returned_value < MAX(all
RelationGetNumberOfBlocks() calls done on this relation, since last
postmaster start)" then the FSM-returned value won't have the problem seen
in $SUBJECT.

- We'd be happy and efficient with a ReadBufferMode such that

ReadBufferExtended() returns NULL after a 0-byte read, with all other

errors handled normally. That sounds like the specification of
RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't
that,
but perhaps one RBM could satisfy both sets of needs.

I'm not sure about this one, this seems to have far more reaching
consequences. This would mean we don't have any cooperation from the FSM,
and would need to implement error recovery scenarios in each caller. But
I admit I haven't looked too much into it, and focused instead in seeing
if the current approach can get us to an acceptable state.

I, too, am not sure. I agree that RBM approach wouldn't let you isolate the
changes like the last patch does. Each GetFreeIndexPage() caller would
need code for the return-NULL case. Avoiding that is nice. (If we ever do
the RBM change in the future, the fsm_readbuf() and vm_readbuf() !extend
cases should also use it.)
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
+			if (addr.level == FSM_BOTTOM_LEVEL) {
+				BlockNumber blkno = 

fsm_get_heap_blk(addr, slot);

+				Page page;
+				/*
+				 * If the buffer is past the end

of the relation,

+ * update the page and restarts

from the root

+ */
+ if (fsm_does_block_exist(rel,

blkno))

+				{
+					ReleaseBuffer(buf);
+					return blkno;
+				}
+				page = BufferGetPage(buf);
+				LockBuffer(buf,

BUFFER_LOCK_EXCLUSIVE);

+ fsm_set_avail(page, slot, 0);
+ MarkBufferDirtyHint(buf, false);

I wondered if we should zero all slots on the same FSM page that correspond
to main fork blocks past the end of the main fork. The usual "advancenext"
behavior is poor for this case, since every later slot is invalid in same
way. My current thinking: no, it's not worth the code bulk to do better. A
single extension caps at 64 blocks, so the worst case is roughly locking
the buffer 64 times and restarting from FSM root 64 times. For something
this infrequent, that's not bad.

That was my reasoning as well. Paying a small performance penalty in the
unlikely case we hit a corruption and get done with it seems correct.

Show quoted text

Thanks,
nm

#20

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#19)

Re: FSM Corruption (was: Could not read block at end of the relation)

On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:

Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a ï¿½crit :

I would argue this is fine, as corruptions don't happen often, and FSM is
not an exact thing anyway. If we record a block as full, it will just be
unused until the next vacuum adds it back to the FSM with a correct
value.

If this happened only under FSM corruption, it would be fine. I was
thinking about normal extension, with an event sequence like this:

- RelationExtensionLockWaiterCount() returns 10.
- Backend B1 extends by 10 pages.
- B1 records 9 pages in the FSM.
- B2, B3, ... B11 wake up and fetch from the FSM. In each of those
backends, fsm_does_block_exist() returns false, because the cached relation
size is stale. Those pages remain unused until VACUUM re-records them in
the FSM.

PostgreSQL added the multiple-block extension logic (commit 719c84c) from
hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of
that change's benefit, that's probably okay. If it loses 20%, we should
work to do better. While we could rerun the 2016 benchmark to see how the
patch regresses it, there might be a simple fix. fsm_does_block_exist()
could skip the cache when blknumber>=cached, like this:

return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])

&&

blknumber < smgr-
smgr_cached_nblocks[MAIN_FORKNUM]) ||
blknumber < RelationGetNumberOfBlocks(rel));

How do you see it? That still leaves us with an avoidable lseek in B2, B3,
... B11, but that feels tolerable. I don't know how to do better without
switching to the ReadBufferMode approach.

That makes sense. I'm a bit worried about triggering additional lseeks when we
in fact have other free pages in the map, that would pass the test. I'm not
sure how relevant it is given the way we search the FSM with fp_next_slot
though...

That's a reasonable thing to worry about. We could do wrong by trying too
hard to use an FSM slot, and we could do wrong by not trying hard enough.

To address that, I've given a bit of thought about enabling / disabling the
auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the
cases where we know we have a somewhat up-to-date value compared to the case
where we don't (as in, for heap, try without repair, then get an uptodate
value to try the last block, and if we need to look at the FSM again then ask
for it to be repaired) but it brings way too much complexity and would need
careful thought for each AM.

So please find attached a patch with the change you propose.

Do you have a link to the benchmark you mention though to evaluate the patch
against it ?

Here is one from the thread that created commit 719c84c:
/messages/by-id/CA+Tgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g@mail.gmail.com

There may be other benchmarks earlier in that thread.

#21

Ronan Dunklau

ronan.dunklau@aiven.io

about 2 years ago

In reply to: Noah Misch (#20)

#22

Noah Misch

noah@leadboat.com

about 2 years ago

In reply to: Ronan Dunklau (#21)