pg15b3: recovery fails with wal prefetch enabled

Started by Justin Pryzbyover 3 years ago25 messages
#1Justin Pryzby
pryzby@telsasoft.com

An internal VM crashed last night due to OOM.

When I tried to start postgres, it failed like:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

--
Justin

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#1)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a
vm fork page (which recovery prefetch should ignore completely). Did
you happen to get a copy before the successful recovery? After the
successful recovery, what LSN does that page have, and can you find
the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
-F vm? Have you turned FPW off (perhaps this is on ZFS?)?

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#2)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote:

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a
vm fork page (which recovery prefetch should ignore completely). Did
you happen to get a copy before the successful recovery? After the
successful recovery, what LSN does that page have, and can you find
the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
-F vm? Have you turned FPW off (perhaps this is on ZFS?)?

Yes, I have a copy that reproduces the issue:

#1 0x00000000009a0df4 in errfinish (filename=<optimized out>, filename@entry=0xa15535 "xlog.c", lineno=lineno@entry=2671, funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588
#2 0x000000000055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668
#3 0x0000000000813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=<optimized out>) at bufmgr.c:2889
#4 0x0000000000817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, skip_recently_used=skip_recently_used@entry=false, wb_context=wb_context@entry=0x7fffffffcdf0) at bufmgr.c:2576
#5 0x00000000008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164
#6 0x00000000008182f5 in CheckPointBuffers (flags=flags@entry=358) at bufmgr.c:2743
#7 0x00000000005587b2 in CheckPointGuts (checkPointRedo=19795985413936, flags=flags@entry=358) at xlog.c:6855
#8 0x000000000055feb3 in CreateCheckPoint (flags=flags@entry=358) at xlog.c:6534
#9 0x00000000007aceaa in CheckpointerMain () at checkpointer.c:455
#10 0x00000000007aac52 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153
#11 0x00000000007b0bd8 in StartChildProcess (type=<optimized out>) at postmaster.c:5430
#12 0x00000000007b388f in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0xf139e0) at postmaster.c:1463
#13 0x00000000004986a6 in main (argc=7, argv=0xf139e0) at main.c:202

It's not on zfs, and FPW have never been turned off.

I should add that this instance has been pg_upgraded since v10.

BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from
CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings,
pg_stat_bgwriter, pg_stat_database, pg_stat_wal).

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'main', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
---------------+----------+-------+-------+-------+---------+----------+---------+------------
1201/1CDD1F98 | -6200 | 1 | 44 | 424 | 8192 | 8192 | 4 | 3681043287
(1 fila)

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'vm', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
---------------+----------+-------+-------+-------+---------+----------+---------+-----------
1201/1CAF84F0 | -6010 | 0 | 24 | 8192 | 8192 | 8192 | 4 | 0

I found this in waldump (note that you had a typoe - it's 16881).

[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C
rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 0
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE cutoff xid 3671427998 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 2
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE cutoff xid 3672889900 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 4
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF5FB0, prev 1201/1CAF4E08, desc: VISIBLE cutoff xid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 5
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7320, prev 1201/1CAF5FF0, desc: VISIBLE cutoff xid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 6
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7398, prev 1201/1CAF7360, desc: VISIBLE cutoff xid 3679751919 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 11
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7410, prev 1201/1CAF73D8, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 17
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7488, prev 1201/1CAF7450, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 19
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7500, prev 1201/1CAF74C8, desc: VISIBLE cutoff xid 3645406844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 23
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7578, prev 1201/1CAF7540, desc: VISIBLE cutoff xid 3669978567 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 24
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF75F0, prev 1201/1CAF75B8, desc: VISIBLE cutoff xid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 25
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7668, prev 1201/1CAF7630, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 26
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF76E0, prev 1201/1CAF76A8, desc: VISIBLE cutoff xid 3671911724 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 27
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7758, prev 1201/1CAF7720, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 34
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF77D0, prev 1201/1CAF7798, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 35
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7EF8, prev 1201/1CAF7810, desc: VISIBLE cutoff xid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 37
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7F70, prev 1201/1CAF7F38, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 38
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7FE8, prev 1201/1CAF7FB0, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 39
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8078, prev 1201/1CAF8040, desc: VISIBLE cutoff xid 3678237783 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 41
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF80F0, prev 1201/1CAF80B8, desc: VISIBLE cutoff xid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 42
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8168, prev 1201/1CAF8130, desc: VISIBLE cutoff xid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 43
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF81E0, prev 1201/1CAF81A8, desc: VISIBLE cutoff xid 3667994218 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 44
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8258, prev 1201/1CAF8220, desc: VISIBLE cutoff xid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 45
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF82D0, prev 1201/1CAF8298, desc: VISIBLE cutoff xid 3673830395 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 48
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8348, prev 1201/1CAF8310, desc: VISIBLE cutoff xid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 50
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF83C0, prev 1201/1CAF8388, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 51
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8438, prev 1201/1CAF8400, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 52
rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoff xid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53
pg_waldump: error: error en registro de WAL en 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

I could send our WAL to you if that's desirable ..

--
Justin

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#3)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 12:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Yes, I have a copy that reproduces the issue:

That's good news.

So the last record touching that page was:

rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoff xid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53

I think the expected LSN for that page is past the end of that record,
so 0x1CAF84B0 + 59 = 0x1caf84eb which rounds up to 0x1CAF84F0, and
indeed we see that in the restored page when recovery succeeds.

Next question: why do we think the WAL finishes at 1201/1CADB730 while
running that checkpoint? Looking...

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#2)
Re: pg15b3: recovery fails with wal prefetch enabled

Some more details, in case they're important:

First: the server has wal_compression=zstd (I wonder if something
doesn't allow/accomodate compressed FPI?)

I thought to mention that after compiling pg15 locally and forgetting to
use --with-zstd.

I compiled it to enable your debug logging, which wrote these during
recovery:

< 2022-08-31 21:17:01.807 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation
< 2022-08-31 21:17:01.903 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation
< 2022-08-31 21:17:02.029 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small

Also, pg_waldump seems to fail early with -w:
[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C
rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54
pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

Also, the VM has crashed with OOM before, while runnning pg15, with no issue in
recovery. I haven't been able to track down the cause..

The VM is running: kernel-3.10.0-1160.66.1.el7.x86_64

pgsql is an ext4 FS (no tablespaces), which is a qemu block device
exposed like:

<driver name='qemu' type='raw' cache='none' io='native'/>
<target dev='vdg' bus='virtio'/>

It's nowhere near full:

/dev/vdc 96G 51G 46G 53% /var/lib/pgsql

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#2)
Re: pg15b3: recovery fails with wal prefetch enabled

At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

/messages/by-id/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
/messages/by-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3@amazon.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Kyotaro Horiguchi (#6)
1 attachment(s)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

Hmm. Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3, any chance you could cherry pick that change and
check what happens? And without that, could you show what this logs
for good and bad recovery settings?

Attachments:

debug.patchtext/x-patch; charset=US-ASCII; name=debug.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 248a40e8fa..1e435e81d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5308,6 +5308,7 @@ StartupXLOG(void)
 	 */
 	endOfRecoveryInfo = FinishWalRecovery();
 	EndOfLog = endOfRecoveryInfo->endOfLog;
+elog(LOG, "XXX point 1: EndOfLog = %lx", EndOfLog);
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
@@ -5446,7 +5447,9 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+elog(LOG, "XXX clobbering EndOfLog");
 	}
+elog(LOG, "XXX point 2: EndOfLog = %lx", EndOfLog);
 
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
@@ -5456,6 +5459,7 @@ StartupXLOG(void)
 	Insert = &XLogCtl->Insert;
 	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+elog(LOG, "XXX point 3: Insert->CurrBytePos = %lx", Insert->CurrBytePos);
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -5554,6 +5558,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
 
+elog(LOG, "XXX point 4: Insert->CurrBytePos = %lx", Insert->CurrBytePos);
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
 	 */
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 141af04388..8c6753a8a8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1434,6 +1434,8 @@ FinishWalRecovery(void)
 	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
 	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
+elog(LOG, "point 0: lastRec = %lx", lastRec);
+elog(LOG, "point 0: endOfLog = %lx", endOfLog);
 
 	/*
 	 * Remember the TLI in the filename of the XLOG segment containing the
#8Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#7)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:

On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

< 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much. The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

Hmm. Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3,

No - it's:
commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)

If it's REL_15_BETA3, any chance you could cherry pick that change and
check what happens? And without that, could you show what this logs
And without that, could you show what this logs
for good and bad recovery settings?

I wasn't sure what mean by "without that" , so here's a bunch of logs to
sift through:

At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678
...
< 2022-08-31 23:31:38.690 CDT >LOG: redo starts at 1201/1B931F50
< 2022-08-31 23:31:40.204 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation
< 2022-08-31 23:31:40.307 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation
< 2022-08-31 23:31:40.493 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small
< 2022-08-31 23:31:40.721 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.23 s, elapsed: 2.03 s
< 2022-08-31 23:31:41.452 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:31:41.698 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:31:41.698 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:31:41.699 CDT >FATAL: checkpoint request failed
< 2022-08-31 23:31:41.699 CDT >HINT: Consult recent messages in the server log for details.
< 2022-08-31 23:31:41.704 CDT >LOG: startup process (PID 25046) exited with exit code 1
< 2022-08-31 23:31:41.704 CDT >LOG: terminating any other active server processes
< 2022-08-31 23:31:41.705 CDT >LOG: shutting down due to startup process failure
< 2022-08-31 23:31:41.731 CDT >LOG: database system is shut down

With your patch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678
...
< 2022-08-31 23:34:22.897 CDT >LOG: redo starts at 1201/1B931F50
< 2022-08-31 23:34:23.146 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation
< 2022-08-31 23:34:23.147 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation
< 2022-08-31 23:34:23.268 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small
< 2022-08-31 23:34:23.323 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.29 s, system: 0.12 s, elapsed: 0.42 s
< 2022-08-31 23:34:23.323 CDT >LOG: point 0: lastRec = 12011cadb300
< 2022-08-31 23:34:23.323 CDT >LOG: point 0: endOfLog = 12011cadb730
< 2022-08-31 23:34:23.323 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730
< 2022-08-31 23:34:23.386 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730
< 2022-08-31 23:34:23.387 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:34:23.565 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:34:23.606 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:34:23.767 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:34:23.767 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:34:23.768 CDT >FATAL: checkpoint request failed

And without prefetch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c recovery_prefetch=no
...
< 2022-08-31 23:37:08.792 CDT >LOG: redo starts at 1201/1B931F50
< 2022-08-31 23:37:09.269 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0
< 2022-08-31 23:37:09.269 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.35 s, system: 0.11 s, elapsed: 0.47 s
< 2022-08-31 23:37:09.269 CDT >LOG: point 0: lastRec = 12011cd90e48
< 2022-08-31 23:37:09.269 CDT >LOG: point 0: endOfLog = 12011cd91010
< 2022-08-31 23:37:09.269 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010
< 2022-08-31 23:37:09.350 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010
< 2022-08-31 23:37:09.350 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:37:09.420 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:37:09.552 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:37:12.987 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.403 s, sync=2.841 s, total=3.566 s; sync files=102, longest=2.808 s, average=0.028 s; distance=20860 kB, estimate=20860 kB
< 2022-08-31 23:37:13.077 CDT >LOG: database system is ready to accept connections

If I revert 6672d79139 (and roll back to the unrecovered state):

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 # -c recovery_prefetch=no
...
< 2022-08-31 23:42:40.592 CDT >LOG: redo starts at 1201/1B931F50
< 2022-08-31 23:42:42.168 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation
< 2022-08-31 23:42:42.238 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation
< 2022-08-31 23:42:42.405 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small
< 2022-08-31 23:42:42.602 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.25 s, elapsed: 2.01 s
< 2022-08-31 23:42:42.602 CDT >LOG: point 0: lastRec = 12011cadb300
< 2022-08-31 23:42:42.602 CDT >LOG: point 0: endOfLog = 12011cadb730
< 2022-08-31 23:42:42.602 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730
< 2022-08-31 23:42:42.830 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730
< 2022-08-31 23:42:42.830 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:42:43.194 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:42:43.266 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:42:43.425 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:42:43.425 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:42:43.425 CDT >FATAL: checkpoint request failed
< 2022-08-31 23:42:43.425 CDT >HINT: Consult recent messages in the server log for details.
< 2022-08-31 23:42:43.431 CDT >LOG: startup process (PID 2415) exited with exit code 1
< 2022-08-31 23:42:43.431 CDT >LOG: terminating any other active server processes
< 2022-08-31 23:42:43.432 CDT >LOG: shutting down due to startup process failure
< 2022-08-31 23:42:43.451 CDT >LOG: database system is shut down

If I revert 6672d79139 and disable prefetch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c recovery_prefetch=no
...
< 2022-08-31 23:43:25.711 CDT >LOG: redo starts at 1201/1B931F50
< 2022-08-31 23:43:26.178 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0
< 2022-08-31 23:43:26.178 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.33 s, system: 0.11 s, elapsed: 0.46 s
< 2022-08-31 23:43:26.178 CDT >LOG: point 0: lastRec = 12011cd90e48
< 2022-08-31 23:43:26.178 CDT >LOG: point 0: endOfLog = 12011cd91010
< 2022-08-31 23:43:26.178 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010
< 2022-08-31 23:43:26.369 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010
< 2022-08-31 23:43:26.369 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:43:26.433 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:43:26.490 CDT >LOG: checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:43:29.519 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.380 s, sync=2.492 s, total=3.086 s; sync files=102, longest=2.438 s, average=0.025 s; distance=20860 kB, estimate=20860 kB
< 2022-08-31 23:43:29.567 CDT >LOG: database system is ready to accept connections

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Justin Pryzby (#8)
Re: pg15b3: recovery fails with wal prefetch enabled

At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:

On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

Hmm. Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3,

No - it's:
commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)

It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
that commit.

If it's REL_15_BETA3, any chance you could cherry pick that change and
check what happens? And without that, could you show what this logs
And without that, could you show what this logs
for good and bad recovery settings?

I wasn't sure what mean by "without that" , so here's a bunch of logs to
sift through:

There's no need to cherry picking..

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Kyotaro Horiguchi (#9)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 5:18 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:

Hmm. Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3,

No - it's:
commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)

It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
that commit.

Yeah.

I wasn't sure what mean by "without that" , so here's a bunch of logs to
sift through:

So it *looks* like it finished early (and without the expected
error?). But it also looks like it replayed that record, according to
the page LSN. So which is it? Could you recompile with WAL_DEBUG
defined in pg_config_manual.h, and run recovery with wal_debug = on,
and see if it replays 1CAF84B0?

#11Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#10)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 01, 2022 at 05:35:23PM +1200, Thomas Munro wrote:

So it *looks* like it finished early (and without the expected
error?). But it also looks like it replayed that record, according to
the page LSN. So which is it? Could you recompile with WAL_DEBUG
defined in pg_config_manual.h, and run recovery with wal_debug = on,
and see if it replays 1CAF84B0?

This is with 6672d79139 un-reverted.

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on 2>&1 |grep 1CAF84B0 || echo not found
not found

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on -c recovery_prefetch=no 2>&1 |grep 1CAF84B0 || echo not found
< 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF8478; LSN 1201/1CAF84B0: prev 1201/1CAF8438; xid 0; len 2; blkref #0: rel 1663/16881/2840, blk 53 - Heap2/VACUUM: nunused 4
< 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84B0; LSN 1201/1CAF84F0: prev 1201/1CAF8478; xid 0; len 5; blkref #0: rel 1663/16881/2840, fork 2, blk 0; blkref #1: rel 1663/16881/2840, blk 53 - Heap2/VISIBLE: cutoff xid 3678741092 flags 0x01
< 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84F0; LSN 1201/1CAF8AC0: prev 1201/1CAF84B0; xid 0; len 2; blkref #0: rel 1663/16881/1259, blk 1 FPW, compression method: zstd - Heap/INPLACE: off 33

(Note that "compression method: zstd" is a local change to
xlog_block_info() which I just extracted from my original patch for
wal_compression, after forgetting to compile --with-zstd. I'll mail
about that at a later time...).

--
Justin

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Justin Pryzby (#11)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

compression method: zstd

Ahh, problem repro'd here with WAL compression. More soon.

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: pg15b3: recovery fails with wal prefetch enabled

On Thu, Sep 1, 2022 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Ahh, problem repro'd here with WAL compression. More soon.

I followed some false pistes for a while there, but I finally figured
it out what's happening here after Justin kindly shared some files
with me. The active ingredient here is a setting of
maintenance_io_concurency=0, which runs into a dumb accounting problem
of the fencepost variety and incorrectly concludes it's reached the
end early. Setting it to 3 or higher allows his system to complete
recovery. I'm working on a fix ASAP.

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#13)
1 attachment(s)
Re: pg15b3: recovery fails with wal prefetch enabled

On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... The active ingredient here is a setting of
maintenance_io_concurency=0, which runs into a dumb accounting problem
of the fencepost variety and incorrectly concludes it's reached the
end early. Setting it to 3 or higher allows his system to complete
recovery. I'm working on a fix ASAP.

The short version is that when tracking the number of IOs in progress,
I had two steps in the wrong order in the algorithm for figuring out
whether IO is saturated. Internally, the effect of
maintenance_io_concurrency is clamped to 2 or more, and that mostly
hides the bug until you try to replay a particular sequence like
Justin's with such a low setting. Without that clamp, and if you set
it to 1, then several of our recovery tests fail.

That clamp was a bad idea. What I think we really want is for
maintenance_io_concurrency=0 to disable recovery prefetching exactly
as if you'd set recovery_prefetch=off, and any other setting including
1 to work without clamping.

Here's the patch I'm currently testing. It also fixes a related
dangling reference problem with very small maintenance_io_concurrency.

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl. I
think that failure case should report an error, no? I find it strange
that we end recovery in silence. That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state. I'm still looking into that.

Attachments:

0001-Fix-recovery_prefetch-with-low-maintenance_io_concur.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-recovery_prefetch-with-low-maintenance_io_concur.patchDownload
From fabc519d39adcb65996031059dbf42c8e3623764 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 5 Sep 2022 12:07:24 +1200
Subject: [PATCH] Fix recovery_prefetch with low maintenance_io_concurrency.

The LSN-based logic for knowing when IOs complete ran operations in the
wrong order.  We should process completed IOs first before trying to
start more, so that it is always possible to decode one more record when
the decoded record queue is empty, even if maintenance_io_concurrency is
set so low that a single earlier WAL record might have saturated the IO
queue.

That thinko was hidden because the effect of maintenance_io_concurrency
was arbitrarily clamped to be at least 2.  Fix the ordering, and also
remove that clamp.  We need a special case for 0, which is now treated
the same as recovery_prefetch=off, but otherwise the number is used
directly.  This allows for testing with 1, which would have made the
problem obvious in simple test scenarios.

Back-patch to 15.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
---
 src/backend/access/transam/xlogprefetcher.c | 58 ++++++++++++++-------
 src/backend/access/transam/xlogreader.c     |  4 ++
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 9aa56411d5..96038448cc 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -72,7 +72,9 @@
 int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
 
 #ifdef USE_PREFETCH
-#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#define RecoveryPrefetchEnabled() \
+		(recovery_prefetch != RECOVERY_PREFETCH_OFF && \
+		 maintenance_io_concurrency > 0)
 #else
 #define RecoveryPrefetchEnabled() false
 #endif
@@ -1000,7 +1002,8 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 
 		if (RecoveryPrefetchEnabled())
 		{
-			max_inflight = Max(maintenance_io_concurrency, 2);
+			Assert(maintenance_io_concurrency > 0);
+			max_inflight = maintenance_io_concurrency;
 			max_distance = max_inflight * XLOGPREFETCHER_DISTANCE_MULTIPLIER;
 		}
 		else
@@ -1018,14 +1021,39 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/*
-	 * Release last returned record, if there is one.  We need to do this so
-	 * that we can check for empty decode queue accurately.
+	 * Release last returned record, if there is one, and perform tasks that
+	 * are we can do now that the caller has replayed it.
 	 */
-	XLogReleasePreviousRecord(prefetcher->reader);
+	if (likely(record = prefetcher->reader->record))
+	{
+		XLogRecPtr	replayed_up_to = record->next_lsn;
+
+		XLogReleasePreviousRecord(prefetcher->reader);
+
+		/*
+		 * Can we drop any filters yet?  If we were waiting for a relation to
+		 * be created or extended, it is now OK to access blocks in the covered
+		 * range.
+		 */
+		XLogPrefetcherCompleteFilters(prefetcher, replayed_up_to);
+
+		/*
+		 * All IO initiated by earlier WAL is now completed.  This might
+		 * trigger further prefetching.
+		 */
+		lrq_complete_lsn(prefetcher->streaming_read, replayed_up_to);
+	}
 
-	/* If there's nothing queued yet, then start prefetching. */
+	/*
+	 * If there's nothing queued yet, then start prefetching to cause at least
+	 * one record to be queued.
+	 */
 	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+	{
+		Assert(lrq_inflight(prefetcher->streaming_read) == 0);
+		Assert(lrq_completed(prefetcher->streaming_read) == 0);
 		lrq_prefetch(prefetcher->streaming_read);
+	}
 
 	/* Read the next record. */
 	record = XLogNextRecord(prefetcher->reader, errmsg);
@@ -1039,12 +1067,13 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	Assert(record == prefetcher->reader->record);
 
 	/*
-	 * Can we drop any prefetch filters yet, given the record we're about to
-	 * return?  This assumes that any records with earlier LSNs have been
-	 * replayed, so if we were waiting for a relation to be created or
-	 * extended, it is now OK to access blocks in the covered range.
+	 * If maintenance_io_concurrency is set very low, we might have started
+	 * prefetching some but not all of the blocks referenced in the record
+	 * we're about to return.  Forget about the rest of the blocks in this
+	 * record by dropping the prefetcher's reference to it.
 	 */
-	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+	if (record == prefetcher->record)
+		prefetcher->record = NULL;
 
 	/*
 	 * See if it's time to compute some statistics, because enough WAL has
@@ -1053,13 +1082,6 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
 		XLogPrefetcherComputeStats(prefetcher);
 
-	/*
-	 * The caller is about to replay this record, so we can now report that
-	 * all IO initiated because of early WAL must be finished. This may
-	 * trigger more readahead.
-	 */
-	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
-
 	Assert(record == prefetcher->reader->record);
 
 	return &record->header;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..9adf7ab143 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -907,6 +907,10 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+		report_invalid_record(state,
+							  "aborted record pointer at %X/%X: missing contrecord at %X/%X",
+							  LSN_FORMAT_ARGS(state->abortedRecPtr),
+							  LSN_FORMAT_ARGS(state->missingContrecPtr));
 	}
 
 	if (decoded && decoded->oversized)
-- 
2.35.1

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
1 attachment(s)
Re: pg15b3: recovery fails with wal prefetch enabled

On Mon, Sep 5, 2022 at 1:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl. I
think that failure case should report an error, no? I find it strange
that we end recovery in silence. That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state. I'm still looking into that.

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued. I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

I'll have to check whether a doc change is necessary somewhere to
advertise that maintenance_io_concurrency=0 turns off prefetching, but
IIRC that's kinda already implied.

I've tested quite a lot of scenarios including make check-world with
maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
relevant GUCs on a standby running large pgbench to check expected
effect on pg_stat_recovery_prefetch view and generate system calls.

Attachments:

v2-0001-Fix-recovery_prefetch-with-low-maintenance_io_con.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-recovery_prefetch-with-low-maintenance_io_con.patchDownload
From 956d0613ecad68ba694b8a492895aa02002dbed5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 5 Sep 2022 12:07:24 +1200
Subject: [PATCH v2] Fix recovery_prefetch with low maintenance_io_concurrency.

The LSN-based logic for knowing when IOs complete ran operations in the
wrong order.  We should process completed IOs first before trying to
start more, so that it is always possible to decode one more record when
the decoded record queue is empty, even if maintenance_io_concurrency is
set so low that a single earlier WAL record might have saturated the IO
queue.

That thinko was hidden because the effect of maintenance_io_concurrency
was arbitrarily clamped to be at least 2.  Fix the ordering, and also
remove that clamp.  We need a special case for 0, which is now treated
the same as recovery_prefetch=off, but otherwise the number is used
directly.  This allows for testing with 1, which would have made the
problem obvious in simple test scenarios.

Back-patch to 15.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
---
 src/backend/access/transam/xlogprefetcher.c | 58 ++++++++++++++-------
 src/backend/access/transam/xlogreader.c     | 12 +++++
 2 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 9aa56411d5..33491d7325 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -72,7 +72,9 @@
 int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
 
 #ifdef USE_PREFETCH
-#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#define RecoveryPrefetchEnabled() \
+		(recovery_prefetch != RECOVERY_PREFETCH_OFF && \
+		 maintenance_io_concurrency > 0)
 #else
 #define RecoveryPrefetchEnabled() false
 #endif
@@ -1000,7 +1002,8 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 
 		if (RecoveryPrefetchEnabled())
 		{
-			max_inflight = Max(maintenance_io_concurrency, 2);
+			Assert(maintenance_io_concurrency > 0);
+			max_inflight = maintenance_io_concurrency;
 			max_distance = max_inflight * XLOGPREFETCHER_DISTANCE_MULTIPLIER;
 		}
 		else
@@ -1018,14 +1021,39 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/*
-	 * Release last returned record, if there is one.  We need to do this so
-	 * that we can check for empty decode queue accurately.
+	 * Release last returned record, if there is one, and perform tasks that
+	 * are we can do now that the caller has replayed it.
 	 */
-	XLogReleasePreviousRecord(prefetcher->reader);
+	if (likely(record = prefetcher->reader->record))
+	{
+		XLogRecPtr	replayed_up_to = record->next_lsn;
+
+		XLogReleasePreviousRecord(prefetcher->reader);
+
+		/*
+		 * Can we drop any filters yet?  If we were waiting for a relation to
+		 * be created or extended, it is now OK to access blocks in the
+		 * covered range.
+		 */
+		XLogPrefetcherCompleteFilters(prefetcher, replayed_up_to);
+
+		/*
+		 * All IO initiated by earlier WAL is now completed.  This might
+		 * trigger further prefetching.
+		 */
+		lrq_complete_lsn(prefetcher->streaming_read, replayed_up_to);
+	}
 
-	/* If there's nothing queued yet, then start prefetching. */
+	/*
+	 * If there's nothing queued yet, then start prefetching to cause at least
+	 * one record to be queued.
+	 */
 	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+	{
+		Assert(lrq_inflight(prefetcher->streaming_read) == 0);
+		Assert(lrq_completed(prefetcher->streaming_read) == 0);
 		lrq_prefetch(prefetcher->streaming_read);
+	}
 
 	/* Read the next record. */
 	record = XLogNextRecord(prefetcher->reader, errmsg);
@@ -1039,12 +1067,13 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	Assert(record == prefetcher->reader->record);
 
 	/*
-	 * Can we drop any prefetch filters yet, given the record we're about to
-	 * return?  This assumes that any records with earlier LSNs have been
-	 * replayed, so if we were waiting for a relation to be created or
-	 * extended, it is now OK to access blocks in the covered range.
+	 * If maintenance_io_concurrency is set very low, we might have started
+	 * prefetching some but not all of the blocks referenced in the record
+	 * we're about to return.  Forget about the rest of the blocks in this
+	 * record by dropping the prefetcher's reference to it.
 	 */
-	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+	if (record == prefetcher->record)
+		prefetcher->record = NULL;
 
 	/*
 	 * See if it's time to compute some statistics, because enough WAL has
@@ -1053,13 +1082,6 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
 		XLogPrefetcherComputeStats(prefetcher);
 
-	/*
-	 * The caller is about to replay this record, so we can now report that
-	 * all IO initiated because of early WAL must be finished. This may
-	 * trigger more readahead.
-	 */
-	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
-
 	Assert(record == prefetcher->reader->record);
 
 	return &record->header;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..1fefaf2826 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -907,6 +907,18 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If we got here without reporting an error, report one now so that
+		 * XLogPrefetcherReadRecord() doesn't bring us back a second time and
+		 * clobber the above state.  Otherwise, the existing error takes
+		 * precedence.
+		 */
+		if (!state->errormsg_deferred)
+			report_invalid_record(state,
+								  "aborted record pointer at %X/%X: missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(state->abortedRecPtr),
+								  LSN_FORMAT_ARGS(state->missingContrecPtr));
 	}
 
 	if (decoded && decoded->oversized)
-- 
2.35.1

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#14)
Re: pg15b3: recovery fails with wal prefetch enabled

At Mon, 5 Sep 2022 13:28:12 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl. I
think that failure case should report an error, no? I find it strange
that we end recovery in silence. That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state. I'm still looking into that.

+1 for showing any message for the failure, but I think we shouldn't
hide an existing message if any. And the error messages around are
just telling that "<some error happened> at RecPtr". So I think
"missing contrecord at RecPtr" is sufficient here.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..bfe332c014 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -907,6 +907,11 @@ err:
                 */
                state->abortedRecPtr = RecPtr;
                state->missingContrecPtr = targetPagePtr;
+
+               /* Put a generic error message if no particular cause is recorded. */
+               if (!state->errormsg_buf[0])
+                       report_invalid_record(state, "missing contrecord at %X/%X",
+                                                                 LSN_FORMAT_ARGS(RecPtr));
        }

if (decoded && decoded->oversized)

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#15)
Re: pg15b3: recovery fails with wal prefetch enabled

(the previous mail was crossing with yours..)

At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
me> +1 for showing any message for the failure, but I think we shouldn't
me> hide an existing message if any.

At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued. I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

It looks fine in this regard. I still think that the message looks
somewhat internal.

me> And the error messages around are
me> just telling that "<some error happened> at RecPtr". So I think
me> "missing contrecord at RecPtr" is sufficient here.

I'll have to check whether a doc change is necessary somewhere to
advertise that maintenance_io_concurrency=0 turns off prefetching, but
IIRC that's kinda already implied.

I've tested quite a lot of scenarios including make check-world with
maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
relevant GUCs on a standby running large pgbench to check expected
effect on pg_stat_recovery_prefetch view and generate system calls.

+ if (likely(record = prefetcher->reader->record))

Isn't this confusing a bit?

+	if (likely(record = prefetcher->reader->record))
+	{
+		XLogRecPtr	replayed_up_to = record->next_lsn;
+
+		XLogReleasePreviousRecord(prefetcher->reader);
+

The likely condition is the prerequisite for
XLogReleasePreviousRecord. But is is a little hard to read the
condition as "in case no previous record exists". Since there is one
in most cases, can't call XLogReleasePreviousRecord() unconditionally
then the function returns true when the previous record exists and
false if not?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Kyotaro Horiguchi (#17)
1 attachment(s)
Re: pg15b3: recovery fails with wal prefetch enabled

On Mon, Sep 5, 2022 at 5:34 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
me> +1 for showing any message for the failure, but I think we shouldn't
me> hide an existing message if any.

At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued. I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

It looks fine in this regard. I still think that the message looks
somewhat internal.

Thanks for looking!

me> And the error messages around are
me> just telling that "<some error happened> at RecPtr". So I think
me> "missing contrecord at RecPtr" is sufficient here.

Ok, I've updated it like that.

I'll have to check whether a doc change is necessary somewhere to
advertise that maintenance_io_concurrency=0 turns off prefetching, but
IIRC that's kinda already implied.

I've tested quite a lot of scenarios including make check-world with
maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
relevant GUCs on a standby running large pgbench to check expected
effect on pg_stat_recovery_prefetch view and generate system calls.

+ if (likely(record = prefetcher->reader->record))

Isn't this confusing a bit?

+       if (likely(record = prefetcher->reader->record))
+       {
+               XLogRecPtr      replayed_up_to = record->next_lsn;
+
+               XLogReleasePreviousRecord(prefetcher->reader);
+

The likely condition is the prerequisite for
XLogReleasePreviousRecord. But is is a little hard to read the
condition as "in case no previous record exists". Since there is one
in most cases, can't call XLogReleasePreviousRecord() unconditionally
then the function returns true when the previous record exists and
false if not?

We also need the LSN that is past that record.
XLogReleasePreviousRecord() could return it (or we could use
reader->EndRecPtr I suppose). Thoughts on this version?

Attachments:

v3-0001-Fix-recovery_prefetch-with-low-maintenance_io_con.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-recovery_prefetch-with-low-maintenance_io_con.patchDownload
From efa095529b3615f30b710a4b42bd0904b187af53 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 5 Sep 2022 12:07:24 +1200
Subject: [PATCH v3] Fix recovery_prefetch with low maintenance_io_concurrency.

The LSN-based logic for knowing when IOs complete ran operations in the
wrong order.  We should process completed IOs first before trying to
start more, so that it is always possible to decode one more record when
the decoded record queue is empty, even if maintenance_io_concurrency is
set so low that a single earlier WAL record might have saturated the IO
queue.

That thinko was hidden because the effect of maintenance_io_concurrency
was arbitrarily clamped to be at least 2.  Fix the ordering, and also
remove that clamp.  We need a special case for 0, which is now treated
the same as recovery_prefetch=off, but otherwise the number is used
directly.  This allows for testing with 1, which would have made the
problem obvious in simple test scenarios.

Also add an explicit error message for missing contrecords.  It's needed
for correctness with the resulting slightly more aggressive prefetching,
to avoid losing track of that state, but it's also a bit strange that we
didn't have an error message already.

Back-patch to 15.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
---
 src/backend/access/transam/xlogprefetcher.c | 54 ++++++++++++++-------
 src/backend/access/transam/xlogreader.c     | 22 +++++++--
 src/include/access/xlogreader.h             |  2 +-
 3 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 9aa56411d5..91bfba30ec 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -72,7 +72,9 @@
 int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
 
 #ifdef USE_PREFETCH
-#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#define RecoveryPrefetchEnabled() \
+		(recovery_prefetch != RECOVERY_PREFETCH_OFF && \
+		 maintenance_io_concurrency > 0)
 #else
 #define RecoveryPrefetchEnabled() false
 #endif
@@ -985,6 +987,7 @@ XLogRecord *
 XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 {
 	DecodedXLogRecord *record;
+	XLogRecPtr	replayed_up_to;
 
 	/*
 	 * See if it's time to reset the prefetching machinery, because a relevant
@@ -1000,7 +1003,8 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 
 		if (RecoveryPrefetchEnabled())
 		{
-			max_inflight = Max(maintenance_io_concurrency, 2);
+			Assert(maintenance_io_concurrency > 0);
+			max_inflight = maintenance_io_concurrency;
 			max_distance = max_inflight * XLOGPREFETCHER_DISTANCE_MULTIPLIER;
 		}
 		else
@@ -1018,14 +1022,34 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/*
-	 * Release last returned record, if there is one.  We need to do this so
-	 * that we can check for empty decode queue accurately.
+	 * Release last returned record, if there is one, and perform tasks that
+	 * are we can do now that the caller has replayed it.
 	 */
-	XLogReleasePreviousRecord(prefetcher->reader);
+	replayed_up_to = XLogReleasePreviousRecord(prefetcher->reader);
 
-	/* If there's nothing queued yet, then start prefetching. */
+	/*
+	 * Can we drop any filters yet?  If we were waiting for a relation to be
+	 * created or extended, it is now OK to access blocks in the covered
+	 * range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replayed_up_to);
+
+	/*
+	 * All IO initiated by earlier WAL is now completed.  This might trigger
+	 * further prefetching.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, replayed_up_to);
+
+	/*
+	 * If there's nothing queued yet, then start prefetching to cause at least
+	 * one record to be queued.
+	 */
 	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+	{
+		Assert(lrq_inflight(prefetcher->streaming_read) == 0);
+		Assert(lrq_completed(prefetcher->streaming_read) == 0);
 		lrq_prefetch(prefetcher->streaming_read);
+	}
 
 	/* Read the next record. */
 	record = XLogNextRecord(prefetcher->reader, errmsg);
@@ -1039,12 +1063,13 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	Assert(record == prefetcher->reader->record);
 
 	/*
-	 * Can we drop any prefetch filters yet, given the record we're about to
-	 * return?  This assumes that any records with earlier LSNs have been
-	 * replayed, so if we were waiting for a relation to be created or
-	 * extended, it is now OK to access blocks in the covered range.
+	 * If maintenance_io_concurrency is set very low, we might have started
+	 * prefetching some but not all of the blocks referenced in the record
+	 * we're about to return.  Forget about the rest of the blocks in this
+	 * record by dropping the prefetcher's reference to it.
 	 */
-	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+	if (record == prefetcher->record)
+		prefetcher->record = NULL;
 
 	/*
 	 * See if it's time to compute some statistics, because enough WAL has
@@ -1053,13 +1078,6 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
 		XLogPrefetcherComputeStats(prefetcher);
 
-	/*
-	 * The caller is about to replay this record, so we can now report that
-	 * all IO initiated because of early WAL must be finished. This may
-	 * trigger more readahead.
-	 */
-	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
-
 	Assert(record == prefetcher->reader->record);
 
 	return &record->header;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..c1c2d9ca90 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -276,21 +276,24 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 /*
  * See if we can release the last record that was returned by
- * XLogNextRecord(), if any, to free up space.
+ * XLogNextRecord(), if any, to free up space.  Returns the LSN
+ * past the end of the previous record.
  */
-void
+XLogRecPtr
 XLogReleasePreviousRecord(XLogReaderState *state)
 {
 	DecodedXLogRecord *record;
+	XLogRecPtr		next_lsn;
 
 	if (!state->record)
-		return;
+		return InvalidXLogRecPtr;
 
 	/*
 	 * Remove it from the decoded record queue.  It must be the oldest item
 	 * decoded, decode_queue_head.
 	 */
 	record = state->record;
+	next_lsn = record->next_lsn;
 	Assert(record == state->decode_queue_head);
 	state->record = NULL;
 	state->decode_queue_head = record->next;
@@ -336,6 +339,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 			state->decode_buffer_tail = state->decode_buffer;
 		}
 	}
+
+	return next_lsn;
 }
 
 /*
@@ -907,6 +912,17 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If we got here without reporting an error, report one now so that
+		 * XLogPrefetcherReadRecord() doesn't bring us back a second time and
+		 * clobber the above state.  Otherwise, the existing error takes
+		 * precedence.
+		 */
+		if (!state->errormsg_buf[0])
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	if (decoded && decoded->oversized)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 97b6f00114..6afec33d41 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -363,7 +363,7 @@ extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
 /* Release the previously returned record, if necessary. */
-extern void XLogReleasePreviousRecord(XLogReaderState *state);
+extern XLogRecPtr XLogReleasePreviousRecord(XLogReaderState *state);
 
 /* Try to read ahead, if there is data and space. */
 extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
-- 
2.30.2

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#18)
Re: pg15b3: recovery fails with wal prefetch enabled

On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued. I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end. So
the procedural/RMT question is whether it's still possible to close
this item in beta4.

#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#18)
Re: pg15b3: recovery fails with wal prefetch enabled

At Mon, 5 Sep 2022 21:08:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

We also need the LSN that is past that record.
XLogReleasePreviousRecord() could return it (or we could use
reader->EndRecPtr I suppose). Thoughts on this version?

(Catching the gap...)

It is easier to read. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21Jonathan S. Katz
jkatz@postgresql.org
In reply to: Thomas Munro (#19)
Re: pg15b3: recovery fails with wal prefetch enabled

On 9/5/22 7:18 PM, Thomas Munro wrote:

On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued. I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end. So
the procedural/RMT question is whether it's still possible to close
this item in beta4.

Presumably because Tom stamped it, the released is wrapped so it
wouldn't make Beta 4, but I defer to him to see if it can be included
with the tag.

That said, if it doesn't make it for Beta 4, it would be in the next
release (which is hopefully RC1).

Thanks,

Jonathan

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonathan S. Katz (#21)
Re: pg15b3: recovery fails with wal prefetch enabled

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 9/5/22 7:18 PM, Thomas Munro wrote:

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end. So
the procedural/RMT question is whether it's still possible to close
this item in beta4.

Presumably because Tom stamped it, the released is wrapped so it
wouldn't make Beta 4, but I defer to him to see if it can be included
with the tag.

I already made the tarballs available to packagers, so adding this
would involve a re-wrap and great confusion. In any case, I'm not
a fan of pushing fixes within a day or two of the wrap deadline,
let alone after it; you get inadequate buildfarm coverage when you
cut corners that way. I think this one missed the boat.

regards, tom lane

#23Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#22)
Re: pg15b3: recovery fails with wal prefetch enabled

On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 9/5/22 7:18 PM, Thomas Munro wrote:

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end. So
the procedural/RMT question is whether it's still possible to close
this item in beta4.

Presumably because Tom stamped it, the released is wrapped so it
wouldn't make Beta 4, but I defer to him to see if it can be included
with the tag.

I already made the tarballs available to packagers, so adding this
would involve a re-wrap and great confusion. In any case, I'm not
a fan of pushing fixes within a day or two of the wrap deadline,
let alone after it; you get inadequate buildfarm coverage when you
cut corners that way. I think this one missed the boat.

Got it. Yeah I knew it was going to be a close thing with a problem
diagnosed on Thursday/Friday before a Monday wrap, even before I
managed to confuse myself about dates and times. Thanks both.

#24Jonathan S. Katz
jkatz@postgresql.org
In reply to: Thomas Munro (#23)
Re: pg15b3: recovery fails with wal prefetch enabled

On 9/5/22 10:03 PM, Thomas Munro wrote:

On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 9/5/22 7:18 PM, Thomas Munro wrote:

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end. So
the procedural/RMT question is whether it's still possible to close
this item in beta4.

Presumably because Tom stamped it, the released is wrapped so it
wouldn't make Beta 4, but I defer to him to see if it can be included
with the tag.

I already made the tarballs available to packagers, so adding this
would involve a re-wrap and great confusion. In any case, I'm not
a fan of pushing fixes within a day or two of the wrap deadline,
let alone after it; you get inadequate buildfarm coverage when you
cut corners that way. I think this one missed the boat.

Got it. Yeah I knew it was going to be a close thing with a problem
diagnosed on Thursday/Friday before a Monday wrap, even before I
managed to confuse myself about dates and times. Thanks both.

To close this loop, I added a section for "fixed before RC1" to Open
Items since this is presumably the next release. We can include it there
once committed.

Jonathan

#25Thomas Munro
thomas.munro@gmail.com
In reply to: Jonathan S. Katz (#24)
Re: pg15b3: recovery fails with wal prefetch enabled

On Wed, Sep 7, 2022 at 1:56 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

To close this loop, I added a section for "fixed before RC1" to Open
Items since this is presumably the next release. We can include it there
once committed.

Done yesterday.

To tie up a couple of loose ends from this thread:

On Thu, Sep 1, 2022 at 2:48 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Also, pg_waldump seems to fail early with -w:
[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C
rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54
pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

That looks OK to me. With or without -w, we get as far as
1201/1CD91010 and then hit zeroes.

On Thu, Sep 1, 2022 at 5:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:

So it *looks* like it finished early (and without the expected
error?). But it also looks like it replayed that record, according to
the page LSN. So which is it?

The reason 1201/1CAF84B0 appeared on a page despite not having been
replayed (due to the bug) is just that vismap pages don't follow the
usual logging rules, and can be read in by heap records that don't
mention the vm page (and therefore no FPW). So we can finish up
reading a page from disk with a future LSN on it.