BUG #17928: Standby fails to decode WAL on termination of primary
The following bug has been logged on the website:
Bug reference: 17928
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 15.2
Operating system: Ubuntu 22.04
Description:
When a primary server is stopped, standby might fail to replay WAL with
the memory allocation error, for example:
2023-05-11 08:42:38.254 MSK [1706654] LOG: recovery restart point at
7/12012E88
2023-05-11 08:42:38.254 MSK [1706654] DETAIL: Last completed transaction
was at log time 2023-05-11 08:42:38.245399+03.
2023-05-11 08:42:38.420 MSK [1706654] LOG: restartpoint starting: wal
2023-05-11 08:42:38.451 MSK [1725929] FATAL: could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-05-11 08:42:38.468 MSK [1706656] FATAL: invalid memory alloc request
size 2021163525
2023-05-11 08:42:38.469 MSK [1706643] LOG: startup process (PID 1706656)
exited with exit code 1
2023-05-11 08:42:38.469 MSK [1706643] LOG: terminating any other active
server processes
2023-05-11 08:42:38.469 MSK [1706643] LOG: shutting down due to startup
process failure
2023-05-11 08:42:38.470 MSK [1706643] LOG: database system is shut down
The call stack of the error:
...
#6 0x000055cfabab8e97 in palloc_extended (size=2021163525, flags=2) at
mcxt.c:1143
#7 0x000055cfab456ae0 in XLogReadRecordAlloc (state=0x55cfacef4a08,
xl_tot_len=2021161080,
allow_oversized=true) at xlogreader.c:524
#8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
#9 0x000055cfab4575c0 in XLogReadAhead (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:960
#10 0x000055cfab45519d in XLogPrefetcherNextBlock
(pgsr_private=94350447959632, lsn=0x55cfacf4bc58)
at xlogprefetcher.c:496
#11 0x000055cfab454b1c in lrq_prefetch (lrq=0x55cfacf4bb08) at
xlogprefetcher.c:256
#12 0x000055cfab454cdf in lrq_complete_lsn (lrq=0x55cfacf4bb08,
lsn=2976923632) at xlogprefetcher.c:294
#13 0x000055cfab455df2 in XLogPrefetcherReadRecord
(prefetcher=0x55cfacef5e50, errmsg=0x7ffc2e410458)
at xlogprefetcher.c:1039
#14 0x000055cfab45f136 in ReadRecord (xlogprefetcher=0x55cfacef5e50,
emode=15, fetching_ckpt=false,
replayTLI=1) at xlogrecovery.c:3047
#15 0x000055cfab45c922 in PerformWalRecovery () at xlogrecovery.c:1754
#16 0x000055cfab448869 in StartupXLOG () at xlog.c:5300
#17 0x000055cfab7d151a in StartupProcessMain () at startup.c:267
#18 0x000055cfab7c3f80 in AuxiliaryProcessMain (auxtype=StartupProcess) at
auxprocess.c:141
#19 0x000055cfab7cfb4b in StartChildProcess (type=StartupProcess) at
postmaster.c:5429
#20 0x000055cfab7ca5c0 in PostmasterMain (argc=4, argv=0x55cfacef3e40) at
postmaster.c:1470
#21 0x000055cfab6be302 in main (argc=4, argv=0x55cfacef3e40) at main.c:202
(gdb) frame 8
#8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
689 decoded = XLogReadRecordAlloc(state,
(gdb) print/x *record
$5 = {xl_tot_len = 0x78787878, xl_xid = 0x78787878, xl_prev =
0x7878787878787878, xl_info = 0x0,
xl_rmid = 0x10, xl_crc = 0x0}
(gdb) print/x state->NextRecPtr
$6 = 0xb1703ff0
hexdump -C .../standby_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 ...@...V........
00703ff0 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
00704000 10 d1 05 00 01 00 00 00 00 40 70 30 00 00 00 00 .........@p0....
00704010 db 00 00 00 00 00 00 00 78 78 78 78 78 78 78 78 ........xxxxxxxx
00704020 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
007040f0 09 00 00 00 00 00 00 00 33 02 00 00 1b 0d 00 00 ........3.......
hexdump -C .../primary_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 ...@...V........
00703ff0 33 02 00 00 54 2a 00 00 e0 3e 70 b1 00 00 00 00 3...T*...>p.....
00704000 10 d1 05 00 01 00 00 00 00 40 70 b1 00 00 00 00 .........@p.....
00704010 23 02 00 00 00 00 00 00 80 0a 00 00 55 0f 07 2c #...........U..,
00704020 00 60 02 02 7f 06 00 00 05 00 00 00 f7 85 01 00 .`..............
00704030 00 00 00 00 ff 03 02 00 02 08 18 00 01 00 00 00 ................
00704040 e0 07 00 00 78 78 78 78 78 78 78 78 78 78 78 78 ....xxxxxxxxxxxx
00704050 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
00704230 78 78 78 78 78 78 78 78 01 00 00 00 00 00 00 00 xxxxxxxx........
(In this case, rows with 500 'x' were inserted when the primary was
stopped.)
`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).
A reproducer for the anomaly to follow.
11.05.2023 11:00, PG Bug reporting form wrote:
The following bug has been logged on the website:
Bug reference: 17928
...
`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).A reproducer for the anomaly to follow.
The TAP test that demonstrates the issue is attached. To catch the failure
faster, I place it in multiple directories src/test/recoveryX/t, add
minimal Makefiles, and run (on tmpfs):
for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j7 --linebuffer --tag make
-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 || break; done
iteration 1
recovery1 +++ tap check in src/test/recovery1 +++
recovery2 +++ tap check in src/test/recovery2 +++
recovery3 +++ tap check in src/test/recovery3 +++
recovery4 +++ tap check in src/test/recovery4 +++
recovery5 +++ tap check in src/test/recovery5 +++
recovery6 +++ tap check in src/test/recovery6 +++
recovery7 +++ tap check in src/test/recovery7 +++
...
recovery5 # Restarting primary instance (49)
recovery3 # Restarting primary instance (49)
recovery7 # Restarting primary instance (49)
recovery2 Bailout called. Further testing stopped: pg_ctl stop failed
recovery2 FAILED--Further testing stopped: pg_ctl stop failed
recovery2 make: *** [Makefile:6: check] Error 255
parallel: This job failed:
make -s check -C src/test/recovery2
tail src/test/recovery2/tmp_check/log/099_restart_with_stanby_standby.log
2023-05-11 20:19:22.247 MSK [2046385] DETAIL: End of WAL reached on timeline 1 at 3/64BDFF8.
2023-05-11 20:19:22.247 MSK [2046385] FATAL: could not send end-of-streaming message to primary: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1
2023-05-11 20:19:22.248 MSK [2037114] LOG: terminating any other active server processes
2023-05-11 20:19:22.248 MSK [2037114] LOG: shutting down due to startup process failure
2023-05-11 20:19:22.249 MSK [2037114] LOG: database system is shut down
Best regards,
Alexander
Attachments:
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1
Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]/messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.
[1]: /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.
I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.
[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)
Attachments:
xl_tot_len-validate-v1.patchtext/plain; charset=us-asciiDownload+38-59
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in
On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.
I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.
[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)
I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.
regards.
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
write_finalize_tot_len.difftext/x-patch; charset=us-asciiDownload+9-0
On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in
On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1
On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:
- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.
Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages.
The removed check would be repetitive in all cases known to me, but I wouldn't
mind keeping the check instead.
However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.
--- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) { Assert(tli != 0);+ if (dying) + { + int buflen = sizeof(uint32); + char buf[sizeof(uint32)]; + + memset(buf, 0, buflen); + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); + } + if (LogstreamResult.Flush < LogstreamResult.Write) { WalRcvData *walrcv = WalRcv;
That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.
At Sun, 16 Jul 2023 17:49:05 -0700, Noah Misch <noah@leadboat.com> wrote in
On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:
I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages.The removed check would be repetitive in all cases known to me, but I wouldn't
mind keeping the check instead.
Ah. I got it. No need to do that again.
I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.
...
That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.
Agreed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:
It took less than five minutes for me to reproduce the failure using
the test case sent by Alexander on HEAD:
2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525
2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1
I don't see it after two hours of tests with the patch.
- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.
Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?
However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.
FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyz
Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?
--- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) { Assert(tli != 0);+ if (dying) + { + int buflen = sizeof(uint32); + char buf[sizeof(uint32)]; + + memset(buf, 0, buflen); + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); + } + if (LogstreamResult.Flush < LogstreamResult.Write) { WalRcvData *walrcv = WalRcv;That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.
Tweaking the WAL receiver to do more zeroing has been discussed in the
past, but as far as I recall we were not completely sure whether it's
completely free, either. It seems to me that this should have a
discussion on its own, in a new thread. Not sure if we should worry
about archiving, actually, even if the segments should be padded with
zeros beforehand.
+ gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;
Perhaps this could use more parenthesis around the expression checked,
for readability.
--
Michael
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?
Actually, for this one, I think that I have a simpler idea to make it
deterministic. Once we have inserted a record at the page limit on
the primary, we can:
- Stop the standby
- Stop the primary
- Rewrite by ourselves a few bytes in the last segment on the standby
to emulate a recycled segment portion, based on the end LSN of the
logical message record, retrieved either with pg_walinspect or
pg_waldump.
- Start the standby, which would replay up to the previous record at
the page limit.
- The standby just be in a state where it waits for the missing
records from the primary and keeps looking at streaming, but it should
not fail startup.
--
Michael
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:It took less than five minutes for me to reproduce the failure using
the test case sent by Alexander on HEAD:
2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525
2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1I don't see it after two hours of tests with the patch.
That is good.
- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?
Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.
[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyzAnyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?
I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.
--- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) { Assert(tli != 0);+ if (dying) + { + int buflen = sizeof(uint32); + char buf[sizeof(uint32)]; + + memset(buf, 0, buflen); + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); + } + if (LogstreamResult.Flush < LogstreamResult.Write) { WalRcvData *walrcv = WalRcv;That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.Tweaking the WAL receiver to do more zeroing has been discussed in the
past, but as far as I recall we were not completely sure whether it's
completely free, either. It seems to me that this should have a
discussion on its own, in a new thread. Not sure if we should worry
about archiving, actually, even if the segments should be padded with
zeros beforehand.
Okay.
+ gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;
Okay.
On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.
FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way. Perhaps using a larger series of
records and select only one would be more reliable.. Need to test
that a bit more.
FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyzAnyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.
Okay, fine by me. That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.
You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.
--
Michael
On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way. Perhaps using a larger series of
records and select only one would be more reliable.. Need to test
that a bit more.
Interesting. So pg_logical_emit_message(false, 'X', repeat('X', n)) doesn't
get close enough, but s/n/n+1/ spills to the next page? If so, I did not
anticipate that.
FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyzAnyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.Okay, fine by me. That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.
I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.
Thanks for working on this.
I wonder if we need a more explicit way to construct pages with the
right bits to reach interesting test cases and get full enough
coverage... (Cf throwing SQL at the WAL to see if it sticks.)
On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.
Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.
Thanks for working on this.
I wonder if we need a more explicit way to construct pages with the
right bits to reach interesting test cases and get full enough
coverage... (Cf throwing SQL at the WAL to see if it sticks.)
You mean SQL functions that write an arbitrary set of bytes at a given
LSN, to trigger some expected behavior on a follow-up crash recovery?
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.
Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.
That was my gut reaction too -- risk/reward seems not great for the
last v11 release. (I've been burned a couple of times now by putting
can't-fix-it-anymore bugs into the final release of a branch, so maybe
I'm just being overly paranoid. But it's something to worry about.)
regards, tom lane
On Fri, Aug 11, 2023 at 08:32:00PM -0400, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.That was my gut reaction too -- risk/reward seems not great for the
last v11 release. (I've been burned a couple of times now by putting
can't-fix-it-anymore bugs into the final release of a branch, so maybe
I'm just being overly paranoid. But it's something to worry about.)
Okay, v12+ would be fine with me. Alas, a closer look at v1 found two
defects, one with severity far greater than the bug v1 aimed to fix:
===== 1. For oversized records, it changed startup FATAL to silent data loss
This affects only supported branches (v15-). In v16+, commit 8fcb32d
introduced and checked XLogRecordMaxSize. But in v15 with patch v1, redo of
pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 1000))
just ends at the oversized record:
2523581 2023-08-12 17:17:39.512 GMT LOG: redo starts at 0/41F155B8
2523581 2023-08-12 17:17:39.512 GMT LOG: record length 1073740879 at 0/41F155F0 too long
2523581 2023-08-12 17:17:39.512 GMT LOG: redo done at 0/41F155B8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.
As an alternative, zeroing recycled pages could remove "most of" the spurious
errors without adding silent data loss. I considered another alternative of
back-patching XLogRecordMaxSize and proceeding with a patch like v1, but that
wouldn't help with records in existing WAL archives. Of course, we could also
choose to leave $SUBJECT unfixed in v15-.
===== 2. DecodeXLogRecordRequiredSpace() is the wrong ceiling
v1 treats DecodeXLogRecordRequiredSpace(L) (=L+2445 here) as the max needed,
but allocate_recordbuf(state, L) allocates up to L+8192. If the uninitialized
length happened to contain an integer between about 1<<30-8192 and 1<<30-2445,
one still got e.g. "FATAL: invalid memory alloc request size 1073741824".
Since v16+ has XLogRecordMaxSize, testing XLogRecordMaxSize avoids the problem
and should suffice:
===
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -640,7 +640,6 @@ restart:
(uint32) SizeOfXLogRecord, total_len);
goto err;
}
-#ifndef FRONTEND
/*
* We may be looking at a random uint32 read from a recycled segment. For
@@ -651,13 +650,12 @@ restart:
* the allocation may succeed but record checks are going to fail so this
* would be short-lived.
*/
- if (!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))
+ if (total_len > XLogRecordMaxSize)
{
report_invalid_record(state, "record length %u at %X/%X too long",
total_len, LSN_FORMAT_ARGS(RecPtr));
goto err;
}
-#endif
/*
* If the whole record header is on this page, validate it immediately.
===
allocate_recordbuf()'s addend is not critical; it's just to "avoid useless
small increases". v15- should avoid the problem like this:
===
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 969bcc3..1bac303 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -201,4 +201,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
+#ifndef FRONTEND
+ if (!AllocSizeIsValid(newSize))
+ newSize = reclength;
+#endif
if (state->readRecordBuf)
===
In v15, one can reach this case with pg_logical_emit_message(false, '_',
repeat('x', 1024 * 1024 * 1024 - 4000)). On v16+, one might reach this with
unlucky trailing garbage, but XLogRecordMaxSize prevents reaching it through
pg_logical_emit_message().
Thanks,
nm
On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.
That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?
On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:
On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?
Hmm, I think that is right. A coincidental match of the 32-bit CRC is more
probable than having all those fields appear valid by coincidence, especially
xlp_pageaddr.
On Sat, Aug 12, 2023 at 08:30:34PM -0700, Noah Misch wrote:
On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:
On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?
+1 for relying on xlp_rem_len for that.
Hmm, I think that is right. A coincidental match of the 32-bit CRC is more
probable than having all those fields appear valid by coincidence, especially
xlp_pageaddr.
Hmm. [.. thinks ..]
It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well? It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc(). That should remove the need of having
gotheader from v1.
--
Michael
On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well? It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc(). That should remove the need of having
gotheader from v1.
Yeah. Trying out an idea for how to untangle that...