BUG #17928: Standby fails to decode WAL on termination of primary

exclusion@gmail.com

about 3 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

11.05.2023 11:00, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 17928
...
`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).

A reproducer for the anomaly to follow.

The TAP test that demonstrates the issue is attached. To catch the failure
faster, I place it in multiple directories src/test/recoveryX/t, add
minimal Makefiles, and run (on tmpfs):
for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j7 --linebuffer --tag make
-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 || break; done

iteration 1
recovery1       +++ tap check in src/test/recovery1 +++
recovery2       +++ tap check in src/test/recovery2 +++
recovery3       +++ tap check in src/test/recovery3 +++
recovery4       +++ tap check in src/test/recovery4 +++
recovery5       +++ tap check in src/test/recovery5 +++
recovery6       +++ tap check in src/test/recovery6 +++
recovery7       +++ tap check in src/test/recovery7 +++
...
recovery5       # Restarting primary instance (49)
recovery3       # Restarting primary instance (49)
recovery7       # Restarting primary instance (49)
recovery2       Bailout called. Further testing stopped: pg_ctl stop failed
recovery2       FAILED--Further testing stopped: pg_ctl stop failed
recovery2       make: *** [Makefile:6: check] Error 255
parallel: This job failed:
make -s check -C src/test/recovery2

tail src/test/recovery2/tmp_check/log/099_restart_with_stanby_standby.log
2023-05-11 20:19:22.247 MSK [2046385] DETAIL: End of WAL reached on timeline 1 at 3/64BDFF8.
2023-05-11 20:19:22.247 MSK [2046385] FATAL: could not send end-of-streaming message to primary: server closed the
connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        no COPY in progress
2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1
2023-05-11 20:19:22.248 MSK [2037114] LOG: terminating any other active server processes
2023-05-11 20:19:22.248 MSK [2037114] LOG: shutting down due to startup process failure
2023-05-11 20:19:22.249 MSK [2037114] LOG: database system is shut down

Best regards,
Alexander

[1]: /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

thomas.munro@gmail.com

about 3 years ago

In reply to: Alexander Lakhin (#2)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1

Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]/messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.

noah@leadboat.com

almost 3 years ago

In reply to: Thomas Munro (#3)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:

On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1

Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.

I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.

[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Noah Misch (#4)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in

On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:

On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1

Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.

I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

regards.

Kyotaro Horiguchi
NTT Open Source Software Center

noah@leadboat.com

almost 3 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:

At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in

On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:

On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1

On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:

- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.

Thanks Alexander. Looking into this. I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late. I probably
need to add an earlier not-too-big validation. Thinking.

I agree about an earlier not-too-big validation. Like the attached? I
haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages.

The removed check would be repetitive in all cases known to me, but I wouldn't
mind keeping the check instead.

However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
{
Assert(tli != 0);

+	if (dying)
+	{
+		int buflen = sizeof(uint32);
+		char buf[sizeof(uint32)];
+		
+		memset(buf, 0, buflen);
+		XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
+	}
+	
if (LogstreamResult.Flush < LogstreamResult.Write)
{
WalRcvData *walrcv = WalRcv;

That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Noah Misch (#6)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

At Sun, 16 Jul 2023 17:49:05 -0700, Noah Misch <noah@leadboat.com> wrote in

On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages.

The removed check would be repetitive in all cases known to me, but I wouldn't
mind keeping the check instead.

Ah. I got it. No need to do that again.

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

...

That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.

Agreed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

michael@paquier.xyz

almost 3 years ago

In reply to: Noah Misch (#6)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:

On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:

It took less than five minutes for me to reproduce the failure using
the test case sent by Alexander on HEAD:
2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525
2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1

I don't see it after two hours of tests with the patch.

- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.

Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyz

Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?

--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
{
Assert(tli != 0);
+	if (dying)
+	{
+		int buflen = sizeof(uint32);
+		char buf[sizeof(uint32)];
+		
+		memset(buf, 0, buflen);
+		XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
+	}
+	
if (LogstreamResult.Flush < LogstreamResult.Write)
{
WalRcvData *walrcv = WalRcv;
That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.

Tweaking the WAL receiver to do more zeroing has been discussed in the
past, but as far as I recall we were not completely sure whether it's
completely free, either. It seems to me that this should have a
discussion on its own, in a new thread. Not sure if we should worry
about archiving, actually, even if the segments should be padded with
zeros beforehand.

+   gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;
Perhaps this could use more parenthesis around the expression checked,
for readability.
--
Michael

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#8)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:

On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:

- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.

Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

Actually, for this one, I think that I have a simpler idea to make it
deterministic. Once we have inserted a record at the page limit on
the primary, we can:
- Stop the standby
- Stop the primary
- Rewrite by ourselves a few bytes in the last segment on the standby
to emulate a recycled segment portion, based on the end LSN of the
logical message record, retrieved either with pg_walinspect or
pg_waldump.
- Start the standby, which would replay up to the previous record at
the page limit.
- The standby just be in a state where it waits for the missing
records from the primary and keeps looking at streaming, but it should
not fail startup.
--
Michael

#10

noah@leadboat.com

almost 3 years ago

In reply to: Michael Paquier (#8)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:

On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:

On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189. Perhaps a faster,
deterministic test would be feasible:

It took less than five minutes for me to reproduce the failure using
the test case sent by Alexander on HEAD:
2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525
2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1

I don't see it after two hours of tests with the patch.

That is good.

- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
pg_current_wal_insert_lsn() such that new message crosses a segment boundary
and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.

Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.

[1] /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len? (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyz

Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?

I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.

--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
{
Assert(tli != 0);
+	if (dying)
+	{
+		int buflen = sizeof(uint32);
+		char buf[sizeof(uint32)];
+		
+		memset(buf, 0, buflen);
+		XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
+	}
+	
if (LogstreamResult.Flush < LogstreamResult.Write)
{
WalRcvData *walrcv = WalRcv;
That's inexpensive in cycles, and it should make the failures in question
quite rare. If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.
Tweaking the WAL receiver to do more zeroing has been discussed in the
past, but as far as I recall we were not completely sure whether it's
completely free, either. It seems to me that this should have a
discussion on its own, in a new thread. Not sure if we should worry
about archiving, actually, even if the segments should be padded with
zeros beforehand.

Okay.

+ gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;

Okay.

#11

michael@paquier.xyz

almost 3 years ago

In reply to: Noah Misch (#10)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:

On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:

Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.

FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way. Perhaps using a larger series of
records and select only one would be more reliable.. Need to test
that a bit more.

FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyz

Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?

I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.

Okay, fine by me. That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.

You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.
--
Michael

#12

noah@leadboat.com

almost 3 years ago

In reply to: Michael Paquier (#11)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:

On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:

On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:

Good idea to pollute the data with recycled segments. Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough. The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

Could be. I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one. The test doesn't
need to be 100% deterministic. If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.

FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way. Perhaps using a larger series of
records and select only one would be more reliable.. Need to test
that a bit more.

Interesting. So pg_logical_emit_message(false, 'X', repeat('X', n)) doesn't
get close enough, but s/n/n+1/ spills to the next page? If so, I did not
anticipate that.

FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
/messages/by-id/ZMh/WV+CuknqePQQ@paquier.xyz

Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf(). Noah, are you planning to work more on that?

I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today. I'm content for my part to stop there.

Okay, fine by me. That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.

You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

#13

thomas.munro@gmail.com

almost 3 years ago

In reply to: Noah Misch (#12)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:

You are planning for a backpatch to take care of the inconsistency,
right? The report has been on 15~ where the prefetching was
introduced. I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

Thanks for working on this.

I wonder if we need a more explicit way to construct pages with the
right bits to reach interesting test cases and get full enough
coverage... (Cf throwing SQL at the WAL to see if it sticks.)

#14

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#13)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:

On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.

Thanks for working on this.

I wonder if we need a more explicit way to construct pages with the
right bits to reach interesting test cases and get full enough
coverage... (Cf throwing SQL at the WAL to see if it sticks.)

You mean SQL functions that write an arbitrary set of bytes at a given
LSN, to trigger some expected behavior on a follow-up crash recovery?
--
Michael

#15

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Michael Paquier (#14)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

Michael Paquier <michael@paquier.xyz> writes:

On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:

On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.

That was my gut reaction too -- risk/reward seems not great for the
last v11 release. (I've been burned a couple of times now by putting
can't-fix-it-anymore bugs into the final release of a branch, so maybe
I'm just being overly paranoid. But it's something to worry about.)

regards, tom lane

#16

noah@leadboat.com

almost 3 years ago

In reply to: Tom Lane (#15)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Fri, Aug 11, 2023 at 08:32:00PM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:

On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more. At least it does not seem
worth taking any risk compared to the potential gain.

That was my gut reaction too -- risk/reward seems not great for the
last v11 release. (I've been burned a couple of times now by putting
can't-fix-it-anymore bugs into the final release of a branch, so maybe
I'm just being overly paranoid. But it's something to worry about.)

Okay, v12+ would be fine with me. Alas, a closer look at v1 found two
defects, one with severity far greater than the bug v1 aimed to fix:

===== 1. For oversized records, it changed startup FATAL to silent data loss

This affects only supported branches (v15-). In v16+, commit 8fcb32d
introduced and checked XLogRecordMaxSize. But in v15 with patch v1, redo of
pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 1000))
just ends at the oversized record:

2523581 2023-08-12 17:17:39.512 GMT LOG: redo starts at 0/41F155B8
2523581 2023-08-12 17:17:39.512 GMT LOG: record length 1073740879 at 0/41F155F0 too long
2523581 2023-08-12 17:17:39.512 GMT LOG: redo done at 0/41F155B8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

As an alternative, zeroing recycled pages could remove "most of" the spurious
errors without adding silent data loss. I considered another alternative of
back-patching XLogRecordMaxSize and proceeding with a patch like v1, but that
wouldn't help with records in existing WAL archives. Of course, we could also
choose to leave $SUBJECT unfixed in v15-.

===== 2. DecodeXLogRecordRequiredSpace() is the wrong ceiling

v1 treats DecodeXLogRecordRequiredSpace(L) (=L+2445 here) as the max needed,
but allocate_recordbuf(state, L) allocates up to L+8192. If the uninitialized
length happened to contain an integer between about 1<<30-8192 and 1<<30-2445,
one still got e.g. "FATAL: invalid memory alloc request size 1073741824".
Since v16+ has XLogRecordMaxSize, testing XLogRecordMaxSize avoids the problem
and should suffice:

===
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -640,7 +640,6 @@ restart:
 							  (uint32) SizeOfXLogRecord, total_len);
 		goto err;
 	}
-#ifndef FRONTEND

 	/*
 	 * We may be looking at a random uint32 read from a recycled segment.  For
@@ -651,13 +650,12 @@ restart:
 	 * the allocation may succeed but record checks are going to fail so this
 	 * would be short-lived.
 	 */
-	if (!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))
+	if (total_len > XLogRecordMaxSize)
 	{
 		report_invalid_record(state, "record length %u at %X/%X too long",
 							  total_len, LSN_FORMAT_ARGS(RecPtr));
 		goto err;
 	}
-#endif

/*
* If the whole record header is on this page, validate it immediately.
===

allocate_recordbuf()'s addend is not critical; it's just to "avoid useless
small increases". v15- should avoid the problem like this:

   ===
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 969bcc3..1bac303 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -201,4 +201,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
 	newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
+#ifndef FRONTEND
+	if (!AllocSizeIsValid(newSize))
+		newSize = reclength;
+#endif

if (state->readRecordBuf)
===

In v15, one can reach this case with pg_logical_emit_message(false, '_',
repeat('x', 1024 * 1024 * 1024 - 4000)). On v16+, one might reach this with
unlucky trailing garbage, but XLogRecordMaxSize prevents reaching it through
pg_logical_emit_message().

Thanks,
nm

#17

thomas.munro@gmail.com

almost 3 years ago

In reply to: Noah Misch (#16)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?

#18

noah@leadboat.com

almost 3 years ago

In reply to: Thomas Munro (#17)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:

On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?

Hmm, I think that is right. A coincidental match of the 32-bit CRC is more
probable than having all those fields appear valid by coincidence, especially
xlp_pageaddr.

#19

michael@paquier.xyz

almost 3 years ago

In reply to: Noah Misch (#18)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Sat, Aug 12, 2023 at 08:30:34PM -0700, Noah Misch wrote:

On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:

On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug. So far, I'm seeing one way
to clearly fix $SUBJECT without that harm. When a record header spans a page
boundary, read the next page to reassemble the header. If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC. If CRC is
invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized
record. A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations. Thoughts? For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

That sounds quite strong. But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode? If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?

+1 for relying on xlp_rem_len for that.

Hmm, I think that is right. A coincidental match of the 32-bit CRC is more
probable than having all those fields appear valid by coincidence, especially
xlp_pageaddr.

Hmm. [.. thinks ..]

It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well? It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc(). That should remove the need of having
gotheader from v1.
--
Michael

#20

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#19)

Re: BUG #17928: Standby fails to decode WAL on termination of primary

On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:

It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well? It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc(). That should remove the need of having
gotheader from v1.

Yeah. Trying out an idea for how to untangle that...

#21

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#20)

#22

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#21)

#23

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#22)

#24

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#23)

#25

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#24)

#26

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#25)

#27

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#26)

#28

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#27)

#29

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#28)

#30

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#29)

#31

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#30)

#32

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#31)

#33

Sergei Kornilov

sk@zsrv.org

almost 3 years ago

In reply to: Michael Paquier (#32)

#34

michael@paquier.xyz

almost 3 years ago

In reply to: Sergei Kornilov (#33)

#35

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#34)

#36

Sergei Kornilov

sk@zsrv.org

almost 3 years ago

In reply to: Michael Paquier (#35)

#37

thomas.munro@gmail.com

almost 3 years ago

In reply to: Sergei Kornilov (#36)

#38

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#37)

#39

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#38)

#40

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#39)

#41

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#40)

#42

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#41)

#43

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#42)

#44

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#43)

#45

michael@paquier.xyz

almost 3 years ago

In reply to: Michael Paquier (#44)

#46

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#45)

#47

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#46)

#48

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#47)

#49

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#48)

#50

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#49)

#51

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#50)

#52

exclusion@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#50)

#53

thomas.munro@gmail.com

almost 3 years ago

In reply to: Alexander Lakhin (#52)

#54

thomas.munro@gmail.com

almost 3 years ago

In reply to: Alexander Lakhin (#52)

#55

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#54)

#56

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#55)

#57

exclusion@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#56)

#58

thomas.munro@gmail.com

almost 3 years ago

In reply to: Alexander Lakhin (#57)

#59

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#58)

#60

exclusion@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#58)

#61

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#59)

#62

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#61)

#63

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#62)

#64

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#63)

#65

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#64)

#66

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#65)

#67

thomas.munro@gmail.com

almost 3 years ago

In reply to: Tom Lane (#66)

#68

thomas.munro@gmail.com

almost 3 years ago

In reply to: Michael Paquier (#64)

#69

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#67)

#70

michael@paquier.xyz

almost 3 years ago

In reply to: Thomas Munro (#68)

#71

thomas.munro@gmail.com

over 2 years ago

In reply to: Michael Paquier (#70)

#72

michael@paquier.xyz

over 2 years ago

In reply to: Thomas Munro (#71)

#73

thomas.munro@gmail.com

over 2 years ago

In reply to: Michael Paquier (#72)

#74

michael@paquier.xyz

over 2 years ago

In reply to: Thomas Munro (#73)

#75

thomas.munro@gmail.com

over 2 years ago

In reply to: Michael Paquier (#74)

#76

noah@leadboat.com

over 2 years ago

In reply to: Thomas Munro (#75)

#77

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Thomas Munro (#75)

#78

michael@paquier.xyz

over 2 years ago

In reply to: Tom Lane (#77)

#79

thomas.munro@gmail.com

over 2 years ago

In reply to: Tom Lane (#77)

#80

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Thomas Munro (#79)

#81

thomas.munro@gmail.com

over 2 years ago

In reply to: Tom Lane (#80)

#82

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Michael Paquier (#78)

#83

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Thomas Munro (#81)

#84

michael@paquier.xyz

over 2 years ago

In reply to: Tom Lane (#82)

#85

thomas.munro@gmail.com

over 2 years ago

In reply to: Tom Lane (#83)

#86