Non-replayable WAL records through overflows and >MaxAllocSize lengths

Started by Matthias van de Meentabout 4 years ago44 messageshackers
Jump to latest
#1Matthias van de Meent
boekewurm+postgres@gmail.com

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;

PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.

Kind regards,

Matthias van de Meent

Attachments:

v1-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/x-patch; name=v1-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+26-2
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Matthias van de Meent (#1)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On 11/03/2022 17:42, Matthias van de Meent wrote:

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.

Oops, that would be nasty.

Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;

And that too.

Have you been able to create a test case for that? The largest record I
can think of is a commit record with a huge number of subtransactions,
dropped relations, and shared inval messages. I'm not sure if you can
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

rdata->data = data;

Could check for just AllocSizeValid(mainrdata_len), if we're only
worried about the total size of the data to exceed the limit, and assume
that each individual piece of data is smaller.

We also don't check for negative 'len'. I think that's fine, the caller
bears some responsibility for passing valid arguments too. But maybe
uint32 or size_t would be more appropriate here.

I wonder if these checks hurt performance. These are very cheap, but
then again, this codepath is very hot. It's probably fine, but it still
worries me a little. Maybe some of these could be Asserts.

@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)

if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

rdata->data = data;

Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble,
that's the real limit. And if you check for that here, you don't need to
check it in XLogRecordAssemble.

@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;

I don't think the change to uint64 is necessary. If all the data blocks
are limited to 64 kB, and the number of blocks is limited, and the
number of blocks is limited too.

@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,

if (needs_data)
{
+			/* protect against overflow */
+			if (unlikely(regbuf->rdata_len > UINT16_MAX))
+				elog(ERROR, "too much WAL data for registered buffer");
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+	/*
+	 * Ensure that xlogreader.c can read the record; and check that we don't
+	 * accidentally overflow the size of the record.
+	 * */
+	if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+		elog(ERROR, "too much registered data for WAL record");
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does

It's enough to check AllocSizeIsValid(total_len), no need to also check
against UINT32_MAX.

- Heikki

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Have you been able to create a test case for that? The largest record I
can think of is a commit record with a huge number of subtransactions,
dropped relations, and shared inval messages. I'm not sure if you can
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

I believe that wal_level=logical can generate very large update and
delete records, especially with REPLICA IDENTITY FULL.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Hi,

On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:

Have you been able to create a test case for that? The largest record I can
think of is a commit record with a huge number of subtransactions, dropped
relations, and shared inval messages. I'm not sure if you can overflow a
uint32 with that, but exceeding MaxAllocSize seems possible.

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long

I wonder if these checks hurt performance. These are very cheap, but then
again, this codepath is very hot. It's probably fine, but it still worries
me a little. Maybe some of these could be Asserts.

I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.

FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.

So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?

@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+			/* protect against overflow */
+			if (unlikely(regbuf->rdata_len > UINT16_MAX))
+				elog(ERROR, "too much WAL data for registered buffer");
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.

FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.

ISTM that this could just be an assert?

Greetings,

Andres Freund

#5Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#4)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Thank you all for the feedback. Please find attached v2 of the
patchset, which contains updated comments and applies the suggested
changes.

On Sat, 12 Mar 2022 at 02:03, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:

Have you been able to create a test case for that? The largest record I can
think of is a commit record with a huge number of subtransactions, dropped
relations, and shared inval messages. I'm not sure if you can overflow a
uint32 with that, but exceeding MaxAllocSize seems possible.

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long

Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.

I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.

I wonder if these checks hurt performance. These are very cheap, but then
again, this codepath is very hot. It's probably fine, but it still worries
me a little. Maybe some of these could be Asserts.

I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.

FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.

So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?

Updated.

@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+                   /* protect against overflow */
+                   if (unlikely(regbuf->rdata_len > UINT16_MAX))
+                           elog(ERROR, "too much WAL data for registered buffer");
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.

FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.

ISTM that this could just be an assert?

This specific location has been replaced with an Assert, while
XLogRegisterBufData always does the unlikely()-ed bounds check.

Kind regards,

Matthias

Attachments:

v2-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v2-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+35-7
#6Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#5)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Hi

A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.

On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:

I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.

We definitely shouldn't require this much RAM for the tests.

It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.

--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
-	Assert(begininsert_called);
+	Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));

Shouldn't we just make the length argument unsigned?

-	if (num_rdatas >= max_rdatas)
+	/*
+	 * Check against max_rdatas; and ensure we don't fill a record with
+	 * more data than can be replayed
+	 */
+	if (unlikely(num_rdatas >= max_rdatas ||
+				 !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.

rdata->data = data;
@@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;

-	Assert(begininsert_called);
+	Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);

/* find the registered buffer struct */
regbuf = &registered_buffers[block_id];
@@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);

-	if (num_rdatas >= max_rdatas)
+	/*
+	 * Check against max_rdatas; and ensure we don't register more data per
+	 * buffer than can be handled by the physical record format.
+	 */
+	if (unlikely(num_rdatas >= max_rdatas ||
+				 regbuf->rdata_len + len > UINT16_MAX))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).

Greetings,

Andres Freund

#7Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#6)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:

Hi

A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.

That would be interesting; though out of scope for this bug I'm trying to fix.

On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:

I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.

We definitely shouldn't require this much RAM for the tests.

It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.

--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
-     Assert(begininsert_called);
+     Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));

Shouldn't we just make the length argument unsigned?

I've applied that in the attached revision; but I'd like to note that
this makes the fix less straightforward to backpatch; as the changes
to the public function signatures shouldn't be applied in older
versions.

-     if (num_rdatas >= max_rdatas)
+     /*
+      * Check against max_rdatas; and ensure we don't fill a record with
+      * more data than can be replayed
+      */
+     if (unlikely(num_rdatas >= max_rdatas ||
+                              !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.

Agreed, updated.

rdata->data = data;
@@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;

-     Assert(begininsert_called);
+     Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);

/* find the registered buffer struct */
regbuf = &registered_buffers[block_id];
@@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);

-     if (num_rdatas >= max_rdatas)
+     /*
+      * Check against max_rdatas; and ensure we don't register more data per
+      * buffer than can be handled by the physical record format.
+      */
+     if (unlikely(num_rdatas >= max_rdatas ||
+                              regbuf->rdata_len + len > UINT16_MAX))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];

Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).

The check itself is slightly different in those 3 places; but the
error message is shared. Do you mean to extract the elog() into a
static inline function (as attached), or did I misunderstand?

-Matthias

Attachments:

v3-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/x-patch; name=v3-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+52-13
#8Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#7)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Apart from registering this in CF 2022-07 last Friday, I've also just
added this issue to the Open Items list for PG15 under "Older bugs
affecting stable branches"; as a precaution to not lose track of this
issue in the buzz of the upcoming feature freeze.

-Matthias

#9Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#7)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The changes that were requested by Heikki and Andres have been merged
into patch v3, and I think it would be nice to fix this security issue
in the upcoming minor release(s).

-Matthias

#10Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#9)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed. Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael

#11David Zhang
david.zhang@highgo.ca
In reply to: Michael Paquier (#10)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Hi,

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM

repeat(repeat(' ', 1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record

length 2145386550 at 0/3000060 too long

Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.

A little confused here, does this patch V3 intend to solve this problem
"record length 2145386550 at 0/3000060 too long"?

I set up a simple Primary and Standby stream replication environment,
and use the above query to run the test for before and after patch v3.
The error message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at
0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver
process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at
0/3000060 too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at
0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from
WAL stream: ERROR:  requested WAL segment 000000010000000000000045 has
already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at
0/3000060 too long

And once the error happens, then the Standby can't continue the replication.

Is a particular reason to say "more datas" at line 52 in patch v3?

+ * more datas than are being accounted for by the XLog infrastructure.

On 2022-04-18 10:19 p.m., Michael Paquier wrote:

On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed. Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael

Thank you,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

#12Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: David Zhang (#11)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Sat, 11 Jun 2022 at 01:32, David Zhang <david.zhang@highgo.ca> wrote:

Hi,

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long

Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.

A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.

I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for before and after patch v3. The error message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000045 has already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long

And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

Is a particular reason to say "more datas" at line 52 in patch v3?

+ * more datas than are being accounted for by the XLog infrastructure.

Yes. This error is thrown when you try to register a 34th block, or an
Nth rdata where the caller previously only reserved n - 1 data slots.
As such 'datas', for the num_rdatas and max_rdatas variables.

Thanks for looking at the patch.

- Matthias

#13David Zhang
david.zhang@highgo.ca
In reply to: Matthias van de Meent (#12)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.

I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for before and after patch v3. The error message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000045 has already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long

And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

Thanks a lot for the clarification. My testing environment is pretty
simple, initdb for Primary, run basebackup and set the connection string
for Standby, then run the "pg_logical_emit_message" query and tail the
log on standby side.

Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

#14Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#12)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD. The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.

+   if (unlikely(num_rdatas >= max_rdatas) ||
+       unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+       XLogErrorDataLimitExceeded();
[...]
+inline void
+XLogErrorDataLimitExceeded()
+{
+   elog(ERROR, "too much WAL data");
+}
The three checks are different, OK..  Note that static is missing.
+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?
--
Michael
#15Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#14)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD. The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.

The signature change is mostly ornamental, see attached v4.backpatch.
The main reason for changing the signature is to make sure nobody can
provide a negative value, but it's not important to the patch.

+   if (unlikely(num_rdatas >= max_rdatas) ||
+       unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+       XLogErrorDataLimitExceeded();
[...]
+inline void
+XLogErrorDataLimitExceeded()
+{
+   elog(ERROR, "too much WAL data");
+}
The three checks are different, OK..

They each check slightly different things, but with the same error. In
RegisterData, it checks that the data can still be allocated and does
not overflow the register, in RegisterBlock it checks that the total
length of data registered to the block does not exceed the max value
of XLogRecordBlockHeader->data_length. I've updated the comments above
the checks so that this distinction is more clear.

Note that static is missing.

Fixed in attached v4.patch

+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?

I was doing so previously, but there were some good arguments against that:

- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.

- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.

Kind regards,

Matthias van de Meent

Attachments:

v4-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v4-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+58-13
v4-0001-Add-protections-in-xlog-record-APIs-against-large.backpatchapplication/octet-stream; name=v4-0001-Add-protections-in-xlog-record-APIs-against-large.backpatchDownload+54-9
#16Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#15)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, Jun 20, 2022 at 11:01:51AM +0200, Matthias van de Meent wrote:

On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:

+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?

I was doing so previously, but there were some good arguments against that:

- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.

Some macro-benchmarking could be in place here, and this would most
likely become noticeable when assembling a bunch of little records?

- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.

Okay.

+   /*
+    * Ensure that xlogreader.c can read the record by ensuring that the
+    * data section of the WAL record can be allocated.
+    */
+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();

By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.
--
Michael

#17Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#16)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Tue, 21 Jun 2022 at 03:45, Michael Paquier <michael@paquier.xyz> wrote:

+   /*
+    * Ensure that xlogreader.c can read the record by ensuring that the
+    * data section of the WAL record can be allocated.
+    */
+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();

By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.

I see, thanks for notifying me about that.

PFA a correction for that issue. It does copy over the value for
MaxAllocSize from memutils.h into xlogreader.h, because we need that
value in FRONTEND builds too, and memutils.h can't be included in
FRONTEND builds. One file suffixed with .backpatch that doesn't
include the function signature changes, but it is not optimized for
any stable branch[15]it should apply to stable branches all the way back to REL_15_STABLE and still work as expected. Any older than that I haven't tested, but probably only require some updates for XLogRecMaxLength in xlogreader.h..

-Matthias

PS. I'm not amused by the double copy we do in the xlogreader, as I
had expected we'd just read the record and point into that single
xl_rec_len-sized buffer. Apparently that's not how it works...

[15]: it should apply to stable branches all the way back to REL_15_STABLE and still work as expected. Any older than that I haven't tested, but probably only require some updates for XLogRecMaxLength in xlogreader.h.
REL_15_STABLE and still work as expected. Any older than that I
haven't tested, but probably only require some updates for
XLogRecMaxLength in xlogreader.h.

Attachments:

v5-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v5-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+105-15
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch.backpatchapplication/octet-stream; name=v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch.backpatchDownload+101-11
#18David Zhang
david.zhang@highgo.ca
In reply to: Matthias van de Meent (#17)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Hi,

I tried to apply this patch v5 to current master branch but it complains,
"git apply --check
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"

then I checked it out before the commit
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.

1) both make check and make installcheck passed.

2) and I can also see this patch v5 prevents the error happens previously,

"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
repeat(repeat(' ', 1024), 1024*1023) as l(long);
ERROR:  too much WAL data"

3) without this v5 patch, the same test will cause the standby crash
like below, and the standby not be able to boot up after this crash.

"2022-07-08 12:28:16.425 PDT [2363] FATAL:  invalid memory alloc request
size 2145388995
2022-07-08 12:28:16.426 PDT [2360] LOG:  startup process (PID 2363)
exited with exit code 1
2022-07-08 12:28:16.426 PDT [2360] LOG:  terminating any other active
server processes
2022-07-08 12:28:16.427 PDT [2360] LOG:  shutting down due to startup
process failure
2022-07-08 12:28:16.428 PDT [2360] LOG:  database system is shut down"

Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

#19Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: David Zhang (#18)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Fri, 8 Jul 2022 at 21:35, David Zhang <david.zhang@highgo.ca> wrote:

Hi,

I tried to apply this patch v5 to current master branch but it complains,
"git apply --check
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"

then I checked it out before the commit
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.

The attached rebased patchset should work with master @ 2cd2569c and
REL_15_STABLE @ 53df1e28. I've also added a patch that works for PG14
and earlier, which should be correct for all versions that include
commit 2c03216d (that is, all versions back to 9.5).

1) both make check and make installcheck passed.

2) and I can also see this patch v5 prevents the error happens previously,

"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
repeat(repeat(' ', 1024), 1024*1023) as l(long);
ERROR: too much WAL data"

3) without this v5 patch, the same test will cause the standby crash
like below, and the standby not be able to boot up after this crash.

Thanks for reviewing.

Kind regards,

Matthias van de Meent

Attachments:

v6-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload+105-15
v6-0001-Add-protections-in-xlog-record-APIs-against-large.v15-patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.v15-patchDownload+101-11
v6-0001-Add-protections-in-xlog-record-APIs-against-large.v14-patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.v14-patchDownload+96-11
#20Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#19)
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:

Thanks for reviewing.

I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no? And the origin here is when the record is assembled. At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH. So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().

@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)

{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
This has no need to change.

My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:

+   /*
+    * Ensure that xlogreader.c can read the record.
+    */
+   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
+       elog(ERROR, "too much WAL data");

This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.

Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()). The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
--
Michael

Attachments:

v7-0001-Add-protections-in-xlog-record-APIs-against-large_michael.patchtext/x-diff; charset=us-asciiDownload+27-8
#21Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#6)
#22Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#20)
#24Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#20)
#25Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#23)
#26Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#26)
#28Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#28)
#30Ian Lawrence Barwick
barwick@gmail.com
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Ian Lawrence Barwick (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#26)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#32)
#35Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#34)
#36Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#38)
#40Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#41)
#43Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michael Paquier (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Matthias van de Meent (#43)