Non-replayable WAL records through overflows and >MaxAllocSize lengths
Hi,
Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;
PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.
Kind regards,
Matthias van de Meent
Attachments:
v1-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/x-patch; name=v1-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From a72121b8fccbbf8c289e71a2c76801a1004f5353 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v1] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
src/backend/access/transam/xloginsert.c | 23 ++++++++++++++++++++++-
src/include/access/xlogrecord.h | 4 ++++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
+
+ /* protect against overflow */
+ if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+ elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
+
+ /* protect against overflow */
+ if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+ elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /* protect against overflow */
+ if (unlikely(regbuf->rdata_len > UINT16_MAX))
+ elog(ERROR, "too much WAL data for registered buffer");
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record; and check that we don't
+ * accidentally overflow the size of the record.
+ * */
+ if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+ elog(ERROR, "too much registered data for WAL record");
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index c1b1137aa7..950e1f22b0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
On 11/03/2022 17:42, Matthias van de Meent wrote:
Hi,
Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Oops, that would be nasty.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;
And that too.
Have you been able to create a test case for that? The largest record I
can think of is a commit record with a huge number of subtransactions,
dropped relations, and shared inval messages. I'm not sure if you can
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.
PFA a patch that attempts to fix both of these issues in the insertion API; by checking against overflows and other incorrectly large values in the relevant functions in xloginsert.c. In this patch, I've also added a comment to the XLogRecord spec to document that xl_tot_len should not be larger than 1GB - 1B; and why that limit exists. diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index c260310c4c..ae654177de 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];rdata->data = data;
Could check for just AllocSizeValid(mainrdata_len), if we're only
worried about the total size of the data to exceed the limit, and assume
that each individual piece of data is smaller.
We also don't check for negative 'len'. I think that's fine, the caller
bears some responsibility for passing valid arguments too. But maybe
uint32 or size_t would be more appropriate here.
I wonder if these checks hurt performance. These are very cheap, but
then again, this codepath is very hot. It's probably fine, but it still
worries me a little. Maybe some of these could be Asserts.
@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
if (num_rdatas >= max_rdatas) elog(ERROR, "too much WAL data"); + + /* protect against overflow */ + if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX)) + elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];rdata->data = data;
Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble,
that's the real limit. And if you check for that here, you don't need to
check it in XLogRecordAssemble.
@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
I don't think the change to uint64 is necessary. If all the data blocks
are limited to 64 kB, and the number of blocks is limited, and the
number of blocks is limited too.
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data) { + /* protect against overflow */ + if (unlikely(regbuf->rdata_len > UINT16_MAX)) + elog(ERROR, "too much WAL data for registered buffer"); + /* * Link the caller-supplied rdata chain for this buffer to the * overall list. @@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next) COMP_CRC32C(rdata_crc, rdt->data, rdt->len);+ /* + * Ensure that xlogreader.c can read the record; and check that we don't + * accidentally overflow the size of the record. + * */ + if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX)) + elog(ERROR, "too much registered data for WAL record"); + /* * Fill in the fields in the record header. Prev-link is filled in later, * once we know where in the WAL the record will be inserted. The CRC does
It's enough to check AllocSizeIsValid(total_len), no need to also check
against UINT32_MAX.
- Heikki
On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Have you been able to create a test case for that? The largest record I
can think of is a commit record with a huge number of subtransactions,
dropped relations, and shared inval messages. I'm not sure if you can
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.
I believe that wal_level=logical can generate very large update and
delete records, especially with REPLICA IDENTITY FULL.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
Have you been able to create a test case for that? The largest record I can
think of is a commit record with a huge number of subtransactions, dropped
relations, and shared inval messages. I'm not sure if you can overflow a
uint32 with that, but exceeding MaxAllocSize seems possible.
MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);
on a standby:
2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long
I wonder if these checks hurt performance. These are very cheap, but then
again, this codepath is very hot. It's probably fine, but it still worries
me a little. Maybe some of these could be Asserts.
I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.
FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.
So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if (needs_data) { + /* protect against overflow */ + if (unlikely(regbuf->rdata_len > UINT16_MAX)) + elog(ERROR, "too much WAL data for registered buffer"); + /* * Link the caller-supplied rdata chain for this buffer to the * overall list.
FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.
ISTM that this could just be an assert?
Greetings,
Andres Freund
Thank you all for the feedback. Please find attached v2 of the
patchset, which contains updated comments and applies the suggested
changes.
On Sat, 12 Mar 2022 at 02:03, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
Have you been able to create a test case for that? The largest record I can
think of is a commit record with a huge number of subtransactions, dropped
relations, and shared inval messages. I'm not sure if you can overflow a
uint32 with that, but exceeding MaxAllocSize seems possible.MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);on a standby:
2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long
Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.
I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.
I wonder if these checks hurt performance. These are very cheap, but then
again, this codepath is very hot. It's probably fine, but it still worries
me a little. Maybe some of these could be Asserts.I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?
Updated.
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if (needs_data) { + /* protect against overflow */ + if (unlikely(regbuf->rdata_len > UINT16_MAX)) + elog(ERROR, "too much WAL data for registered buffer"); + /* * Link the caller-supplied rdata chain for this buffer to the * overall list.FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.ISTM that this could just be an assert?
This specific location has been replaced with an Assert, while
XLogRegisterBufData always does the unlikely()-ed bounds check.
Kind regards,
Matthias
Attachments:
v2-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v2-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From 2da7f0258e458bf37df79ca3efcf2bd37b27e667 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v2] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
src/backend/access/transam/xloginsert.c | 37 +++++++++++++++++++++----
src/include/access/xlogrecord.h | 4 +++
2 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..71a649e216 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));
- if (num_rdatas >= max_rdatas)
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed
+ */
+ if (unlikely(num_rdatas >= max_rdatas ||
+ !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical record format.
+ */
+ if (unlikely(num_rdatas >= max_rdatas ||
+ regbuf->rdata_len + len > UINT16_MAX))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -505,7 +517,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -734,12 +746,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -836,6 +854,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record can be allocated.
+ */
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ elog(ERROR, "too much registered data for WAL record");
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index c1b1137aa7..950e1f22b0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
Hi
A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.
On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.
We definitely shouldn't require this much RAM for the tests.
It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.
--- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len) { XLogRecData *rdata;- Assert(begininsert_called); + Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));
Shouldn't we just make the length argument unsigned?
- if (num_rdatas >= max_rdatas) + /* + * Check against max_rdatas; and ensure we don't fill a record with + * more data than can be replayed + */ + if (unlikely(num_rdatas >= max_rdatas || + !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len))) elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];
Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.
rdata->data = data;
@@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;- Assert(begininsert_called); + Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);- if (num_rdatas >= max_rdatas) + /* + * Check against max_rdatas; and ensure we don't register more data per + * buffer than can be handled by the physical record format. + */ + if (unlikely(num_rdatas >= max_rdatas || + regbuf->rdata_len + len > UINT16_MAX)) elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];
Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).
Greetings,
Andres Freund
On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
Hi
A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.
That would be interesting; though out of scope for this bug I'm trying to fix.
On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.We definitely shouldn't require this much RAM for the tests.
It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.--- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len) { XLogRecData *rdata;- Assert(begininsert_called); + Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));Shouldn't we just make the length argument unsigned?
I've applied that in the attached revision; but I'd like to note that
this makes the fix less straightforward to backpatch; as the changes
to the public function signatures shouldn't be applied in older
versions.
- if (num_rdatas >= max_rdatas) + /* + * Check against max_rdatas; and ensure we don't fill a record with + * more data than can be replayed + */ + if (unlikely(num_rdatas >= max_rdatas || + !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len))) elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.
Agreed, updated.
rdata->data = data;
@@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;- Assert(begininsert_called); + Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);- if (num_rdatas >= max_rdatas) + /* + * Check against max_rdatas; and ensure we don't register more data per + * buffer than can be handled by the physical record format. + */ + if (unlikely(num_rdatas >= max_rdatas || + regbuf->rdata_len + len > UINT16_MAX)) elog(ERROR, "too much WAL data"); + rdata = &rdatas[num_rdatas++];Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).
The check itself is slightly different in those 3 places; but the
error message is shared. Do you mean to extract the elog() into a
static inline function (as attached), or did I misunderstand?
-Matthias
Attachments:
v3-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/x-patch; name=v3-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From 9e4b70ca096469616fd1320a02cbde129a4943b6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v3] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
src/backend/access/transam/xloginsert.c | 56 ++++++++++++++++++++-----
src/include/access/xloginsert.h | 4 +-
src/include/access/xlogrecord.h | 4 ++
3 files changed, 52 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..e87642ccae 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -127,6 +127,17 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
/*
* Begin constructing a WAL record. This must be called before the
@@ -334,14 +345,20 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && AllocSizeIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -372,12 +389,12 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -385,8 +402,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical record format.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -505,7 +528,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -734,12 +757,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -836,6 +865,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record can be allocated.
+ */
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 5fc340c434..5ca0b0c2f0 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileNode *rnode,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index c1b1137aa7..950e1f22b0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
Apart from registering this in CF 2022-07 last Friday, I've also just
added this issue to the Open Items list for PG15 under "Older bugs
affecting stable branches"; as a precaution to not lose track of this
issue in the buzz of the upcoming feature freeze.
-Matthias
Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?
The changes that were requested by Heikki and Andres have been merged
into patch v3, and I think it would be nice to fix this security issue
in the upcoming minor release(s).
-Matthias
On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:
Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?
The next minor release is three weeks away, so now would be a good
time to get that addressed. Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael
Hi,
MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM
repeat(repeat(' ', 1024), 1024*1023) as l(long);
on a standby:
2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record
length 2145386550 at 0/3000060 too long
Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.
A little confused here, does this patch V3 intend to solve this problem
"record length 2145386550 at 0/3000060 too long"?
I set up a simple Primary and Standby stream replication environment,
and use the above query to run the test for before and after patch v3.
The error message still exist, but with different message.
Before patch v3, the error is showing below,
2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at
0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver
process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at
0/3000060 too long
After patch v3, the error displays differently
2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at
0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000010000000000000045 has
already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at
0/3000060 too long
And once the error happens, then the Standby can't continue the replication.
Is a particular reason to say "more datas" at line 52 in patch v3?
+ * more datas than are being accounted for by the XLog infrastructure.
On 2022-04-18 10:19 p.m., Michael Paquier wrote:
On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:
Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?The next minor release is three weeks away, so now would be a good
time to get that addressed. Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael
Thank you,
--
David
Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
On Sat, 11 Jun 2022 at 01:32, David Zhang <david.zhang@highgo.ca> wrote:
Hi,
MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);on a standby:
2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long
Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too long"?
No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.
I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for before and after patch v3. The error message still exist, but with different message.
Before patch v3, the error is showing below,
2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at 0/3000060 too longAfter patch v3, the error displays differently
2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000045 has already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at 0/3000060 too longAnd once the error happens, then the Standby can't continue the replication.
Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.
Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.
Is a particular reason to say "more datas" at line 52 in patch v3?
+ * more datas than are being accounted for by the XLog infrastructure.
Yes. This error is thrown when you try to register a 34th block, or an
Nth rdata where the caller previously only reserved n - 1 data slots.
As such 'datas', for the num_rdatas and max_rdatas variables.
Thanks for looking at the patch.
- Matthias
A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too long"?
No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for before and after patch v3. The error message still exist, but with different message.
Before patch v3, the error is showing below,
2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at 0/3000060 too longAfter patch v3, the error displays differently
2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000045 has already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at 0/3000060 too longAnd once the error happens, then the Standby can't continue the replication.
Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.
Thanks a lot for the clarification. My testing environment is pretty
simple, initdb for Primary, run basebackup and set the connection string
for Standby, then run the "pg_logical_emit_message" query and tail the
log on standby side.
Best regards,
--
David
Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.
As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD. The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
[...]
+inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
The three checks are different, OK.. Note that static is missing.
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?
--
Michael
On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD. The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.
The signature change is mostly ornamental, see attached v4.backpatch.
The main reason for changing the signature is to make sure nobody can
provide a negative value, but it's not important to the patch.
+ if (unlikely(num_rdatas >= max_rdatas) || + unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len))) + XLogErrorDataLimitExceeded(); [...] +inline void +XLogErrorDataLimitExceeded() +{ + elog(ERROR, "too much WAL data"); +} The three checks are different, OK..
They each check slightly different things, but with the same error. In
RegisterData, it checks that the data can still be allocated and does
not overflow the register, in RegisterBlock it checks that the total
length of data registered to the block does not exceed the max value
of XLogRecordBlockHeader->data_length. I've updated the comments above
the checks so that this distinction is more clear.
Note that static is missing.
Fixed in attached v4.patch
+ if (unlikely(!AllocSizeIsValid(total_len))) + XLogErrorDataLimitExceeded(); Rather than a single check at the end of XLogRecordAssemble(), you'd better look after that each time total_len is added up?
I was doing so previously, but there were some good arguments against that:
- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.
- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.
Kind regards,
Matthias van de Meent
Attachments:
v4-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v4-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From f4cc40351f56fdb6eaa4531c85677e80ea6feaf5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v4] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 62 +++++++++++++++++++++----
src/include/access/xloginsert.h | 4 +-
src/include/access/xlogrecord.h | 4 ++
3 files changed, 58 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ce9be2cc7..a3387a9265 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -348,14 +359,24 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && AllocSizeIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk, so
+ * ensure AllocSizeIsValid() for that size of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -386,12 +407,12 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +420,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +548,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +785,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +893,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record can be allocated.
+ */
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 5fc340c434..5ca0b0c2f0 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileNode *rnode,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 052ac6817a..3575363e84 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
v4-0001-Add-protections-in-xlog-record-APIs-against-large.backpatchapplication/octet-stream; name=v4-0001-Add-protections-in-xlog-record-APIs-against-large.backpatchDownload
From e980e58d9cf8470a6ed12b0d8ac979c68d5cc17c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v4] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 58 +++++++++++++++++++++----
src/include/access/xlogrecord.h | 4 ++
2 files changed, 54 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ce9be2cc7..0308531096 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -352,10 +363,20 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && AllocSizeIsValid(len) && len >= 0);
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk, so
+ * ensure AllocSizeIsValid() for that size of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -391,7 +412,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX && len >= 0);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +420,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +548,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +785,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +893,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record can be allocated.
+ */
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 052ac6817a..3575363e84 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
On Mon, Jun 20, 2022 at 11:01:51AM +0200, Matthias van de Meent wrote:
On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:
+ if (unlikely(!AllocSizeIsValid(total_len))) + XLogErrorDataLimitExceeded(); Rather than a single check at the end of XLogRecordAssemble(), you'd better look after that each time total_len is added up?I was doing so previously, but there were some good arguments against that:
- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.
Some macro-benchmarking could be in place here, and this would most
likely become noticeable when assembling a bunch of little records?
- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.
Okay.
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record can be allocated.
+ */
+ if (unlikely(!AllocSizeIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.
--
Michael
On Tue, 21 Jun 2022 at 03:45, Michael Paquier <michael@paquier.xyz> wrote:
+ /* + * Ensure that xlogreader.c can read the record by ensuring that the + * data section of the WAL record can be allocated. + */ + if (unlikely(!AllocSizeIsValid(total_len))) + XLogErrorDataLimitExceeded();By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.
I see, thanks for notifying me about that.
PFA a correction for that issue. It does copy over the value for
MaxAllocSize from memutils.h into xlogreader.h, because we need that
value in FRONTEND builds too, and memutils.h can't be included in
FRONTEND builds. One file suffixed with .backpatch that doesn't
include the function signature changes, but it is not optimized for
any stable branch[15]it should apply to stable branches all the way back to REL_15_STABLE and still work as expected. Any older than that I haven't tested, but probably only require some updates for XLogRecMaxLength in xlogreader.h..
-Matthias
PS. I'm not amused by the double copy we do in the xlogreader, as I
had expected we'd just read the record and point into that single
xl_rec_len-sized buffer. Apparently that's not how it works...
[15]: it should apply to stable branches all the way back to REL_15_STABLE and still work as expected. Any older than that I haven't tested, but probably only require some updates for XLogRecMaxLength in xlogreader.h.
REL_15_STABLE and still work as expected. Any older than that I
haven't tested, but probably only require some updates for
XLogRecMaxLength in xlogreader.h.
Attachments:
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v5-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From a9647e8ef3f3e8036e051a186818888dd8f65a80 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v5] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 63 +++++++++++++++++++++----
src/backend/access/transam/xlogreader.c | 12 ++++-
src/include/access/xloginsert.h | 4 +-
src/include/access/xlogreader.h | 36 +++++++++++++-
src/include/access/xlogrecord.h | 4 ++
5 files changed, 105 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ce9be2cc7..f8d7bf9671 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -348,14 +359,25 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && XLogRecLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecLengthIsValid() for that size
+ * of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!XLogRecLengthIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -386,12 +408,12 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +421,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +786,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +894,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record (plus infra overhead) can be allocated.
+ */
+ if (unlikely(!XLogRecLengthIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23cb8..63a4ff0a0d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -669,7 +669,17 @@ restart:
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
- (uint32) SizeOfXLogRecord, total_len);
+ (uint32) SizeOfXLogRecord,
+ total_len);
+ goto err;
+ }
+ if (!XLogRecLengthIsValid(total_len))
+ {
+ report_invalid_record(state,
+ "invalid record length at %X/%X: wanted %u, got %u",
+ LSN_FORMAT_ARGS(RecPtr),
+ (uint32) XLogRecMaxLength,
+ total_len);
goto err;
}
gotheader = false;
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 5fc340c434..5ca0b0c2f0 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileNode *rnode,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e73ea4a840..b2939b13c5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,6 +36,7 @@
#ifndef FRONTEND
#include "access/transam.h"
+#include "utils/memutils.h"
#endif
#include "access/xlogrecord.h"
@@ -425,7 +426,40 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
#ifndef FRONTEND
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-#endif
+#else
+#define MaxAllocSize (Size) 0x3FFFFFFF
+#endif /* ifndef FRONTEND */
+
+/*
+ * XLogRecMaxSize - the maximum WAL record size that the xlogreader
+ * infrastructure can handle.
+ *
+ * Defined by MaxAllocSize and the various overheads used in the
+ * DecodedXLogRecord infrastructure.
+ *
+ * see DecodeXLogRecordRequiredSpace and allocate_recordbuf for places where
+ * allocations of full records + overhead are done.
+ */
+#define XLogRecMaxLength \
+( \
+ Min( \
+ (MaxAllocSize - ( \
+ offsetof(DecodedXLogRecord, blocks[0]) + \
+ (sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1)) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) \
+ )), \
+ MaxAllocSize - (MaxAllocSize % XLOG_BLCKSZ) \
+ ) \
+)
+
+/*
+ * XLogRecLengthIsValid
+ *
+ * Check that this length does not exceed the maximum xlog record length
+ */
+#define XLogRecLengthIsValid(xlr_size) ((xlr_size) <= XLogRecMaxLength)
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 052ac6817a..3575363e84 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch.backpatchapplication/octet-stream; name=v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch.backpatchDownload
From 321cdb85893b57e959f8633d9ffd866f8be2b3fd Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v5] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 59 +++++++++++++++++++++----
src/backend/access/transam/xlogreader.c | 12 ++++-
src/include/access/xlogreader.h | 36 ++++++++++++++-
src/include/access/xlogrecord.h | 4 ++
4 files changed, 101 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ce9be2cc7..a8ac036cbd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -352,10 +363,21 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && XLogRecLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecLengthIsValid() for that size
+ * of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!XLogRecLengthIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -391,7 +413,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +421,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +786,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +894,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record (plus infra overhead) can be allocated.
+ */
+ if (unlikely(!XLogRecLengthIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23cb8..63a4ff0a0d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -669,7 +669,17 @@ restart:
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
- (uint32) SizeOfXLogRecord, total_len);
+ (uint32) SizeOfXLogRecord,
+ total_len);
+ goto err;
+ }
+ if (!XLogRecLengthIsValid(total_len))
+ {
+ report_invalid_record(state,
+ "invalid record length at %X/%X: wanted %u, got %u",
+ LSN_FORMAT_ARGS(RecPtr),
+ (uint32) XLogRecMaxLength,
+ total_len);
goto err;
}
gotheader = false;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e73ea4a840..b2939b13c5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,6 +36,7 @@
#ifndef FRONTEND
#include "access/transam.h"
+#include "utils/memutils.h"
#endif
#include "access/xlogrecord.h"
@@ -425,7 +426,40 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
#ifndef FRONTEND
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-#endif
+#else
+#define MaxAllocSize (Size) 0x3FFFFFFF
+#endif /* ifndef FRONTEND */
+
+/*
+ * XLogRecMaxSize - the maximum WAL record size that the xlogreader
+ * infrastructure can handle.
+ *
+ * Defined by MaxAllocSize and the various overheads used in the
+ * DecodedXLogRecord infrastructure.
+ *
+ * see DecodeXLogRecordRequiredSpace and allocate_recordbuf for places where
+ * allocations of full records + overhead are done.
+ */
+#define XLogRecMaxLength \
+( \
+ Min( \
+ (MaxAllocSize - ( \
+ offsetof(DecodedXLogRecord, blocks[0]) + \
+ (sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1)) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) \
+ )), \
+ MaxAllocSize - (MaxAllocSize % XLOG_BLCKSZ) \
+ ) \
+)
+
+/*
+ * XLogRecLengthIsValid
+ *
+ * Check that this length does not exceed the maximum xlog record length
+ */
+#define XLogRecLengthIsValid(xlr_size) ((xlr_size) <= XLogRecMaxLength)
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 052ac6817a..3575363e84 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
Hi,
I tried to apply this patch v5 to current master branch but it complains,
"git apply --check
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"
then I checked it out before the commit
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.
1) both make check and make installcheck passed.
2) and I can also see this patch v5 prevents the error happens previously,
"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
repeat(repeat(' ', 1024), 1024*1023) as l(long);
ERROR: too much WAL data"
3) without this v5 patch, the same test will cause the standby crash
like below, and the standby not be able to boot up after this crash.
"2022-07-08 12:28:16.425 PDT [2363] FATAL: invalid memory alloc request
size 2145388995
2022-07-08 12:28:16.426 PDT [2360] LOG: startup process (PID 2363)
exited with exit code 1
2022-07-08 12:28:16.426 PDT [2360] LOG: terminating any other active
server processes
2022-07-08 12:28:16.427 PDT [2360] LOG: shutting down due to startup
process failure
2022-07-08 12:28:16.428 PDT [2360] LOG: database system is shut down"
Best regards,
--
David
Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
On Fri, 8 Jul 2022 at 21:35, David Zhang <david.zhang@highgo.ca> wrote:
Hi,
I tried to apply this patch v5 to current master branch but it complains,
"git apply --check
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"then I checked it out before the commit
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.
The attached rebased patchset should work with master @ 2cd2569c and
REL_15_STABLE @ 53df1e28. I've also added a patch that works for PG14
and earlier, which should be correct for all versions that include
commit 2c03216d (that is, all versions back to 9.5).
1) both make check and make installcheck passed.
2) and I can also see this patch v5 prevents the error happens previously,
"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
repeat(repeat(' ', 1024), 1024*1023) as l(long);
ERROR: too much WAL data"3) without this v5 patch, the same test will cause the standby crash
like below, and the standby not be able to boot up after this crash.
Thanks for reviewing.
Kind regards,
Matthias van de Meent
Attachments:
v6-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From af666e94f54b6bb81f1a6a86c39c4cfb2db9e799 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v6] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 63 +++++++++++++++++++++----
src/backend/access/transam/xlogreader.c | 12 ++++-
src/include/access/xloginsert.h | 4 +-
src/include/access/xlogreader.h | 36 +++++++++++++-
src/include/access/xlogrecord.h | 4 ++
5 files changed, 105 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f3c29fa909..814d83b362 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -348,14 +359,25 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && XLogRecLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecLengthIsValid() for that size
+ * of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!XLogRecLengthIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -386,12 +408,12 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +421,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +786,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +894,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record (plus infra overhead) can be allocated.
+ */
+ if (unlikely(!XLogRecLengthIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3dc4b7797..deb1dff3d0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -669,7 +669,17 @@ restart:
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
- (uint32) SizeOfXLogRecord, total_len);
+ (uint32) SizeOfXLogRecord,
+ total_len);
+ goto err;
+ }
+ if (!XLogRecLengthIsValid(total_len))
+ {
+ report_invalid_record(state,
+ "invalid record length at %X/%X: wanted %u, got %u",
+ LSN_FORMAT_ARGS(RecPtr),
+ (uint32) XLogRecMaxLength,
+ total_len);
goto err;
}
gotheader = false;
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index c04f77b173..aed4643d1c 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 5395f155aa..8f7ce4e5fe 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,6 +36,7 @@
#ifndef FRONTEND
#include "access/transam.h"
+#include "utils/memutils.h"
#endif
#include "access/xlogrecord.h"
@@ -425,7 +426,40 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
#ifndef FRONTEND
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-#endif
+#else
+#define MaxAllocSize (Size) 0x3FFFFFFF
+#endif /* ifndef FRONTEND */
+
+/*
+ * XLogRecMaxSize - the maximum WAL record size that the xlogreader
+ * infrastructure can handle.
+ *
+ * Defined by MaxAllocSize and the various overheads used in the
+ * DecodedXLogRecord infrastructure.
+ *
+ * see DecodeXLogRecordRequiredSpace and allocate_recordbuf for places where
+ * allocations of full records + overhead are done.
+ */
+#define XLogRecMaxLength \
+( \
+ Min( \
+ (MaxAllocSize - ( \
+ offsetof(DecodedXLogRecord, blocks[0]) + \
+ (sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1)) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) \
+ )), \
+ MaxAllocSize - (MaxAllocSize % XLOG_BLCKSZ) \
+ ) \
+)
+
+/*
+ * XLogRecLengthIsValid
+ *
+ * Check that this length does not exceed the maximum xlog record length
+ */
+#define XLogRecLengthIsValid(xlr_size) ((xlr_size) <= XLogRecMaxLength)
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 835151ec92..2a282dac0c 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
v6-0001-Add-protections-in-xlog-record-APIs-against-large.v15-patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.v15-patchDownload
From fa9a18dfa478f88b66e3f831057ecd9e986147b3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v6] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 59 +++++++++++++++++++++----
src/backend/access/transam/xlogreader.c | 12 ++++-
src/include/access/xlogreader.h | 36 ++++++++++++++-
src/include/access/xlogrecord.h | 4 ++
4 files changed, 101 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ce9be2cc7..a8ac036cbd 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -352,10 +363,21 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && XLogRecLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecLengthIsValid() for that size
+ * of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!XLogRecLengthIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -391,7 +413,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +421,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -756,12 +786,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +894,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record (plus infra overhead) can be allocated.
+ */
+ if (unlikely(!XLogRecLengthIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23cb8..63a4ff0a0d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -669,7 +669,17 @@ restart:
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
- (uint32) SizeOfXLogRecord, total_len);
+ (uint32) SizeOfXLogRecord,
+ total_len);
+ goto err;
+ }
+ if (!XLogRecLengthIsValid(total_len))
+ {
+ report_invalid_record(state,
+ "invalid record length at %X/%X: wanted %u, got %u",
+ LSN_FORMAT_ARGS(RecPtr),
+ (uint32) XLogRecMaxLength,
+ total_len);
goto err;
}
gotheader = false;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e73ea4a840..b2939b13c5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,6 +36,7 @@
#ifndef FRONTEND
#include "access/transam.h"
+#include "utils/memutils.h"
#endif
#include "access/xlogrecord.h"
@@ -425,7 +426,40 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
#ifndef FRONTEND
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-#endif
+#else
+#define MaxAllocSize (Size) 0x3FFFFFFF
+#endif /* ifndef FRONTEND */
+
+/*
+ * XLogRecMaxSize - the maximum WAL record size that the xlogreader
+ * infrastructure can handle.
+ *
+ * Defined by MaxAllocSize and the various overheads used in the
+ * DecodedXLogRecord infrastructure.
+ *
+ * see DecodeXLogRecordRequiredSpace and allocate_recordbuf for places where
+ * allocations of full records + overhead are done.
+ */
+#define XLogRecMaxLength \
+( \
+ Min( \
+ (MaxAllocSize - ( \
+ offsetof(DecodedXLogRecord, blocks[0]) + \
+ (sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1)) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) + \
+ (MAXIMUM_ALIGNOF - 1) \
+ )), \
+ MaxAllocSize - (MaxAllocSize % XLOG_BLCKSZ) \
+ ) \
+)
+
+/*
+ * XLogRecLengthIsValid
+ *
+ * Check that this length does not exceed the maximum xlog record length
+ */
+#define XLogRecLengthIsValid(xlr_size) ((xlr_size) <= XLogRecMaxLength)
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 052ac6817a..3575363e84 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
v6-0001-Add-protections-in-xlog-record-APIs-against-large.v14-patchapplication/octet-stream; name=v6-0001-Add-protections-in-xlog-record-APIs-against-large.v14-patchDownload
From 21a933ab3d2c0d9cc2b7c06329df28145f5930df Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v6] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/backend/access/transam/xloginsert.c | 59 +++++++++++++++++++++----
src/backend/access/transam/xlogreader.c | 12 ++++-
src/include/access/xlogreader.h | 31 ++++++++++++-
src/include/access/xlogrecord.h | 4 ++
4 files changed, 96 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b153fad594..f73785911a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -114,7 +114,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -331,10 +342,21 @@ XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && XLogRecLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecLengthIsValid() for that size
+ * of record.
+ *
+ * We also protect against overflows by first casting to uint64 - with only
+ * uint32-operations this could still be an issue.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(!XLogRecLengthIsValid((uint64) mainrdata_len + (uint64) len)))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -370,7 +392,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -378,8 +400,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -495,7 +525,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -703,12 +733,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -805,6 +841,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record by ensuring that the
+ * data section of the WAL record (plus infra overhead) can be allocated.
+ */
+ if (unlikely(!XLogRecLengthIsValid(total_len)))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index d797d9d508..1e39b3033e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -402,7 +402,17 @@ restart:
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
LSN_FORMAT_ARGS(RecPtr),
- (uint32) SizeOfXLogRecord, total_len);
+ (uint32) SizeOfXLogRecord,
+ total_len);
+ goto err;
+ }
+ if (!XLogRecLengthIsValid(total_len))
+ {
+ report_invalid_record(state,
+ "invalid record length at %X/%X: wanted %u, got %u",
+ LSN_FORMAT_ARGS(RecPtr),
+ (uint32) XLogRecMaxLength,
+ total_len);
goto err;
}
gotheader = false;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 10458c23ed..1cbc70a04e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,6 +36,7 @@
#ifndef FRONTEND
#include "access/transam.h"
+#include "utils/memutils.h"
#endif
#include "access/xlogrecord.h"
@@ -329,7 +330,35 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
#ifndef FRONTEND
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-#endif
+#else
+#define MaxAllocSize (Size) 0x3FFFFFFF
+#endif /* ifndef FRONTEND */
+
+/*
+ * XLogRecMaxSize - the maximum WAL record size that the xlogreader
+ * infrastructure can handle.
+ *
+ * Defined by MaxAllocSize and the various overheads used in the
+ * DecodedXLogRecord infrastructure.
+ *
+ * Places that allocate memory based on xlogrecords' data lengths:
+ * - allocate_recordbuf
+ * Allocates xl_tot_len, rounded up to the next multiple of
+ * XLOG_BLCKSZ.
+ * - DecodeXLogRecord
+ * Allocates a buffer for xlog main data only, rounded up
+ * to the next multiple of BLCKSZ/2.
+ */
+#define XLogRecMaxLength \
+ (MaxAllocSize - (MaxAllocSize % XLOG_BLCKSZ))
+
+
+/*
+ * XLogRecLengthIsValid
+ *
+ * Check that this length does not exceed the maximum xlog record length
+ */
+#define XLogRecLengthIsValid(xlr_size) ((xlr_size) <= XLogRecMaxLength)
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index f68cb18912..2f60a3ab39 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
*/
typedef struct XLogRecord
{
--
2.30.2
On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
Thanks for reviewing.
I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no? And the origin here is when the record is assembled. At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH. So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().
@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
This has no need to change.
My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:
+ /*
+ * Ensure that xlogreader.c can read the record.
+ */
+ if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
+ elog(ERROR, "too much WAL data");
This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.
Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()). The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
--
Michael
Attachments:
v7-0001-Add-protections-in-xlog-record-APIs-against-large_michael.patchtext/x-diff; charset=us-asciiDownload
From 2031f15fe90c94deb21d4e41b717ada9a8bdb520 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v7] Add protections in xlog record APIs against large numbers
and overflows.
Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
src/include/access/xloginsert.h | 4 ++--
src/backend/access/transam/xloginsert.c | 30 ++++++++++++++++++++-----
2 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index c04f77b173..aed4643d1c 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f3c29fa909..3788429b70 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -348,13 +348,13 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
Assert(begininsert_called);
- if (num_rdatas >= max_rdatas)
+ if (unlikely(num_rdatas >= max_rdatas))
elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
@@ -386,7 +386,7 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
@@ -399,8 +399,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
+ /*
+ * Check against max_rdatas and ensure we do not register more data per
+ * buffer than can be handled by the physical data format; i.e. that
+ * regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (unlikely(num_rdatas >= max_rdatas) ||
+ unlikely(regbuf->rdata_len + len > UINT16_MAX))
elog(ERROR, "too much WAL data");
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -756,12 +764,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +872,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that xlogreader.c can read the record.
+ */
+ if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
+ elog(ERROR, "too much WAL data");
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
--
2.36.1
On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.
I just remembered your comment while going through the xlog code and
thought this about the same issue: We still have 2 bytes of padding in
XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
space for rmgr-specific flags, as opposed to stealing bits from
xl_info?
Kind regards,
Matthias van de Meent
Hi,
On 2022-07-15 11:25:54 +0200, Matthias van de Meent wrote:
On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.I just remembered your comment while going through the xlog code and
thought this about the same issue: We still have 2 bytes of padding in
XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
space for rmgr-specific flags, as opposed to stealing bits from
xl_info?
Sounds like a good idea to me. I'm not sure who is stealing bits from what
right now, but it clearly seems worthwhile to separate "flags" from "record
type within rmgr".
I think we should split it at least into three things:
1) generic per-record flags for xlog machinery (ie. XLR_SPECIAL_REL_UPDATE, XLR_CHECK_CONSISTENCY)
2) rmgr record type identifier (e.g. XLOG_HEAP_*)
2) rmgr specific flags (e.g. XLOG_HEAP_INIT_PAGE)
Greetings,
Andres Freund
On 13/07/2022 08:54, Michael Paquier wrote:
I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no? And the origin here is when the record is assembled. At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH. So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included){
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
This has no need to change.My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:+ /* + * Ensure that xlogreader.c can read the record. + */ + if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))) + elog(ERROR, "too much WAL data");This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.
The way this is written, it would change whenever we add/remove fields
in DecodedBkpBlock, for example. That's fragile; if you added a field in
a back-branch, you could accidentally make the new minor version unable
to read maximum-sized WAL records generated with an older version. I'd
like the maximum to be more explicit.
How large exactly is the maximum size that this gives? I'd prefer to set
the limit conservatively to 1020 MB, for example, with a compile-time
static assertion that
AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).
Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).
+1. I'm not excited about adding the "unlikely()" hints, though. We have
a pg_attribute_cold hint in ereport(), that should be enough.
- Heikki
On Wed, 13 Jul 2022 at 07:54, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
Thanks for reviewing.
I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no? And the origin here is when the record is assembled. At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH.
I would prefer it if we would fix the "cannot catch up to primary
because of oversized WAL record" issue in backbranches too. Rather
than failing to recover after failure or breaking replication streams,
I'd rather be unable to write the singular offending WAL record and
break up to one transaction.
So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included){
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
This has no need to change.My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:+ /* + * Ensure that xlogreader.c can read the record. + */ + if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))) + elog(ERROR, "too much WAL data");
Huh, yeah, I hadn't thought of that, but that's much simpler indeed.
This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.
Yes, I see your point.
Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).
The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
I think that would be better for performance, yes.
DecodeXLogRecordRequiredSpace will already be optimized to just a
single addition by any of `-O[123]`, so keeping this indirection is
quite expensive (relative to the operation being performed).
As for your patch patch:
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;Assert(begininsert_called);
- if (num_rdatas >= max_rdatas) + if (unlikely(num_rdatas >= max_rdatas)) elog(ERROR, "too much WAL data"); rdata = &rdatas[num_rdatas++];
XLogRegisterData is designed to be called multiple times for each
record, and this allows the user of the API to overflow the internal
mainrdata_len field if we don't check that the field does not exceed
the maximum record length (or overflow the 32-bit field). As such, I'd
still want a len-check in that function.
I'll send an updated patch by tomorrow.
- Matthias
On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
The way this is written, it would change whenever we add/remove fields in
DecodedBkpBlock, for example. That's fragile; if you added a field in a
back-branch, you could accidentally make the new minor version unable to
read maximum-sized WAL records generated with an older version. I'd like the
maximum to be more explicit.
That's a good point.
How large exactly is the maximum size that this gives? I'd prefer to set the
limit conservatively to 1020 MB, for example, with a compile-time static
assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).
Something like that would work, I guess.
Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).+1. I'm not excited about adding the "unlikely()" hints, though. We have a
pg_attribute_cold hint in ereport(), that should be enough.
Okay, that makes sense. FWIW, I have been wondering about the
addition of the extra condition in XLogRegisterBufData() and I did not
see a difference on HEAD in terms of execution time or profile, with a
micro-benchmark doing a couple of million calls in a row as of the
following, roughly:
// Can be anything, really..
rel = relation_open(RelationRelationId, AccessShareLock);
buffer = ReadBuffer(rel, 0);
for (i = 0 ; i < WAL_MAX_CALLS ; i++)
{
XLogBeginInsert();
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, buf, 10);
XLogResetInsertion();
}
ReleaseBuffer(buffer);
relation_close(rel, AccessShareLock);
--
Michael
On Tue, 26 Jul 2022 at 09:20, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).+1. I'm not excited about adding the "unlikely()" hints, though. We have a
pg_attribute_cold hint in ereport(), that should be enough.Okay, that makes sense. FWIW, I have been wondering about the
addition of the extra condition in XLogRegisterBufData() and I did not
see a difference on HEAD in terms of execution time or profile, with a
micro-benchmark doing a couple of million calls in a row as of the
following, roughly:
[...]
Thanks for testing.
How large exactly is the maximum size that this gives? I'd prefer to set the
limit conservatively to 1020 MB, for example, with a compile-time static
assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).Something like that would work, I guess.
I've gone over the patch and reviews again, and updated those places
that received comments:
- updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
macros (now in xlogrecord.h), with a max length of the somewhat
arbitrary 1020MiB.
This leaves room for approx. 4MiB of per-record allocation overhead
before you'd hit MaxAllocSize, and also detaches the dependency on
memutils.h.
- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
- Dropped any changes in xlogreader.h/c
Kind regards,
Matthias van de Meent
Attachments:
v8-0001-Add-protections-in-xlog-record-APIs-against-large.patchapplication/octet-stream; name=v8-0001-Add-protections-in-xlog-record-APIs-against-large.patchDownload
From c2e24db75eb6d1f1e647173081964971a8454c9b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v8] Add protections in xlog record APIs against large numbers
and overflows.
Before this, it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
Emitting invalid records was also possible through pg_logical_emit_message(),
which allowed you to emit arbitrary amounts of data up to 2GB, much higher
than the replay limit of approximately 1GB.
This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read any
max-sized XLogRecords, such that the wal-generating backend will fail when
it attempts to insert unreadable records, instead of that insertion
succeeding but breaking any replication streams.
---
src/backend/access/transam/xloginsert.c | 77 ++++++++++++++++++++++---
src/include/access/xloginsert.h | 4 +-
src/include/access/xlogrecord.h | 13 +++++
3 files changed, 83 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f3c29fa909..4b343c1c12 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,7 +141,18 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
+
/*
* Begin constructing a WAL record. This must be called before the
* XLogRegister* functions and XLogInsert().
@@ -348,14 +359,29 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
* XLogRecGetData().
*/
void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && XLogRecordLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecordLengthIsValid() for that
+ * size of record.
+ *
+ * Additionally, check that we don't accidentally overflow the
+ * intermediate sum value on 32-bit systems by ensuring that the
+ * sum of the two inputs is no less than one of the inputs.
+ */
+ if (num_rdatas >= max_rdatas ||
+#if SIZEOF_SIZE_T == 4
+ mainrdata_len + len < len ||
+#endif
+ !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -386,12 +412,12 @@ XLogRegisterData(char *data, int len)
* limited)
*/
void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -399,8 +425,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ /*
+ * Check against max_rdatas; and ensure we don't register more data per
+ * buffer than can be handled by the physical data format;
+ * i.e. that regbuf->rdata_len does not grow beyond what
+ * XLogRecordBlockHeader->data_length can hold.
+ */
+ if (num_rdatas >= max_rdatas ||
+ regbuf->rdata_len + len > UINT16_MAX)
+ XLogErrorDataLimitExceeded();
+
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -519,6 +553,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
+ /*
+ * Overflow of total_len is not normally possible, considering that
+ * this value will be at most 33 RegBufDatas (at UINT16_MAX each)
+ * plus one MaxXLogRecordSize, which together are still an order of
+ * magnitude smaller than UINT32_MAX.
+ */
uint32 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
@@ -527,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char *scratch = hdr_scratch;
+ /* ensure that any assembled record can be decoded */
+ Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));
+
/*
* Note: this function can be called multiple times for the same record.
* All the modifications we do to the rdata chains below must handle that.
@@ -756,12 +799,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
if (needs_data)
{
+ /*
+ * When copying to XLogRecordBlockHeader, the length is narrowed
+ * to an uint16. We double-check that that is still correct.
+ */
+ Assert(regbuf->rdata_len <= UINT16_MAX);
+
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
- bkpb.data_length = regbuf->rdata_len;
+ bkpb.data_length = (uint16) regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure we will
+ * not emit records larger than those sizes we advertise we support.
+ */
+ if (!XLogRecordLengthIsValid(total_len))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index c04f77b173..aed4643d1c 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
ForkNumber forknum, BlockNumber blknum, char *page,
uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
extern void XLogResetInsertion(void);
extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 835151ec92..065bc63a79 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,19 @@ typedef struct XLogRecord
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
+ * 4MB -1b of allocation overhead in anything that allocates xlog record
+ * data, which should be enough for effectively all purposes.
+ */
+#define MaxXLogRecordSize (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)
+
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
--
2.30.2
On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
Why? The record has not been inserted yet. I would tend to keep only
the check at the bottom of XLogRecordAssemble(), for simplicity, and
call it a day.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
I am not really convinced that this one is worth doing.
+#define MaxXLogRecordSize (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)
These are used only in xloginsert.c, so we could keep them isolated.
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.
For now, I have extracted from the patch the two API changes and the
checks for the block information for uint16, and applied this part.
That's one less thing to worry about.
--
Michael
On Wed, 27 Jul 2022 at 11:09, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.Why? The record has not been inserted yet. I would tend to keep only
the check at the bottom of XLogRecordAssemble(), for simplicity, and
call it a day.
Because the sum value main_rdatalen can easily overflow in both the
current and the previous APIs, which then corrupts the WAL - one of
the two issues that I mentioned when I started the thread.
We don't re-summarize the lengths of all XLogRecData segments for the
main record data when assembling a record to keep the performance of
RecordAssemble (probably to limit the complexity when many data
segments are registered), and because I didn't want to add more
changes than necessary this check will need to be done in the place
where the overflow may occur, which is in XLogRegisterData.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)I am not really convinced that this one is worth doing.
I'm not married to that change, but I also don't see why this can't be
updated while this code is already being touched.
+#define MaxXLogRecordSize (1020 * 1024 * 1024) + +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)These are used only in xloginsert.c, so we could keep them isolated.
They might be only used in xloginsert right now, but that's not the
point. This is now advertised as part of the record API spec: A record
larger than 1020MB is explicitly not supported. If it was kept
internal to xloginsert, that would be implicit and other people might
start hitting issues similar to those we're hitting right now -
records that are too large to read. Although PostgreSQL is usually the
only one generating WAL, we do support physical replication from
arbitrary PG-compatible WAL streams, which means that any compatible
WAL source could be the origin of our changes - and those need to be
aware of the assumptions we make about the WAL format.
I'm fine with also updating xlogreader.c to check this while reading
records to clarify the limits there as well, if so desired.
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.
Will be included in the next update..
For now, I have extracted from the patch the two API changes and the
checks for the block information for uint16, and applied this part.
That's one less thing to worry about.
Thanks.
Kind regards,
Matthias van de Meent
Hi Matthias,
On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:
My apologies for the time it took me to come back to this thread.
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.Will be included in the next update..
v8 fails to apply. Could you send a rebased version?
As far as I recall the problems with the block image sizes are solved,
but we still have a bit more to do in terms of the overall record
size. Perhaps there are some parts of the patch you'd like to
revisit?
For now, I have switched the back as waiting on author, and moved it
to the next CF.
--
Michael
2022年10月5日(水) 16:46 Michael Paquier <michael@paquier.xyz>:
Hi Matthias,
On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:
My apologies for the time it took me to come back to this thread.
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.Will be included in the next update..
v8 fails to apply. Could you send a rebased version?
As far as I recall the problems with the block image sizes are solved,
but we still have a bit more to do in terms of the overall record
size. Perhaps there are some parts of the patch you'd like to
revisit?For now, I have switched the back as waiting on author, and moved it
to the next CF.
Hi Matthias
CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.
Thanks
Ian Barwick
On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.
No replies after 4 weeks, so I have marked this entry as returned
with feedback. I am still wondering what would be the best thing to
do here..
--
Michael
Hi,
On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.No replies after 4 weeks, so I have marked this entry as returned
with feedback. I am still wondering what would be the best thing to
do here..
IMO this a bugfix, I don't think we can just close the entry, even if Matthias
doesn't have time / energy to push it forward.
I think the big issue with the patch as it stands is that it will typically
cause PANICs on failure, because the record-too-large ERROR be a in a critical
section. That's still better than generating a record that can't be replayed,
but it's not good.
There's not all that many places with potentially huge records. I wonder if we
ought to modify at least the most prominent ones to prepare the record before
the critical section. I think the by far most prominent real-world case is
RecordTransactionCommit(). I think we could rename XactLogCommitRecord() to
XactBuildCommitRecord() build the commit record, then have the caller do
START_CRIT_SECTION(), set DELAY_CHKPT_START, and only then do the
XLogInsert().
That'd even have the nice side-effect of reducing the window in which
DELAY_CHKPT_START is set a bit.
Greetings,
Andres Freund
Hi,
On 2022-07-26 18:58:02 +0200, Matthias van de Meent wrote:
- updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
macros (now in xlogrecord.h), with a max length of the somewhat
arbitrary 1020MiB.
This leaves room for approx. 4MiB of per-record allocation overhead
before you'd hit MaxAllocSize, and also detaches the dependency on
memutils.h.- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
I don't think it should be a static inline. It should to be a *non* inlined
function, so we don't include the code for the elog in the callers.
+/* + * Error due to exceeding the maximum size of a WAL record, or registering + * more datas than are being accounted for by the XLog infrastructure. + */ +static inline void +XLogErrorDataLimitExceeded() +{ + elog(ERROR, "too much WAL data"); +}
I think this should be pg_noinline, as mentioned above.
/* * Begin constructing a WAL record. This must be called before the * XLogRegister* functions and XLogInsert(). @@ -348,14 +359,29 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum, * XLogRecGetData(). */ void -XLogRegisterData(char *data, int len) +XLogRegisterData(char *data, uint32 len) { XLogRecData *rdata;- Assert(begininsert_called); + Assert(begininsert_called && XLogRecordLengthIsValid(len)); + + /* + * Check against max_rdatas; and ensure we don't fill a record with + * more data than can be replayed. Records are allocated in one chunk + * with some overhead, so ensure XLogRecordLengthIsValid() for that + * size of record. + * + * Additionally, check that we don't accidentally overflow the + * intermediate sum value on 32-bit systems by ensuring that the + * sum of the two inputs is no less than one of the inputs. + */ + if (num_rdatas >= max_rdatas || +#if SIZEOF_SIZE_T == 4 + mainrdata_len + len < len || +#endif + !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len)) + XLogErrorDataLimitExceeded();
This is quite a complicated check, and the SIZEOF_SIZE_T == 4 bit is fairly
ugly.
I think we should make mainrdata_len a uint64, then we don't have to worry
about it overflowing on 32bit systems. And TBH, we don't care about some minor
inefficiency on 32bit systems.
@@ -399,8 +425,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);- if (num_rdatas >= max_rdatas) - elog(ERROR, "too much WAL data"); + /* + * Check against max_rdatas; and ensure we don't register more data per + * buffer than can be handled by the physical data format; + * i.e. that regbuf->rdata_len does not grow beyond what + * XLogRecordBlockHeader->data_length can hold. + */ + if (num_rdatas >= max_rdatas || + regbuf->rdata_len + len > UINT16_MAX) + XLogErrorDataLimitExceeded(); + rdata = &rdatas[num_rdatas++];rdata->data = data;
This partially has been applied in ffd1b6bb6f8, I think we should consider
adding XLogErrorDataLimitExceeded() separately too.
rdt_datas_last->next = regbuf->rdata_head;
@@ -858,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);+ /* + * Ensure that the XLogRecord is not too large. + * + * XLogReader machinery is only able to handle records up to a certain + * size (ignoring machine resource limitations), so make sure we will + * not emit records larger than those sizes we advertise we support. + */ + if (!XLogRecordLengthIsValid(total_len)) + XLogErrorDataLimitExceeded(); + /* * Fill in the fields in the record header. Prev-link is filled in later, * once we know where in the WAL the record will be inserted. The CRC does
I think this needs to mention that it'll typically cause a PANIC.
Greetings,
Andres Freund
On 2022-Dec-02, Andres Freund wrote:
Hi,
On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.No replies after 4 weeks, so I have marked this entry as returned
with feedback. I am still wondering what would be the best thing to
do here..IMO this a bugfix, I don't think we can just close the entry, even if Matthias
doesn't have time / energy to push it forward.
I have created one in the January commitfest,
https://commitfest.postgresql.org/41/
and rebased the patch on current master. (I have not reviewed this.)
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"
Attachments:
v9-0001-Add-protections-in-xlog-record-APIs-against-large.patchtext/x-diff; charset=us-asciiDownload
From 385ccf1b7d68923009c73db493dbe74be34e305b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 19 Dec 2022 12:26:52 +0100
Subject: [PATCH v9] Add protections in xlog record APIs against large numbers
and overflows.
Before this, it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
Emitting invalid records was also possible through pg_logical_emit_message(),
which allowed you to emit arbitrary amounts of data up to 2GB, much higher
than the replay limit of approximately 1GB.
This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read any
max-sized XLogRecords, such that the wal-generating backend will fail when
it attempts to insert unreadable records, instead of that insertion
succeeding but breaking any replication streams.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
---
src/backend/access/transam/xloginsert.c | 59 ++++++++++++++++++++++---
src/include/access/xlogrecord.h | 13 ++++++
2 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f811523448..c25431f0e9 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,6 +141,17 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
bool *topxid_included);
static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+ elog(ERROR, "too much WAL data");
+}
/*
* Begin constructing a WAL record. This must be called before the
@@ -352,10 +363,25 @@ XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && XLogRecordLengthIsValid(len));
+
+ /*
+ * Check against max_rdatas; and ensure we don't fill a record with
+ * more data than can be replayed. Records are allocated in one chunk
+ * with some overhead, so ensure XLogRecordLengthIsValid() for that
+ * size of record.
+ *
+ * Additionally, check that we don't accidentally overflow the
+ * intermediate sum value on 32-bit systems by ensuring that the
+ * sum of the two inputs is no less than one of the inputs.
+ */
+ if (num_rdatas >= max_rdatas ||
+#if SIZEOF_SIZE_T == 4
+ mainrdata_len + len < len ||
+#endif
+ !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
+ XLogErrorDataLimitExceeded();
- if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -391,7 +417,7 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
registered_buffer *regbuf;
XLogRecData *rdata;
- Assert(begininsert_called);
+ Assert(begininsert_called && len <= UINT16_MAX);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
@@ -400,14 +426,14 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
block_id);
/*
- * Check against max_rdatas and ensure we do not register more data per
+ * Check against max_rdatas; and ensure we don't register more data per
* buffer than can be handled by the physical data format; i.e. that
* regbuf->rdata_len does not grow beyond what
* XLogRecordBlockHeader->data_length can hold.
*/
if (num_rdatas >= max_rdatas ||
regbuf->rdata_len + len > UINT16_MAX)
- elog(ERROR, "too much WAL data");
+ XLogErrorDataLimitExceeded();
rdata = &rdatas[num_rdatas++];
@@ -527,6 +553,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
+ /*
+ * Overflow of total_len is not normally possible, considering that
+ * this value will be at most 33 RegBufDatas (at UINT16_MAX each)
+ * plus one MaxXLogRecordSize, which together are still an order of
+ * magnitude smaller than UINT32_MAX.
+ */
uint32 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char *scratch = hdr_scratch;
+ /* ensure that any assembled record can be decoded */
+ Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));
+
/*
* Note: this function can be called multiple times for the same record.
* All the modifications we do to the rdata chains below must handle that.
@@ -766,7 +801,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
{
/*
* When copying to XLogRecordBlockHeader, the length is narrowed
- * to an uint16. Double-check that it is still correct.
+ * to an uint16. We double-check that that is still correct.
*/
Assert(regbuf->rdata_len <= UINT16_MAX);
@@ -872,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure we will
+ * not emit records larger than those sizes we advertise we support.
+ */
+ if (!XLogRecordLengthIsValid(total_len))
+ XLogErrorDataLimitExceeded();
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 835151ec92..065bc63a79 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,19 @@ typedef struct XLogRecord
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
+ * 4MB -1b of allocation overhead in anything that allocates xlog record
+ * data, which should be enough for effectively all purposes.
+ */
+#define MaxXLogRecordSize (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)
+
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
--
2.30.2
On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
I have created one in the January commitfest,
https://commitfest.postgresql.org/41/
and rebased the patch on current master. (I have not reviewed this.)
I have spent some time on that, and here are some comments with an
updated version of the patch attached.
The checks in XLogRegisterData() seemed overcomplicated to me. In
this context, I think that we should just care about making sure that
mainrdata_len does not overflow depending on the length given by the
caller, which is where pg_add_u32_overflow() becomes handy.
XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
we already check for overflow a couple of lines down. This is not
necessary, it seems.
@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char *scratch = hdr_scratch;
+ /* ensure that any assembled record can be decoded */
+ Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));
A hardcoded check like that has no need to be in a code path triggered
each time a WAL record is assembled. One place where this could be is
InitXLogInsert(). It still means that it is called one time for each
backend, but seeing it where the initialization of xloginsert.c feels
natural, at least. A postmaster location would be enough, as well.
XLogRecordMaxSize just needs to be checked once IMO, around the end of
XLogRecordAssemble() once we know the total size of the record that
will be fed to a XLogReader. One thing that we should be more careful
of is to make sure that total_len does not overflow its uint32 value
while assembling the record, as well.
I have removed XLogErrorDataLimitExceeded(), replacing it with more
context about the errors happening. Perhaps this has no need to be
that much verbose, but it can be really useful for developers.
Some comments had no need to be updated, and there were some typos.
I am on board with the idea of a XLogRecordMaxSize that's bounded at
1020MB, leaving 4MB as room for the extra data needed by a
XLogReader.
At the end, I think that this is quite interesting long-term. For
example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
adding buffer data or main data separately.
Thoughts about this version?
--
Michael
Attachments:
v10-0001-Add-protections-in-xlog-record-APIs-against-over.patchtext/x-diff; charset=us-asciiDownload
From 16b40324a5bc5bbe6e06cbfb86251c907f400151 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 28 Mar 2023 20:34:31 +0900
Subject: [PATCH v10] Add protections in xlog record APIs against overflows
Before this, it was possible for an extension to create malicious WAL
records that were too large to replay; or that would overflow the
xl_tot_len field, causing potential corruption in WAL record IO ops.
Emitting invalid records was also possible through
pg_logical_emit_message(), which allowed you to emit arbitrary amounts
of data up to 2GB, much higher than the replay limit of approximately
1GB.
This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read
any max-sized XLogRecords, such that the wal-generating backend will
fail when it attempts to insert unreadable records, instead of that
insertion succeeding but breaking any replication streams.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
---
src/include/access/xlogrecord.h | 11 ++++
src/backend/access/transam/xloginsert.c | 73 +++++++++++++++++++++----
2 files changed, 74 insertions(+), 10 deletions(-)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..0a7b0bb6fc 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,17 @@ typedef struct XLogRecord
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize (1020 * 1024 * 1024)
+
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..86fa6a5276 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -33,6 +33,7 @@
#include "access/xloginsert.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "common/int.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -351,11 +352,13 @@ void
XLogRegisterData(char *data, uint32 len)
{
XLogRecData *rdata;
+ uint32 result = 0;
Assert(begininsert_called);
if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ elog(ERROR, "too much WAL data in main chunk: num_rdatas %d, max_rdatas %d",
+ num_rdatas, max_rdatas);
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -369,7 +372,11 @@ XLogRegisterData(char *data, uint32 len)
mainrdata_last->next = rdata;
mainrdata_last = rdata;
- mainrdata_len += len;
+ if (pg_add_u32_overflow(mainrdata_len, len, &result))
+ elog(ERROR, "too much WAL data in main chunk: mainrdata_len %u, len %u",
+ mainrdata_len, len);
+
+ mainrdata_len = result;
}
/*
@@ -390,6 +397,7 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
+ uint32 result = 0;
Assert(begininsert_called);
@@ -405,9 +413,9 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
* regbuf->rdata_len does not grow beyond what
* XLogRecordBlockHeader->data_length can hold.
*/
- if (num_rdatas >= max_rdatas ||
- regbuf->rdata_len + len > UINT16_MAX)
- elog(ERROR, "too much WAL data");
+ if (num_rdatas >= max_rdatas)
+ elog(ERROR, "too much WAL data specific to block %u: num_rdatas %u, max_rdatas %u",
+ block_id, num_rdatas, max_rdatas);
rdata = &rdatas[num_rdatas++];
@@ -416,7 +424,12 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
regbuf->rdata_tail->next = rdata;
regbuf->rdata_tail = rdata;
- regbuf->rdata_len += len;
+
+ if (pg_add_u32_overflow(regbuf->rdata_len, len, &result) ||
+ regbuf->rdata_len + len > UINT16_MAX)
+ elog(ERROR, "too much WAL data: rdata_len %u, len %u",
+ regbuf->rdata_len, len);
+ regbuf->rdata_len = result;
}
/*
@@ -528,6 +541,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
{
XLogRecData *rdt;
uint32 total_len = 0;
+ uint32 result = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -759,7 +773,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
}
}
- total_len += bimg.length;
+ if (pg_add_u32_overflow(total_len, (uint32) bimg.length, &result))
+ elog(ERROR, "too much WAL data: block_id %u, block length %u, total_len %u",
+ block_id, bimg.length, total_len);
+ total_len = result;
}
if (needs_data)
@@ -776,7 +793,11 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
bkpb.data_length = (uint16) regbuf->rdata_len;
- total_len += regbuf->rdata_len;
+
+ if (pg_add_u32_overflow(total_len, regbuf->rdata_len, &result))
+ elog(ERROR, "too much WAL data: total_len %u, rdata_len %u",
+ total_len, regbuf->rdata_len);
+ total_len = result;
rdt_datas_last->next = regbuf->rdata_head;
rdt_datas_last = regbuf->rdata_tail;
@@ -852,12 +873,19 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
}
rdt_datas_last->next = mainrdata_head;
rdt_datas_last = mainrdata_last;
- total_len += mainrdata_len;
+
+ if (pg_add_u32_overflow(total_len, mainrdata_len, &result))
+ elog(ERROR, "too much WAL data: total_len %u, mainrdata_len %u",
+ total_len, mainrdata_len);
+ total_len = result;
}
rdt_datas_last->next = NULL;
hdr_rdt.len = (scratch - hdr_scratch);
- total_len += hdr_rdt.len;
+ if (pg_add_u32_overflow(total_len, hdr_rdt.len, &result))
+ elog(ERROR, "too much WAL data: total_len %u, hdr_rdt.len %u",
+ total_len, hdr_rdt.len);
+ total_len = result;
/*
* Calculate CRC of the data
@@ -872,6 +900,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure we will
+ * not emit records larger than those sizes we advertise we support.
+ * This cap is based on DecodeXLogRecordRequiredSpace().
+ */
+ if (total_len >= XLogRecordMaxSize)
+ elog(ERROR, "too much WAL data: total_len %u, max %u",
+ total_len, XLogRecordMaxSize);
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
@@ -1297,6 +1337,19 @@ log_newpage_range(Relation rel, ForkNumber forknum,
void
InitXLogInsert(void)
{
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Check that any records assembled can be decoded. This is capped
+ * based on what XLogReader would require at its maximum bound.
+ * This code path is called once per backend, more than enough for
+ * this check.
+ */
+ size_t max_required = DecodeXLogRecordRequiredSpace(XLogRecordMaxSize);
+
+ Assert(AllocSizeIsValid(max_required));
+#endif
+
/* Initialize the working areas */
if (xloginsert_cxt == NULL)
{
--
2.40.0
On Tue, 28 Mar 2023 at 13:42, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
I have created one in the January commitfest,
https://commitfest.postgresql.org/41/
and rebased the patch on current master. (I have not reviewed this.)I have spent some time on that, and here are some comments with an
updated version of the patch attached.The checks in XLogRegisterData() seemed overcomplicated to me. In
this context, I think that we should just care about making sure that
mainrdata_len does not overflow depending on the length given by the
caller, which is where pg_add_u32_overflow() becomes handy.XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
we already check for overflow a couple of lines down. This is not
necessary, it seems.@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char *scratch = hdr_scratch;+ /* ensure that any assembled record can be decoded */ + Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));A hardcoded check like that has no need to be in a code path triggered
each time a WAL record is assembled. One place where this could be is
InitXLogInsert(). It still means that it is called one time for each
backend, but seeing it where the initialization of xloginsert.c feels
natural, at least. A postmaster location would be enough, as well.XLogRecordMaxSize just needs to be checked once IMO, around the end of
XLogRecordAssemble() once we know the total size of the record that
will be fed to a XLogReader. One thing that we should be more careful
of is to make sure that total_len does not overflow its uint32 value
while assembling the record, as well.I have removed XLogErrorDataLimitExceeded(), replacing it with more
context about the errors happening. Perhaps this has no need to be
that much verbose, but it can be really useful for developers.Some comments had no need to be updated, and there were some typos.
I am on board with the idea of a XLogRecordMaxSize that's bounded at
1020MB, leaving 4MB as room for the extra data needed by a
XLogReader.At the end, I think that this is quite interesting long-term. For
example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
adding buffer data or main data separately.Thoughts about this version?
I thought that the plan was to use int64 to skip checking for most
overflows and to do a single check at the end in XLogRecordAssemble,
so that the checking has minimal overhead in the performance-critical
log record assembling path and reduced load on the branch predictor.
One more issue that Andres was suggesting we'd fix was to allow XLog
assembly separate from the actual XLog insertion:
Currently you can't pre-assemble a record outside a critical section
if the record must be inserted in a critical section, which makes e.g.
commit records problematic due to the potentially oversized data
resulting in ERRORs during record assembly. This would crash postgres
because commit xlog insertion happens in a critical section. Having a
pre-assembled record would greatly improve the ergonomics in that path
and reduce the length of the critical path.
I think it was something along the lines of the attached; 0001
contains separated Commit/Abort record construction and insertion like
Andres suggested, 0002 does the size checks with updated error
messages.
Kind regards,
Matthias van de Meent
Attachments:
v11-0001-Create-a-path-for-separate-xlog-record-construct.patchapplication/octet-stream; name=v11-0001-Create-a-path-for-separate-xlog-record-construct.patchDownload
From 180b6076ecb00cdc2c9b9a7ca1bdd1c7e21f7ee2 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 30 Mar 2023 20:02:16 +0200
Subject: [PATCH v11 1/2] Create a path for separate xlog record construction
and insertion
The current record insertion routines are likely to be called inside a
transaction, which means they shouldn't throw errors unless there is
something catastrophically wrong, as it'd cause the server to reboot.
In some cases that is the right thing to do; however, not all WAL records
are created equal, and some cases failing to build the expected WAL
record could easily be resolved by rolling back the transaction.
This prepares the code for XLog record size validation, where the
XLogRecordAssemble() code may throw errors if WAL record size
constraints are violated.
---
src/backend/access/transam/multixact.c | 5 +
src/backend/access/transam/twophase.c | 58 +++--
src/backend/access/transam/xact.c | 298 +++++++++++++-----------
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xloginsert.c | 122 ++++++++++
src/backend/commands/sequence.c | 18 ++
src/include/access/xact.h | 64 +++--
src/include/access/xlog.h | 1 +
src/include/access/xloginsert.h | 3 +
src/include/access/xlogrecord.h | 11 +
10 files changed, 408 insertions(+), 174 deletions(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index fe6698d5ff..78d6563df8 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -832,6 +832,11 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
* find a more compact representation of this Xlog record -- perhaps all
* the status flags in one XLogRecData, then all the xids in another one?
* Not clear that it's worth the trouble though.
+ *
+ * XXX: large values for nmembers may mean the data would not fit in the
+ * max xlog record size, in which case this would panic. However, that is
+ * quite unlikely as such record would log more than 250M members - much
+ * more than we can currently process efficiently.
*/
XLogBeginInsert();
XLogRegisterData((char *) (&xlrec), SizeOfMultiXactCreate);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 068e59bec0..f98d79aa29 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1181,18 +1181,24 @@ EndPrepare(GlobalTransaction gxact)
*/
XLogEnsureRecordSpace(0, records.num_chunks);
- START_CRIT_SECTION();
-
- Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
- MyProc->delayChkptFlags |= DELAY_CHKPT_START;
-
+ /*
+ * Start preparing the record for insertion. Do so outside the critical
+ * section to make sure we won't throw errors due to oversized registered
+ * data.
+ */
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+ XLogPrepareInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ START_CRIT_SECTION();
+
+ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+ MyProc->delayChkptFlags |= DELAY_CHKPT_START;
- gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+ gxact->prepare_end_lsn = XLogInsertPrepared();
if (replorigin)
{
@@ -2294,6 +2300,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
bool replorigin;
+ xl_xact_commit_fields recdata;
/*
* Are we using the replication origins feature? Or, in other words, are
@@ -2302,6 +2309,22 @@ RecordTransactionCommitPrepared(TransactionId xid,
replorigin = (replorigin_session_origin != InvalidRepOriginId &&
replorigin_session_origin != DoNotReplicateId);
+ /*
+ * We can't assume that the data we're trying to fit in the WAL record is
+ * going to fit in our XLog record format, due to the arbitrary data we're
+ * registering. Because xlog record creation will throw an ERROR, we should
+ * create the record outside the critical section: we don't want to crash
+ * the backend because of oversized xlog records - we "just" want to error
+ * out and revert the transaction.
+ */
+ XactPrepareCommitRecord(committs,
+ nchildren, children, nrels, rels,
+ nstats, stats,
+ ninvalmsgs, invalmsgs,
+ initfileinval,
+ MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
+ xid, gid, &recdata);
+
START_CRIT_SECTION();
/* See notes in RecordTransactionCommit */
@@ -2313,13 +2336,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
* potentially having AccessExclusiveLocks since we don't know whether or
* not they do.
*/
- recptr = XactLogCommitRecord(committs,
- nchildren, children, nrels, rels,
- nstats, stats,
- ninvalmsgs, invalmsgs,
- initfileinval,
- MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid, gid);
+ recptr = XactLogCommitRecord();
if (replorigin)
@@ -2388,6 +2405,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
{
XLogRecPtr recptr;
bool replorigin;
+ xl_xact_abort_fields recdata;
/*
* Are we using the replication origins feature? Or, in other words, are
@@ -2404,6 +2422,13 @@ RecordTransactionAbortPrepared(TransactionId xid,
elog(PANIC, "cannot abort transaction %u, it was already committed",
xid);
+ XactPrepareAbortRecord(GetCurrentTimestamp(),
+ nchildren, children,
+ nrels, rels,
+ nstats, stats,
+ MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
+ xid, gid, &recdata);
+
START_CRIT_SECTION();
/*
@@ -2411,12 +2436,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
* potentially having AccessExclusiveLocks since we don't know whether or
* not they do.
*/
- recptr = XactLogAbortRecord(GetCurrentTimestamp(),
- nchildren, children,
- nrels, rels,
- nstats, stats,
- MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid, gid);
+ recptr = XactLogAbortRecord();
if (replorigin)
/* Move LSNs forward for this replication origin */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 01b1e0fb8c..79c0448889 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -247,6 +247,10 @@ static TransactionStateData TopTransactionStateData = {
/*
* unreportedXids holds XIDs of all subtransactions that have not yet been
* reported in an XLOG_XACT_ASSIGNMENT record.
+ *
+ * Increasing the size of the unreportedXids cache should be limited to
+ * a reasonable degree, as these xids may need to be fit into a WAL-record,
+ * which have a limited size.
*/
static int nUnreportedXids;
static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -1365,6 +1369,7 @@ RecordTransactionCommit(void)
else
{
bool replorigin;
+ xl_xact_commit_fields recdata;
/*
* Are we using the replication origins feature? Or, in other words,
@@ -1376,7 +1381,14 @@ RecordTransactionCommit(void)
/*
* Begin commit critical section and insert the commit XLOG record.
*/
-
+ XactPrepareCommitRecord(GetCurrentTransactionStopTimestamp(),
+ nchildren, children, nrels, rels,
+ ndroppedstats, droppedstats,
+ nmsgs, invalMessages,
+ RelcacheInitFileInval,
+ MyXactFlags,
+ InvalidTransactionId, NULL /* plain commit */,
+ &recdata);
/*
* Mark ourselves as within our "commit critical section". This
* forces any concurrent checkpoint to wait until we've updated
@@ -1398,13 +1410,7 @@ RecordTransactionCommit(void)
START_CRIT_SECTION();
MyProc->delayChkptFlags |= DELAY_CHKPT_START;
- XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
- nchildren, children, nrels, rels,
- ndroppedstats, droppedstats,
- nmsgs, invalMessages,
- RelcacheInitFileInval,
- MyXactFlags,
- InvalidTransactionId, NULL /* plain commit */ );
+ XactLogCommitRecord();
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1712,6 +1718,7 @@ RecordTransactionAbort(bool isSubXact)
TransactionId *children;
TimestampTz xact_time;
bool replorigin;
+ xl_xact_abort_fields recdata;
/*
* If we haven't been assigned an XID, nobody will care whether we aborted
@@ -1754,10 +1761,6 @@ RecordTransactionAbort(bool isSubXact)
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(false, &droppedstats);
- /* XXX do we really need a critical section here? */
- START_CRIT_SECTION();
-
- /* Write the ABORT record */
if (isSubXact)
xact_time = GetCurrentTimestamp();
else
@@ -1765,12 +1768,18 @@ RecordTransactionAbort(bool isSubXact)
xact_time = GetCurrentTransactionStopTimestamp();
}
- XactLogAbortRecord(xact_time,
- nchildren, children,
- nrels, rels,
- ndroppedstats, droppedstats,
- MyXactFlags, InvalidTransactionId,
- NULL);
+ XactPrepareAbortRecord(xact_time,
+ nchildren, children,
+ nrels, rels,
+ ndroppedstats, droppedstats,
+ MyXactFlags, InvalidTransactionId,
+ NULL, &recdata);
+
+ /* XXX do we really need a critical section here? */
+ START_CRIT_SECTION();
+
+ /* Write the ABORT record */
+ XactLogAbortRecord();
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -5628,37 +5637,34 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
- * Log the commit record for a plain or twophase transaction commit.
+ * Prepare the commit record for a plain or twophase transaction commit.
*
* A 2pc commit will be emitted when twophase_xid is valid, a plain one
* otherwise.
+ *
+ * This is separated from writing the record to WAL due to the arbitrary
+ * size of the record potentially resulting in an ERROR whilst constructing
+ * the record, which (when called from critical sections) would panic.
+ * This separation allows us to prepare the insert outside critical sections,
+ * preventing potential crashes.
*/
-XLogRecPtr
-XactLogCommitRecord(TimestampTz commit_time,
- int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
- int ndroppedstats, xl_xact_stats_item *droppedstats,
- int nmsgs, SharedInvalidationMessage *msgs,
- bool relcacheInval,
- int xactflags, TransactionId twophase_xid,
- const char *twophase_gid)
+void
+XactPrepareCommitRecord(TimestampTz commit_time,
+ int nsubxacts, TransactionId *subxacts,
+ int nrels, RelFileLocator *rels,
+ int ndroppedstats, xl_xact_stats_item *droppedstats,
+ int nmsgs, SharedInvalidationMessage *msgs,
+ bool relcacheInval,
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid,
+ xl_xact_commit_fields *recdata)
{
- xl_xact_commit xlrec;
- xl_xact_xinfo xl_xinfo;
- xl_xact_dbinfo xl_dbinfo;
- xl_xact_subxacts xl_subxacts;
- xl_xact_relfilelocators xl_relfilelocators;
- xl_xact_stats_items xl_dropped_stats;
- xl_xact_invals xl_invals;
- xl_xact_twophase xl_twophase;
- xl_xact_origin xl_origin;
uint8 info;
- Assert(CritSectionCount > 0);
+ Assert(CritSectionCount == 0);
- xl_xinfo.xinfo = 0;
+ recdata->xl_xinfo.xinfo = 0;
/* decide between a plain and 2pc commit */
if (!TransactionIdIsValid(twophase_xid))
@@ -5668,21 +5674,21 @@ XactLogCommitRecord(TimestampTz commit_time,
/* First figure out and collect all the information needed */
- xlrec.xact_time = commit_time;
+ recdata->xlrec.xact_time = commit_time;
if (relcacheInval)
- xl_xinfo.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE;
+ recdata->xl_xinfo.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE;
if (forceSyncCommit)
- xl_xinfo.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT;
+ recdata->xl_xinfo.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT;
if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK))
- xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
/*
* Check if the caller would like to ask standbys for immediate feedback
* once this commit is applied.
*/
if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
- xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
+ recdata->xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
/*
* Relcache invalidations requires information about the current database
@@ -5690,145 +5696,154 @@ XactLogCommitRecord(TimestampTz commit_time,
*/
if (nmsgs > 0 || XLogLogicalInfoActive())
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
- xl_dbinfo.dbId = MyDatabaseId;
- xl_dbinfo.tsId = MyDatabaseTableSpace;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ recdata->xl_dbinfo.dbId = MyDatabaseId;
+ recdata->xl_dbinfo.tsId = MyDatabaseTableSpace;
}
if (nsubxacts > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_SUBXACTS;
- xl_subxacts.nsubxacts = nsubxacts;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_SUBXACTS;
+ recdata->xl_subxacts.nsubxacts = nsubxacts;
}
if (nrels > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
- xl_relfilelocators.nrels = nrels;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
+ recdata->xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
}
if (ndroppedstats > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_DROPPED_STATS;
- xl_dropped_stats.nitems = ndroppedstats;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_DROPPED_STATS;
+ recdata->xl_dropped_stats.nitems = ndroppedstats;
}
if (nmsgs > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_INVALS;
- xl_invals.nmsgs = nmsgs;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_INVALS;
+ recdata->xl_invals.nmsgs = nmsgs;
}
if (TransactionIdIsValid(twophase_xid))
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
- xl_twophase.xid = twophase_xid;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
+ recdata->xl_twophase.xid = twophase_xid;
Assert(twophase_gid != NULL);
if (XLogLogicalInfoActive())
- xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
}
/* dump transaction origin information */
if (replorigin_session_origin != InvalidRepOriginId)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
- xl_origin.origin_lsn = replorigin_session_origin_lsn;
- xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
+ recdata->xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ recdata->xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
- if (xl_xinfo.xinfo != 0)
+ if (recdata->xl_xinfo.xinfo != 0)
info |= XLOG_XACT_HAS_INFO;
/* Then include all the collected data into the commit record. */
XLogBeginInsert();
- XLogRegisterData((char *) (&xlrec), sizeof(xl_xact_commit));
+ XLogRegisterData((char *) (&recdata->xlrec), sizeof(xl_xact_commit));
- if (xl_xinfo.xinfo != 0)
- XLogRegisterData((char *) (&xl_xinfo.xinfo), sizeof(xl_xinfo.xinfo));
+ if (recdata->xl_xinfo.xinfo != 0)
+ XLogRegisterData((char *) (&recdata->xl_xinfo.xinfo), sizeof(recdata->xl_xinfo.xinfo));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
- XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&recdata->xl_dbinfo), sizeof(recdata->xl_dbinfo));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
- XLogRegisterData((char *) (&xl_subxacts),
+ XLogRegisterData((char *) (&recdata->xl_subxacts),
MinSizeOfXactSubxacts);
XLogRegisterData((char *) subxacts,
nsubxacts * sizeof(TransactionId));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILELOCATORS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILELOCATORS)
{
- XLogRegisterData((char *) (&xl_relfilelocators),
+ XLogRegisterData((char *) (&recdata->xl_relfilelocators),
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
{
- XLogRegisterData((char *) (&xl_dropped_stats),
+ XLogRegisterData((char *) (&recdata->xl_dropped_stats),
MinSizeOfXactStatsItems);
XLogRegisterData((char *) droppedstats,
ndroppedstats * sizeof(xl_xact_stats_item));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_INVALS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_INVALS)
{
- XLogRegisterData((char *) (&xl_invals), MinSizeOfXactInvals);
+ XLogRegisterData((char *) (&recdata->xl_invals), MinSizeOfXactInvals);
XLogRegisterData((char *) msgs,
nmsgs * sizeof(SharedInvalidationMessage));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
{
- XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) (&recdata->xl_twophase), sizeof(xl_xact_twophase));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
XLogRegisterData(unconstify(char *, twophase_gid), strlen(twophase_gid) + 1);
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
- XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&recdata->xl_origin), sizeof(xl_xact_origin));
/* we allow filtering by xacts */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ XLogPrepareInsert(RM_XACT_ID, info);
}
/*
- * Log the commit record for a plain or twophase transaction abort.
+ * Log the prepared XLogCommit record.
+ * See XactPrepareCommitRecord for more info.
+ */
+XLogRecPtr
+XactLogCommitRecord(void)
+{
+ Assert(CritSectionCount > 0);
+ return XLogInsertPrepared();
+}
+
+/*
+ * Prepare the commit record for a plain or twophase transaction abort.
*
* A 2pc abort will be emitted when twophase_xid is valid, a plain one
* otherwise.
+ *
+ * This is separated from writing the record to WAL due to the arbitrary
+ * size of the record potentially resulting in an ERROR whilst constructing
+ * the record, which (when called from critical sections) would panic.
+ * This separation allows us to prepare the insert outside critical sections,
+ * preventing potential crashes.
*/
-XLogRecPtr
-XactLogAbortRecord(TimestampTz abort_time,
- int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
- int ndroppedstats, xl_xact_stats_item *droppedstats,
- int xactflags, TransactionId twophase_xid,
- const char *twophase_gid)
+void
+XactPrepareAbortRecord(TimestampTz abort_time,
+ int nsubxacts, TransactionId *subxacts,
+ int nrels, RelFileLocator *rels,
+ int ndroppedstats, xl_xact_stats_item *droppedstats,
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid,
+ xl_xact_abort_fields *recdata)
{
- xl_xact_abort xlrec;
- xl_xact_xinfo xl_xinfo;
- xl_xact_subxacts xl_subxacts;
- xl_xact_relfilelocators xl_relfilelocators;
- xl_xact_stats_items xl_dropped_stats;
- xl_xact_twophase xl_twophase;
- xl_xact_dbinfo xl_dbinfo;
- xl_xact_origin xl_origin;
-
uint8 info;
- Assert(CritSectionCount > 0);
+ Assert(CritSectionCount == 0);
- xl_xinfo.xinfo = 0;
+ recdata->xl_xinfo.xinfo = 0;
/* decide between a plain and 2pc abort */
if (!TransactionIdIsValid(twophase_xid))
@@ -5839,45 +5854,45 @@ XactLogAbortRecord(TimestampTz abort_time,
/* First figure out and collect all the information needed */
- xlrec.xact_time = abort_time;
+ recdata->xlrec.xact_time = abort_time;
if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK))
- xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
if (nsubxacts > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_SUBXACTS;
- xl_subxacts.nsubxacts = nsubxacts;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_SUBXACTS;
+ recdata->xl_subxacts.nsubxacts = nsubxacts;
}
if (nrels > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
- xl_relfilelocators.nrels = nrels;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
+ recdata->xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
}
if (ndroppedstats > 0)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_DROPPED_STATS;
- xl_dropped_stats.nitems = ndroppedstats;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_DROPPED_STATS;
+ recdata->xl_dropped_stats.nitems = ndroppedstats;
}
if (TransactionIdIsValid(twophase_xid))
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
- xl_twophase.xid = twophase_xid;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
+ recdata->xl_twophase.xid = twophase_xid;
Assert(twophase_gid != NULL);
if (XLogLogicalInfoActive())
- xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
}
if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
- xl_dbinfo.dbId = MyDatabaseId;
- xl_dbinfo.tsId = MyDatabaseTableSpace;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ recdata->xl_dbinfo.dbId = MyDatabaseId;
+ recdata->xl_dbinfo.tsId = MyDatabaseTableSpace;
}
/*
@@ -5886,65 +5901,76 @@ XactLogAbortRecord(TimestampTz abort_time,
*/
if (replorigin_session_origin != InvalidRepOriginId)
{
- xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+ recdata->xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
- xl_origin.origin_lsn = replorigin_session_origin_lsn;
- xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
+ recdata->xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ recdata->xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
- if (xl_xinfo.xinfo != 0)
+ if (recdata->xl_xinfo.xinfo != 0)
info |= XLOG_XACT_HAS_INFO;
/* Then include all the collected data into the abort record. */
XLogBeginInsert();
- XLogRegisterData((char *) (&xlrec), MinSizeOfXactAbort);
+ XLogRegisterData((char *) (&recdata->xlrec), MinSizeOfXactAbort);
- if (xl_xinfo.xinfo != 0)
- XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (recdata->xl_xinfo.xinfo != 0)
+ XLogRegisterData((char *) (&recdata->xl_xinfo), sizeof(recdata->xl_xinfo));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
- XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&recdata->xl_dbinfo), sizeof(recdata->xl_dbinfo));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
- XLogRegisterData((char *) (&xl_subxacts),
+ XLogRegisterData((char *) (&recdata->xl_subxacts),
MinSizeOfXactSubxacts);
XLogRegisterData((char *) subxacts,
nsubxacts * sizeof(TransactionId));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILELOCATORS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILELOCATORS)
{
- XLogRegisterData((char *) (&xl_relfilelocators),
+ XLogRegisterData((char *) (&recdata->xl_relfilelocators),
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
{
- XLogRegisterData((char *) (&xl_dropped_stats),
+ XLogRegisterData((char *) (&recdata->xl_dropped_stats),
MinSizeOfXactStatsItems);
XLogRegisterData((char *) droppedstats,
ndroppedstats * sizeof(xl_xact_stats_item));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
{
- XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) (&recdata->xl_twophase), sizeof(xl_xact_twophase));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
XLogRegisterData(unconstify(char *, twophase_gid), strlen(twophase_gid) + 1);
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
- XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+ if (recdata->xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&recdata->xl_origin), sizeof(xl_xact_origin));
/* Include the replication origin */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ XLogPrepareInsert(RM_XACT_ID, info);
+}
+
+/*
+ * Log the prepared XLogAbort record.
+ * See XactPrepareAbortRecord for more info.
+ */
+XLogRecPtr
+XactLogAbortRecord(void)
+{
+ Assert(CritSectionCount > 0);
+ return XLogInsertPrepared();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..b283563f6e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -821,7 +821,7 @@ XLogInsertRecord(XLogRecData *rdata,
doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
if (doPageWrites &&
- (!prevDoPageWrites ||
+ (((!prevDoPageWrites) && ((flags & XLOG_NOPAGEWRITE_USED) != 0)) ||
(fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
{
/*
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..134cf32bb4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -132,6 +132,17 @@ static int max_rdatas; /* allocated size */
static bool begininsert_called = false;
+/*
+ * Used in split creation / logging of the xlog record, when the caller
+ * wants to protect against >maxsized xlog records errors while inserting the
+ * record in a critical section.
+ */
+static bool prepareinsert_called = false;
+static XLogRecData *prepared_record = NULL;
+static XLogRecPtr preprec_fpw_lsn = InvalidXLogRecPtr;
+static bool preprec_topxid_included;
+static int preprec_num_fpw;
+
/* Memory context to hold the registered buffer and data references. */
static MemoryContext xloginsert_cxt;
@@ -233,6 +244,8 @@ XLogResetInsertion(void)
mainrdata_last = (XLogRecData *) &mainrdata_head;
curinsert_flags = 0;
begininsert_called = false;
+ prepareinsert_called = false;
+ prepared_record = NULL;
}
/*
@@ -433,6 +446,8 @@ void
XLogSetRecordFlags(uint8 flags)
{
Assert(begininsert_called);
+ /* don't modify the flags after preparing the record for insertion */
+ Assert(!prepareinsert_called);
curinsert_flags |= flags;
}
@@ -506,6 +521,107 @@ XLogInsert(RmgrId rmid, uint8 info)
return EndPos;
}
+/*
+ * Prepare to insert an XLOG record having the specified RMID and info bytes,
+ * with the body of the record being the data and buffer references registered
+ * earlier with XLogRegister* calls.
+ *
+ * The user must later call XLogInsertPrepared() to do the actual insertion,
+ * but once this function returns the user can be assured that the xlog record
+ * itself can be written with the assumed info.
+ */
+void
+XLogPrepareInsert(RmgrId rmid, uint8 info)
+{
+ XLogRecPtr RedoRecPtr;
+ bool doPageWrites;
+
+ /* XLogBeginInsert() must have been called. */
+ if (!begininsert_called)
+ elog(ERROR, "XLogBeginInsert was not called");
+ /* ... and you can't prepare the record twice */
+ if (prepareinsert_called)
+ elog(ERROR, "Cannot call XLogPrepareInsert twice");
+
+ prepareinsert_called = true;
+
+ Assert(CritSectionCount == 0);
+ /*
+ * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+ * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
+ */
+ if ((info & ~(XLR_RMGR_INFO_MASK |
+ XLR_SPECIAL_REL_UPDATE |
+ XLR_CHECK_CONSISTENCY)) != 0)
+ elog(PANIC, "invalid xlog info mask %02X", info);
+
+ TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
+
+ /*
+ * In bootstrap mode, we don't actually log anything but XLOG resources;
+ * return a phony record pointer.
+ */
+ if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
+ {
+ prepared_record = NULL;
+ return;
+ }
+
+ preprec_topxid_included = false;
+
+ /*
+ * Get values needed to decide whether to do full-page writes. Since
+ * we don't yet have an insertion lock, these could change under us,
+ * but XLogInsertRecord will recheck them once it has a lock.
+ */
+ GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
+
+ prepared_record = XLogRecordAssemble(rmid, info, RedoRecPtr,
+ doPageWrites, &preprec_fpw_lsn,
+ &preprec_num_fpw, &preprec_topxid_included);
+}
+
+/*
+ * XLogInsertPrepared - write the previously prepared record into WAL.
+ *
+ * Unlike XLogInsert(), this does not retry insertions, but instead returns
+ * InvalidXLogRecPtr when it fails to immediately insert the record (e.g. due
+ * to changes in the FPW settings between prepare and ).
+ */
+XLogRecPtr
+XLogInsertPrepared(void)
+{
+ XLogRecPtr EndPos;
+
+ /*
+ * XLogBeginInsert() must have been called, and a record must have been
+ * prepared.
+ */
+ if (!begininsert_called)
+ elog(ERROR, "XLogBeginInsert was not called");
+ if (!prepareinsert_called)
+ elog(ERROR, "XLogPrepareInsert was not called");
+
+ /*
+ * In bootstrap mode, we don't actually log anything but XLOG resources;
+ * return a phony record pointer.
+ */
+ if (IsBootstrapProcessingMode() && prepared_record == NULL)
+ {
+ XLogResetInsertion();
+ EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */
+ return EndPos;
+ }
+
+ EndPos = XLogInsertRecord(prepared_record, preprec_fpw_lsn, curinsert_flags,
+ preprec_num_fpw, preprec_topxid_included);
+
+ Assert(!XLogRecPtrIsInvalid(EndPos));
+ XLogResetInsertion();
+
+ return EndPos;
+}
+
/*
* Assemble a WAL record from the registered data and buffers into an
* XLogRecData chain, ready for insertion with XLogInsertRecord().
@@ -548,6 +664,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
rdt_datas_last = &hdr_rdt;
hdr_rdt.data = hdr_scratch;
+ /* We're rebuilding the record, reset the internal flag */
+ curinsert_flags &= ~XLOG_NOPAGEWRITE_USED;
+
/*
* Enforce consistency checks for this record if user is looking for it.
* Do this before at the beginning of this routine to give the possibility
@@ -584,7 +703,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
else if (regbuf->flags & REGBUF_NO_IMAGE)
needs_backup = false;
else if (!doPageWrites)
+ {
+ curinsert_flags |= XLOG_NOPAGEWRITE_USED;
needs_backup = false;
+ }
else
{
/*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index bfe279cddf..0cc6a25cca 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -431,6 +431,15 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) tuple->t_data, tuple->t_len);
+ /*
+ * The data in the record can currently not exceed the maximum size of
+ * a heap tuple by much more than a constant offset, which in turn is
+ * limited to about BLCKSZ.
+ * If we ever change this log record's format, be sure to check that
+ * we can't exceed the maximum size of xlog record while inside this
+ * critical section, or we'd run the chance of PANICing during this
+ * insertion.
+ */
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
@@ -849,6 +858,15 @@ nextval_internal(Oid relid, bool check_permissions)
XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
+ /*
+ * The data in the record can currently not exceed the maximum size of
+ * a heap tuple by much more than a constant offset, which in turn is
+ * limited to about BLCKSZ.
+ * If we ever change this log record's format, be sure to check that
+ * we can't exceed the maximum size of xlog record while inside this
+ * critical section, or we'd run the chance of PANICing during this
+ * insertion.
+ */
recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
PageSetLSN(page, recptr);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 7d3b9446e6..7d6a3de0a7 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -425,6 +425,30 @@ typedef struct xl_xact_parsed_abort
TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
+/* used to hold the data of the commit record when using prepared xlog insert */
+typedef struct xl_xact_commit_fields {
+ xl_xact_commit xlrec;
+ xl_xact_xinfo xl_xinfo;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_subxacts xl_subxacts;
+ xl_xact_relfilelocators xl_relfilelocators;
+ xl_xact_stats_items xl_dropped_stats;
+ xl_xact_invals xl_invals;
+ xl_xact_twophase xl_twophase;
+ xl_xact_origin xl_origin;
+} xl_xact_commit_fields;
+
+/* used to hold the data of the abort record when using prepared xlog insert */
+typedef struct xl_xact_abort_fields {
+ xl_xact_abort xlrec;
+ xl_xact_xinfo xl_xinfo;
+ xl_xact_subxacts xl_subxacts;
+ xl_xact_relfilelocators xl_relfilelocators;
+ xl_xact_stats_items xl_dropped_stats;
+ xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
+} xl_xact_abort_fields;
/* ----------------
* extern definitions
@@ -494,24 +518,28 @@ extern void MarkSubxactTopXidLogged(void);
extern int xactGetCommittedChildren(TransactionId **ptr);
-extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
- int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
- int ndroppedstats,
- xl_xact_stats_item *droppedstats,
- int nmsgs, SharedInvalidationMessage *msgs,
- bool relcacheInval,
- int xactflags,
- TransactionId twophase_xid,
- const char *twophase_gid);
-
-extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
- int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
- int ndroppedstats,
- xl_xact_stats_item *droppedstats,
- int xactflags, TransactionId twophase_xid,
- const char *twophase_gid);
+extern void XactPrepareCommitRecord(TimestampTz commit_time,
+ int nsubxacts, TransactionId *subxacts,
+ int nrels, RelFileLocator *rels,
+ int ndroppedstats,
+ xl_xact_stats_item *droppedstats,
+ int nmsgs, SharedInvalidationMessage *msgs,
+ bool relcacheInval,
+ int xactflags,
+ TransactionId twophase_xid,
+ const char *twophase_gid,
+ xl_xact_commit_fields *recdata);
+extern XLogRecPtr XactLogCommitRecord(void);
+
+extern void XactPrepareAbortRecord(TimestampTz abort_time,
+ int nsubxacts, TransactionId *subxacts,
+ int nrels, RelFileLocator *rels,
+ int ndroppedstats,
+ xl_xact_stats_item *droppedstats,
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid,
+ xl_xact_abort_fields *recdata);
+extern XLogRecPtr XactLogAbortRecord(void);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c10a3c3679 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -150,6 +150,7 @@ extern PGDLLIMPORT bool XLOG_DEBUG;
*/
#define XLOG_INCLUDE_ORIGIN 0x01 /* include the replication origin */
#define XLOG_MARK_UNIMPORTANT 0x02 /* record not important for durability */
+#define XLOG_NOPAGEWRITE_USED 0x04 /* xlog record would've had FPI, if not for settings */
/* Checkpoint statistics */
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 31785dc578..9d2c8a0893 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -42,6 +42,9 @@
extern void XLogBeginInsert(void);
extern void XLogSetRecordFlags(uint8 flags);
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
+extern void XLogPrepareInsert(RmgrId rmid, uint8 info);
+extern XLogRecPtr XLogInsertPrepared(void);
+
extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
extern void XLogRegisterData(char *data, uint32 len);
extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..72bee48c5b 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,17 @@ typedef struct XLogRecord
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4MiB of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize (1020 * 1024 * 1024)
+
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
--
2.39.0
v11-0002-Add-protections-in-xlog-record-APIs-against-over.patchapplication/octet-stream; name=v11-0002-Add-protections-in-xlog-record-APIs-against-over.patchDownload
From ab7f5bd145fe2d21667cb7982db71d3de287b416 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 5 Apr 2023 16:11:56 +0200
Subject: [PATCH v11 2/2] Add protections in xlog record APIs against overflows
Before this, it was possible for an extension to create malicious WAL
records that were too large to replay; or that would overflow the
xl_tot_len field, causing potential corruption in WAL record IO ops.
Emitting invalid records was also possible through
pg_logical_emit_message(), which allowed you to emit arbitrary amounts
of data up to 2GB, much higher than the replay limit of approximately
1GB.
This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read
any max-sized XLogRecords, such that the wal-generating backend will
fail when it attempts to insert unreadable records, instead of that
insertion succeeding but breaking any replication streams.
---
src/backend/access/transam/xloginsert.c | 49 ++++++++++++++++++++++---
src/include/access/xlogrecord.h | 11 ++++++
2 files changed, 54 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 134cf32bb4..6437234994 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -98,7 +98,7 @@ static int max_registered_block_id = 0; /* highest block_id + 1 currently
*/
static XLogRecData *mainrdata_head;
static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
-static uint32 mainrdata_len; /* total # of bytes in chain */
+static uint64 mainrdata_len; /* total # of bytes in chain */
/* flags for the in-progress insertion */
static uint8 curinsert_flags = 0;
@@ -368,7 +368,10 @@ XLogRegisterData(char *data, uint32 len)
Assert(begininsert_called);
if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use",
+ num_rdatas, max_rdatas)));
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -418,9 +421,16 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
* regbuf->rdata_len does not grow beyond what
* XLogRecordBlockHeader->data_length can hold.
*/
- if (num_rdatas >= max_rdatas ||
- regbuf->rdata_len + len > UINT16_MAX)
- elog(ERROR, "too much WAL data");
+ if (num_rdatas >= max_rdatas)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use",
+ num_rdatas, max_rdatas)));
+ if (regbuf->rdata_len + len > UINT16_MAX || len > UINT16_MAX)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("Registering more than max %u bytes total to block %u: current %uB, adding %uB",
+ UINT16_MAX, block_id, regbuf->rdata_len, len)));
rdata = &rdatas[num_rdatas++];
@@ -643,7 +653,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -994,6 +1004,20 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure we will
+ * not emit records larger than those sizes we advertise we support.
+ * This cap is based on DecodeXLogRecordRequiredSpace().
+ */
+ if (total_len >= XLogRecordMaxSize)
+ ereport(ERROR,
+ (errmsg_internal("oversized WAL record"),
+ errdetail_internal("WAL record would be %llu bytes (of max %u bytes); rmid %u flags %u",
+ (unsigned long long) total_len, XLogRecordMaxSize, rmid, info)));
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
@@ -1419,6 +1443,19 @@ log_newpage_range(Relation rel, ForkNumber forknum,
void
InitXLogInsert(void)
{
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Check that any records assembled can be decoded. This is capped
+ * based on what XLogReader would require at its maximum bound.
+ * This code path is called once per backend, more than enough for
+ * this check.
+ */
+ size_t max_required = DecodeXLogRecordRequiredSpace(XLogRecordMaxSize);
+
+ Assert(AllocSizeIsValid(max_required));
+#endif
+
/* Initialize the working areas */
if (xloginsert_cxt == NULL)
{
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 72bee48c5b..06cca575ea 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -65,6 +65,17 @@ typedef struct XLogRecord
*/
#define XLogRecordMaxSize (1020 * 1024 * 1024)
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize (1020 * 1024 * 1024)
+
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
--
2.39.0
On Wed, Apr 05, 2023 at 04:35:37PM +0200, Matthias van de Meent wrote:
I thought that the plan was to use int64 to skip checking for most
overflows and to do a single check at the end in XLogRecordAssemble,
so that the checking has minimal overhead in the performance-critical
log record assembling path and reduced load on the branch predictor.
And that's the reason why your v11-0002 is better and simpler than the
v10-0001 I posted a few days ago.
+ if (regbuf->rdata_len + len > UINT16_MAX || len > UINT16_MAX)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("Registering more than max %u bytes total to block %u: current %uB, adding %uB",
+ UINT16_MAX, block_id, regbuf->rdata_len, len)));
I was wondering for a few minutes about the second part of this
check.. But you are worried about the case where len is too large
that it would overflow rdata_len if calling XLogRegisterBufData() more
than once on the same block, if len is between
(UINT32_MAX-UINT16_MAX,UINT32_MAX) on the second call.
The extra errdetail_internal() could be tweaked a bit more, but I'm
also OK with your proposal, overall. One thing is "current %uB,
adding %uB" would be better using "bytes".
One more issue that Andres was suggesting we'd fix was to allow XLog
assembly separate from the actual XLog insertion:
Currently you can't pre-assemble a record outside a critical section
if the record must be inserted in a critical section, which makes e.g.
commit records problematic due to the potentially oversized data
resulting in ERRORs during record assembly. This would crash postgres
because commit xlog insertion happens in a critical section. Having a
pre-assembled record would greatly improve the ergonomics in that path
and reduce the length of the critical path.I think it was something along the lines of the attached; 0001
contains separated Commit/Abort record construction and insertion like
Andres suggested,
I am honestly not sure whether we should complicate xloginsert.c this
way, but we could look at that for v17.
0002 does the size checks with updated error messages.
0002 can also be done before 0001, so I'd like to get that part
applied on HEAD before the feature freeze and close this thread. If
there are any objections, please feel free..
--
Michael
On Thu, Apr 06, 2023 at 10:54:43AM +0900, Michael Paquier wrote:
0002 can also be done before 0001, so I'd like to get that part
applied on HEAD before the feature freeze and close this thread. If
there are any objections, please feel free..
I was doing a pre-commit review of the patch, and double-checked the
uses of mainrdata_len. And there is this part:
/* followed by main data, if any */
if (mainrdata_len > 0)
{
if (mainrdata_len > 255)
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
memcpy(scratch, &mainrdata_len, sizeof(uint32));
scratch += sizeof(uint32);
}
else
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
*(scratch++) = (uint8) mainrdata_len;
}
rdt_datas_last->next = mainrdata_head;
rdt_datas_last = mainrdata_last;
total_len += mainrdata_len;
}
rdt_datas_last->next = NULL;
So bumping mainrdata_len to uint64 is actually not entirely in line
with this code. Well, it will work because we'd still fail a couple
of lines down, but perhaps its readability should be improved so as
we have an extra check in this code path to make sure that
mainrdata_len is not higher than PG_UINT32_MAX, then use an
intermediate casted variable before saving the length in the record
data to make clear that the type of the main static length in
xloginsert.c is not the same as what a record has? The v10 I sent
previously blocked this possibility, but not v11.
--
Michael
On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
So bumping mainrdata_len to uint64 is actually not entirely in line
with this code. Well, it will work because we'd still fail a couple
of lines down, but perhaps its readability should be improved so as
we have an extra check in this code path to make sure that
mainrdata_len is not higher than PG_UINT32_MAX, then use an
intermediate casted variable before saving the length in the record
data to make clear that the type of the main static length in
xloginsert.c is not the same as what a record has? The v10 I sent
previously blocked this possibility, but not v11.
So, I was thinking about something like the attached tweaking this
point, the error details a bit, applying an indentation and writing a
commit message... Matthias?
--
Michael
Attachments:
v12-0001-Add-more-protections-in-WAL-record-APIs-against-.patchtext/x-diff; charset=us-asciiDownload
From 10eff110176a4988295118b0f17a2a868fb9b96e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 7 Apr 2023 08:19:24 +0900
Subject: [PATCH v12] Add more protections in WAL record APIs against overflows
This commit adds a limit to the size of an XLogRecord at 1020MiB, based
on a suggestion by Heikki Linnakangas. This counts for the overhead
needed by the xlogreader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace(). An assertion based on that
is added to make sure to detect that any additions in the XLogReader
facilities would not cause any overflows. If that's ever the case, the
upper bound would need to be adjusted.
Before this, it was possible for an external module to create WAL
records large enough to be assembled still not replayable, causing
failures when replaying such WAL records on standbys. One case
mentioned where this is possible is the in-core function
pg_logical_emit_message(), that allows to emit WAL records with an
arbitrary amount of data up to 2GB, much higher than the replay limit
of approximately 1GB (limit of a palloc).
This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data. Here, the checks are extended to the whole
record length. All the error messages related to overflow checks are
improved to provide more context about what is happening.
Author: Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
---
src/include/access/xlogrecord.h | 11 +++++
src/backend/access/transam/xloginsert.c | 60 ++++++++++++++++++++++---
2 files changed, 64 insertions(+), 7 deletions(-)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..f355e08e1d 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -62,6 +62,17 @@ typedef struct XLogRecord
#define XLR_INFO_MASK 0x0F
#define XLR_RMGR_INFO_MASK 0xF0
+/*
+ * XLogReader needs to allocate all the data of a WAL record in a single
+ * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize (1020 * 1024 * 1024)
+
/*
* If a WAL record modifies any relation files, in ways not covered by the
* usual block references, this flag is set. This is not used for anything
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..b71e690c5b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -98,7 +98,7 @@ static int max_registered_block_id = 0; /* highest block_id + 1 currently
*/
static XLogRecData *mainrdata_head;
static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
-static uint32 mainrdata_len; /* total # of bytes in chain */
+static uint64 mainrdata_len; /* total # of bytes in chain */
/* flags for the in-progress insertion */
static uint8 curinsert_flags = 0;
@@ -355,7 +355,10 @@ XLogRegisterData(char *data, uint32 len)
Assert(begininsert_called);
if (num_rdatas >= max_rdatas)
- elog(ERROR, "too much WAL data");
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use.",
+ num_rdatas, max_rdatas)));
rdata = &rdatas[num_rdatas++];
rdata->data = data;
@@ -405,9 +408,16 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
* regbuf->rdata_len does not grow beyond what
* XLogRecordBlockHeader->data_length can hold.
*/
- if (num_rdatas >= max_rdatas ||
- regbuf->rdata_len + len > UINT16_MAX)
- elog(ERROR, "too much WAL data");
+ if (num_rdatas >= max_rdatas)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use.",
+ num_rdatas, max_rdatas)));
+ if (regbuf->rdata_len + len > UINT16_MAX || len > UINT16_MAX)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("Registering more than max %u bytes allowed to block %u: current %u bytes, adding %u bytes.",
+ UINT16_MAX, block_id, regbuf->rdata_len, len)));
rdata = &rdatas[num_rdatas++];
@@ -527,7 +537,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
{
XLogRecData *rdt;
- uint32 total_len = 0;
+ uint64 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
@@ -841,8 +851,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
{
if (mainrdata_len > 255)
{
+ uint32 mainrdata_len_4b;
+
+ if (mainrdata_len > PG_UINT32_MAX)
+ ereport(ERROR,
+ (errmsg_internal("too much WAL data"),
+ errdetail_internal("Main data length at %llu bytes for a max of %u bytes.",
+ (unsigned long long) mainrdata_len,
+ PG_UINT32_MAX)));
+
+ mainrdata_len_4b = (uint32) mainrdata_len;
*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
- memcpy(scratch, &mainrdata_len, sizeof(uint32));
+ memcpy(scratch, &mainrdata_len_4b, sizeof(uint32));
scratch += sizeof(uint32);
}
else
@@ -872,6 +892,20 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+ /*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure that we will
+ * not emit records larger than the sizes advertised to be supported.
+ * This cap is based on DecodeXLogRecordRequiredSpace().
+ */
+ if (total_len >= XLogRecordMaxSize)
+ ereport(ERROR,
+ (errmsg_internal("oversized WAL record"),
+ errdetail_internal("WAL record would be %llu bytes (of max %u bytes); rmid %u flags %u.",
+ (unsigned long long) total_len, XLogRecordMaxSize, rmid, info)));
+
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
@@ -1297,6 +1331,18 @@ log_newpage_range(Relation rel, ForkNumber forknum,
void
InitXLogInsert(void)
{
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Check that any records assembled can be decoded. This is capped based
+ * on what XLogReader would require at its maximum bound. This code path
+ * is called once per backend, more than enough for this check.
+ */
+ size_t max_required = DecodeXLogRecordRequiredSpace(XLogRecordMaxSize);
+
+ Assert(AllocSizeIsValid(max_required));
+#endif
+
/* Initialize the working areas */
if (xloginsert_cxt == NULL)
{
--
2.40.0
On Fri, 7 Apr 2023, 01:35 Michael Paquier, <michael@paquier.xyz> wrote:
On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
So bumping mainrdata_len to uint64 is actually not entirely in line
with this code. Well, it will work because we'd still fail a couple
of lines down, but perhaps its readability should be improved so as
we have an extra check in this code path to make sure that
mainrdata_len is not higher than PG_UINT32_MAX, then use an
intermediate casted variable before saving the length in the record
data to make clear that the type of the main static length in
xloginsert.c is not the same as what a record has? The v10 I sent
previously blocked this possibility, but not v11.
Yes, that was a bad oversight, which would've shown up in tests on a system
with an endianness that my computer doesn't have...
So, I was thinking about something like the attached tweaking this
point, the error details a bit, applying an indentation and writing a
commit message... Matthias?
That looks fine to me. Thanks for picking this up and fixing the issue.
Kind regards,
Matthias van de Meent
On Fri, Apr 07, 2023 at 01:50:00AM +0200, Matthias van de Meent wrote:
Yes, that was a bad oversight, which would've shown up in tests on a system
with an endianness that my computer doesn't have...
I don't think that we have many bigendian animals in the buildfarm,
either..
That looks fine to me. Thanks for picking this up and fixing the issue.
Okay, cool!
--
Michael
On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
Okay, cool!
Done this one with 8fcb32d.
--
Michael
On Fri, 7 Apr 2023 at 08:05, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
Okay, cool!
Done this one with 8fcb32d.
Thanks a lot! I'll post the separation of record construction and
write-out to xlog in a future thread for 17.
One remaining question: Considering that the changes and checks of
that commit are mostly internal to xloginsert.c (or xlog.c in older
releases), and that no special public-facing changes were made, would
it be safe to backport this to older releases?
PostgreSQL 15 specifically would benefit from this as it supports
external rmgrs which may generate WAL records and would benefit from
these additional checks, but all supported releases of PostgreSQL have
pg_logical_emit_message and are thus easily subject to the issue of
writing oversized WAL records and subsequent recovery- and replication
stream failures.
Kind regards,
Matthias van de Meent
On Sat, Apr 08, 2023 at 04:24:35PM +0200, Matthias van de Meent wrote:
Thanks a lot! I'll post the separation of record construction and
write-out to xlog in a future thread for 17.
Thanks! Creating a new thread makes sense.
One remaining question: Considering that the changes and checks of
that commit are mostly internal to xloginsert.c (or xlog.c in older
releases), and that no special public-facing changes were made, would
it be safe to backport this to older releases?
The routine changes done in ffd1b6b cannot be backpatched on ABI
grounds, still you would propose to have protection around
needs_data as well as the whole record length.
PostgreSQL 15 specifically would benefit from this as it supports
external rmgrs which may generate WAL records and would benefit from
these additional checks, but all supported releases of PostgreSQL have
pg_logical_emit_message and are thus easily subject to the issue of
writing oversized WAL records and subsequent recovery- and replication
stream failures.
Custom RMGRs are a good argument, though I don't really see an urgent
argument about doing something in REL_15_STABLE. For one, it would
mean more backpatching conflicts with ~14. Another argument is that
XLogRecordMaxSize is not an exact science, either. In ~15, a record
with a total size between XLogRecordMaxSize and
DecodeXLogRecordRequiredSpace(MaxAllocSize) would work, though it
would not in 16~ because we have the 4MB margin given as room for the
per-record allocation in the XLogReader. A record of such a size
would not be generated anymore after a minor release update of 15.3~
if we were to do something about that by May on REL_15_STABLE.
--
Michael