Re: Panic during xlog building with big values

Started by Maksim.Melnikov3 months ago2 messages

m.melnikov@postgrespro.ru

3 months ago

1 attachment(s)

Hi, Andy, thanks for your review!

I've checked RecordTransactionCommit too, but I don't think it can fire
similar error. Problem, that was described above, occurred because we
used external column storage without compression and with REPLICA
IDENTITY FULL.
To be honest, it's degenerate case, that can occur only in case of tuple
update/delete, when we need full row to identify updated/deleted value,
more info can be found in doc [1]https://www.postgresql.org/docs/current/logical-replication-publication.html.

I've fixed comments with yours remarks, thanks. Patch is attached.

Also rebased patch with commit d3ba50db48e66be8804b9edf093b0f921d625425.

[1]: https://www.postgresql.org/docs/current/logical-replication-publication.html
https://www.postgresql.org/docs/current/logical-replication-publication.html

Best regards,
Maksim Melnikov

Attachments:

v3-0001-Pre-check-potential-XLogRecord-oversize.patchtext/x-patch; charset=UTF-8; name=v3-0001-Pre-check-potential-XLogRecord-oversize.patchDownload

From 125b8b75f8db12cfb076709a914d7717060e2e51 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov <m.melnikov@postgrespro.ru>
Date: Fri, 15 Aug 2025 12:15:09 +0300
Subject: [PATCH v3] Pre-check potential XLogRecord oversize for heap_update
 and heap_insert.

Pre-check potential XLogRecord oversize. XLogRecord will be created
later, and it's size will be checked. However, this operation will
occur within a critical section, and in the event of failure, a core
dump will be generated.
It does't seem good, so to avoid this, we can calculate the approximate
xlog record size here and check it.

Size prediction is based on xlog update and xlog delete logic, and can
be revised if it changes. For now, the buf size is limited by
UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).

Anyway, to accommodate some overhead, 1M is subtracted from the predicted
value. It seems like that's enough for now.
---
 src/backend/access/heap/heapam.c        | 31 +++++++++++++++++++++++++
 src/backend/access/transam/xloginsert.c | 20 ++++++++++++++++
 src/include/access/xloginsert.h         |  1 +
 3 files changed, 52 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..bc9b115d279 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -61,6 +61,7 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 								  Buffer newbuf, HeapTuple oldtup,
 								  HeapTuple newtup, HeapTuple old_key_tuple,
 								  bool all_visible_cleared, bool new_all_visible_cleared);
+static void log_heap_precheck(Relation reln, HeapTuple tp);
 #ifdef USE_ASSERT_CHECKING
 static void check_lock_if_inplace_updateable_rel(Relation relation,
 												 ItemPointer otid,
@@ -9061,6 +9062,34 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	return recptr;
 }
 
+/*
+ * Pre-check potential XLogRecord oversize. XLogRecord will be created
+ * later, and it's size will be checked. However, this operation will
+ * occur within a critical section, and in the event of failure, a core
+ * dump will be generated.
+ * It does't seem good, so to avoid this, we can calculate the approximate
+ * xlog record size here and check it.
+
+ * Size prediction is based on xlog update and xlog delete logic, and can
+ * be revised if it changes. For now, the buf size is limited by
+ * UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).
+
+ * Anyway, to accommodate some overhead, 1M is subtracted from the predicted
+ * value. It seems like that's enough for now.
+ */
+static void
+log_heap_precheck(Relation reln, HeapTuple tp)
+{
+#define XLogRecordMaxOverhead ((uint32) (1024 * 1024))
+
+	if (tp && RelationIsLogicallyLogged(reln))
+	{
+		uint32		data_len = tp->t_len - SizeofHeapTupleHeader;
+
+		XLogPreCheckSize(data_len + XLogRecordMaxOverhead);
+	}
+}
+
 /*
  * Perform XLogInsert of an XLOG_HEAP2_NEW_CID record
  *
@@ -9178,6 +9207,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 			*copy = true;
 			tp = toast_flatten_tuple(tp, desc);
 		}
+		log_heap_precheck(relation, tp);
 		return tp;
 	}
 
@@ -9234,6 +9264,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 		heap_freetuple(oldtup);
 	}
 
+	log_heap_precheck(relation, key_tuple);
 	return key_tuple;
 }
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 496e0fa4ac6..bb770294c2d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1053,6 +1053,26 @@ XLogCheckBufferNeedsBackup(Buffer buffer)
 	return false;				/* buffer does not need to be backed up */
 }
 
+/*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure that we will
+ * not emit records larger than the sizes advertised to be supported.
+ */
+void
+XLogPreCheckSize(uint32 data_len)
+{
+	if (data_len > XLogRecordMaxSize)
+	{
+		ereport(ERROR,
+				(errmsg_internal("oversized WAL record"),
+				 errdetail_internal("WAL record would be %u bytes (of maximum "
+									"%u bytes).",
+									data_len, XLogRecordMaxSize)));
+	}
+}
+
 /*
  * Write a backup block if needed when we are setting a hint. Note that
  * this may be called for a variety of page types, not just heaps.
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index d6a71415d4f..7b40a824d9b 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -54,6 +54,7 @@ extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
 extern void XLogRegisterBufData(uint8 block_id, const void *data, uint32 len);
 extern void XLogResetInsertion(void);
 extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
+extern void XLogPreCheckSize(uint32 size);
 
 extern XLogRecPtr log_newpage(RelFileLocator *rlocator, ForkNumber forknum,
 							  BlockNumber blkno, Page page, bool page_std);
-- 
2.43.0

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Maksim.Melnikov (#1)

On Tue, Oct 14, 2025 at 10:08:12AM +0300, Maksim.Melnikov wrote:

I've checked RecordTransactionCommit too, but I don't think it can fire
similar error. Problem, that was described above, occurred because we used
external column storage without compression and with REPLICA IDENTITY FULL.
To be honest, it's degenerate case, that can occur only in case of tuple
update/delete, when we need full row to identify updated/deleted value, more
info can be found in doc [1].

"Degenerate" sounds like a pretty good term to define your test case.
So the issue is that the uncompressed TOAST blobs get so large that
the mainrdata_len computed with a single call of XLogRegisterData()
triggers the size restriction. The protections added in XLogInsert()
are doing their job here: the record generated by the UPDATE cannot be
replayed, failing on an allocation failure in the standby if one lifts
the size restriction in XLogInsert(). What's pretty "good" about your
case is that the first INSERT is large, but small enough so as a
palloc() would fail on the initial insertion, making it succeed. Only
the second UPDATE would become large enough, still you are able to
bypass the allocation limits with a combination of the old and new
tuple data that need to be replicated because of the full replica
identity. Fun case, I'd say.

I've fixed comments with yours remarks, thanks. Patch is attached.

I see what you are doing in your patch. ExtractReplicaIdentity() has
only two callers: heap_update() or heap_delete(). Both document that
this stuff happens before entering a critical section to avoid a PANIC
on allocation, but this does not count for the overhead required by a
WAL record because we don't know yet how large the record will be
(well most of it is going to be the old tuple key anyway), as we may
have pages, some of them compressed or holes. Then your patch adds an
extra check depending on the size of the "old" key generated.

+static void
+log_heap_precheck(Relation reln, HeapTuple tp)
+{
+#define XLogRecordMaxOverhead ((uint32) (1024 * 1024))
+
+    if (tp && RelationIsLogicallyLogged(reln))
+    {
+        uint32        data_len = tp->t_len - SizeofHeapTupleHeader;
+
+        XLogPreCheckSize(data_len + XLogRecordMaxOverhead);
+    }
+}

This adds a size prediction of XLogRecordMaxOverhead on top of the
existing XLogRecordMaxSize, which is itself an estimation with a 4MB
allocation overhead allowed, so you are adding a second estimation
layer on top of the existing one based on how much the XLogReader
needs when processing a record. This is not optimal, and we cannot
have a precise number until we have computed all the elements that
build a WAL record.

Some numbers I've grabbed on the way, while looking at your case, for
reference:
- size of allocation at replay: 1073750016
- number of repeat values in the UPDATE: 1073741717
- size registered in XLogRegisterData(): 1073741746

A different way to think about the problem would be to rework the way
we flatten the tuple when a old tuple is extracted in full. For
example, if some attributes are external but not compressed, we could
also take the route to force some compression in the key extracted to
make it shorter and able to fit in a record all the time. External
but uncompressed data is not a very common case, so this may not
justify the extra implementation cost and complications in the tuple
flattening routines.

Perhaps the best answer is just to do nothing here.
--
Michael