Using per-transaction memory contexts for storing decoded tuples

Started by Masahiko Sawadaover 1 year ago39 messages

sawada.mshk@gmail.com

over 1 year ago

2 attachment(s)

Hi all,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1]/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com[2]/messages/by-id/17974-f8c9d353a62f414d@postgresql.org[3]/messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com. For instance in one of the
reports[1]/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com, even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.

Regards,

[1]: /messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
[2]: /messages/by-id/17974-f8c9d353a62f414d@postgresql.org
[3]: /messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

use_tup_context_per_tx_v1.patchapplication/octet-stream; name=use_tup_context_per_tx_v1.patchDownload

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..3818efacb0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -932,9 +932,11 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	tuplelen = datalen - SizeOfHeapHeader;
 
 	change->data.tp.newtuple =
-		ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+		ReorderBufferGetTupleBufForXid(ctx->reorder, XLogRecGetXid(r),
+									   buf->origptr, tuplelen);
 
-	DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+	if (change->data.tp.newtuple)
+		DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
 
 	change->data.tp.clear_toast_afterwards = true;
 
@@ -984,9 +986,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		tuplelen = datalen - SizeOfHeapHeader;
 
 		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+			ReorderBufferGetTupleBufForXid(ctx->reorder,
+										   XLogRecGetXid(r),
+										   buf->origptr,
+										   tuplelen);
 
-		DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+		if (change->data.tp.newtuple)
+			DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
 	}
 
 	if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1000,9 +1006,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		tuplelen = datalen - SizeOfHeapHeader;
 
 		change->data.tp.oldtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+			ReorderBufferGetTupleBufForXid(ctx->reorder,
+										   XLogRecGetXid(r),
+										   buf->origptr,
+										   tuplelen);
 
-		DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
+		if (change->data.tp.oldtuple)
+			DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
 	}
 
 	change->data.tp.clear_toast_afterwards = true;
@@ -1055,10 +1065,14 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		Assert(XLogRecGetDataLen(r) > (SizeOfHeapDelete + SizeOfHeapHeader));
 
 		change->data.tp.oldtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
-
-		DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
-						datalen, change->data.tp.oldtuple);
+			ReorderBufferGetTupleBufForXid(ctx->reorder,
+										   XLogRecGetXid(r),
+										   buf->origptr,
+										   tuplelen);
+
+		if (change->data.tp.oldtuple)
+			DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
+							datalen, change->data.tp.oldtuple);
 	}
 
 	change->data.tp.clear_toast_afterwards = true;
@@ -1164,7 +1178,10 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		datalen = xlhdr->datalen;
 
 		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, datalen);
+			ReorderBufferGetTupleBufForXid(ctx->reorder,
+										   XLogRecGetXid(r),
+										   buf->origptr,
+										   datalen);
 
 		tuple = change->data.tp.newtuple;
 		header = tuple->t_data;
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 00a8327e77..65b80fc156 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -336,17 +336,6 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));
 
-	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
-	 */
-	buffer->tup_context = GenerationContextCreate(new_ctx,
-												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
-
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
 	hash_ctl.hcxt = buffer->context;
@@ -467,6 +456,10 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Reset the toast hash */
 	ReorderBufferToastReset(rb, txn);
 
+	/* Remove the memory context for storing decoded tuples */
+	if (txn->tup_context)
+		MemoryContextDelete(txn->tup_context);
+
 	pfree(txn);
 }
 
@@ -504,17 +497,15 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 		case REORDER_BUFFER_CHANGE_UPDATE:
 		case REORDER_BUFFER_CHANGE_DELETE:
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
+			/*
+			 * Tuple buffers will be freed altogether when the tuple context
+			 * is deleted.
+			 */
 			if (change->data.tp.newtuple)
-			{
-				ReorderBufferReturnTupleBuf(change->data.tp.newtuple);
 				change->data.tp.newtuple = NULL;
-			}
-
 			if (change->data.tp.oldtuple)
-			{
-				ReorderBufferReturnTupleBuf(change->data.tp.oldtuple);
 				change->data.tp.oldtuple = NULL;
-			}
+
 			break;
 		case REORDER_BUFFER_CHANGE_MESSAGE:
 			if (change->data.msg.prefix != NULL)
@@ -559,20 +550,57 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
  * overhead).
  */
 HeapTuple
-ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)
+ReorderBufferGetTupleBufForTXN(ReorderBuffer *rb,
+							   ReorderBufferTXN *txn,
+							   Size tuple_len)
 {
 	HeapTuple	tuple;
 	Size		alloc_len;
 
-	alloc_len = tuple_len + SizeofHeapTupleHeader;
+	if (txn->tup_context == NULL)
+		txn->tup_context = BumpContextCreate(rb->context,
+											 "per-txn tuple context",
+											 ALLOCSET_DEFAULT_SIZES);
 
-	tuple = (HeapTuple) MemoryContextAlloc(rb->tup_context,
+	alloc_len = tuple_len + SizeofHeapTupleHeader;
+	tuple = (HeapTuple) MemoryContextAlloc(txn->tup_context,
 										   HEAPTUPLESIZE + alloc_len);
 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
 
 	return tuple;
 }
 
+/*
+ * Similar to ReorderBufferGetTupleBufForTXN but for the transaction
+ * of the given xid.
+ */
+HeapTuple
+ReorderBufferGetTupleBufForXid(ReorderBuffer *rb, TransactionId xid,
+							   XLogRecPtr lsn, Size tuple_len)
+{
+	ReorderBufferTXN	*txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+	Assert(txn != NULL);
+
+	/*
+	 * If the transaction is assigned to its top-level transaction,
+	 * their decoded tuple data are stored to top-level transaction's
+	 * memory context. This save the number of tuple memory contexts
+	 * to create.
+	 *
+	 * XXX: this assumes that sub-transactions are freed (when being
+	 * serialized or being streamed) together with its top-level
+	 * transaction.
+	 */
+	txn = rbtxn_get_toptxn(txn);
+
+	if (txn->concurrent_abort)
+		return NULL;
+
+	return ReorderBufferGetTupleBufForTXN(rb, txn, tuple_len);
+}
+
 /*
  * Free a HeapTuple returned by ReorderBufferGetTupleBuf().
  */
@@ -678,6 +706,7 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 		txn = ent->txn;
 		txn->first_lsn = lsn;
 		txn->restart_decoding_lsn = rb->current_restart_decoding_lsn;
+		txn->tup_context = NULL;
 
 		if (create_as_top)
 		{
@@ -3765,6 +3794,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	txn->nentries_mem = 0;
 	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
+	/* Remove the memory context for storing decoded tuples */
+	if (txn->tup_context != NULL)
+		MemoryContextReset(txn->tup_context);
+
 	if (fd != -1)
 		CloseTransientFile(fd);
 }
@@ -4396,7 +4429,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				uint32		tuplelen = ((HeapTuple) data)->t_len;
 
 				change->data.tp.oldtuple =
-					ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
+					ReorderBufferGetTupleBufForTXN(rb, txn,
+												   tuplelen - SizeofHeapTupleHeader);
 
 				/* restore ->tuple */
 				memcpy(change->data.tp.oldtuple, data,
@@ -4421,7 +4455,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					   sizeof(uint32));
 
 				change->data.tp.newtuple =
-					ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
+					ReorderBufferGetTupleBufForTXN(rb, txn,
+												   tuplelen - SizeofHeapTupleHeader);
 
 				/* restore ->tuple */
 				memcpy(change->data.tp.newtuple, data,
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..ab3350ca51 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -407,6 +407,9 @@ typedef struct ReorderBufferTXN
 	 */
 	pairingheap_node txn_node;
 
+	/* Memory context for storing decoded tuples */
+	MemoryContext	tup_context;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -626,7 +629,6 @@ struct ReorderBuffer
 	 */
 	MemoryContext change_context;
 	MemoryContext txn_context;
-	MemoryContext tup_context;
 
 	XLogRecPtr	current_restart_decoding_lsn;
 
@@ -668,9 +670,14 @@ struct ReorderBuffer
 extern ReorderBuffer *ReorderBufferAllocate(void);
 extern void ReorderBufferFree(ReorderBuffer *rb);
 
-extern HeapTuple ReorderBufferGetTupleBuf(ReorderBuffer *rb,
-										  Size tuple_len);
 extern void ReorderBufferReturnTupleBuf(HeapTuple tuple);
+extern HeapTuple ReorderBufferGetTupleBufForTXN(ReorderBuffer *rb,
+												ReorderBufferTXN *txn,
+												Size tuple_len);
+extern HeapTuple ReorderBufferGetTupleBufForXid(ReorderBuffer *rb,
+												TransactionId xid,
+												XLogRecPtr lsn,
+												Size tuple_len);
 
 extern ReorderBufferChange *ReorderBufferGetChange(ReorderBuffer *rb);
 extern void ReorderBufferReturnChange(ReorderBuffer *rb,

rb_excessive_memory_reproducer.patchapplication/octet-stream; name=rb_excessive_memory_reproducer.patchDownload

diff --git a/contrib/test_decoding/t/002_rb_memory.pl b/contrib/test_decoding/t/002_rb_memory.pl
new file mode 100644
index 0000000000..f20c6fb181
--- /dev/null
+++ b/contrib/test_decoding/t/002_rb_memory.pl
@@ -0,0 +1,45 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init(allows_streaming => 'logical');
+$node->start;
+
+$node->safe_psql('postgres', 'create table test (a int, b text)');
+$node->safe_psql('postgres',
+		 q[select pg_create_logical_replication_slot('s', 'test_decoding')]);
+my $psql1 = $node->background_psql('postgres', on_error_stop => 0);
+my $psql2 = $node->background_psql('postgres', on_error_stop => 0);
+
+$psql2->query_safe(q[begin]);
+for my $i (1 .. 300)
+{
+    $psql1->query_safe(q[
+begin;
+insert into test select c, repeat(md5(c::text), 10) from generate_series(1, 20000) c;
+commit;
+]);
+
+    $psql2->query_safe(q[insert into test select c, repeat(md5(c::text), 10) from generate_series(1, 150) c;]);
+}
+
+$psql2->query_safe(q[commit]);
+
+$psql1->quit;
+$psql2->quit;
+
+my $ret = $node->safe_psql('postgres',
+		 q[
+set logical_decoding_work_mem to '256MB';
+select count(*) from pg_logical_slot_peek_changes('s', null, null);
+]);
+
+is($ret, 6045602, "test");
+
+$node->stop;
+
+done_testing();
+
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..26d03e58da 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3632,6 +3632,18 @@ static void
 ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 {
 	ReorderBufferTXN *txn;
+	static bool report_1gb_mem = true;
+
+	if (report_1gb_mem && MemoryContextMemAllocated(rb->tup_context, true) > (1024L * 1024L * 1024L))
+	{
+		elog(LOG, "reorderbuffer memory: logical_decoding_work_mem=%lu rb->size=%lu rb->tup_context=%lu rb->context=%lu",
+			 logical_decoding_work_mem * 1024L,
+			 rb->size,
+			 MemoryContextMemAllocated(rb->tup_context, true),
+			 MemoryContextMemAllocated(rb->context, true));
+
+		report_1gb_mem = false; /* report only once */
+	}
 
 	/*
 	 * Bail out if debug_logical_replication_streaming is buffered and we

torikoshia

torikoshia@oss.nttdata.com

over 1 year ago

In reply to: Masahiko Sawada (#1)

Re: Using per-transaction memory contexts for storing decoded tuples

On 2024-09-12 07:32, Masahiko Sawada wrote:

Thanks a lot for working on this!

Hi all,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

I haven't read the patch yet, but it seems a reasonable approach.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

Yeah, and I imagine there would be cases where the current
implementation shows better performance, such as when there are no long
transactions, but compared to unexpected memory bloat and OOM kill, I
think it's far more better.

Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.

Regards,

[1]
/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
[2]
/messages/by-id/17974-f8c9d353a62f414d@postgresql.org
[3]
/messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#1)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released? Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context.

Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?

We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?

--
With Regards,
Amit Kapila.

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#3)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context.

Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?

Yeah, I missed that we could evict only sub-transactions when choosing
the largest transaction. We need to find a better solution.

We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?

Agreed. Will investigate the thread and check the cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#1)

RE: Using per-transaction memory contexts for storing decoded tuples

Hi,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#4)

Re: Using per-transaction memory contexts for storing decoded tuples

On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.

Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

makes sense.

--
With Regards,
Amit Kapila.

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#6)

Re: Using per-transaction memory contexts for storing decoded tuples

On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.

We haven't actually checked with the person who reported the problem,
so this is just a guess, but I think there were concurrent
transactions, including long-running INSERT transactions. For example,
suppose a transaction that inserts 10 million rows and many OLTP-like
(short) transactions are running at the same time. The scenario I
thought of was that one 8MB Generation Context Block contains 1MB of
the large insert transaction changes, and the other 7MB contains short
OLTP transaction changes. If there are just 256 such blocks, even
after all short-transactions have completed, the Generation Context
will have allocated 2GB of memory until we decode the commit record of
the large transaction, but rb->size will say 256MB.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#6)

Re: Using per-transaction memory contexts for storing decoded tuples

On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

makes sense.

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Here are three code bases that I used:

* head: current head code.
* per-tx-bump: the proposed idea (with a slight change; each sub and
top-level transactions have its own bump memory context to store
decoded tuples).
* 8kb-mem-block: same as head except for changing the generation
memory block size from 8MB to 8kB.

And here are test cases and results:

1. Memory usage check

I've run the test that I shared before and checked the maximum amount
of memory allocated in the reorderbuffer context shown by
MemoryContextMemAllocated(). Here are results:

head: 2.1GB (while rb->size showing 43MB)
per-tx-bump: 50MB (while rb->size showing 43MB)
8kb-mem-block: 54MB (while rb->size showing 43MB)

I've confirmed that the excessive memory usage issue didn't happen in
the per-tx-bump case and the 8kb-mem-block cases.

2. Decoding many sub transactions

IIUC this kind of workload was a trigger to make us invent the
Generation Context for logical decoding[1]/messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org. The single top-level
transaction has 1M sub-transactions each of which insert a tuple. Here
are results:

head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: 31834.872 ms (00:31.835)

The head and 8kb-mem-block showed similar results whereas I see there
is a bit of regression on per-tx-bump. I think this is because of the
overhead of creating and deleting memory contexts for each
sub-transactions.

3. Decoding a big transaction

The next test case I did is to decode a single big transaction that
inserts 10M rows. I set logical_decoding_work_mem large enough to
avoid spilling behavior. Here are results:

head: 19859.113 ms (00:19.859)
per-tx-bump: 19422.308 ms (00:19.422)
8kb-mem-block: 19923.600 ms (00:19.924)

There were no big differences. FYI, I also checked the maximum memory
usage for this test case as well:

head: 1.53GB
per-tx-bump: 1.4GB
8kb-mem-block: 1.53GB

The per-tx-bump used a bit lesser memory probably thanks to bump
memory contexts.

4. Decoding many short transactions.

The last test case I did is to decode a bunch of short pgbench
transactions (10k transactions). Here are results:

head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: Time: 31834.872 ms (00:31.835)

I can see a similar trend of the test case #2 above.

Overall, reducing the generation context memory block size to 8kB
seems to be promising. And using the bump memory context per
transaction didn't bring performance improvement than I expected in
these cases.

Regards,

[1]: /messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#8)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.

Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.

Overall, reducing the generation context memory block size to 8kB
seems to be promising. And using the bump memory context per
transaction didn't bring performance improvement than I expected in
these cases.

Having a bump context per transaction would cause a malloc() every
time you create a new context and a free() each time you delete the
context when cleaning up the reorder buffer for the transaction.
GenerationContext has a "freeblock" field that it'll populate instead
of freeing a block. That block will be reused next time a new block is
required. For truly FIFO workloads that never need an oversized
block, all new blocks will come from the freeblock field once the
first block becomes unused. See the comments in GenerationFree(). I
expect this is why bump contexts are slower than the generation
context for your short transaction workload.

David

#10

Fujii Masao

masao.fujii@oss.nttdata.com

over 1 year ago

In reply to: Masahiko Sawada (#8)

Re: Using per-transaction memory contexts for storing decoded tuples

On 2024/09/19 8:53, Masahiko Sawada wrote:

On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

makes sense.

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Sounds good!

I believe the memory bloat issue in the reorder buffer should be
considered a bug. Since this solution isn’t too invasive, I think
it’s worth considering back-patch to older versions.

Then, if we find a better approach, we can apply it to v18 or later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#11

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: David Rowley (#9)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.

Good idea.

Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.

After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.

--
With Regards,
Amit Kapila.

#12

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#11)

1 attachment(s)

Re: Using per-transaction memory contexts for storing decoded tuples

On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.

Good idea.

Agreed.

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

8kB: 19747.870 ms
16kB: 19780.025 ms
32kB: 19760.575 ms
64kB: 19772.387 ms
128kB: 19825.385 ms
256kB: 19781.118 ms
512kB: 19808.138 ms
1MB: 19757.640 ms
2MB: 19801.429 ms
4MB: 19673.996 ms
8MB: 19643.547 ms

Interestingly, there were no noticeable differences in the execution
time. I've checked the number of allocated memory blocks in each case
and more blocks are allocated in smaller block size cases. For
example, when the logical decoding used the maximum memory (about
1.5GB), we allocated about 80k blocks in 8kb memory block size case
and 80 blocks in 8MB memory block cases.

It could have different results in different environments. I've
attached the patch that I used to change the memory block size via
GUC. It would be appreciated if someone also could do a similar test
to see the differences.

Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.

Interesting idea.

After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.

The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

rb_mem_block_size.patchapplication/octet-stream; name=rb_mem_block_size.patchDownload

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..1f76939baa 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -214,6 +214,7 @@ static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* GUC variable */
 int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED;
+int			rb_mem_block_size = 8192 * 1024L;
 
 /* ---------------------------------------
  * primary reorderbuffer support routines
@@ -341,11 +342,12 @@ ReorderBufferAllocate(void)
 	 * growing code.  These values should likely be benchmarked and set to
 	 * more suitable values.
 	 */
+	elog(LOG, "RB MEM BLOCK SIZE %ld", rb_mem_block_size * 1024L);
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  rb_mem_block_size * 1024L,
+												  rb_mem_block_size * 1024L,
+												  rb_mem_block_size * 1024L);
 
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..7e2e492521 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2085,6 +2085,16 @@ struct config_bool ConfigureNamesBool[] =
 
 struct config_int ConfigureNamesInt[] =
 {
+	{
+		{"rb_mem_block_size", PGC_USERSET, RESOURCES_MEM,
+		 gettext_noop(""),
+		 NULL,
+		 GUC_UNIT_KB
+		},
+		&rb_mem_block_size,
+		8192 * 1024L, 0, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
 	{
 		{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
 			gettext_noop("Sets the amount of time to wait before forcing a "
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..37bc37236a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -26,6 +26,7 @@
 /* GUC variables */
 extern PGDLLIMPORT int logical_decoding_work_mem;
 extern PGDLLIMPORT int debug_logical_replication_streaming;
+extern PGDLLIMPORT int rb_mem_block_size;
 
 /* possible values for debug_logical_replication_streaming */
 typedef enum

#13

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#5)

Re: Using per-transaction memory contexts for storing decoded tuples

Hi,

On Mon, Sep 16, 2024 at 10:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Hi,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.

Thank you for your interest in this patch. I've just shared some
benchmark results (with a patch) that could be different depending on
the environment[1]/messages/by-id/CAD21AoAaN4jaJ=W+WprHvc0cGCf80SkiFQhRc6R+X3-05HAFqw@mail.gmail.com. I would be appreciated if you also do similar
tests and share the results.

Regards,

[1]: /messages/by-id/CAD21AoAaN4jaJ=W+WprHvc0cGCf80SkiFQhRc6R+X3-05HAFqw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#12)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation. You saw
bad results with 8MB blocks and good results with 8KB blocks. The
measure that's interesting here is the MemoryContextMemAllocated() for
the GenerationContext in question.

The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.

I had roughly imagined that you'd add an extra field to store
txn->size when the last fragmentation was done and only defrag the
transaction when the ReorderBufferTXN->size is, say, double the last
size when the changes were last defragmented. Of course, if the first
defragmentation was enough to drop MemoryContextMemAllocated() below
the concerning threshold above logical_decoding_work_mem then further
defrags wouldn't be done, at least, until such times as the
MemoryContextMemAllocated() became concerning again. If you didn't
want to a full Size variable for the defragmentation size, you could
just store a uint8 to store which power of 2 ReorderBufferTXN->size
was when it was last defragmented. There are plenty of holds in that
struct that could be filled without enlarging the struct.

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue. Unfortunately, AllocSet could
also suffer from fragmentation issues too if lots of chunks end up on
freelists that are never reused, so using another context type might
just create a fragmentation issue for a different workload.

David

#15

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#12)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.

Good idea.

Agreed.

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

8kB: 19747.870 ms
16kB: 19780.025 ms
32kB: 19760.575 ms
64kB: 19772.387 ms
128kB: 19825.385 ms
256kB: 19781.118 ms
512kB: 19808.138 ms
1MB: 19757.640 ms
2MB: 19801.429 ms
4MB: 19673.996 ms
8MB: 19643.547 ms

Interestingly, there were no noticeable differences in the execution
time. I've checked the number of allocated memory blocks in each case
and more blocks are allocated in smaller block size cases. For
example, when the logical decoding used the maximum memory (about
1.5GB), we allocated about 80k blocks in 8kb memory block size case
and 80 blocks in 8MB memory block cases.

What exactly do these test results mean? Do you want to prove that
there is no regression by using smaller block sizes?

--
With Regards,
Amit Kapila.

#16

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: David Rowley (#14)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.

+1. That test will be helpful.

The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.

I had roughly imagined that you'd add an extra field to store
txn->size when the last fragmentation was done and only defrag the
transaction when the ReorderBufferTXN->size is, say, double the last
size when the changes were last defragmented. Of course, if the first
defragmentation was enough to drop MemoryContextMemAllocated() below
the concerning threshold above logical_decoding_work_mem then further
defrags wouldn't be done, at least, until such times as the
MemoryContextMemAllocated() became concerning again. If you didn't
want to a full Size variable for the defragmentation size, you could
just store a uint8 to store which power of 2 ReorderBufferTXN->size
was when it was last defragmented. There are plenty of holds in that
struct that could be filled without enlarging the struct.

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

--
With Regards,
Amit Kapila.

#17

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#13)

1 attachment(s)

RE: Using per-transaction memory contexts for storing decoded tuples

Dear Sawada-san,

Thank you for your interest in this patch. I've just shared some
benchmark results (with a patch) that could be different depending on
the environment[1]. I would be appreciated if you also do similar
tests and share the results.

Okay, I did similar tests, the attached script is the test runner. rb_mem_block_size
was changed from 8kB to 8MB. Below table show the result (millisecond unit).
Each cell is the average of 5 runs.

==========
8kB 12877.4
16kB 12829.1
32kB 11793.3
64kB 13134.4
128kB 13353.1
256kB 11664.0
512kB 12603.4
1MB 13443.8
2MB 12469.0
4MB 12651.4
8MB 12381.4
==========

The standard deviation of measurements was 100-500 ms, there were no noticeable
differences on my env as well.

Also, I've checked the statistics of the generation context, and confirmed the
number of allocated blocks is x1000 higher if the block size is changed 8kB->8MB.
[1]: 8kB Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used
of actual used space comes from the header of each block. Each block has attributes
for management so that the total usage becomes larger based on the number.

[1]: 8kB Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used
8kB
Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used
Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used

8MB
Tuples: 721420288 total in 86 blocks (10000000 chunks); 1415344 free (0 chunks); 720004944 used
Grand total: 721420288 bytes in 86 blocks; 1415344 free (0 chunks); 720004944 use

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#18

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#16)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.

+1. That test will be helpful.

Sure. Here are results of peak memory usage in bytes reported by
MemoryContextMemAllocated() (when rb->size shows 43MB):

8kB: 52,371,328
16kB: 52,887,424
32kB: 53,428,096
64kB: 55,099,264
128kB: 86,163,328
256kB: 149,340,032
512kB: 273,334,144
1MB: 523,419,520
2MB: 1,021,493,120
4MB: 1,984,085,888
8MB: 2,130,886,528

Probably we can increase the size to 64kB?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#19

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Amit Kapila (#16)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

David

#20

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: David Rowley (#19)

Re: Using per-transaction memory contexts for storing decoded tuples

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

--
With Regards,
Amit Kapila.

#21

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#18)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Sep 20, 2024 at 10:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.

+1. That test will be helpful.

Sure. Here are results of peak memory usage in bytes reported by
MemoryContextMemAllocated() (when rb->size shows 43MB):

8kB: 52,371,328
16kB: 52,887,424
32kB: 53,428,096
64kB: 55,099,264
128kB: 86,163,328
256kB: 149,340,032
512kB: 273,334,144
1MB: 523,419,520
2MB: 1,021,493,120
4MB: 1,984,085,888
8MB: 2,130,886,528

Probably we can increase the size to 64kB?

Yeah, but before deciding on a particular size, we need more testing
on different platforms as suggested by David.

--
With Regards,
Amit Kapila.

#22

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#15)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Sep 19, 2024 at 10:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.

Good idea.

Agreed.

I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):

8kB: 19747.870 ms
16kB: 19780.025 ms
32kB: 19760.575 ms
64kB: 19772.387 ms
128kB: 19825.385 ms
256kB: 19781.118 ms
512kB: 19808.138 ms
1MB: 19757.640 ms
2MB: 19801.429 ms
4MB: 19673.996 ms
8MB: 19643.547 ms

Interestingly, there were no noticeable differences in the execution
time. I've checked the number of allocated memory blocks in each case
and more blocks are allocated in smaller block size cases. For
example, when the logical decoding used the maximum memory (about
1.5GB), we allocated about 80k blocks in 8kb memory block size case
and 80 blocks in 8MB memory block cases.

What exactly do these test results mean? Do you want to prove that
there is no regression by using smaller block sizes?

Yes, there was no noticeable performance regression at least in this
test scenario.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#23

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#20)

Re: Using per-transaction memory contexts for storing decoded tuples

On Sun, Sep 22, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

+1. I'll do the same test on my Mac as well.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#24

Shlok Kyal

shlok.kyal.oss@gmail.com

over 1 year ago

In reply to: Amit Kapila (#20)

1 attachment(s)

Re: Using per-transaction memory contexts for storing decoded tuples

On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

I did similar tests on Windows. rb_mem_block_size was changed from 8kB
to 8MB. Below table shows the result (average of 5 runs) and Standard
Deviation (of 5 runs) for each block-size.

===============================================
block-size | Average time (ms) | Standard Deviation (ms)
-------------------------------------------------------------------------------------
8kb | 12580.879 ms | 144.6923467
16kb | 12442.7256 ms | 94.02799006
32kb | 12370.7292 ms | 97.7958552
64kb | 11877.4888 ms | 222.2419142
128kb | 11828.8568 ms | 129.732941
256kb | 11801.086 ms | 20.60030913
512kb | 12361.4172 ms | 65.27390105
1MB | 12343.3732 ms | 80.84427202
2MB | 12357.675 ms | 79.40017604
4MB | 12395.8364 ms | 76.78273689
8MB | 11712.8862 ms | 50.74323039
==============================================

From the results, I think there is a small regression for small block size.

I ran the tests in git bash. I have also attached the test script.

Thanks and Regards,
Shlok Kyal

#25

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Shlok Kyal (#24)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

I did similar tests on Windows. rb_mem_block_size was changed from 8kB
to 8MB. Below table shows the result (average of 5 runs) and Standard
Deviation (of 5 runs) for each block-size.

===============================================
block-size | Average time (ms) | Standard Deviation (ms)
-------------------------------------------------------------------------------------
8kb | 12580.879 ms | 144.6923467
16kb | 12442.7256 ms | 94.02799006
32kb | 12370.7292 ms | 97.7958552
64kb | 11877.4888 ms | 222.2419142
128kb | 11828.8568 ms | 129.732941
256kb | 11801.086 ms | 20.60030913
512kb | 12361.4172 ms | 65.27390105
1MB | 12343.3732 ms | 80.84427202
2MB | 12357.675 ms | 79.40017604
4MB | 12395.8364 ms | 76.78273689
8MB | 11712.8862 ms | 50.74323039
==============================================

From the results, I think there is a small regression for small block size.

I ran the tests in git bash. I have also attached the test script.

Thank you for testing on Windows! I've run the same benchmark on Mac
(Sonoma 14.7, M1 Pro):

8kB: 4852.198 ms
16kB: 4822.733 ms
32kB: 4776.776 ms
64kB: 4851.433 ms
128kB: 4804.821 ms
256kB: 4781.778 ms
512kB: 4776.486 ms
1MB: 4783.456 ms
2MB: 4770.671 ms
4MB: 4785.800 ms
8MB: 4747.447 ms

I can see there is a small regression for small block sizes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#26

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#25)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Sep 27, 2024 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

I did similar tests on Windows. rb_mem_block_size was changed from 8kB
to 8MB. Below table shows the result (average of 5 runs) and Standard
Deviation (of 5 runs) for each block-size.

===============================================
block-size | Average time (ms) | Standard Deviation (ms)
-------------------------------------------------------------------------------------
8kb | 12580.879 ms | 144.6923467
16kb | 12442.7256 ms | 94.02799006
32kb | 12370.7292 ms | 97.7958552
64kb | 11877.4888 ms | 222.2419142
128kb | 11828.8568 ms | 129.732941
256kb | 11801.086 ms | 20.60030913
512kb | 12361.4172 ms | 65.27390105
1MB | 12343.3732 ms | 80.84427202
2MB | 12357.675 ms | 79.40017604
4MB | 12395.8364 ms | 76.78273689
8MB | 11712.8862 ms | 50.74323039
==============================================

From the results, I think there is a small regression for small block size.

I ran the tests in git bash. I have also attached the test script.

Thank you for testing on Windows! I've run the same benchmark on Mac
(Sonoma 14.7, M1 Pro):

8kB: 4852.198 ms
16kB: 4822.733 ms
32kB: 4776.776 ms
64kB: 4851.433 ms
128kB: 4804.821 ms
256kB: 4781.778 ms
512kB: 4776.486 ms
1MB: 4783.456 ms
2MB: 4770.671 ms
4MB: 4785.800 ms
8MB: 4747.447 ms

I can see there is a small regression for small block sizes.

So, decoding a large transaction with many smaller allocations can
have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
real workloads, we will have fewer such large transactions or a mix of
small and large transactions. That will make the overhead much less
visible. Does this mean that we should invent some strategy to defrag
the memory at some point during decoding or use any other technique? I
don't find this overhead above the threshold to invent something
fancy. What do others think?

--
With Regards,
Amit Kapila.

#27

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#26)

Re: Using per-transaction memory contexts for storing decoded tuples

On Tue, Oct 1, 2024 at 5:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 27, 2024 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:

On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

I did similar tests on Windows. rb_mem_block_size was changed from 8kB
to 8MB. Below table shows the result (average of 5 runs) and Standard
Deviation (of 5 runs) for each block-size.

===============================================
block-size | Average time (ms) | Standard Deviation (ms)
-------------------------------------------------------------------------------------
8kb | 12580.879 ms | 144.6923467
16kb | 12442.7256 ms | 94.02799006
32kb | 12370.7292 ms | 97.7958552
64kb | 11877.4888 ms | 222.2419142
128kb | 11828.8568 ms | 129.732941
256kb | 11801.086 ms | 20.60030913
512kb | 12361.4172 ms | 65.27390105
1MB | 12343.3732 ms | 80.84427202
2MB | 12357.675 ms | 79.40017604
4MB | 12395.8364 ms | 76.78273689
8MB | 11712.8862 ms | 50.74323039
==============================================

From the results, I think there is a small regression for small block size.

I ran the tests in git bash. I have also attached the test script.

Thank you for testing on Windows! I've run the same benchmark on Mac
(Sonoma 14.7, M1 Pro):

8kB: 4852.198 ms
16kB: 4822.733 ms
32kB: 4776.776 ms
64kB: 4851.433 ms
128kB: 4804.821 ms
256kB: 4781.778 ms
512kB: 4776.486 ms
1MB: 4783.456 ms
2MB: 4770.671 ms
4MB: 4785.800 ms
8MB: 4747.447 ms

I can see there is a small regression for small block sizes.

So, decoding a large transaction with many smaller allocations can
have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
real workloads, we will have fewer such large transactions or a mix of
small and large transactions. That will make the overhead much less
visible. Does this mean that we should invent some strategy to defrag
the memory at some point during decoding or use any other technique? I
don't find this overhead above the threshold to invent something
fancy. What do others think?

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old
branches (if we agree) and to revert the change in case something
happens.

BTW I've read the discussions for inventing generational memory
context[1]/messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org[2]/messages/by-id/d15dff83-0b37-28ed-0809-95a5cc7292ad@2ndquadrant.com, but I could not find any discussion on the memory block
sizes. It seems that we use 8MB memory blocks from the first patch.

[1]: /messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org
[2]: /messages/by-id/d15dff83-0b37-28ed-0809-95a5cc7292ad@2ndquadrant.com

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#28

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#27)

RE: Using per-transaction memory contexts for storing decoded tuples

Dear Sawada-san, Amit,

So, decoding a large transaction with many smaller allocations can
have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
real workloads, we will have fewer such large transactions or a mix of
small and large transactions. That will make the overhead much less
visible. Does this mean that we should invent some strategy to defrag
the memory at some point during decoding or use any other technique? I
don't find this overhead above the threshold to invent something
fancy. What do others think?

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old
branches (if we agree) and to revert the change in case something
happens.

I also felt okay. Just to confirm - you do not push rb_mem_block_size patch and
just replace SLAB_LARGE_BLOCK_SIZE -> SLAB_DEFAULT_BLOCK_SIZE, right? It seems that
only reorderbuffer.c uses the LARGE macro so that it can be removed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#29

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#28)

Re: Using per-transaction memory contexts for storing decoded tuples

On Wed, Oct 2, 2024 at 9:42 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san, Amit,

So, decoding a large transaction with many smaller allocations can
have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In
real workloads, we will have fewer such large transactions or a mix of
small and large transactions. That will make the overhead much less
visible. Does this mean that we should invent some strategy to defrag
the memory at some point during decoding or use any other technique? I
don't find this overhead above the threshold to invent something
fancy. What do others think?

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old
branches (if we agree) and to revert the change in case something
happens.

I also felt okay. Just to confirm - you do not push rb_mem_block_size patch and
just replace SLAB_LARGE_BLOCK_SIZE -> SLAB_DEFAULT_BLOCK_SIZE, right?

Right.

It seems that
only reorderbuffer.c uses the LARGE macro so that it can be removed.

I'm going to keep the LARGE macro since extensions might be using it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#30

Fujii Masao

masao.fujii@oss.nttdata.com

over 1 year ago

In reply to: Masahiko Sawada (#29)

Re: Using per-transaction memory contexts for storing decoded tuples

On 2024/10/03 13:47, Masahiko Sawada wrote:

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB).

It's easy to backpatch to old
branches (if we agree)

It seems that
only reorderbuffer.c uses the LARGE macro so that it can be removed.

I'm going to keep the LARGE macro since extensions might be using it.

Yes, for the back-patch. But in the master branch,
we basically don't need to maintain this kind of compatibility?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#31

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Fujii Masao (#30)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Oct 3, 2024 at 2:46 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2024/10/03 13:47, Masahiko Sawada wrote:

I agree that the overhead will be much less visible in real workloads.
+1 to use a smaller block (i.e. 8kB).

+1

It's easy to backpatch to old
branches (if we agree)

+1

It seems that
only reorderbuffer.c uses the LARGE macro so that it can be removed.

I'm going to keep the LARGE macro since extensions might be using it.

Yes, for the back-patch. But in the master branch,
we basically don't need to maintain this kind of compatibility?

Yes, but as for this macro specifically, I thought that it might be
better to keep it, since it avoids breaking extension unnecessarily
and it seems to be natural to have it as an option for slab context.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#32

Fujii Masao

masao.fujii@oss.nttdata.com

over 1 year ago

In reply to: Masahiko Sawada (#31)

Re: Using per-transaction memory contexts for storing decoded tuples

On 2024/10/04 3:32, Masahiko Sawada wrote:

Yes, but as for this macro specifically, I thought that it might be
better to keep it, since it avoids breaking extension unnecessarily
and it seems to be natural to have it as an option for slab context.

If the macro has value, I'm okay with leaving it as is.

Are you planning to post the patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#33

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Fujii Masao (#32)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Oct 10, 2024 at 8:04 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2024/10/04 3:32, Masahiko Sawada wrote:

Yes, but as for this macro specifically, I thought that it might be
better to keep it, since it avoids breaking extension unnecessarily
and it seems to be natural to have it as an option for slab context.

If the macro has value, I'm okay with leaving it as is.

Are you planning to post the patch?

Yes, I'll post the patch soon.

Regards,n

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#34

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#33)

7 attachment(s)

Re: Using per-transaction memory contexts for storing decoded tuples

On Thu, Oct 10, 2024 at 8:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 10, 2024 at 8:04 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2024/10/04 3:32, Masahiko Sawada wrote:

Yes, but as for this macro specifically, I thought that it might be
better to keep it, since it avoids breaking extension unnecessarily
and it seems to be natural to have it as an option for slab context.

If the macro has value, I'm okay with leaving it as is.

Are you planning to post the patch?

Yes, I'll post the patch soon.

Please find the attached patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

REL17_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL17_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 5da814254be6325ae22d1cbd7d3f48f16d2a09f7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d1c4258844..4fc59fb567 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL13_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL13_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 41640b4db0fde9258e3fe703d4ff6600d3897f89 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb323a80ec..26c75a65f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -302,7 +302,7 @@ ReorderBufferAllocate(void)

 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL12_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL12_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 442d1393de3b754e4d78378b975e1564a5cef5de Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d8345a75b6..1123f5858e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -264,7 +264,7 @@ ReorderBufferAllocate(void)

 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL15_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL15_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 164ccd9cde7c76d66f45e1dbbd6c855dbe3bf0ab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3a68a393d2..3f4830ddd8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -335,9 +335,9 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

master_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=master_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 5549584d484b4820720d6310f9add0c254556ad4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..7969045a26 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL16_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL16_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From d75e331777461e9b10e6272d4e42238a6dd21da8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0dab0bb64e..7dd604e5a8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -338,9 +338,9 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL14_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL14_v1-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 244a761f4947daf164afbc9fd3e2b74bb707bc49 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v1] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 264c253a87..b6d6cd1651 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -330,7 +330,7 @@ ReorderBufferAllocate(void)

 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

#35

Amit Kapila

amit.kapila16@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#34)

Re: Using per-transaction memory contexts for storing decoded tuples

On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please find the attached patches.

@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
  */
  buffer->tup_context = GenerationContextCreate(new_ctx,
    "Tuples",
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE);
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE);

Shouldn't we change the comment atop this change [1]/* * XXX the allocation sizes used below pre-date generation context's block * growing code. These values should likely be benchmarked and set to * more suitable values. */ which states that
we should benchmark the existing values?

One more thing we kept the max size as SLAB_DEFAULT_BLOCK_SIZE instead
of something like we do with ALLOCSET_DEFAULT_SIZES, so we can
probably write a comment as to why we choose to use the max_size same
as init_size. BTW, can we once try to use the max size as
SLAB_LARGE_BLOCK_SIZE? Can it lead to the same problem with concurrent
transactions where freeing larger blocks could be a problem, if so, we
can at least write a comment for future reference.

[1]: /* * XXX the allocation sizes used below pre-date generation context's block * growing code. These values should likely be benchmarked and set to * more suitable values. */
/*
* XXX the allocation sizes used below pre-date generation context's block
* growing code. These values should likely be benchmarked and set to
* more suitable values.
*/

--
With Regards,
Amit Kapila.

#36

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Amit Kapila (#35)

7 attachment(s)

Re: Using per-transaction memory contexts for storing decoded tuples

On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please find the attached patches.

Thank you for reviewing the patch!

@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
*/
buffer->tup_context = GenerationContextCreate(new_ctx,
"Tuples",
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE);
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE);

Shouldn't we change the comment atop this change [1] which states that
we should benchmark the existing values?

Agreed.

One more thing we kept the max size as SLAB_DEFAULT_BLOCK_SIZE instead
of something like we do with ALLOCSET_DEFAULT_SIZES, so we can
probably write a comment as to why we choose to use the max_size same
as init_size.

Agreed. I've updated the comment. Please review it.

BTW, can we once try to use the max size as
SLAB_LARGE_BLOCK_SIZE? Can it lead to the same problem with concurrent
transactions where freeing larger blocks could be a problem, if so, we
can at least write a comment for future reference.

I've tested with SLAB_LARGE_BLOCK_SIZE as the max size but it seems
that a huge memory fragmentation issue still happens. In the same
scenario I used before, the maximum amount of allocated memory during
logical decoding was 1.26GB with logical_decodiing_work_mem being
'256MB'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

REL14_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL14_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 5c11c20dbfde3b7cfb79085f35cbf26d357c409e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 264c253a87..201ede6791 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -328,9 +328,16 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

master_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=master_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 8e577c3ecf0a4692f58fe88b29d1aa60da067f43 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..ae74d4fb2b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,15 +337,19 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue. One might think that we can
+	 * use the max size as SLAB_LARGE_BLOCK_SIZE but the test also showed it
+	 * doesn't help resolve the fragmentation issue.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL17_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL17_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 41f658e829145eac2fd9a5d80ae4163732b0587e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d1c4258844..413a79f2f6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,15 +337,19 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue. One might think that we can
+	 * use the max size as SLAB_LARGE_BLOCK_SIZE but the test also showed it
+	 * doesn't help resolve the fragmentation issue.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL16_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL16_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 4372a77ed2eb0bafe32ab8d4f79f5ca39ccf50b2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0dab0bb64e..4d58d18f8a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -332,15 +332,19 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue. One might think that we can
+	 * use the max size as SLAB_LARGE_BLOCK_SIZE but the test also showed it
+	 * doesn't help resolve the fragmentation issue.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL15_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL15_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 6713bfd53b59fc194b924f66c7f7ce891d3c47fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3a68a393d2..738c3a7e18 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -329,15 +329,19 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue. One might think that we can
+	 * use the max size as SLAB_LARGE_BLOCK_SIZE but the test also showed it
+	 * doesn't help resolve the fragmentation issue.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL12_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL12_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 156f068d396c3dd7b9d5cf652b77780cfcb431f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d8345a75b6..8ef33e0b95 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -262,9 +262,16 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL13_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL13_v2-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 6465e5f97897df425877247346315a220e307292 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v2] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb323a80ec..4eb283184c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -300,9 +300,16 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The tests showed
+	 * that the default memory block size maintains logical decoding
+	 * performance without fragmentation issue.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

#37

Amit Kapila

amit.kapila16@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#36)

Re: Using per-transaction memory contexts for storing decoded tuples

On Tue, Oct 15, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please find the attached patches.

Thank you for reviewing the patch!
@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
*/
buffer->tup_context = GenerationContextCreate(new_ctx,
"Tuples",
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE);
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE);
Shouldn't we change the comment atop this change [1] which states that
we should benchmark the existing values?
Agreed.

Can we slightly tweak the comments as follows so that it doesn't sound
like a fix for a bug?

/* To minimize memory fragmentation caused by long-running
transactions with changes spanning multiple memory blocks, we use a
single fixed-size memory block for decoded tuple storage. The tests
showed that the default memory block size maintains logical decoding
performance without causing fragmentation due to concurrent
transactions. One might think that we can use the max size as
SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
the memory fragmentation.

Other than that the changes in the patch look good to me. Note that I
haven't tested the patch by myself but the test results shown by you
and others in the thread seem sufficient to proceed with this change.

--
With Regards,
Amit Kapila.

#38

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Amit Kapila (#37)

7 attachment(s)

Re: Using per-transaction memory contexts for storing decoded tuples

On Tue, Oct 15, 2024 at 9:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 15, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please find the attached patches.

Thank you for reviewing the patch!
@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
*/
buffer->tup_context = GenerationContextCreate(new_ctx,
"Tuples",
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE);
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE);
Shouldn't we change the comment atop this change [1] which states that
we should benchmark the existing values?
Agreed.
Can we slightly tweak the comments as follows so that it doesn't sound
like a fix for a bug?

/* To minimize memory fragmentation caused by long-running
transactions with changes spanning multiple memory blocks, we use a
single fixed-size memory block for decoded tuple storage. The tests
showed that the default memory block size maintains logical decoding
performance without causing fragmentation due to concurrent
transactions. One might think that we can use the max size as
SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
the memory fragmentation.

Agreed. I've incorporated your comment in the latest patches. I'm
going to push them today, barring any objections or further comments.

Other than that the changes in the patch look good to me. Note that I
haven't tested the patch by myself but the test results shown by you
and others in the thread seem sufficient to proceed with this change.

Understood. Thank you for the discussion and your help reviewing the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

REL16_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL16_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From aa10081ec551c88c5108511ad623849133c9eacd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0dab0bb64e..930549948a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -332,15 +332,20 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions. One might think that we can use the max size as
+	 * SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
+	 * the memory fragmentation.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL15_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL15_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From d5d3deb8b6b0896623b3123c0d7e02fd5792a7bd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3a68a393d2..d5cde20a1c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -329,15 +329,20 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions. One might think that we can use the max size as
+	 * SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
+	 * the memory fragmentation.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL14_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL14_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 3d2c9b34ae4f45c12a64883d9258e4bd4c579970 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 264c253a87..64d9baa798 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -328,9 +328,17 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

master_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=master_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From f41619d871df6e5e95ff14b81a3a6f802089392f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 22bcf171ff..e3a5c7b660 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,15 +337,20 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions. One might think that we can use the max size as
+	 * SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
+	 * the memory fragmentation.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL17_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL17_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 020af4ec180b8d5373bb51312ebf4a09b744b20b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:34:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d1c4258844..9c742e96eb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,15 +337,20 @@ ReorderBufferAllocate(void)
 											sizeof(ReorderBufferTXN));

 	/*
-	 * XXX the allocation sizes used below pre-date generation context's block
-	 * growing code.  These values should likely be benchmarked and set to
-	 * more suitable values.
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions. One might think that we can use the max size as
+	 * SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
+	 * the memory fragmentation.
 	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE,
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE,
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL12_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL12_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 07b0284c11376127e3c344ebeab6ca8c9a8491f9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d8345a75b6..c3b000c19b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -262,9 +262,17 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

REL13_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchapplication/octet-stream; name=REL13_v3-0001-Reduce-memory-block-size-for-decoded-tuple-storag.patchDownload

From 9a58049aab0675cc55c62146f77fe23808c51ba6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Thu, 10 Oct 2024 14:51:44 -0700
Subject: [PATCH v3] Reduce memory block size for decoded tuple storage to 8kB.

Commit a4ccc1cef introduced the Generation Context and modified the
logical decoding process to use a Generation Context with a fixed
block size of 8MB for storing tuple data decoded during logical
decoding (i.e., rb->tup_context). Several reports have indicated that
the logical decoding process can be terminated due to
out-of-memory (OOM) situations caused by excessive memory usage in
rb->tup_context.

This issue can occur when decoding a workload involving several
concurrent transactions, including a long-running transaction that
modifies tuples. By design, the Generation Context does not free a
memory block until all chunks within that block are
released. Consequently, if tuples modified by the long-running
transaction are stored across multiple memory blocks, these blocks
remain allocated until the long-running transaction completes, leading
to substantial memory fragmentation. The memory usage during logical
decoding, tracked by rb->size, does not account for memory
fragmentation, resulting in potentially much higher memory consumption
than the value of the logical_decoding_work_mem parameter.

Various improvement strategies were discussed in the relevant
thread. This change reduces the block size of the Generation Context
used in rb->tup_context from 8MB to 8kB. This modification
significantly decreases the likelihood of substantial memory
fragmentation occurring and is relatively straightforward to
backport. Performance testing across multiple platforms has confirmed
that this change will not introduce any performance degradation that
would impact actual operation.

Backport to all supported branches.

Reported-by: Alex Richman, Michael Guissine, Avi Weinberg
Reviewed-by: Amit Kapila, Fujii Masao, David Rowley
Tested-by: Hayato Kuroda, Shlok Kyal

Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com
Backpatch-through: 12
---
 src/backend/replication/logical/reorderbuffer.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb323a80ec..c7f8fa6216 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -300,9 +300,17 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));

+	/*
+	 * To minimize memory fragmentation caused by long-running transactions
+	 * with changes spanning multiple memory blocks, we use a single
+	 * fixed-size memory block for decoded tuple storage. The performance
+	 * testing showed that the default memory block size maintains logical
+	 * decoding performance without causing fragmentation due to concurrent
+	 * transactions.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
-												  SLAB_LARGE_BLOCK_SIZE);
+												  SLAB_DEFAULT_BLOCK_SIZE);

 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
-- 
2.39.3

#39

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#38)

Re: Using per-transaction memory contexts for storing decoded tuples

On Wed, Oct 16, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 15, 2024 at 9:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Oct 15, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please find the attached patches.

Thank you for reviewing the patch!
@@ -343,9 +343,9 @@ ReorderBufferAllocate(void)
*/
buffer->tup_context = GenerationContextCreate(new_ctx,
"Tuples",
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE,
-   SLAB_LARGE_BLOCK_SIZE);
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE,
+   SLAB_DEFAULT_BLOCK_SIZE);
Shouldn't we change the comment atop this change [1] which states that
we should benchmark the existing values?
Agreed.
Can we slightly tweak the comments as follows so that it doesn't sound
like a fix for a bug?

/* To minimize memory fragmentation caused by long-running
transactions with changes spanning multiple memory blocks, we use a
single fixed-size memory block for decoded tuple storage. The tests
showed that the default memory block size maintains logical decoding
performance without causing fragmentation due to concurrent
transactions. One might think that we can use the max size as
SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve
the memory fragmentation.
Agreed. I've incorporated your comment in the latest patches. I'm
going to push them today, barring any objections or further comments.

Pushed. Thank you all for reviewing and testing the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com