Report reorder buffer size

Started by Ashutosh Bapat5 months ago18 messages

ashutosh.bapat.oss@gmail.com

5 months ago

4 attachment(s)

Hi All,
The question of how to tune logical_decoding_work_mem has come up a
few times in the context of customer cases. If it is too low then the
changes are spilled to disk, slowing down replication and consuming
disk space. In case of streamed transactions, it won't consume space
on primary but will consume resources (transactional resources/disk
space) on replica. If decoding memory is too high, it may cause memory
pressure especially when there are many wal senders and large amounts
of changes queued to reorder buffers frequently.

In order to tune logical_decoding_work_mem optimally we need a measure
of the size of changes belonging to the transactions queued in the
reorder buffer, whether spilled to the disk or in memory or sent
downstream as part of an unfinished streamed transaction. Currently
there is no way to determine that. pg_stat_replication_slots reports
cumulative statistics about spilled or streamed transactions. It helps
to know whether the logical decoding work memory is sufficient or not
to hold all the changes. But it does not provide a way to estimate an
optimal value for it since it doesn't capture the state of the reorder
buffer at a given point in time.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached. In the plots, the Y-axis is the total size of
reorder buffer and the X axis is time of day.
rbsize_trendlines_pgbench.png shows trendline when just regular
pgbench is run with 3K TPS and
rbsize_trendlines_large_transaction_spikes.png shows the trendline
when there are long and large transactions in addition to pgbench.
Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed. Or they may decide to set
it as 100MB, 200MB or 500MB reducing or eliminating the disk spill or
streaming. It makes it easy to quantify the trade-offs in tuning
logical_decoding_work_mem.

Tracking total size of reorder buffer also helps to quantify load on a
wal sender. It is an additional metric that can be useful to debug WAL
buildup, slow replication etc. For example, the plot clearly and
directly shows when the wal sender started processing the large
transaction, when it finished etc.

The patch is not complete. There are many things to be worked on
1. Whether the metrics should be reported in pg_stat_replication or
pg_stat_replication_slots. Given that it's a point-in-time metric, it
fits pg_stat_replication better. But that view does not report logical
decoding activity happening in backends other than the wal senders.
But that's true with the other metrics in that view as well.
pg_stat_replication_slots, however, covers all users of logical
decoding. For now I have used pg_stat_replication, but it can be
changed based on the discussion here.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer? If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

3. The patch doesn't cover all the places where the reorder buffer's
total size should be maintained. I need to find all such places and
add required maths.

But before I work on those, I would like to know others' opinions,
usefulness and acceptance. FWIW, attached is a tap test to test the
changes

--
Best Wishes,
Ashutosh Bapat

Attachments:

rbsize_trendlines_pgbench.pngimage/png; name=rbsize_trendlines_pgbench.pngDownload

rbsize_trendline_long_transaction_spikes.pngimage/png; name=rbsize_trendline_long_transaction_spikes.pngDownload

0001-Report-size-of-reorder-buffer-contents-20250813.patchtext/x-patch; charset=US-ASCII; name=0001-Report-size-of-reorder-buffer-contents-20250813.patchDownload

From 57cf4fca23fc6481f79d94b9261a2a5c14c6ba9a Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Date: Mon, 11 Aug 2025 11:02:58 +0530
Subject: [PATCH 1/2] Report size of reorder buffer contents

pg_stat_replication_slots reports cumulative statistics about spilled or
streamed transactions. It helps to know that whether the logical
decoding work memory is sufficient or not to hold all the changes. This
statistics does not give an idea about the total size of changes
belonging to the transactions tracked by the reorder buffer at a given
point in time. That is important to set logical decoding work memory
size so as to make use of memory, (in case spilled transactions) disk
and (in case of streamed transactions) resources on downstream
optimally.

This patch reports total size of the reorder buffer contents through
pg_stat_replication view. The size includes the contents in the memory,
spilled to the disk or sent downstream as part of an unfinished streamed
transaction.  Effectively it counts for size of all changes belonging
the transactions that are currently active in the reorder buffer. The
trendline of total reorder buffer size should help user set
logical_decoding_work_mem after considering various trade-offs.

It's arguable whether the count should be reported in
pg_stat_replication_slots or pg_stat_replication.
pg_stat_replication_slot covers all replication slots, in turn covering
wal senders as well as backends using replication slots e.g. clients
receiving logical changes via SQL callable functions. But it is a
cumulative statistics view and size of reorder buffer is a point in time
statistics. Such statistics are covered by
pg_stat_replication, which doesn't report statistics for client backends
receiving logical changes via a SQL callable function.
pg_stat_replication_slot statistics are updated at stragegic events
during logical decoding. Updating reorder buffer size at those strategic
points may not always suffice; it will not give the accurate value at
the time when the view is queried. pg_stat_replication fetches the real
time state of wal sender process and thus will report reorder buffer
size more accurately. But user may miss strategic events when sampling
pg_stat_replication.

Any serious user of logical replication who is looking for optimally
sizing logical_decoding_work_mem better be using a WAL sender. Thus
pg_stat_replication makes more sense.

Author: Ashutosh Bapat
---
 doc/src/sgml/monitoring.sgml                    | 14 ++++++++++++++
 src/backend/catalog/system_views.sql            |  3 ++-
 src/backend/replication/logical/decode.c        |  4 ++--
 src/backend/replication/logical/reorderbuffer.c | 16 +++++++++++++++-
 src/backend/replication/walsender.c             | 15 ++++++++++++++-
 src/include/catalog/pg_proc.dat                 |  6 +++---
 src/include/replication/reorderbuffer.h         | 16 +++++++++++++++-
 src/include/replication/walsender_private.h     |  7 +++++++
 src/test/regress/expected/rules.out             |  5 +++--
 9 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..c40f5be9a59 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1464,6 +1464,20 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        Send time of last reply message received from standby server
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rb_total_size</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total size of the changes belonging to all the transactions queued in the
+       reorder buffer. This includes the size of changes queued in memory or
+       spilled to the disk or sent downstream as part of a an unfinished
+       streamed transaction. It can be used to tune
+       <literal>logical_decoding_work_mem</literal> or estimate the logical
+       decoding load on the wal sender.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1b3c5a55882..ca31bf5375b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -932,7 +932,8 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.rb_total_size
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index cc03f0706e9..5fd8f1848ce 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -713,9 +713,9 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
-			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr, true);
 		}
-		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr, true);
 
 		return;
 	}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 34cf05668ae..0adec1fddb8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -392,6 +392,7 @@ ReorderBufferAllocate(void)
 	buffer->streamBytes = 0;
 	buffer->totalTxns = 0;
 	buffer->totalBytes = 0;
+	buffer->totalSize = 0;
 
 	buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
 
@@ -3167,7 +3168,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
  * to this xid might re-create the transaction incompletely.
  */
 void
-ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn, bool upd_rb_total_size)
 {
 	ReorderBufferTXN *txn;
 
@@ -3198,6 +3199,12 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 
 	/* remove potential on-disk data, and deallocate */
 	ReorderBufferCleanupTXN(rb, txn);
+
+	if (upd_rb_total_size)
+	{
+		/* Update the total size of the reorder buffer */
+		rb->totalSize -= txn->rb_size;
+	}
 }
 
 /*
@@ -3403,6 +3410,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 		txn->size += sz;
 		rb->size += sz;
 
+		/*
+		 * TODO: update totalSize if only this is a new change being added the
+		 * first time - i.e. not being read from the spilled disk.
+		 */
+		rb->totalSize += sz;
+		txn->rb_size += sz;
+
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0855bae3535..670f49f9286 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1495,6 +1495,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	/* Also update the sent position status in shared memory */
 	SpinLockAcquire(&MyWalSnd->mutex);
 	MyWalSnd->sentPtr = MyReplicationSlot->data.restart_lsn;
+	MyWalSnd->rb_total_size = logical_decoding_ctx->reorder->totalSize;
 	SpinLockRelease(&MyWalSnd->mutex);
 
 	replication_active = true;
@@ -3029,6 +3030,7 @@ InitWalSenderSlot(void)
 			walsnd->applyLag = -1;
 			walsnd->sync_standby_priority = 0;
 			walsnd->replyTime = 0;
+			walsnd->rb_total_size = 0;
 
 			/*
 			 * The kind assignment is done here and not in StartReplication()
@@ -3545,6 +3547,7 @@ XLogSendLogical(void)
 
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->sentPtr = sentPtr;
+		walsnd->rb_total_size = logical_decoding_ctx->reorder->totalSize;
 		SpinLockRelease(&walsnd->mutex);
 	}
 }
@@ -3963,7 +3966,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	13
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	SyncRepStandbyData *sync_standbys;
 	int			num_standbys;
@@ -3991,10 +3994,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		int64		rb_total_size;
 		bool		is_sync_standby;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
 		int			j;
+		ReplicationKind replkind;
 
 		/* Collect data from shared memory */
 		SpinLockAcquire(&walsnd->mutex);
@@ -4014,6 +4019,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		rb_total_size = walsnd->rb_total_size;
+		replkind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -4110,6 +4117,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			/* Physical walsenders do not maintain a reorder buffer. */
+			if (replkind == REPLICATION_KIND_PHYSICAL)
+				nulls[12] = true;
+			else
+				values[12] = Int64GetDatum(rb_total_size);
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 118d6da1ace..f44407a6aa2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5661,9 +5661,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,rb_total_size}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index fa0745552f8..932ce87b8c2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -458,6 +458,12 @@ typedef struct ReorderBufferTXN
 	 */
 	Size		size;
 
+	/*
+	 * Size of this transaction (changes currently in memory, on disk or
+	 * streamed in bytes).
+	 */
+	Size		rb_size;
+
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
@@ -696,6 +702,14 @@ struct ReorderBuffer
 	 */
 	int64		totalTxns;		/* total number of transactions sent */
 	int64		totalBytes;		/* total amount of data decoded */
+
+	/*
+	 * Tracks total size of changes in the reorder buffer including the
+	 * changes spilled to disk or sent downstream as part of an unfinished
+	 * streamed transaction. In other words, this tracks the total size of all
+	 * the changes in reorder buffer if it would have been infinitely large.
+	 */
+	int64		totalSize;
 };
 
 
@@ -736,7 +750,7 @@ extern void ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
 extern void ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 							   TimestampTz abort_time);
 extern void ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid);
-extern void ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn);
+extern void ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn, bool upd_rb_total_size);
 extern void ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn);
 
 extern void ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index e98701038f5..0df7b6c1e75 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -75,6 +75,13 @@ typedef struct WalSnd
 	 */
 	TimestampTz replyTime;
 
+	/*
+	 * Size of contents in reorder buffer including those spilled to the disk
+	 * or sent downstream as part of an unfinished streamed transaction. This
+	 * is the total size of all the changes in reorder buffer if it would have
+	 * been infinitely large.
+	 */
+	int64		rb_total_size;
 	ReplicationKind kind;
 } WalSnd;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..6afddad4c59 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2127,9 +2127,10 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.rb_total_size
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, rb_total_size) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_replication_slots| SELECT s.slot_name,
     s.spill_txns,

base-commit: 783cbb6d5e8bdf87d321286f210983c177ead967
-- 
2.34.1

049_rb_total_size.plapplication/x-perl; name=049_rb_total_size.plDownload

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#1)

1 attachment(s)

Re: Report reorder buffer size

Hi,

On Wed, Aug 13, 2025 at 06:09:47PM +0530, Ashutosh Bapat wrote:

Hi All,
The question of how to tune logical_decoding_work_mem has come up a
few times in the context of customer cases.

Same here.

If it is too low then the
changes are spilled to disk, slowing down replication and consuming
disk space. In case of streamed transactions, it won't consume space
on primary but will consume resources (transactional resources/disk
space) on replica. If decoding memory is too high, it may cause memory
pressure especially when there are many wal senders and large amounts
of changes queued to reorder buffers frequently.

In order to tune logical_decoding_work_mem optimally we need a measure
of the size of changes belonging to the transactions queued in the
reorder buffer, whether spilled to the disk or in memory or sent
downstream as part of an unfinished streamed transaction.

Fully agree, that's something I started to work on some time ago. Not only
to tune the logical_decoding_work_mem but also to help diagnose issues with high
memory consumption and/or OOM due to logical decoding.

Currently
there is no way to determine that. pg_stat_replication_slots reports
cumulative statistics about spilled or streamed transactions. It helps
to know whether the logical decoding work memory is sufficient or not
to hold all the changes. But it does not provide a way to estimate an
optimal value for it since it doesn't capture the state of the reorder
buffer at a given point in time.

Agree.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached. In the plots, the Y-axis is the total size of
reorder buffer and the X axis is time of day.
rbsize_trendlines_pgbench.png shows trendline when just regular
pgbench is run with 3K TPS and
rbsize_trendlines_large_transaction_spikes.png shows the trendline
when there are long and large transactions in addition to pgbench.

Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed.

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

Attached is what I did prepare a couple of years ago. I did not share it until
now because it "only" logs information in the log file and I wanted to do more.

It's far from being polished and I'm not sure it's 100% correct. I share
it just as food for thought about what information I thought could be useful
to log. It applies on top of yours.

Tracking total size of reorder buffer also helps to quantify load on a
wal sender.

Yeah, and memory being used.

It is an additional metric that can be useful to debug WAL
buildup, slow replication etc. For example, the plot clearly and
directly shows when the wal sender started processing the large
transaction, when it finished etc.

Agree that's useful info to have.

The patch is not complete. There are many things to be worked on

1. Whether the metrics should be reported in pg_stat_replication or
pg_stat_replication_slots. Given that it's a point-in-time metric, it
fits pg_stat_replication better.

I do think the same.

But that view does not report logical
decoding activity happening in backends other than the wal senders.
But that's true with the other metrics in that view as well.
pg_stat_replication_slots, however, covers all users of logical
decoding. For now I have used pg_stat_replication, but it can be
changed based on the discussion here.

I think pg_stat_replication is a good place to record "real time" activities.

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot. Also maybe record the max value the reorder buffer reached and
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

Also, should we log some of information to the server log?

But before I work on those, I would like to know others' opinions,
usefulness and acceptance.

Thanks for working on it. I like the idea and also think that is an area where
more details/metrics should be provided.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Adding-Log-messages-related-to-ReorderBuffer-size.txttext/plain; charset=us-asciiDownload

From ad86daed90d25d247b7278f4d69dd45d5414152a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Thu, 14 Aug 2025 07:12:17 +0000
Subject: [PATCH v1] Adding Log messages related to ReorderBuffer size

Add log messages related to ReorderBuffer size:

1. Size of the buffer before logical decoding starts to create spilled files on disk
(should be > logical_decoding_work_mem but by how much?)

2. Size of the buffer when the logical decoder starts to read from the spilled files.
We report the size every 100MB (controlled by ReorderBuffer_Size_Log) as well as
information about xid, sub xid and LSNs being concerned.

We could report messages at a frequency higher that 100MB but then increasing
the risk of loosing information in case of OOM.

100MB should not produce too much writes in the log file (a 10GB ReorderBuffer
would produce "only" about 100 lines).
---
 .../replication/logical/reorderbuffer.c       | 24 +++++++++++++++++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/replication/reorderbuffer.h       |  1 +
 4 files changed, 37 insertions(+)
  74.7% src/backend/replication/logical/
  22.2% src/backend/utils/misc/
   3.0% src/include/replication/

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0adec1fddb8..a8a5fd84883 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -224,6 +224,9 @@ typedef struct ReorderBufferDiskChange
  */
 int			logical_decoding_work_mem;
 static const Size max_changes_in_memory = 4096; /* XXX for restore only */
+int            reorderbuffer_size_log;
+static Size    max_logical_decoding_work_mem_used_before_serialize = 0;
+static Size    reorderbuffer_size_log_modulo = 0;
 
 /* GUC variable */
 int			debug_logical_replication_streaming = DEBUG_LOGICAL_REP_STREAMING_BUFFERED;
@@ -1221,6 +1224,8 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
 						 XLogRecPtr end_lsn)
 {
 	ReorderBufferTXN *subtxn;
+	reorderbuffer_size_log_modulo = 0;
+	max_logical_decoding_work_mem_used_before_serialize = 0;
 
 	subtxn = ReorderBufferTXNByXid(rb, subxid, false, NULL,
 								   InvalidXLogRecPtr, false);
@@ -3906,6 +3911,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		rb->size < logical_decoding_work_mem * (Size) 1024)
 		return;
 
+	if (rb->size > max_logical_decoding_work_mem_used_before_serialize)
+	{
+		max_logical_decoding_work_mem_used_before_serialize = rb->size;
+		ereport(LOG,(errmsg("ReorderBuffer Size before creating spill file(s): %zu", rb->size),
+			errcontext("xid %u", rb->by_txn_last_xid)));
+	}
+
 	/*
 	 * If debug_logical_replication_streaming is immediate, loop until there's
 	 * no change. Otherwise, loop until we reach under the memory limit. One
@@ -4531,6 +4543,18 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
+	if (rb->size / (reorderbuffer_size_log * 1024L) > reorderbuffer_size_log_modulo)
+	{
+		reorderbuffer_size_log_modulo = rb->size / (reorderbuffer_size_log * 1024L);
+		ereport(LOG,(errmsg("ReorderBuffer Size while restoring from spilled files: %lu",
+						 rb->size),
+						 errcontext("Restoring top level xid %u, xid %u at first_lsn %X/%08X, final_lsn %X/%08X and end_lsn %X/%08X",
+									txn->toplevel_xid,
+									txn->xid,
+									(uint32) (txn->first_lsn >> 32), (uint32) txn->first_lsn,
+									(uint32) (txn->final_lsn >> 32), (uint32) txn->final_lsn,
+									(uint32) (txn->end_lsn >> 32), (uint32) txn->end_lsn)));
+	}
 
 	/* free current entries, so we have memory for more */
 	dlist_foreach_modify(cleanup_iter, &txn->changes)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..c84df6c844c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2603,6 +2603,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"reorderbuffer_size_log", PGC_USERSET, LOGGING_WHEN,
+			gettext_noop("Sets the frequency in bytes to display information while restoring from spilled files during logical decoding."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&reorderbuffer_size_log,
+		102400, 10240, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a9d8293474a..48b7c1eaa77 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -145,6 +145,7 @@
 #maintenance_work_mem = 64MB		# min 64kB
 #autovacuum_work_mem = -1		# min 64kB, or -1 to use maintenance_work_mem
 #logical_decoding_work_mem = 64MB	# min 64kB
+#reorderbuffer_size_log = 100MB    # min 10MB
 #max_stack_depth = 2MB			# min 100kB
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 932ce87b8c2..6385b82ec7a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -26,6 +26,7 @@
 /* GUC variables */
 extern PGDLLIMPORT int logical_decoding_work_mem;
 extern PGDLLIMPORT int debug_logical_replication_streaming;
+extern PGDLLIMPORT int reorderbuffer_size_log;
 
 /* possible values for debug_logical_replication_streaming */
 typedef enum
-- 
2.34.1

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Bertrand Drouvot (#2)

Re: Report reorder buffer size

Hi Bertrand,
Thanks for your response. I am glad that you have found the proposal
to be generally useful.

On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed.

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

When the replication is stuck or is slow or WAL builds up a detailed
look at the reorder buffer's contents would be useful. It will need to
be done outside of pg_stat_replication or pg_stat_replication_slots. I
think it will be useful to provide a function which will trigger a
given WAL sender to dump the information about the contents of the
reorder buffer to the server error log or send them to the requesting
client.

I think pg_stat_replication is a good place to record "real time" activities.

Ok. Thanks for confirming.

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot.

I think pg_stat_replication_slots::spill_count and
pg_stat_replication_slots::stream_count give that value.

Also maybe record the max value the reorder buffer reached and
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

A function which dumps the information about reorder buffers can be
used along with the trendline for this purpose. For example, if a user
notices WAL accumulation and an upward trend in the metric exposed by
my patch, they can trigger a dump of reorder buffer contents at that
time and diagnose the problem and even fix it proactively.

Also, should we log some of information to the server log?

Information in server error logs is hard to access and analyse,
especially in cloud environment. Having a SQL construct is better.

But before I work on those, I would like to know others' opinions,
usefulness and acceptance.

Thanks for working on it. I like the idea and also think that is an area where
more details/metrics should be provided.

Thanks.

--
Best Wishes,
Ashutosh Bapat

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#3)

Re: Report reorder buffer size

Hi Ashutosh,

On Thu, Aug 21, 2025 at 07:26:41PM +0530, Ashutosh Bapat wrote:

Hi Bertrand,
Thanks for your response. I am glad that you have found the proposal
to be generally useful.

On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed.

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

When the replication is stuck or is slow or WAL builds up a detailed
look at the reorder buffer's contents would be useful. It will need to
be done outside of pg_stat_replication or pg_stat_replication_slots. I
think it will be useful to provide a function which will trigger a
given WAL sender to dump the information about the contents of the
reorder buffer to the server error log or send them to the requesting
client.

Maybe we could add such functionalities in contrib/pg_logicalinspect?

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot.

I think pg_stat_replication_slots::spill_count and
pg_stat_replication_slots::stream_count give that value.

Yes as long as subtrans are not involved. But if say I've one transaction made of
1500 subtrans, I'd get something like:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 1

So we don't know how many times logical_decoding_work_mem has been reached (
except by looking at total_txns).

But as soon as another transaction (that does not involve spill) is decoded:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 2

Then we don't know if logical_decoding_work_mem has been reached one or two
times.

Also maybe record the max value the reorder buffer reached and
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

A function which dumps the information about reorder buffers can be
used along with the trendline for this purpose. For example, if a user
notices WAL accumulation and an upward trend in the metric exposed by
my patch, they can trigger a dump of reorder buffer contents at that
time and diagnose the problem and even fix it proactively.

Right. That would work for cases where the issue is occuring right now and
the system is still available (OOM not triggered for example).

Also, should we log some of information to the server log?

Information in server error logs is hard to access and analyse,
especially in cloud environment. Having a SQL construct is better.

That's right as long as the system is still available. I think a mix of log
and SQL API help more use cases: real time, historical and when the system is
back to a usable state (if it was not anymore).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Bertrand Drouvot (#4)

Re: Report reorder buffer size

On Fri, Aug 22, 2025 at 11:43 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Ashutosh,

On Thu, Aug 21, 2025 at 07:26:41PM +0530, Ashutosh Bapat wrote:

Hi Bertrand,
Thanks for your response. I am glad that you have found the proposal
to be generally useful.

On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed.

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

When the replication is stuck or is slow or WAL builds up a detailed
look at the reorder buffer's contents would be useful. It will need to
be done outside of pg_stat_replication or pg_stat_replication_slots. I
think it will be useful to provide a function which will trigger a
given WAL sender to dump the information about the contents of the
reorder buffer to the server error log or send them to the requesting
client.

Maybe we could add such functionalities in contrib/pg_logicalinspect?

I don't have an opinion about where to place this. But if place it in
the core, there are higher chances that it will be trusted and used.

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot.

I think pg_stat_replication_slots::spill_count and
pg_stat_replication_slots::stream_count give that value.

Yes as long as subtrans are not involved. But if say I've one transaction made of
1500 subtrans, I'd get something like:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 1

So we don't know how many times logical_decoding_work_mem has been reached (
except by looking at total_txns).

But as soon as another transaction (that does not involve spill) is decoded:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 2

Then we don't know if logical_decoding_work_mem has been reached one or two
times.

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

Also maybe record the max value the reorder buffer reached and
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

A function which dumps the information about reorder buffers can be
used along with the trendline for this purpose. For example, if a user
notices WAL accumulation and an upward trend in the metric exposed by
my patch, they can trigger a dump of reorder buffer contents at that
time and diagnose the problem and even fix it proactively.

Right. That would work for cases where the issue is occuring right now and
the system is still available (OOM not triggered for example).

Also, should we log some of information to the server log?

Information in server error logs is hard to access and analyse,
especially in cloud environment. Having a SQL construct is better.

That's right as long as the system is still available. I think a mix of log
and SQL API help more use cases: real time, historical and when the system is
back to a usable state (if it was not anymore).

The way I envision this is that users will sample the views
periodically or when certain thresholds (amount of WAL, size of
reorder buffers) are crossed. That way they can have historical data
in an easy-to-query manner when the actual incident occurs.

--
Best Wishes,
Ashutosh Bapat

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#5)

Re: Report reorder buffer size

Hi,

On Mon, Aug 25, 2025 at 01:48:42PM +0530, Ashutosh Bapat wrote:

On Fri, Aug 22, 2025 at 11:43 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Yes as long as subtrans are not involved. But if say I've one transaction made of
1500 subtrans, I'd get something like:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 1

So we don't know how many times logical_decoding_work_mem has been reached (
except by looking at total_txns).

But as soon as another transaction (that does not involve spill) is decoded:

slot_name | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
logical_slot | 1501 | 0 | 2

Then we don't know if logical_decoding_work_mem has been reached one or two
times.

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
That could help to see if reaching logical_decoding_work_mem is rare or
frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.

That's right as long as the system is still available. I think a mix of log
and SQL API help more use cases: real time, historical and when the system is
back to a usable state (if it was not anymore).

The way I envision this is that users will sample the views
periodically or when certain thresholds (amount of WAL, size of
reorder buffers) are crossed. That way they can have historical data
in an easy-to-query manner when the actual incident occurs.

Yeah. OTOH, having this information in the log could also help users that
did not think about sampling the views and hit an incident.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#1)

Re: Report reorder buffer size

Hi,

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
The question of how to tune logical_decoding_work_mem has come up a
few times in the context of customer cases. If it is too low then the
changes are spilled to disk, slowing down replication and consuming
disk space. In case of streamed transactions, it won't consume space
on primary but will consume resources (transactional resources/disk
space) on replica. If decoding memory is too high, it may cause memory
pressure especially when there are many wal senders and large amounts
of changes queued to reorder buffers frequently.

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

In order to tune logical_decoding_work_mem optimally we need a measure
of the size of changes belonging to the transactions queued in the
reorder buffer, whether spilled to the disk or in memory or sent
downstream as part of an unfinished streamed transaction. Currently
there is no way to determine that. pg_stat_replication_slots reports
cumulative statistics about spilled or streamed transactions. It helps
to know whether the logical decoding work memory is sufficient or not
to hold all the changes. But it does not provide a way to estimate an
optimal value for it since it doesn't capture the state of the reorder
buffer at a given point in time.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

Tracking total size of reorder buffer also helps to quantify load on a
wal sender. It is an additional metric that can be useful to debug WAL
buildup, slow replication etc. For example, the plot clearly and
directly shows when the wal sender started processing the large
transaction, when it finished etc.

Agreed.

The patch is not complete. There are many things to be worked on
1. Whether the metrics should be reported in pg_stat_replication or
pg_stat_replication_slots. Given that it's a point-in-time metric, it
fits pg_stat_replication better. But that view does not report logical
decoding activity happening in backends other than the wal senders.
But that's true with the other metrics in that view as well.
pg_stat_replication_slots, however, covers all users of logical
decoding. For now I have used pg_stat_replication, but it can be
changed based on the discussion here.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

We need to note that the actual size of changes sent downstream
actually depends on logical decoding plugins. For instance, we have
table, row, and column filters in logical replication cases. It might
be worth considering providing such statistics too. The statistics
like the number of changes filtered out by the table filters or the
change-kind filter (e.g., only publishing INSERTs etc) might be
helpful for users to confirm the effectiveness of the filters they
set.

If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

Given logical decoding can be used also by regular backend processes,
I guess that such dynamic metrics would fit a system view dedicated to
logical decoding, say pg_stat_reorder_buffer or
pg_stat_logical_decoding.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Bertrand Drouvot (#6)

Re: Report reorder buffer size

On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
That could help to see if reaching logical_decoding_work_mem is rare or
frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.

We are discussing some improvements to pg_stat_replication_slots over
at [1]/messages/by-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com. Would you be able to create a patch reporting the number of
times logical_decoding_work_mem is reached, in the patchset being
discussed there?

[1]: /messages/by-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Masahiko Sawada (#7)

Re: Report reorder buffer size

Hi Masahiko,
Thanks for your inputs.

On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

Glad that more people think there is a need for improving the reorder
buffer statistics.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

I think the trendline is important to make right trade-offs in case
setting logical_decoding_work_mem to maximum is not possible. With a
high watermark the question is how long does a high remain high? As
the load changes, a high that was once may become lower or higher than
the new high and soon irrelevant. Probably we could reset the
high-watermark every time the view is sampled, so that we provide a
trendline for the high-water mark as well. But I want to be cautious
about adding that complexity of tracking maxima accurately and then
maintain it forever. If sampling is frequent enough usually it will
capture a maxima and minima good enough for practical purposes. The
users will need to consider the trendline as approximate anyway since
the load will show slight variations over the time.

Please share your idea of reporting high-watermark.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

We need to note that the actual size of changes sent downstream
actually depends on logical decoding plugins. For instance, we have
table, row, and column filters in logical replication cases. It might
be worth considering providing such statistics too. The statistics
like the number of changes filtered out by the table filters or the
change-kind filter (e.g., only publishing INSERTs etc) might be
helpful for users to confirm the effectiveness of the filters they
set.

I have proposed this in [1]. Please check.

If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

Given logical decoding can be used also by regular backend processes,
I guess that such dynamic metrics would fit a system view dedicated to
logical decoding, say pg_stat_reorder_buffer or
pg_stat_logical_decoding.

Hmm. That means we will have to reserve some area in the shared memory
with as many slots as the number of replication slots (since that is
the maximum number of reorder buffers that can be in the system) and
use each one to maintain the stats. I haven't yet checked whether we
could use the stats maintaining system for it, but it should be
doable.

--
Best Wishes,
Ashutosh Bapat

#10

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#8)

Re: Report reorder buffer size

Hi,

On Tue, Aug 26, 2025 at 02:56:14PM +0530, Ashutosh Bapat wrote:

On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
That could help to see if reaching logical_decoding_work_mem is rare or
frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.

We are discussing some improvements to pg_stat_replication_slots over
at [1]. Would you be able to create a patch reporting the number of
times logical_decoding_work_mem is reached, in the patchset being
discussed there?

[1] /messages/by-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com

Thanks for pointing me to this thread, I'll look at it and propose a patch
to report the number of times logical_decoding_work_mem is reached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#9)

Re: Report reorder buffer size

If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

Given logical decoding can be used also by regular backend processes,
I guess that such dynamic metrics would fit a system view dedicated to
logical decoding, say pg_stat_reorder_buffer or
pg_stat_logical_decoding.

Hmm. That means we will have to reserve some area in the shared memory
with as many slots as the number of replication slots (since that is
the maximum number of reorder buffers that can be in the system) and
use each one to maintain the stats. I haven't yet checked whether we
could use the stats maintaining system for it, but it should be
doable.

I think the beauty of reporting the statistics only for WAL senders,
as is done with the patch, was that we could piggyback stats update on
current logic to update WalSnd. If we want to do it for other
backends, it would be an extra overhead to lock and update the stats
outside WalSnd. And as I said in my email, WAL senders would be
arguably only serious users of logical decoding whose usage will be
worth monitoring - that's why pg_stat_replication reports only stats
about WAL senders and not other backends. Even if we report stats
about only WAL senders, it should be possible to report them through a
view other than pg_stat_replication. What's your opinion about this?

--
Best Wishes,
Ashutosh Bapat

#12

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Bertrand Drouvot (#10)

Re: Report reorder buffer size

Hi,

On Tue, Aug 26, 2025 at 09:48:04AM +0000, Bertrand Drouvot wrote:

Hi,

On Tue, Aug 26, 2025 at 02:56:14PM +0530, Ashutosh Bapat wrote:

On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
That could help to see if reaching logical_decoding_work_mem is rare or
frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.

We are discussing some improvements to pg_stat_replication_slots over
at [1]. Would you be able to create a patch reporting the number of
times logical_decoding_work_mem is reached, in the patchset being
discussed there?

[1] /messages/by-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com

Thanks for pointing me to this thread, I'll look at it and propose a patch
to report the number of times logical_decoding_work_mem is reached.

I preferred to create a dedicated thread for it [1]/messages/by-id/aK6zNcmb+PpySFvG@ip-10-97-1-34.eu-west-3.compute.internal (I think both may need dedicated
discussions).

[1]: /messages/by-id/aK6zNcmb+PpySFvG@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Ashutosh Bapat (#9)

Re: Report reorder buffer size

On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Masahiko,
Thanks for your inputs.

On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

Glad that more people think there is a need for improving the reorder
buffer statistics.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

I think the trendline is important to make right trade-offs in case
setting logical_decoding_work_mem to maximum is not possible. With a
high watermark the question is how long does a high remain high? As
the load changes, a high that was once may become lower or higher than
the new high and soon irrelevant. Probably we could reset the
high-watermark every time the view is sampled, so that we provide a
trendline for the high-water mark as well. But I want to be cautious
about adding that complexity of tracking maxima accurately and then
maintain it forever. If sampling is frequent enough usually it will
capture a maxima and minima good enough for practical purposes. The
users will need to consider the trendline as approximate anyway since
the load will show slight variations over the time.

Agreed with the importance of trendline.

Please share your idea of reporting high-watermark.

I was thinking of this high-watermark idea since it doesn't require
any external system/tool to get the information for tuning
logical_decoding_work_mem. It would be easy to use when users want to
figure out maximum data usage for logical decoding of the particular
workload.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

For debugging or diagnosing purposes, this information might be
useful, but I'm not sure we need this information of logical decodings
that are running.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

5 months ago

In reply to: Masahiko Sawada (#13)

Re: Report reorder buffer size

On Thu, Aug 28, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Masahiko,
Thanks for your inputs.

On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

Glad that more people think there is a need for improving the reorder
buffer statistics.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

I think the trendline is important to make right trade-offs in case
setting logical_decoding_work_mem to maximum is not possible. With a
high watermark the question is how long does a high remain high? As
the load changes, a high that was once may become lower or higher than
the new high and soon irrelevant. Probably we could reset the
high-watermark every time the view is sampled, so that we provide a
trendline for the high-water mark as well. But I want to be cautious
about adding that complexity of tracking maxima accurately and then
maintain it forever. If sampling is frequent enough usually it will
capture a maxima and minima good enough for practical purposes. The
users will need to consider the trendline as approximate anyway since
the load will show slight variations over the time.

Agreed with the importance of trendline.

Please share your idea of reporting high-watermark.

I was thinking of this high-watermark idea since it doesn't require
any external system/tool to get the information for tuning
logical_decoding_work_mem. It would be easy to use when users want to
figure out maximum data usage for logical decoding of the particular
workload.

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

For debugging or diagnosing purposes, this information might be
useful, but I'm not sure we need this information of logical decodings
that are running.

I was not clear. Sorry. I meant reporting the number of
subtransactions and prepared transactions in the reorder buffer (not
the actual subtransactions or prepared transactions themselves). Do
you think that reporting just the number of transactions in the
reorder buffer is enough? Do you think that reporting the number of
subtransactions and the number of prepared transactions is not
required?

--
Best Wishes,
Ashutosh Bapat

#15

Masahiko Sawada

sawada.mshk@gmail.com

4 months ago

In reply to: Ashutosh Bapat (#14)

Re: Report reorder buffer size

On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Aug 28, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Masahiko,
Thanks for your inputs.

On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

Glad that more people think there is a need for improving the reorder
buffer statistics.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

I think the trendline is important to make right trade-offs in case
setting logical_decoding_work_mem to maximum is not possible. With a
high watermark the question is how long does a high remain high? As
the load changes, a high that was once may become lower or higher than
the new high and soon irrelevant. Probably we could reset the
high-watermark every time the view is sampled, so that we provide a
trendline for the high-water mark as well. But I want to be cautious
about adding that complexity of tracking maxima accurately and then
maintain it forever. If sampling is frequent enough usually it will
capture a maxima and minima good enough for practical purposes. The
users will need to consider the trendline as approximate anyway since
the load will show slight variations over the time.

Agreed with the importance of trendline.

Please share your idea of reporting high-watermark.

I was thinking of this high-watermark idea since it doesn't require
any external system/tool to get the information for tuning
logical_decoding_work_mem. It would be easy to use when users want to
figure out maximum data usage for logical decoding of the particular
workload.

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

I think it ultimately depends on use cases but I imagine that users
can reset the high watermark value after adjusting
logical_decoding_work_mem, and see how the new value works, then
adjust the parameter and reset the value again, ... repeating.

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

I think that high watermark value would be a new type of statistics we
collect. As far as I know pg_stat_statement collects similar values
but there is no precedent in the core.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

For debugging or diagnosing purposes, this information might be
useful, but I'm not sure we need this information of logical decodings
that are running.

I was not clear. Sorry. I meant reporting the number of
subtransactions and prepared transactions in the reorder buffer (not
the actual subtransactions or prepared transactions themselves). Do
you think that reporting just the number of transactions in the
reorder buffer is enough? Do you think that reporting the number of
subtransactions and the number of prepared transactions is not
required?

Given that collecting new statistics and updating the statistics
real-time require costs, it's ultimately rewards vs. costs. In terms
of additional costs, I guess tracking the number of prepared
transactions in the reorderbuffer would not be costly, but I'm not
sure about the number of subtransactions. I personally would avoid
collecting these statistics unless there are explicit use cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Masahiko Sawada (#15)

Re: Report reorder buffer size

On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

I think it ultimately depends on use cases but I imagine that users
can reset the high watermark value after adjusting
logical_decoding_work_mem, and see how the new value works, then
adjust the parameter and reset the value again, ... repeating.

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

I think that high watermark value would be a new type of statistics we
collect. As far as I know pg_stat_statement collects similar values
but there is no precedent in the core.

Thanks for pointing me to pg_stat_statements. I see we store
cumulative statistics like min and max plan and execution times for
queries there. We maintain cumulative statistics about logical
decoding/replication in pg_stat_replication_slot. I think the high
watermark fits there. It also has functionality to reset the
statistics which can be used to reset the high watermark along with
other statistics.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

For debugging or diagnosing purposes, this information might be
useful, but I'm not sure we need this information of logical decodings
that are running.

I was not clear. Sorry. I meant reporting the number of
subtransactions and prepared transactions in the reorder buffer (not
the actual subtransactions or prepared transactions themselves). Do
you think that reporting just the number of transactions in the
reorder buffer is enough? Do you think that reporting the number of
subtransactions and the number of prepared transactions is not
required?

Given that collecting new statistics and updating the statistics
real-time require costs, it's ultimately rewards vs. costs. In terms
of additional costs, I guess tracking the number of prepared
transactions in the reorderbuffer would not be costly, but I'm not
sure about the number of subtransactions. I personally would avoid
collecting these statistics unless there are explicit use cases.

I agree. I think it's not clear whether statistics about number of
transactions currently active in the reorder buffer is useful. So
let's leave it. If needed we will add it at a later point in time.

I will submit a complete patch soon.

--
Best Wishes,
Ashutosh Bapat

#17

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Ashutosh Bapat (#16)

Re: Report reorder buffer size

Hi,

On 2025-09-23 17:45:19 +0530, Ashutosh Bapat wrote:

On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

I think it ultimately depends on use cases but I imagine that users
can reset the high watermark value after adjusting
logical_decoding_work_mem, and see how the new value works, then
adjust the parameter and reset the value again, ... repeating.

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

I think that high watermark value would be a new type of statistics we
collect. As far as I know pg_stat_statement collects similar values
but there is no precedent in the core.

Thanks for pointing me to pg_stat_statements. I see we store
cumulative statistics like min and max plan and execution times for
queries there. We maintain cumulative statistics about logical
decoding/replication in pg_stat_replication_slot. I think the high
watermark fits there. It also has functionality to reset the
statistics which can be used to reset the high watermark along with
other statistics.

I think pg_stat_statement is an anti-example, if anything. It's been so
overloaded with barely useful fields that it got so so slow that it starts to
cause contention even in just moderately busy workloads.

Let's make this a minimal change, instead of over-complicating this beyond
usefulness.

Greetings,

Andres Freund

#18

Ashutosh Bapat

ashutosh.bapat.oss@gmail.com

4 months ago

In reply to: Andres Freund (#17)

Re: Report reorder buffer size

On Tue, Sep 23, 2025 at 6:04 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-09-23 17:45:19 +0530, Ashutosh Bapat wrote:

On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

I think it ultimately depends on use cases but I imagine that users
can reset the high watermark value after adjusting
logical_decoding_work_mem, and see how the new value works, then
adjust the parameter and reset the value again, ... repeating.

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

I think that high watermark value would be a new type of statistics we
collect. As far as I know pg_stat_statement collects similar values
but there is no precedent in the core.

Thanks for pointing me to pg_stat_statements. I see we store
cumulative statistics like min and max plan and execution times for
queries there. We maintain cumulative statistics about logical
decoding/replication in pg_stat_replication_slot. I think the high
watermark fits there. It also has functionality to reset the
statistics which can be used to reset the high watermark along with
other statistics.

I think pg_stat_statement is an anti-example, if anything. It's been so
overloaded with barely useful fields that it got so so slow that it starts to
cause contention even in just moderately busy workloads.

Let's make this a minimal change, instead of over-complicating this beyond
usefulness.

Thanks Bertrand, Masahiko and Andres for your thoughts.

Summarising the discussion so far, it seems we agree that reporting
the current size of reorder buffer (including the spilled and streamed
transactions) in pg_stat_replication is useful.Including the sizes of
changes from spilled as well as streamed transactions allows users to
see how reorder buffer size would trend if logical_decoding_work_mem
were to be infinite. If it's sampled frequently enough, the trendline
can be used to find an optimal value of logical_decoding_work_mem
considering various trade-offs.

We did consider other metrics like number of transactions in the
reorder buffer and highest size of reorder buffer. But the opinions
about their usefulness differ. Hence defering them for now.

I will work on producing a complete patch next.

--
Best Wishes,
Ashutosh Bapat