Skip collecting decoded changes of already-aborted transactions

Started by Masahiko Sawadaover 2 years ago84 messages
#1Masahiko Sawada
sawada.mshk@gmail.com
1 attachment(s)

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

skip_decoding_already_aborted_txn.patchapplication/octet-stream; name=skip_decoding_already_aborted_txn.patchDownload
diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8..fe06e42c98 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,38 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4', 'test_decoding') s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- transaction is large enough to be serialized but aborted.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+ROLLBACK;
+RESET logical_decoding_work_mem;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4', NULL, NULL, 'skip-empty-xacts', '1');
+ count 
+-------
+     0
+(1 row)
+
+-- Check stats. Since the transaction is already aborted, we don't collect
+-- changes, so no data should be spilled.
+SELECT slot_name, spill_txns = 0 AS spill_txns, spill_count = 0 AS spill_count, total_txns > 0 AS total_txns, total_bytes > 0 AS total_bytes FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4';
+       slot_name        | spill_txns | spill_count | total_txns | total_bytes 
+------------------------+------------+-------------+------------+-------------
+ regression_slot_stats4 | t          | t           | f          | f
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index 0f21dcb8e0..bdba352f1a 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -24,12 +24,11 @@ SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
 INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
 TRUNCATE table stream_test;
 rollback to s1;
-INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 31) g(i);
 COMMIT;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
-                           data                           
-----------------------------------------------------------
- streaming message: transactional: 1 prefix: test, sz: 50
+                   data                   
+------------------------------------------
  opening a streamed block for transaction
  streaming change for transaction
  streaming change for transaction
@@ -51,9 +50,20 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  streaming change for transaction
  streaming change for transaction
  streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
  closing a streamed block for transaction
  committing streamed transaction
-(24 rows)
+(34 rows)
 
 -- streaming test for toast changes
 ALTER TABLE stream_test ALTER COLUMN data set storage external;
diff --git a/contrib/test_decoding/expected/twophase_stream.out b/contrib/test_decoding/expected/twophase_stream.out
index b08bb0e573..55d5c6085f 100644
--- a/contrib/test_decoding/expected/twophase_stream.out
+++ b/contrib/test_decoding/expected/twophase_stream.out
@@ -25,13 +25,12 @@ SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
 INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
 TRUNCATE table stream_test;
 ROLLBACK TO s1;
-INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 31) g(i);
 PREPARE TRANSACTION 'test1';
 -- should show the inserts after a ROLLBACK
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
-                           data                           
-----------------------------------------------------------
- streaming message: transactional: 1 prefix: test, sz: 50
+                   data                   
+------------------------------------------
  opening a streamed block for transaction
  streaming change for transaction
  streaming change for transaction
@@ -53,9 +52,20 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  streaming change for transaction
  streaming change for transaction
  streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
  closing a streamed block for transaction
  preparing streamed transaction 'test1'
-(24 rows)
+(34 rows)
 
 COMMIT PREPARED 'test1';
 --should show the COMMIT PREPARED and the other changes in the transaction
@@ -82,10 +92,9 @@ INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20
 PREPARE TRANSACTION 'test1_nodecode';
 -- should NOT show inserts after a ROLLBACK
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
-                           data                           
-----------------------------------------------------------
- streaming message: transactional: 1 prefix: test, sz: 50
-(1 row)
+ data 
+------
+(0 rows)
 
 COMMIT PREPARED 'test1_nodecode';
 -- should show the inserts but not show a COMMIT PREPARED but a COMMIT
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147..8a5819f63d 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,22 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4', 'test_decoding') s4;
+
+-- transaction is large enough to be serialized but aborted.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+ROLLBACK;
+
+RESET logical_decoding_work_mem;
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4', NULL, NULL, 'skip-empty-xacts', '1');
+
+-- Check stats. Since the transaction is already aborted, we don't collect
+-- changes, so no data should be spilled.
+SELECT slot_name, spill_txns = 0 AS spill_txns, spill_count = 0 AS spill_count, total_txns > 0 AS total_txns, total_bytes > 0 AS total_bytes FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 4feec62972..ceab9aa627 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -13,7 +13,7 @@ SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
 INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
 TRUNCATE table stream_test;
 rollback to s1;
-INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 31) g(i);
 COMMIT;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
diff --git a/contrib/test_decoding/sql/twophase_stream.sql b/contrib/test_decoding/sql/twophase_stream.sql
index 646076da20..f607aeae20 100644
--- a/contrib/test_decoding/sql/twophase_stream.sql
+++ b/contrib/test_decoding/sql/twophase_stream.sql
@@ -15,7 +15,7 @@ SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
 INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
 TRUNCATE table stream_test;
 ROLLBACK TO s1;
-INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 31) g(i);
 PREPARE TRANSACTION 'test1';
 -- should show the inserts after a ROLLBACK
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..d58d123957 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -674,6 +674,9 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
 		txn->first_lsn = lsn;
 		txn->restart_decoding_lsn = rb->current_restart_decoding_lsn;
 
+		/* Check if the transaction already aborted */
+		txn->aborted = TransactionIdDidAbort(xid);
+
 		if (create_as_top)
 		{
 			dlist_push_tail(&rb->toplevel_by_lsn, &txn->node);
@@ -780,11 +783,15 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming
+	 * the previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
+	 *
+	 * XXX To pass some TAP tests, we don't skip decoding already-aborted
+	 * transaction changes if logical_replication_mode is immediate, for now.
 	 */
-	if (txn->concurrent_abort)
+	if (txn->concurrent_abort ||
+		(txn->aborted && logical_replication_mode != LOGICAL_REP_MODE_IMMEDIATE))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1b9db22acb..8879083d28 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -431,6 +431,12 @@ typedef struct ReorderBufferTXN
 	/* If we have detected concurrent abort then ignore future changes. */
 	bool		concurrent_abort;
 
+	/*
+	 * If the transaction is known to be already aborted then ignore
+	 * changes.
+	 */
+	bool		aborted;
+
 	/*
 	 * Private data pointer of the output plugin.
 	 */
#2Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#1)
Re: Skip collecting decoded changes of already-aborted transactions

Hi,

On 2023-06-09 14:16:44 +0900, Masahiko Sawada wrote:

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

It's very easy to get uses of TransactionIdDidAbort() wrong. For one, it won't
return true when a transaction was implicitly aborted due to a crash /
restart. You're also supposed to use it only after a preceding
TransactionIdIsInProgress() call.

I'm not sure there are issues with not checking TransactionIdIsInProgress()
first in this case, but I'm also not sure there aren't.

A separate issue is that TransactionIdDidAbort() can end up being very slow if
a lot of transactions are in progress concurrently. As soon as the clog
buffers are extended all time is spent copying pages from the kernel
pagecache. I'd not at all be surprised if this changed causes a substantial
slowdown in workloads with lots of small transactions, where most transactions
commit.

Greetings,

Andres Freund

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#2)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Jun 11, 2023 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-06-09 14:16:44 +0900, Masahiko Sawada wrote:

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

Thank you for the comment.

It's very easy to get uses of TransactionIdDidAbort() wrong. For one, it won't
return true when a transaction was implicitly aborted due to a crash /
restart. You're also supposed to use it only after a preceding
TransactionIdIsInProgress() call.

I'm not sure there are issues with not checking TransactionIdIsInProgress()
first in this case, but I'm also not sure there aren't.

Yeah, it seems to be better to use !TransactionIdDidCommit() with a
preceding TransactionIdIsInProgress() check.

A separate issue is that TransactionIdDidAbort() can end up being very slow if
a lot of transactions are in progress concurrently. As soon as the clog
buffers are extended all time is spent copying pages from the kernel
pagecache. I'd not at all be surprised if this changed causes a substantial
slowdown in workloads with lots of small transactions, where most transactions
commit.

Indeed. So it should check the transaction status less frequently. It
doesn't benefit much even if we can skip collecting decoded changes of
small transactions. Another idea is that we check the status of only
large transactions. That is, when the size of decoded changes of an
aborted transaction exceeds logical_decoding_work_mem, we mark it as
aborted , free its changes decoded so far, and skip further
collection.

Regards

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#3)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Jun 13, 2023 at 2:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jun 11, 2023 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:

A separate issue is that TransactionIdDidAbort() can end up being very slow if
a lot of transactions are in progress concurrently. As soon as the clog
buffers are extended all time is spent copying pages from the kernel
pagecache. I'd not at all be surprised if this changed causes a substantial
slowdown in workloads with lots of small transactions, where most transactions
commit.

Indeed. So it should check the transaction status less frequently. It
doesn't benefit much even if we can skip collecting decoded changes of
small transactions. Another idea is that we check the status of only
large transactions. That is, when the size of decoded changes of an
aborted transaction exceeds logical_decoding_work_mem, we mark it as
aborted , free its changes decoded so far, and skip further
collection.

Your idea might work for large transactions but I have not come across
reports where this is reported as a problem. Do you see any such
reports and can we see how much is the benefit with large
transactions? Because we do have the handling of concurrent aborts
during sys table scans and that might help sometimes for large
transactions.

--
With Regards,
Amit Kapila.

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#4)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Jun 15, 2023 at 7:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 13, 2023 at 2:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jun 11, 2023 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:

A separate issue is that TransactionIdDidAbort() can end up being very slow if
a lot of transactions are in progress concurrently. As soon as the clog
buffers are extended all time is spent copying pages from the kernel
pagecache. I'd not at all be surprised if this changed causes a substantial
slowdown in workloads with lots of small transactions, where most transactions
commit.

Indeed. So it should check the transaction status less frequently. It
doesn't benefit much even if we can skip collecting decoded changes of
small transactions. Another idea is that we check the status of only
large transactions. That is, when the size of decoded changes of an
aborted transaction exceeds logical_decoding_work_mem, we mark it as
aborted , free its changes decoded so far, and skip further
collection.

Your idea might work for large transactions but I have not come across
reports where this is reported as a problem. Do you see any such
reports and can we see how much is the benefit with large
transactions? Because we do have the handling of concurrent aborts
during sys table scans and that might help sometimes for large
transactions.

I've heard there was a case where a user had 29 million deletes in a
single transaction with each one wrapped in a savepoint and rolled it
back, which led to 11TB of spill files. If decoding such a large
transaction fails for some reasons (e.g. a disk full), it would try
decoding the same transaction again and again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#5)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jun 21, 2023 at 8:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jun 15, 2023 at 7:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 13, 2023 at 2:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jun 11, 2023 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:

A separate issue is that TransactionIdDidAbort() can end up being very slow if
a lot of transactions are in progress concurrently. As soon as the clog
buffers are extended all time is spent copying pages from the kernel
pagecache. I'd not at all be surprised if this changed causes a substantial
slowdown in workloads with lots of small transactions, where most transactions
commit.

Indeed. So it should check the transaction status less frequently. It
doesn't benefit much even if we can skip collecting decoded changes of
small transactions. Another idea is that we check the status of only
large transactions. That is, when the size of decoded changes of an
aborted transaction exceeds logical_decoding_work_mem, we mark it as
aborted , free its changes decoded so far, and skip further
collection.

Your idea might work for large transactions but I have not come across
reports where this is reported as a problem. Do you see any such
reports and can we see how much is the benefit with large
transactions? Because we do have the handling of concurrent aborts
during sys table scans and that might help sometimes for large
transactions.

I've heard there was a case where a user had 29 million deletes in a
single transaction with each one wrapped in a savepoint and rolled it
back, which led to 11TB of spill files. If decoding such a large
transaction fails for some reasons (e.g. a disk full), it would try
decoding the same transaction again and again.

I was thinking why the existing handling of concurrent aborts doesn't
handle such a case and it seems that we check that only on catalog
access. However, in your case, the user probably is accessing the same
relation without any concurrent DDL on the same table, so it would
just be a cache look-up for catalogs. Your idea of checking aborts
every logical_decoding_work_mem should work for such cases.

--
With Regards,
Amit Kapila.

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

+1 for the idea of checking the transaction status only when we need
to flush it to the disk or send it downstream (if streaming in
progress is enabled). Although this check is costly since we are
planning only for large transactions then it is worth it if we can
occasionally avoid disk or network I/O for the aborted transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dilip Kumar (#7)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

+1 for the idea of checking the transaction status only when we need
to flush it to the disk or send it downstream (if streaming in
progress is enabled). Although this check is costly since we are
planning only for large transactions then it is worth it if we can
occasionally avoid disk or network I/O for the aborted transactions.

Thanks.

I've attached the updated patch. With this patch, we check the
transaction status for only large-transactions when eviction. For
regression test purposes, I disable this transaction status check when
logical_replication_mode is set to 'immediate'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Skip-decoding-already-aborted-transactions.patchapplication/octet-stream; name=v2-0001-Skip-decoding-already-aborted-transactions.patchDownload
From c70daa35f2308ea195e177e30b54e2a613f78811 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 3 Jul 2023 10:28:00 +0900
Subject: [PATCH v2] Skip decoding already-aborted transactions.

Previously, we had the mechanism for detecting concurrent aborts for
streaming transactions. This commit enables us to check if the
large-transaction already aborted so that we can ignore further
changes. This is helpful for a case where the transaction is quite
large and already rolled back since we can avoid disk or network I/O.

We do the check for only large-transactions when eviction since
checking CLOG is costly and could cause a slowdown with lots of small
transactions, where most transactions commit.

For testing purpose, we disable this check when
logical_replication_mode is set to "immediate".

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 .../replication/logical/reorderbuffer.c       | 94 ++++++++++++++++---
 src/include/replication/reorderbuffer.h       | 13 ++-
 2 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..387d2e9131 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -256,7 +257,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool streaming);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -780,11 +781,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming
+	 * the previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (txn->aborted)
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1603,9 +1604,12 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ * 'streaming' indicates that this function is called while streaming the transaction
+ * and we can mark the transaction as streamed.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool streaming)
 {
 	dlist_mutable_iter iter;
 
@@ -1624,7 +1628,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1658,7 +1662,8 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	if (streaming &&
+		(!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	if (txn_prepared)
@@ -1887,7 +1892,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2030,7 +2035,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2555,7 +2560,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2608,7 +2613,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+			curtxn->aborted = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2792,10 +2797,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Note, for the abort + streaming case a stream_prepare was already sent
+	 * within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (txn->aborted && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3561,6 +3566,59 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.
+ *
+ * If logical_replication_mode is set to "immediate", we disable this check
+ * for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * If logical_replication_mode is "immediate", we don't check the
+	 * transaction status so the caller always process this transaction.
+	 */
+	if (unlikely(logical_replication_mode == LOGICAL_REP_MODE_IMMEDIATE))
+		return false;
+
+	if (txn->aborted)
+		return true;
+
+	if (txn->committed)
+		return false;
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip
+		 * CLOG check next time, avoiding the pressure on CLOG lookup.
+		 */
+		txn->committed = true;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected
+	 * so far, and free all resources allocated for toast reconstruction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+	ReorderBufferToastReset(rb, txn);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	txn->aborted = true;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3613,6 +3671,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3628,6 +3689,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1b9db22acb..fae431ef95 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -428,8 +428,17 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
+	/*
+	 * True if the transaction committed. Then we skip transaction status
+	 * check for this transaction.
+	 */
+	bool		committed;
+
+	/*
+	 * True if the transaction (concurrently) aborted. Then we ignore
+	 * future changes.
+	 */
+	bool		aborted;
 
 	/*
 	 * Private data pointer of the output plugin.
-- 
2.31.1

#9vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#8)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

+1 for the idea of checking the transaction status only when we need
to flush it to the disk or send it downstream (if streaming in
progress is enabled). Although this check is costly since we are
planning only for large transactions then it is worth it if we can
occasionally avoid disk or network I/O for the aborted transactions.

Thanks.

I've attached the updated patch. With this patch, we check the
transaction status for only large-transactions when eviction. For
regression test purposes, I disable this transaction status check when
logical_replication_mode is set to 'immediate'.

May be there is some changes that are missing in the patch, which is
giving the following errors:
reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’:
reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared
(first use in this function)
3584 | if (unlikely(logical_replication_mode ==
LOGICAL_REP_MODE_IMMEDIATE))
| ^~~~~~~~~~~~~~~~~~~~~~~~

Regards,
Vignesh

#10vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#9)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, 3 Oct 2023 at 15:54, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

+1 for the idea of checking the transaction status only when we need
to flush it to the disk or send it downstream (if streaming in
progress is enabled). Although this check is costly since we are
planning only for large transactions then it is worth it if we can
occasionally avoid disk or network I/O for the aborted transactions.

Thanks.

I've attached the updated patch. With this patch, we check the
transaction status for only large-transactions when eviction. For
regression test purposes, I disable this transaction status check when
logical_replication_mode is set to 'immediate'.

May be there is some changes that are missing in the patch, which is
giving the following errors:
reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’:
reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared
(first use in this function)
3584 | if (unlikely(logical_replication_mode ==
LOGICAL_REP_MODE_IMMEDIATE))
| ^~~~~~~~~~~~~~~~~~~~~~~~

With no update to the thread and the compilation still failing I'm
marking this as returned with feedback. Please feel free to resubmit
to the next CF when there is a new version of the patch.

Regards,
Vignesh

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#10)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Feb 2, 2024 at 12:48 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 3 Oct 2023 at 15:54, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

+1 for the idea of checking the transaction status only when we need
to flush it to the disk or send it downstream (if streaming in
progress is enabled). Although this check is costly since we are
planning only for large transactions then it is worth it if we can
occasionally avoid disk or network I/O for the aborted transactions.

Thanks.

I've attached the updated patch. With this patch, we check the
transaction status for only large-transactions when eviction. For
regression test purposes, I disable this transaction status check when
logical_replication_mode is set to 'immediate'.

May be there is some changes that are missing in the patch, which is
giving the following errors:
reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’:
reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared
(first use in this function)
3584 | if (unlikely(logical_replication_mode ==
LOGICAL_REP_MODE_IMMEDIATE))
| ^~~~~~~~~~~~~~~~~~~~~~~~

With no update to the thread and the compilation still failing I'm
marking this as returned with feedback. Please feel free to resubmit
to the next CF when there is a new version of the patch.

I resumed working on this item. I've attached the new version patch.

I rebased the patch to the current HEAD and updated comments and
commit messages. The patch is straightforward and I'm somewhat
satisfied with it, but I'm thinking of adding some tests for it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/x-patch; name=v3-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From c5c78f5d53d375f7a79b2561c551f7bb3ff57717 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 3 Jul 2023 10:28:00 +0900
Subject: [PATCH v3] Skip logical decoding of already-aborted transactions.

If we detect a concurrent abort of a streaming transaction, we discard
all changes and skip decoding further changes of the transaction. This
commit introduces a new check if a (streaming or non-streaming)
transaction is already aborted by CLOG lookup, enabling us to skip
decoding further changes of the transaction. This helps a lot in
logical decoding performance in a case where the transaction is large
and already rolled back since we can save disk or network I/O.

We do this new check for only large-transactions when eviction since
checking CLOG is costly and could cause a slowdown with lots of small
transactions, where most transactions commit.

Reviewed-by:
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 98 ++++++++++++++++---
 src/include/replication/reorderbuffer.h       | 13 ++-
 2 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbf0966182..f3284708bf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -256,7 +257,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool streaming);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -777,11 +778,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (txn->aborted)
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1600,9 +1601,12 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ *
+ * 'streaming_txn' indicates that the given transaction is a streaming transaction.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool streaming_txn)
 {
 	dlist_mutable_iter iter;
 
@@ -1621,7 +1625,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, streaming_txn);
 	}
 
 	/* cleanup changes in the txn */
@@ -1655,7 +1659,8 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	if (streaming_txn && (!txn_prepared) &&
+		(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	if (txn_prepared)
@@ -1884,7 +1889,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2027,7 +2032,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2552,7 +2557,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2605,7 +2610,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+			curtxn->aborted = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2789,10 +2794,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Note, for the abort + streaming case a stream_prepare was already sent
+	 * within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (txn->aborted && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3558,6 +3563,63 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.
+ *
+ * If logical_replication_mode is set to "immediate", we disable this check
+ * for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * If logical_replication_mode is "immediate", we don't check the
+	 * transaction status so the caller always process this transaction.
+	 */
+	if (debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
+		return false;
+
+	/* Quick return if we've already knew the transaction status */
+	if (txn->aborted)
+		return true;
+
+	if (txn->committed)
+		return false;
+
+	/* Check the transaction status using CLOG lookup */
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		txn->committed = true;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+	ReorderBufferToastReset(rb, txn);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	txn->aborted = true;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3610,6 +3672,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3625,6 +3690,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..fe7874bc10 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -409,8 +409,17 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
+	/*
+	 * True if the transaction committed. Then we skip transaction status
+	 * check for this transaction.
+	 */
+	bool		committed;
+
+	/*
+	 * True if the transaction (concurrently) aborted. Then we ignore
+	 * future changes.
+	 */
+	bool		aborted;
 
 	/*
 	 * Private data pointer of the output plugin.
-- 
2.39.3

#12Ajin Cherian
itsajin@gmail.com
In reply to: Masahiko Sawada (#11)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Mar 15, 2024 at 3:17 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I resumed working on this item. I've attached the new version patch.

I rebased the patch to the current HEAD and updated comments and
commit messages. The patch is straightforward and I'm somewhat
satisfied with it, but I'm thinking of adding some tests for it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

I just had a look at the patch, the patch no longer applies because of a
removal of a header in a recent commit. Overall the patch looks fine, and I
didn't find any issues. Some cosmetic comments:
in ReorderBufferCheckTXNAbort()
+ /* Quick return if we've already knew the transaction status */
+ if (txn->aborted)
+ return true;

knew/know

/*
+ * If logical_replication_mode is "immediate", we don't check the
+ * transaction status so the caller always process this transaction.
+ */
+ if (debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
+ return false;

/process/processes

regards,
Ajin Cherian
Fujitsu Australia

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#12)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Mar 15, 2024 at 1:21 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Mar 15, 2024 at 3:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I resumed working on this item. I've attached the new version patch.

I rebased the patch to the current HEAD and updated comments and
commit messages. The patch is straightforward and I'm somewhat
satisfied with it, but I'm thinking of adding some tests for it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

I just had a look at the patch, the patch no longer applies because of a removal of a header in a recent commit. Overall the patch looks fine, and I didn't find any issues. Some cosmetic comments:

Thank you for your review comments.

in ReorderBufferCheckTXNAbort()
+ /* Quick return if we've already knew the transaction status */
+ if (txn->aborted)
+ return true;

knew/know

Maybe it should be "known"?

/*
+ * If logical_replication_mode is "immediate", we don't check the
+ * transaction status so the caller always process this transaction.
+ */
+ if (debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
+ return false;

/process/processes

Fixed.

In addition to these changes, I've made some changes to the latest
patch. Here is the summary:

- Use txn_flags field to record the transaction status instead of two
'committed' and 'aborted' flags.
- Add regression tests.
- Update commit message.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v4-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From 2de0bc1beafbb1852c64df3133f57fa2e2ff91a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 3 Jul 2023 10:28:00 +0900
Subject: [PATCH v4] Skip logical decoding of already-aborted transactions.

Currently, concurrent aborts are detected only during system catalog
scans while replaying a transaction. This commit introduces an
additional check to determine if a transaction is already aborted by a
CLOG lookup, so the logical decoding skips further change also when it
doesn't touch system catalogs.

This optimization enhances logical decoding performance, especially
for large transactions that have already been rolled back, as it
avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C,
Ajin Cherian
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/sql/stats.sql           |  22 +++-
 .../replication/logical/reorderbuffer.c       | 119 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  18 ++-
 3 files changed, 137 insertions(+), 22 deletions(-)

diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147..7e05f39fc5 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,27 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and skip to collect
+-- further changes. So the transaction should not be spilled at all.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+TRUNCATE table stats_test;
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+-- should show only ROLLBACK PREAPRED.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 92cf39ff74..d91e93a011 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -101,6 +101,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -255,7 +256,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool mark_streamed);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -776,11 +777,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_did_abort(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1591,17 +1592,20 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
- * 'txn_prepared' indicates that we have decoded the transaction at prepare
- * time.
+ * If mark_streamed is true, we could mark the transaction as streamed.
+ *
+ * 'streaming_txn' indicates that the given transaction is a streaming transaction.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool mark_streamed)
 {
 	dlist_mutable_iter iter;
 
@@ -1620,7 +1624,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, mark_streamed);
 	}
 
 	/* cleanup changes in the txn */
@@ -1654,7 +1658,8 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	if (mark_streamed && (!txn_prepared) &&
+		(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	if (txn_prepared)
@@ -1883,7 +1888,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2026,7 +2031,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2551,7 +2556,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2604,7 +2609,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Update transaction status */
+			Assert((curtxn->txn_flags & (RBTXN_COMMITTED | RBTXN_ABORTED)) == 0);
+			curtxn->txn_flags |= RBTXN_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2766,6 +2774,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 					 char *gid)
 {
 	ReorderBufferTXN *txn;
+	bool		txn_aborted;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -2777,6 +2786,12 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	txn->txn_flags |= RBTXN_PREPARE;
 	txn->gid = pstrdup(gid);
 
+	/*
+	 * We remember whether the transaction is already aborted before the
+	 * replay in order to detect the concurrent abort below.
+	 */
+	txn_aborted = rbtxn_did_abort(txn);
+
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
@@ -2788,10 +2803,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!txn_aborted && rbtxn_did_abort(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3557,6 +3572,66 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.
+ *
+ * If logical_replication_mode is set to "immediate", we disable this check
+ * for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * If logical_replication_mode is "immediate", we don't check the
+	 * transaction status so the caller always processes this transaction.
+	 */
+	if (debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_did_abort(txn))
+		return true;
+	if (rbtxn_did_commit(txn))
+		return false;
+
+	/* Check the transaction status using CLOG lookup */
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		txn->txn_flags |= RBTXN_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 *
+	 * We don't mark the transaction as streamed since this function can be
+	 * called for non-streamed transactions too.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+	ReorderBufferToastReset(rb, txn);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	txn->txn_flags |= RBTXN_ABORTED;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3609,6 +3684,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3624,6 +3703,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..23c505f29b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -167,6 +167,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_COMMITTED				0x0200
+#define RBTXN_ABORTED				0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,6 +226,19 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Did this transaction committed? */
+#define rbtxn_did_commit(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMITTED) != 0 \
+)
+
+/* Did this transaction aborted? */
+#define rbtxn_did_abort(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ABORTED) != 0 \
+)
+
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -409,9 +424,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.39.3

#14Ajin Cherian
itsajin@gmail.com
In reply to: Masahiko Sawada (#13)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

In addition to these changes, I've made some changes to the latest
patch. Here is the summary:

- Use txn_flags field to record the transaction status instead of two
'committed' and 'aborted' flags.
- Add regression tests.
- Update commit message.

Regards,

Hi Sawada-san,

Thanks for the updated patch. Some comments:

1.
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.

we discards/we discard

2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I
wonder how prepared transactions would be considered, they are neither
committed, nor in progress.

regards,
Ajin Cherian
Fujitsu Australia

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#14)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Mar 27, 2024 at 8:49 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In addition to these changes, I've made some changes to the latest
patch. Here is the summary:

- Use txn_flags field to record the transaction status instead of two
'committed' and 'aborted' flags.
- Add regression tests.
- Update commit message.

Regards,

Hi Sawada-san,

Thanks for the updated patch. Some comments:

Thank you for the view comments!

1.
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.

we discards/we discard

Will fix it.

2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I wonder how prepared transactions would be considered, they are neither committed, nor in progress.

The transaction that is prepared but not resolved yet is considered as
in-progress.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#13)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

Hi, here are some review comments for your patch v4-0001.

======
contrib/test_decoding/sql/stats.sql

1.
Huh? The test fails because the "expected results" file for these new
tests is missing from the patch.

======
.../replication/logical/reorderbuffer.c

2.
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

IIUC this new 'mark_streamed' parameter is more like a prerequisite
for the other conditions to decide to mark the tx as streamed -- i.e.
it is more like 'can_mark_streamed', so I felt the name should be
changed to be like that (everywhere it is used).

~~~

3. ReorderBufferTruncateTXN

- * 'txn_prepared' indicates that we have decoded the transaction at prepare
- * time.
+ * If mark_streamed is true, we could mark the transaction as streamed.
+ *
+ * 'streaming_txn' indicates that the given transaction is a
streaming transaction.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_streamed)

~

What's that new comment about 'streaming_txn' for? It seemed unrelated
to the patch code.

~~~

4.
/*
* Mark the transaction as streamed.
*
* The top-level transaction, is marked as streamed always, even if it
* does not contain any changes (that is, when all the changes are in
* subtransactions).
*
* For subtransactions, we only mark them as streamed when there are
* changes in them.
*
* We do it this way because of aborts - we don't want to send aborts for
* XIDs the downstream is not aware of. And of course, it always knows
* about the toplevel xact (we send the XID in all messages), but we never
* stream XIDs of empty subxacts.
*/
if (mark_streamed && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

~~

With the patch introduction of the new parameter, I felt this code
might be better if it was refactored as follows:

/* Mark the transaction as streamed, if appropriate. */
if (can_mark_streamed)
{
/*
... large comment
*/
if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;
}

~~~

5. ReorderBufferPrepare

- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!txn_aborted && rbtxn_did_abort(txn) && !rbtxn_is_streamed(txn))
  rb->prepare(rb, txn, txn->final_lsn);

~

Maybe I misunderstood this logic, but won't a "concurrent abort" cause
your new Assert added in ReorderBufferProcessTXN to fail?

+ /* Update transaction status */
+ Assert((curtxn->txn_flags & (RBTXN_COMMITTED | RBTXN_ABORTED)) == 0);

~~~

6. ReorderBufferCheckTXNAbort

+ /* Check the transaction status using CLOG lookup */
+ if (TransactionIdIsInProgress(txn->xid))
+ return false;
+
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ txn->txn_flags |= RBTXN_COMMITTED;
+ return false;
+ }

IIUC the purpose of the TransactionIdDidCommit() was to avoid the
overhead of calling the TransactionIdIsInProgress(). So, shouldn't the
order of these checks be swapped? Otherwise, there might be 1 extra
unnecessary call to TransactionIdIsInProgress() next time.

======
src/include/replication/reorderbuffer.h

7.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_COMMITTED 0x0200
+#define RBTXN_ABORTED 0x0400

For consistency with the existing bitmask names, I guess these should be named:
- RBTXN_COMMITTED --> RBTXN_IS_COMMITTED
- RBTXN_ABORTED --> RBTXN_IS_ABORTED

~~~

8.
Similarly, IMO the macros should have the same names as the bitmasks,
like the other nearby ones generally seem to.

rbtxn_did_commit --> rbtxn_is_committed
rbtxn_did_abort --> rbtxn_is_aborted

======

9.
Also, attached is a top-up patch for other cosmetic nitpicks:
- comment wording
- typos in comments
- excessive or missing blank lines
- etc.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

20240612_PS_nitpicks_for_v4.txttext/plain; charset=US-ASCII; name=20240612_PS_nitpicks_for_v4.txtDownload
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 7e05f39..a6a441d 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -54,14 +54,15 @@ COMMIT;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
 
 -- Execute a transaction that is prepared and aborted. We detect that the
--- transaction is aborted before spilling changes, and skip to collect
--- further changes. So the transaction should not be spilled at all.
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes. So, the transaction should not be spilled at all.
 BEGIN;
 INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
 TRUNCATE table stats_test;
 PREPARE TRANSACTION 'test1_abort';
 ROLLBACK PREPARED 'test1_abort';
--- should show only ROLLBACK PREAPRED.
+
+-- Should show only ROLLBACK PREPARED.
 SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- Check stats. We should not spill anything as the transaction is already
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 861ee1c..8b3e1b8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2802,7 +2802,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	txn->gid = pstrdup(gid);
 
 	/*
-	 * We remember whether the transaction is already aborted before the
+	 * Remember whether the transaction is already aborted before the
 	 * replay in order to detect the concurrent abort below.
 	 */
 	txn_aborted = rbtxn_did_abort(txn);
@@ -3608,18 +3608,20 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 
 /*
  * Check the transaction status of the given transaction. If the transaction
- * already aborted, we discards all changes accumulated so far and ignore
- * future changes, and return true. Otherwise return false.
+ * already aborted, we discard all changes accumulated so far, ignore future
+ * changes, and return true. Otherwise return false.
  *
- * If logical_replication_mode is set to "immediate", we disable this check
- * for regression tests.
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
  */
 static bool
 ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	/*
-	 * If logical_replication_mode is "immediate", we don't check the
-	 * transaction status so the caller always processes this transaction.
+	 * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+	 * check the transaction status, so the caller always processes this
+	 * transaction.
 	 */
 	if (debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
 		return false;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b0d381c..8eb0704 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -227,19 +227,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
-/* Did this transaction committed? */
+/* Is this transaction committed? */
 #define rbtxn_did_commit(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_COMMITTED) != 0 \
 )
 
-/* Did this transaction aborted? */
+/* Is this transaction aborted? */
 #define rbtxn_did_abort(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_ABORTED) != 0 \
 )
 
-
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
#17Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#16)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

Sorry for the late reply.

On Tue, Jun 11, 2024 at 7:41 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi, here are some review comments for your patch v4-0001.

Thank you for reviewing the patch!

======
contrib/test_decoding/sql/stats.sql

1.
Huh? The test fails because the "expected results" file for these new
tests is missing from the patch.

Fixed.

======
.../replication/logical/reorderbuffer.c

2.
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

IIUC this new 'mark_streamed' parameter is more like a prerequisite
for the other conditions to decide to mark the tx as streamed -- i.e.
it is more like 'can_mark_streamed', so I felt the name should be
changed to be like that (everywhere it is used).

Agreed. I think 'txn_streaming' sounds better and consistent with
'txn_prepared'.

~~~

3. ReorderBufferTruncateTXN

- * 'txn_prepared' indicates that we have decoded the transaction at prepare
- * time.
+ * If mark_streamed is true, we could mark the transaction as streamed.
+ *
+ * 'streaming_txn' indicates that the given transaction is a
streaming transaction.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_streamed)

~

What's that new comment about 'streaming_txn' for? It seemed unrelated
to the patch code.

Removed.

~~~

4.
/*
* Mark the transaction as streamed.
*
* The top-level transaction, is marked as streamed always, even if it
* does not contain any changes (that is, when all the changes are in
* subtransactions).
*
* For subtransactions, we only mark them as streamed when there are
* changes in them.
*
* We do it this way because of aborts - we don't want to send aborts for
* XIDs the downstream is not aware of. And of course, it always knows
* about the toplevel xact (we send the XID in all messages), but we never
* stream XIDs of empty subxacts.
*/
if (mark_streamed && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

~~

With the patch introduction of the new parameter, I felt this code
might be better if it was refactored as follows:

/* Mark the transaction as streamed, if appropriate. */
if (can_mark_streamed)
{
/*
... large comment
*/
if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;
}

I think we don't necessarily need to make nested if statements just
for comments.

~~~

5. ReorderBufferPrepare

- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!txn_aborted && rbtxn_did_abort(txn) && !rbtxn_is_streamed(txn))
rb->prepare(rb, txn, txn->final_lsn);

~

Maybe I misunderstood this logic, but won't a "concurrent abort" cause
your new Assert added in ReorderBufferProcessTXN to fail?

+ /* Update transaction status */
+ Assert((curtxn->txn_flags & (RBTXN_COMMITTED | RBTXN_ABORTED)) == 0);

I changed txn_flags checks, which should cover your concerns.

~~~

6. ReorderBufferCheckTXNAbort

+ /* Check the transaction status using CLOG lookup */
+ if (TransactionIdIsInProgress(txn->xid))
+ return false;
+
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ txn->txn_flags |= RBTXN_COMMITTED;
+ return false;
+ }

IIUC the purpose of the TransactionIdDidCommit() was to avoid the
overhead of calling the TransactionIdIsInProgress(). So, shouldn't the
order of these checks be swapped? Otherwise, there might be 1 extra
unnecessary call to TransactionIdIsInProgress() next time.

I'm not sure I understand your comment. IIUC we should use
TransactionIdDidCommit() with a preceding TransactionIdIsInProgress()
check. Also I think once we found the transaction is committed, we no
longer check the transaction status on CLOG nor call
TransactionIdIsInProgress().

======
src/include/replication/reorderbuffer.h

7.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_COMMITTED 0x0200
+#define RBTXN_ABORTED 0x0400

For consistency with the existing bitmask names, I guess these should be named:
- RBTXN_COMMITTED --> RBTXN_IS_COMMITTED
- RBTXN_ABORTED --> RBTXN_IS_ABORTED

Agreed and changed.

~~~

8.
Similarly, IMO the macros should have the same names as the bitmasks,
like the other nearby ones generally seem to.

rbtxn_did_commit --> rbtxn_is_committed
rbtxn_did_abort --> rbtxn_is_aborted

Changed.

======

9.
Also, attached is a top-up patch for other cosmetic nitpicks:
- comment wording
- typos in comments
- excessive or missing blank lines
- etc.

Applied your patch.

I've attached the updated patch. Will register it for the next commit fest.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v5-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From 3db3318c859c71ff563daba94026a3d961d85517 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v5] Skip logical decoding of already-aborted transactions.

Currently, concurrent aborts are detected only during system catalog
scans while replaying a transaction. This commit introduces an
additional check to determine if a transaction is already aborted by a
CLOG lookup, so the logical decoding skips further change also when it
doesn't touch system catalogs.

This optimization enhances logical decoding performance, especially
for large transactions that have already been rolled back, as it
avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  44 ++++++-
 contrib/test_decoding/expected/stream.out     |   4 +-
 contrib/test_decoding/sql/stats.sql           |  23 +++-
 contrib/test_decoding/sql/stream.sql          |   2 +-
 .../replication/logical/reorderbuffer.c       | 115 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  17 ++-
 6 files changed, 177 insertions(+), 28 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..6cab8e9d21e 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,48 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes. So, the transaction should not be spilled at all.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+TRUNCATE table stats_test;
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+-- Should show only ROLLBACK PREPARED.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+              data               
+---------------------------------
+ ROLLBACK PREPARED 'test1_abort'
+(1 row)
+
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txn | spill_count 
+---------------------------------+-----------+-------------
+ regression_slot_stats4_twophase | t         | t
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..0950f552c45 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -110,7 +110,7 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
 (17 rows)
 
 /*
- * Test concurrent abort with toast data. When streaming the second insertion, we
+ * Test concurrent abort with toast data. Before streaming the second insertion, we
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
@@ -125,7 +125,7 @@ COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
  count 
 -------
-     5
+     4
 (1 row)
 
 DROP TABLE stream_test;
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a6a441ddd31 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,28 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes. So, the transaction should not be spilled at all.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+TRUNCATE table stats_test;
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+
+-- Should show only ROLLBACK PREPARED.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..5d07cd583e4 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -45,7 +45,7 @@ toasted-123456789012345678901234567890123456789012345678901234567890123456789012
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
 
 /*
- * Test concurrent abort with toast data. When streaming the second insertion, we
+ * Test concurrent abort with toast data. Before streaming the second insertion, we
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..929f168e83c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool mark_streamed);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +794,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1621,20 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ * 'txn_streaming' indicates that the transaction is being streamed.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1681,7 +1685,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
 	/*
-	 * Mark the transaction as streamed.
+	 * Mark the transaction as streamed, if appropriate.
 	 *
 	 * The top-level transaction, is marked as streamed always, even if it
 	 * does not contain any changes (that is, when all the changes are in
@@ -1695,7 +1699,8 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	if (txn_streaming && (!txn_prepared) &&
+		(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
 	if (txn_prepared)
@@ -1924,7 +1929,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2067,7 +2072,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2600,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2653,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted */
+			Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2832,10 +2840,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3620,6 +3628,71 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discard all changes accumulated so far, ignore future
+ * changes, and return true. Otherwise return false.
+ *
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+	 * check the transaction status, so the caller always processes this
+	 * transaction.
+	 */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_is_aborted(txn))
+		return true;
+	if (rbtxn_is_committed(txn))
+		return false;
+
+	/* Check the transaction status using CLOG lookup */
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert((txn->txn_flags & RBTXN_IS_ABORTED) == 0);
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * We don't mark the transaction as streamed since this function can be
+	 * called for non-streamed transactions too.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+	ReorderBufferToastReset(rb, txn);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert((txn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3672,6 +3745,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3764,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#14)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Mar 27, 2024 at 4:49 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In addition to these changes, I've made some changes to the latest
patch. Here is the summary:

- Use txn_flags field to record the transaction status instead of two
'committed' and 'aborted' flags.
- Add regression tests.
- Update commit message.

Regards,

Hi Sawada-san,

Thanks for the updated patch. Some comments:

1.
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.

we discards/we discard

This comment is incorporated into the latest v5 patch I've just sent[1]/messages/by-id/CAD21AoDJE-bLdxt9T_z1rw74RN=E0n0+esYU0eo+-_P32EbuVg@mail.gmail.com.

2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I wonder how prepared transactions would be considered, they are neither committed, nor in progress.

IIUC prepared transactions are considered as in-progress.

Regards,

[1]: /messages/by-id/CAD21AoDJE-bLdxt9T_z1rw74RN=E0n0+esYU0eo+-_P32EbuVg@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#19Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#18)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawada-San, here are some review comments for the patch v5-0001.

======
Commit message.

1.
This commit introduces an additional check to determine if a
transaction is already aborted by a CLOG lookup, so the logical
decoding skips further change also when it doesn't touch system
catalogs.

~

Is that wording backwards? Is it meant to say:

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the ...

======
contrib/test_decoding/sql/stats.sql

2
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
spill_count FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';

Why do the SELECT "= 0" like this, instead of just having zeros in the
"expected" results?

======
.../replication/logical/reorderbuffer.c

3.
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

That last parameter name ('mark_streamed') does not match the same
parameter name in this function's definition.

~~~

ReorderBufferTruncateTXN:

4.
if (txn_streaming && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

if (txn_prepared)
{
~

Since the following condition was already "if (txn_prepared)" would it
be better remove the "(!txn_prepared)" here and instead just refactor
the code like:

if (txn_prepared)
{
...
}
else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
{
...
}

~~~

ReorderBufferProcessTXN:

5.
+
+ /* Remember the transaction is aborted */
+ Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+ curtxn->txn_flags |= RBTXN_IS_ABORTED;

Missing period on comment.

~~~

ReorderBufferCheckTXNAbort:

6.
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction.
+ */
+ if (unlikely(debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+ return false;
+

The wording of the sentence "This is to disable..." seemed a bit
confusing. Maybe this area can be simplified by doing the following.

6a.
Change the function comment to say more like below:

When the GUC 'debug_logical_replication_streaming' is set to
"immediate", we don't check the transaction status, meaning the caller
will always process this transaction. This mode is used by regression
tests to avoid unnecessary transaction status checking.

~

6b.
It is not necessary for this 2nd comment to repeat everything that was
already said in the function comment. A simpler comment here might be
all you need:

SUGGESTION:
Quick return for regression tests.

~~~

7.
Is it worth mentioning about this skipping of the transaction status
check in the docs for this GUC? [1]https://www.postgresql.org/docs/devel/runtime-config-developer.html

======
[1]: https://www.postgresql.org/docs/devel/runtime-config-developer.html

Kind Regards,
Peter Smith.
Fujitsu Australia.

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#19)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Nov 10, 2024 at 11:24 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San, here are some review comments for the patch v5-0001.

Thank you for reviewing the patch!

======
Commit message.

1.
This commit introduces an additional check to determine if a
transaction is already aborted by a CLOG lookup, so the logical
decoding skips further change also when it doesn't touch system
catalogs.

~

Is that wording backwards? Is it meant to say:

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the ...

Fixed.

======
contrib/test_decoding/sql/stats.sql

2
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
spill_count FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';

Why do the SELECT "= 0" like this, instead of just having zeros in the
"expected" results?

Indeed. I used "=0" like other queries in the same file do, but it
makes sense to me just to have zeros in the expected file. That way,
it would make it a bit easier to investigate in case of failures.

======
.../replication/logical/reorderbuffer.c

3.
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

That last parameter name ('mark_streamed') does not match the same
parameter name in this function's definition.

Fixed.

~~~

ReorderBufferTruncateTXN:

4.
if (txn_streaming && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

if (txn_prepared)
{
~

Since the following condition was already "if (txn_prepared)" would it
be better remove the "(!txn_prepared)" here and instead just refactor
the code like:

if (txn_prepared)
{
...
}
else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
{
...
}

Good idea.

~~~

ReorderBufferProcessTXN:

5.
+
+ /* Remember the transaction is aborted */
+ Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+ curtxn->txn_flags |= RBTXN_IS_ABORTED;

Missing period on comment.

Fixed.

~~~

ReorderBufferCheckTXNAbort:

6.
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction.
+ */
+ if (unlikely(debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+ return false;
+

The wording of the sentence "This is to disable..." seemed a bit
confusing. Maybe this area can be simplified by doing the following.

6a.
Change the function comment to say more like below:

When the GUC 'debug_logical_replication_streaming' is set to
"immediate", we don't check the transaction status, meaning the caller
will always process this transaction. This mode is used by regression
tests to avoid unnecessary transaction status checking.

~

6b.
It is not necessary for this 2nd comment to repeat everything that was
already said in the function comment. A simpler comment here might be
all you need:

SUGGESTION:
Quick return for regression tests.

Agreed with the above two comments. Fixed.

~~~

7.
Is it worth mentioning about this skipping of the transaction status
check in the docs for this GUC? [1]

If we want to mention this optimization in the docs, we have to
explain how the optimization works too. I think it's too detailed.

I've attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v6-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From ff9d7ecbf16a834f4877d71d9f1075fd2ecf927b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v6] Skip logical decoding of already-aborted transactions.

Previously, concurrent aborts were detected only during system catalog
scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  44 +++++-
 contrib/test_decoding/expected/stream.out     |   4 +-
 contrib/test_decoding/sql/stats.sql           |  23 ++-
 contrib/test_decoding/sql/stream.sql          |   2 +-
 .../replication/logical/reorderbuffer.c       | 144 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 ++-
 6 files changed, 190 insertions(+), 44 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..253236e3973 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,48 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes. So, the transaction should not be spilled at all.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+TRUNCATE table stats_test;
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+-- Should show only ROLLBACK PREPARED.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+              data               
+---------------------------------
+ ROLLBACK PREPARED 'test1_abort'
+(1 row)
+
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txn | spill_count 
+---------------------------------+-----------+-------------
+ regression_slot_stats4_twophase |         0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..0950f552c45 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -110,7 +110,7 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
 (17 rows)
 
 /*
- * Test concurrent abort with toast data. When streaming the second insertion, we
+ * Test concurrent abort with toast data. Before streaming the second insertion, we
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
@@ -125,7 +125,7 @@ COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
  count 
 -------
-     5
+     4
 (1 row)
 
 DROP TABLE stream_test;
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..77113cd1942 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,28 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes. So, the transaction should not be spilled at all.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);
+TRUNCATE table stats_test;
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+
+-- Should show only ROLLBACK PREPARED.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..5d07cd583e4 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -45,7 +45,7 @@ toasted-123456789012345678901234567890123456789012345678901234567890123456789012
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
 
 /*
- * Test concurrent abort with toast data. When streaming the second insertion, we
+ * Test concurrent abort with toast data. Before streaming the second insertion, we
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..d1b2ec9b638 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool txn_streaming);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +794,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1621,20 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ * 'txn_streaming' indicates that the transaction is being streamed.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1684,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1707,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed, if appropriate.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1924,7 +1929,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2067,7 +2072,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2600,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2653,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2832,10 +2840,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3620,6 +3628,68 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discard all changes accumulated so far, ignore future
+ * changes, and return true. Otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction. This mode is used by regression tests to avoid unnecessary
+ * transaction status checking.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_is_aborted(txn))
+		return true;
+	if (rbtxn_is_committed(txn))
+		return false;
+
+	/* Check the transaction status using CLOG lookup */
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert((txn->txn_flags & RBTXN_IS_ABORTED) == 0);
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * We don't mark the transaction as streamed since this function can be
+	 * called for non-streamed transactions too.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+	ReorderBufferToastReset(rb, txn);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert((txn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3672,6 +3742,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3761,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#21Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#20)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 12, 2024 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated patch.

Hi, here are some review comments for the latest v6-0001.

======
contrib/test_decoding/sql/stats.sql

1.
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM
generate_series(1, 5000) g(i);

I didn't understand the meaning of "serialize-topbig--1". My guess is
it is a typo that was supposed to say "toobig".

Perhaps there should also be some comment to explain that this
"toobig" stuff was done deliberately like this to exceed
'logical_decoding_work_mem' because that would normally (if it was not
aborted) cause a spill to disk.

~~~

2.
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count
FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

Those aliases seem unnecessary: "spill_txns AS spill_txn" and
"spill_count AS spill_count"

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckTXNAbort:

3.
Other static functions are also declared at the top of this module.
For consistency, shouldn't this be the same?

~~~

4.
+ * We don't mark the transaction as streamed since this function can be
+ * called for non-streamed transactions too.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+ ReorderBufferToastReset(rb, txn);

Given the comment says "since this function can be called for
non-streamed transactions too", would it be easier to pass
rbtxn_is_streamed(txn) here instead of 'false', and then just remove
the comment?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#22vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#20)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, 11 Nov 2024 at 23:30, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Nov 10, 2024 at 11:24 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San, here are some review comments for the patch v5-0001.

Thank you for reviewing the patch!

======
Commit message.

1.
This commit introduces an additional check to determine if a
transaction is already aborted by a CLOG lookup, so the logical
decoding skips further change also when it doesn't touch system
catalogs.

~

Is that wording backwards? Is it meant to say:

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the ...

Fixed.

======
contrib/test_decoding/sql/stats.sql

2
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
spill_count FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';

Why do the SELECT "= 0" like this, instead of just having zeros in the
"expected" results?

Indeed. I used "=0" like other queries in the same file do, but it
makes sense to me just to have zeros in the expected file. That way,
it would make it a bit easier to investigate in case of failures.

======
.../replication/logical/reorderbuffer.c

3.
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

That last parameter name ('mark_streamed') does not match the same
parameter name in this function's definition.

Fixed.

~~~

ReorderBufferTruncateTXN:

4.
if (txn_streaming && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

if (txn_prepared)
{
~

Since the following condition was already "if (txn_prepared)" would it
be better remove the "(!txn_prepared)" here and instead just refactor
the code like:

if (txn_prepared)
{
...
}
else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
{
...
}

Good idea.

~~~

ReorderBufferProcessTXN:

5.
+
+ /* Remember the transaction is aborted */
+ Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+ curtxn->txn_flags |= RBTXN_IS_ABORTED;

Missing period on comment.

Fixed.

~~~

ReorderBufferCheckTXNAbort:

6.
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction.
+ */
+ if (unlikely(debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+ return false;
+

The wording of the sentence "This is to disable..." seemed a bit
confusing. Maybe this area can be simplified by doing the following.

6a.
Change the function comment to say more like below:

When the GUC 'debug_logical_replication_streaming' is set to
"immediate", we don't check the transaction status, meaning the caller
will always process this transaction. This mode is used by regression
tests to avoid unnecessary transaction status checking.

~

6b.
It is not necessary for this 2nd comment to repeat everything that was
already said in the function comment. A simpler comment here might be
all you need:

SUGGESTION:
Quick return for regression tests.

Agreed with the above two comments. Fixed.

~~~

7.
Is it worth mentioning about this skipping of the transaction status
check in the docs for this GUC? [1]

If we want to mention this optimization in the docs, we have to
explain how the optimization works too. I think it's too detailed.

I've attached the updated patch.

Few minor suggestions:
1) Can we use rbtxn_is_committed here?
+                       /* Remember the transaction is aborted. */
+                       Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+                       curtxn->txn_flags |= RBTXN_IS_ABORTED;
2) Similarly here too:
+       /*
+        * Mark the transaction as aborted so we ignore future changes of this
+        * transaction.
+        */
+       Assert((txn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+       txn->txn_flags |= RBTXN_IS_ABORTED;
3) Can we use rbtxn_is_aborted here?
+               /*
+                * Remember the transaction is committed so that we
can skip CLOG
+                * check next time, avoiding the pressure on CLOG lookup.
+                */
+               Assert((txn->txn_flags & RBTXN_IS_ABORTED) == 0);

Regards,
Vignesh

#23Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#21)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Nov 11, 2024 at 5:40 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Nov 12, 2024 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated patch.

Hi, here are some review comments for the latest v6-0001.

======
contrib/test_decoding/sql/stats.sql

1.
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM
generate_series(1, 5000) g(i);

I didn't understand the meaning of "serialize-topbig--1". My guess is
it is a typo that was supposed to say "toobig".

Fixex. We have another place using 'topbig', but I think we can leave it.

Perhaps there should also be some comment to explain that this
"toobig" stuff was done deliberately like this to exceed
'logical_decoding_work_mem' because that would normally (if it was not
aborted) cause a spill to disk.

I think we already mentioned the transaction is going to be spilled
but actually not.

+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.

~~~

2.
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count
FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

Those aliases seem unnecessary: "spill_txns AS spill_txn" and
"spill_count AS spill_count"

Fixed.

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckTXNAbort:

3.
Other static functions are also declared at the top of this module.
For consistency, shouldn't this be the same?

Agreed, added.

~~~

4.
+ * We don't mark the transaction as streamed since this function can be
+ * called for non-streamed transactions too.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+ ReorderBufferToastReset(rb, txn);

Given the comment says "since this function can be called for
non-streamed transactions too", would it be easier to pass
rbtxn_is_streamed(txn) here instead of 'false', and then just remove
the comment?

Agreed.

During more testing, I found some bugs in the previous version patch,
so the latest patch incorporates some changes in addition to the
review comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v6-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From 9b12fbff5c08726ee50d94aceaf3bcff76e1b9ab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v6] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  41 ++++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  19 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 168 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 +-
 6 files changed, 212 insertions(+), 45 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..1fe9c5f190a 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,45 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+-- Check if the transaction is not spilled as it's already aborted.
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..f2df0fe869c 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,24 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+
+-- Check if the transaction is not spilled as it's already aborted.
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..96cf4eef3f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,11 +260,12 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool txn_streaming);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
+static bool ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -793,11 +795,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1622,20 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ * 'txn_streaming' indicates that the transaction is being streamed.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1655,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1708,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed, if appropriate.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1924,7 +1930,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2067,7 +2073,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2601,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2654,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2810,6 +2819,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 					 char *gid)
 {
 	ReorderBufferTXN *txn;
+	bool		already_aborted;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -2824,6 +2834,12 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * Remember if the transaction is already aborted to check if we detect
+	 * that the transaction is concurrently aborted during the replay.
+	 */
+	already_aborted = rbtxn_is_aborted(txn);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2832,10 +2848,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3566,7 +3582,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3610,7 +3627,7 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 
 		if ((largest == NULL || txn->total_size > largest_size) &&
 			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			rbtxn_has_streamable_change(txn) && !(rbtxn_is_aborted(txn)))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3620,6 +3637,67 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 	return largest;
 }
 
+/*
+ * Check the transaction status of the given transaction. If the transaction
+ * already aborted, we discard all changes accumulated so far, ignore future
+ * changes, and return true. Otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+		return true;
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * Since we don't check the transaction status while replaying the
+	 * transaction, we don't need to reset toast reconstruction data here.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), rbtxn_is_streamed(txn));
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Check whether the logical_decoding_work_mem limit was reached, and if yes
  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
@@ -3661,8 +3739,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3750,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+			{
+				/* All changes should be truncated */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3773,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if already aborted */
+			if (ReorderBufferCheckTXNAbort(rb, txn))
+			{
+				/* All changes should be truncated */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#22)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 12, 2024 at 7:29 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, 11 Nov 2024 at 23:30, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Nov 10, 2024 at 11:24 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San, here are some review comments for the patch v5-0001.

Thank you for reviewing the patch!

======
Commit message.

1.
This commit introduces an additional check to determine if a
transaction is already aborted by a CLOG lookup, so the logical
decoding skips further change also when it doesn't touch system
catalogs.

~

Is that wording backwards? Is it meant to say:

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the ...

Fixed.

======
contrib/test_decoding/sql/stats.sql

2
+SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
spill_count FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';

Why do the SELECT "= 0" like this, instead of just having zeros in the
"expected" results?

Indeed. I used "=0" like other queries in the same file do, but it
makes sense to me just to have zeros in the expected file. That way,
it would make it a bit easier to investigate in case of failures.

======
.../replication/logical/reorderbuffer.c

3.
static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
- bool txn_prepared);
+ bool txn_prepared, bool mark_streamed);

That last parameter name ('mark_streamed') does not match the same
parameter name in this function's definition.

Fixed.

~~~

ReorderBufferTruncateTXN:

4.
if (txn_streaming && (!txn_prepared) &&
(rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

if (txn_prepared)
{
~

Since the following condition was already "if (txn_prepared)" would it
be better remove the "(!txn_prepared)" here and instead just refactor
the code like:

if (txn_prepared)
{
...
}
else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
{
...
}

Good idea.

~~~

ReorderBufferProcessTXN:

5.
+
+ /* Remember the transaction is aborted */
+ Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+ curtxn->txn_flags |= RBTXN_IS_ABORTED;

Missing period on comment.

Fixed.

~~~

ReorderBufferCheckTXNAbort:

6.
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction. This is to disable this check for regression tests.
+ */
+static bool
+ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
+ * check the transaction status, so the caller always processes this
+ * transaction.
+ */
+ if (unlikely(debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+ return false;
+

The wording of the sentence "This is to disable..." seemed a bit
confusing. Maybe this area can be simplified by doing the following.

6a.
Change the function comment to say more like below:

When the GUC 'debug_logical_replication_streaming' is set to
"immediate", we don't check the transaction status, meaning the caller
will always process this transaction. This mode is used by regression
tests to avoid unnecessary transaction status checking.

~

6b.
It is not necessary for this 2nd comment to repeat everything that was
already said in the function comment. A simpler comment here might be
all you need:

SUGGESTION:
Quick return for regression tests.

Agreed with the above two comments. Fixed.

~~~

7.
Is it worth mentioning about this skipping of the transaction status
check in the docs for this GUC? [1]

If we want to mention this optimization in the docs, we have to
explain how the optimization works too. I think it's too detailed.

I've attached the updated patch.

Few minor suggestions:
1) Can we use rbtxn_is_committed here?
+                       /* Remember the transaction is aborted. */
+                       Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+                       curtxn->txn_flags |= RBTXN_IS_ABORTED;
2) Similarly here too:
+       /*
+        * Mark the transaction as aborted so we ignore future changes of this
+        * transaction.
+        */
+       Assert((txn->txn_flags & RBTXN_IS_COMMITTED) == 0);
+       txn->txn_flags |= RBTXN_IS_ABORTED;
3) Can we use rbtxn_is_aborted here?
+               /*
+                * Remember the transaction is committed so that we
can skip CLOG
+                * check next time, avoiding the pressure on CLOG lookup.
+                */
+               Assert((txn->txn_flags & RBTXN_IS_ABORTED) == 0);

Thank you for reviewing the patch!

These comments are incorporated into the latest v6 patch I just sent[1]/messages/by-id/CAD21AoDtMjbc8YCQiX1K8+RKeahcX2MLt3gwApm5BWGfv14i5A@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoDtMjbc8YCQiX1K8+RKeahcX2MLt3gwApm5BWGfv14i5A@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#25Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#23)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawda-San,

Here are some more review comments for the latest (accidentally called
v6 again?) v6-0001 patch.

======
contrib/test_decoding/sql/stats.sql

1.
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.

You had replied (referring to the above comment):
I think we already mentioned the transaction is going to be spilled
but actually not.

~

Yes, spilling was already mentioned in the current comment but I felt
it assumes the reader is expected to know details of why it was going
to be spilled in the first place.

In other words, I thought the comment could include a bit more
explanatory background info:
(Also, it's not really "we detect" the abort -- it's the new postgres
code of this patch that detects it.)

SUGGESTION:
Execute a transaction that is prepared but then aborted. The INSERT
data exceeds the 'logical_decoding_work_mem limit' limit which
normally would result in the transaction being spilled to disk, but
now when Postgres detects the abort it skips the spilling and also
skips collecting further changes.

~~~

2.
+-- Check if the transaction is not spilled as it's already aborted.
+SELECT count(*) FROM
pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM
pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

/Check if the transaction is not spilled/Verify that the transaction
was not spilled/

======
.../replication/logical/reorderbuffer.c

ReorderBufferResetTXN:

3.
  /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);

Looking at the calling code for ReorderBufferResetTXN it seems this
function can called for streaming OR prepared. So is it OK here to be
passing hardwired 'true' as the txn_streaming parameter, or should
that be passing rbtxn_is_streamed(txn)?

~~~

ReorderBufferLargestStreamableTopTXN:

4.
  if ((largest == NULL || txn->total_size > largest_size) &&
  (txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
- rbtxn_has_streamable_change(txn))
+ rbtxn_has_streamable_change(txn) && !(rbtxn_is_aborted(txn)))
  {
  largest = txn;
  largest_size = txn->total_size;

I felt that this increasingly complicated code would be a lot easier
to understand if you just separate the conditions into: (a) the ones
that filter out transaction you don't care about; (b) the ones that
check for the largest size. For example,

SUGGESTION:
dlist_foreach(...)
{
...

/* Don't consider these kinds of transactions for eviction. */
if (rbtxn_has_partial_change(txn) ||
!rbtxn_has_streamable_change(txn) || rbtxn_is_aborted(txn))
continue;

/* Find the largest of the eviction candidates. */
if ((largest == NULL || txn->total_size > largest_size) &&
(txn->total_size > 0))
{
largest = txn;
largest_size = txn->total_size;
}
}

~~~

ReorderBufferCheckMemoryLimit:

5.
+ /* skip the transaction if already aborted */
+ if (ReorderBufferCheckTXNAbort(rb, txn))
+ {
+ /* All changes should be truncated */
+ Assert(txn->size == 0 && txn->total_size == 0);
+ continue;
+ }

The "discard all changes accumulated so far" side-effect happening
here is not very apparent from the function name. Maybe a better name
for ReorderBufferCheckTXNAbort() would be something like
'ReorderBufferCleanupIfAbortedTXN()'.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#25)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Nov 13, 2024 at 8:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawda-San,

Here are some more review comments for the latest (accidentally called
v6 again?) v6-0001 patch.

Thank you for reviewing the patch! Indeed, the previous version should
have been v7.

======
contrib/test_decoding/sql/stats.sql

1.
+-- Execute a transaction that is prepared and aborted. We detect that the
+-- transaction is aborted before spilling changes, and then skip collecting
+-- further changes.

You had replied (referring to the above comment):
I think we already mentioned the transaction is going to be spilled
but actually not.

~

Yes, spilling was already mentioned in the current comment but I felt
it assumes the reader is expected to know details of why it was going
to be spilled in the first place.

TBH we expect the reader, typically patch authors and reviewers, to
know it.ats1', NULL, NULL, 'skip-empty-xacts', '1');

In other words, I thought the comment could include a bit more
explanatory background info:
(Also, it's not really "we detect" the abort -- it's the new postgres
code of this patch that detects it.)

SUGGESTION:
Execute a transaction that is prepared but then aborted. The INSERT
data exceeds the 'logical_decoding_work_mem limit' limit which
normally would result in the transaction being spilled to disk, but
now when Postgres detects the abort it skips the spilling and also
skips collecting further changes.

But I'm concerned this explanation might be too detailed, and feel odd
to put this comment for the new added tests even though we're doing
similar tests in the same file. For instance, we have:

-- spilling the xact
BEGIN;
INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM
generate_series(1, 5000) g(i);
COMMIT;
SELECT count(*) FROM pg_logical_slot_peek_changes('regression_slot_st

How about rewording it to the following? I think it's better to
explain why we use a prepared transaction here:

+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.

~~~

2.
+-- Check if the transaction is not spilled as it's already aborted.
+SELECT count(*) FROM
pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM
pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

/Check if the transaction is not spilled/Verify that the transaction
was not spilled/

How about "Verify that the decoding doesn't spill already-aborted
transaction's changes."?

======
.../replication/logical/reorderbuffer.c

ReorderBufferResetTXN:

3.
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);

Looking at the calling code for ReorderBufferResetTXN it seems this
function can called for streaming OR prepared. So is it OK here to be
passing hardwired 'true' as the txn_streaming parameter, or should
that be passing rbtxn_is_streamed(txn)?

I think it should pass 'true' because otherwise the transaction won't
be marked as streamed.

After more thoughts, I think the name of txn_streaming is the source
of confusion. The flag is actually used to decide whether or not the
given transaction can be marked as streamed, but should not indicate
whether the transaction is being streamed because this function can be
called while streaming. So I renamed it to 'mark_txn_streaming' and
updated the comment.

~~~

ReorderBufferLargestStreamableTopTXN:

4.
if ((largest == NULL || txn->total_size > largest_size) &&
(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
- rbtxn_has_streamable_change(txn))
+ rbtxn_has_streamable_change(txn) && !(rbtxn_is_aborted(txn)))
{
largest = txn;
largest_size = txn->total_size;

I felt that this increasingly complicated code would be a lot easier
to understand if you just separate the conditions into: (a) the ones
that filter out transaction you don't care about; (b) the ones that
check for the largest size. For example,

SUGGESTION:
dlist_foreach(...)
{
...

/* Don't consider these kinds of transactions for eviction. */
if (rbtxn_has_partial_change(txn) ||
!rbtxn_has_streamable_change(txn) || rbtxn_is_aborted(txn))
continue;

/* Find the largest of the eviction candidates. */
if ((largest == NULL || txn->total_size > largest_size) &&
(txn->total_size > 0))
{
largest = txn;
largest_size = txn->total_size;
}
}

I like this idea.

~~~

ReorderBufferCheckMemoryLimit:

5.
+ /* skip the transaction if already aborted */
+ if (ReorderBufferCheckTXNAbort(rb, txn))
+ {
+ /* All changes should be truncated */
+ Assert(txn->size == 0 && txn->total_size == 0);
+ continue;
+ }

The "discard all changes accumulated so far" side-effect happening
here is not very apparent from the function name. Maybe a better name
for ReorderBufferCheckTXNAbort() would be something like
'ReorderBufferCleanupIfAbortedTXN()'.

Okay, since we use the term "Cleanup" for different meanings in
reorderbuffer.c (discarding all changes and deallocating the entry),
how about ReorderBufferTruncateTXNIfAborted()?

I've attached the updated patch (v8).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v8-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From 45c16d4e36a893d439370c8e7cea22bb7ee9b6b0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v8] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 187 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 +-
 6 files changed, 228 insertions(+), 50 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..e6d56619156 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..177fbe0965b 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..7d88e4105d1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool mark_txn_streaming);
+static bool ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +795,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1622,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.
+ *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool mark_txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1657,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, mark_txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1687,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1710,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed, if appropriate.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1752,6 +1760,68 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in txn->txn_flags
+ * so we can skip future changes and avoid CLOG lookups on the next call. Return
+ * true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+		return true;
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far,
+	 * and free all resources allocated for toast reconstruction. The full
+	 * cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * Since we don't check the transaction status while replaying the
+	 * transaction, we don't need to reset toast reconstruction data here.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false, false);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1924,7 +1994,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2054,10 +2124,10 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
- * abort of the streaming transaction.  This resets the TXN such that it
- * can be used to stream the remaining data of transaction being processed.
- * This can happen when the subtransaction is aborted and we still want to
- * continue processing the main or other subtransactions data.
+ * abort of the streaming (prepared) transaction.  This resets the TXN such
+ * that it can be used to stream the remaining data of transaction being
+ * processed. This can happen when the subtransaction is aborted and we
+ * still want to continue processing the main or other subtransactions data.
  */
 static void
 ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -2067,7 +2137,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2665,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2718,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2810,6 +2883,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 					 char *gid)
 {
 	ReorderBufferTXN *txn;
+	bool		already_aborted;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -2824,6 +2898,12 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * Remember if the transaction is already aborted to check if we detect
+	 * that the transaction is concurrently aborted during the replay.
+	 */
+	already_aborted = rbtxn_is_aborted(txn);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2832,10 +2912,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3566,7 +3646,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3689,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3748,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3759,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+			{
+				/* All changes should be discarded */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3782,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+			{
+				/* All changes should be discarded */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#27Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#26)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawada-Sn,

Here are some review comments for patch v8-0001.

======
contrib/test_decoding/sql/stats.sql

1.
+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.

/to be spilled but not/to be spilled but will not be/

======
.../replication/logical/reorderbuffer.c

ReorderBufferTruncateTXN:

2.
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.
+ *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)

I think the function comment should describe the parameters in the
same order that they appear in the function signature.

~~~

3.
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
...
+ txn->txn_flags |= RBTXN_IS_STREAMED;
+ }

I guess it doesn't matter much, but for the sake of readability,
should the condition also be checking !rbtxn_is_streamed(txn) to avoid
overwriting the RBTXN_IS_STREAMED bit when it was set already?

~~~

ReorderBufferTruncateTXNIfAborted:

4.
+ /*
+ * The transaction aborted. We discard the changes we've collected so far,
+ * and free all resources allocated for toast reconstruction. The full
+ * cleanup will happen as part of decoding ABORT record of this
+ * transaction.
+ *
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, false, false);

4a.
The first part of the comment says "... and free all resources
allocated for toast reconstruction", but the second part says "we
don't need to reset toast reconstruction data here". Is that a
contradiction?

~

4b.
Shouldn't this call still be passing rbtxn_prepared(txn) as the 2nd
last param, like it used to?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#28Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#27)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Nov 14, 2024 at 7:07 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-Sn,

Here are some review comments for patch v8-0001.

Thank you for the comments.

======
contrib/test_decoding/sql/stats.sql

1.
+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.

/to be spilled but not/to be spilled but will not be/

Fixed.

======
.../replication/logical/reorderbuffer.c

ReorderBufferTruncateTXN:

2.
/*
* Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
*
* We additionally remove tuplecids after decoding the transaction at prepare
* time as we only need to perform invalidation at rollback or commit prepared.
*
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.
+ *
* 'txn_prepared' indicates that we have decoded the transaction at prepare
* time.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)

I think the function comment should describe the parameters in the
same order that they appear in the function signature.

Not sure it should be. We sometimes describe the overall idea of the
function first while using arguments names, and then describe what
other arguments mean.

~~~

3.
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
...
+ txn->txn_flags |= RBTXN_IS_STREAMED;
+ }

I guess it doesn't matter much, but for the sake of readability,
should the condition also be checking !rbtxn_is_streamed(txn) to avoid
overwriting the RBTXN_IS_STREAMED bit when it was set already?

Not sure it improves readability because it adds one more check there.
If it's important not to re-set RBTXN_IS_STREAMED, it makes sense to
have that check and describe in the comment. But in this case, I think
we don't necessarily need to do that.

~~~

ReorderBufferTruncateTXNIfAborted:

4.
+ /*
+ * The transaction aborted. We discard the changes we've collected so far,
+ * and free all resources allocated for toast reconstruction. The full
+ * cleanup will happen as part of decoding ABORT record of this
+ * transaction.
+ *
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, false, false);

4a.
The first part of the comment says "... and free all resources
allocated for toast reconstruction", but the second part says "we
don't need to reset toast reconstruction data here". Is that a
contradiction?

Yes, the comment is out-of-date. Since this function is not called
while replaying the transaction, it should not have any toast
reconstruction data.

~

4b.
Shouldn't this call still be passing rbtxn_prepared(txn) as the 2nd
last param, like it used to?

Actually it's not necessary because it should always be false. But
thinking more, it seems to be better to use rbtxn_preapred(txn) since
it's consistent with other places and it's not necessary to put
assumptions there.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0001-Skip-logical-decoding-of-already-aborted-transact.patchapplication/octet-stream; name=v9-0001-Skip-logical-decoding-of-already-aborted-transact.patchDownload
From 0ff66671719ea1296ae14d8b9a6e500f795c5eaf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v9] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 186 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 +-
 6 files changed, 227 insertions(+), 50 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..1771c713fd8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool mark_txn_streaming);
+static bool ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +795,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1622,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.
+ *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool mark_txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1657,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, mark_txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1687,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1710,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed, if appropriate.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1752,6 +1760,67 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in txn->txn_flags
+ * so we can skip future changes and avoid CLOG lookups on the next call. Return
+ * true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+		return true;
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far.
+	 * The full cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * Since we don't check the transaction status while replaying the
+	 * transaction, we don't need to reset toast reconstruction data here.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1924,7 +1993,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2054,10 +2123,10 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
- * abort of the streaming transaction.  This resets the TXN such that it
- * can be used to stream the remaining data of transaction being processed.
- * This can happen when the subtransaction is aborted and we still want to
- * continue processing the main or other subtransactions data.
+ * abort of the streaming (prepared) transaction.  This resets the TXN such
+ * that it can be used to stream the remaining data of transaction being
+ * processed. This can happen when the subtransaction is aborted and we
+ * still want to continue processing the main or other subtransactions data.
  */
 static void
 ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -2067,7 +2136,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2664,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2717,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2810,6 +2882,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 					 char *gid)
 {
 	ReorderBufferTXN *txn;
+	bool		already_aborted;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -2824,6 +2897,12 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * Remember if the transaction is already aborted to check if we detect
+	 * that the transaction is concurrently aborted during the replay.
+	 */
+	already_aborted = rbtxn_is_aborted(txn);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2832,10 +2911,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3566,7 +3645,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3688,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3747,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3758,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+			{
+				/* All changes should be discarded */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3781,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+			{
+				/* All changes should be discarded */
+				Assert(txn->size == 0 && txn->total_size == 0);
+				continue;
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#29vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#28)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, 15 Nov 2024 at 23:32, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 14, 2024 at 7:07 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-Sn,

Here are some review comments for patch v8-0001.

Thank you for the comments.

======
contrib/test_decoding/sql/stats.sql

1.
+-- The INSERT changes are large enough to be spilled but not, because the
+-- transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.

/to be spilled but not/to be spilled but will not be/

Fixed.

======
.../replication/logical/reorderbuffer.c

ReorderBufferTruncateTXN:

2.
/*
* Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
*
* We additionally remove tuplecids after decoding the transaction at prepare
* time as we only need to perform invalidation at rollback or commit prepared.
*
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.
+ *
* 'txn_prepared' indicates that we have decoded the transaction at prepare
* time.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)

I think the function comment should describe the parameters in the
same order that they appear in the function signature.

Not sure it should be. We sometimes describe the overall idea of the
function first while using arguments names, and then describe what
other arguments mean.

~~~

3.
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
...
+ txn->txn_flags |= RBTXN_IS_STREAMED;
+ }

I guess it doesn't matter much, but for the sake of readability,
should the condition also be checking !rbtxn_is_streamed(txn) to avoid
overwriting the RBTXN_IS_STREAMED bit when it was set already?

Not sure it improves readability because it adds one more check there.
If it's important not to re-set RBTXN_IS_STREAMED, it makes sense to
have that check and describe in the comment. But in this case, I think
we don't necessarily need to do that.

~~~

ReorderBufferTruncateTXNIfAborted:

4.
+ /*
+ * The transaction aborted. We discard the changes we've collected so far,
+ * and free all resources allocated for toast reconstruction. The full
+ * cleanup will happen as part of decoding ABORT record of this
+ * transaction.
+ *
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, false, false);

4a.
The first part of the comment says "... and free all resources
allocated for toast reconstruction", but the second part says "we
don't need to reset toast reconstruction data here". Is that a
contradiction?

Yes, the comment is out-of-date. Since this function is not called
while replaying the transaction, it should not have any toast
reconstruction data.

~

4b.
Shouldn't this call still be passing rbtxn_prepared(txn) as the 2nd
last param, like it used to?

Actually it's not necessary because it should always be false. But
thinking more, it seems to be better to use rbtxn_preapred(txn) since
it's consistent with other places and it's not necessary to put
assumptions there.

Few comments:
1) Should we have the Assert inside ReorderBufferTruncateTXNIfAborted
instead of having it at multiple callers, ReorderBufferResetTXN also
has the Assert inside the function after truncate of the transaction:
@@ -3672,6 +3758,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }
+
                        ReorderBufferStreamTXN(rb, txn);
                }
                else
@@ -3687,6 +3781,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
                        Assert(txn->size > 0);
                        Assert(rb->size >= txn->size);
+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

2) txn->txn_flags can be moved to the next line to keep it within 80
chars in this case:
* Check the transaction status by looking CLOG and discard all changes if
* the transaction is aborted. The transaction status is cached in
txn->txn_flags
* so we can skip future changes and avoid CLOG lookups on the next call. Return

3) Is there any scenario where the Assert can fail as the toast is not reset:
+        * Since we don't check the transaction status while replaying the
+        * transaction, we don't need to reset toast reconstruction data here.
+        */
+       ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }
4) This can be changed to a single line comment:
+       /*
+        * Quick return if the transaction status is already known.
+        */
+       if (rbtxn_is_committed(txn))
+               return false;

Regards,
Vignesh

#30Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#28)
Re: Skip collecting decoded changes of already-aborted transactions

Hi, Here are my review comments for patch v9-0001.

These are only trivial nits for some code comments. Everything else
looked good to me.

======
.../replication/logical/reorderbuffer.c

ReorderBufferTruncateTXN:

1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.

/asked it/requested it/

/being true/as true/

~~~

ReorderBufferPrepare:

2.
+ /*
+ * Remember if the transaction is already aborted to check if we detect
+ * that the transaction is concurrently aborted during the replay.
+ */

SUGGESTION:
Remember if the transaction is already aborted so we can detect when
the transaction is concurrently aborted during the replay.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#29)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Nov 18, 2024 at 11:12 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:

Thank you for reviewing the patch!

1) Should we have the Assert inside ReorderBufferTruncateTXNIfAborted
instead of having it at multiple callers, ReorderBufferResetTXN also
has the Assert inside the function after truncate of the transaction:
@@ -3672,6 +3758,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }
+
ReorderBufferStreamTXN(rb, txn);
}
else
@@ -3687,6 +3781,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->size > 0);
Assert(rb->size >= txn->size);
+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

Moved.

2) txn->txn_flags can be moved to the next line to keep it within 80
chars in this case:
* Check the transaction status by looking CLOG and discard all changes if
* the transaction is aborted. The transaction status is cached in
txn->txn_flags
* so we can skip future changes and avoid CLOG lookups on the next call. Return

Fixed.

3) Is there any scenario where the Assert can fail as the toast is not reset:
+        * Since we don't check the transaction status while replaying the
+        * transaction, we don't need to reset toast reconstruction data here.
+        */
+       ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

IIUC we reconstruct TOAST data when replaying the transaction. On the
other hand, this function is called while adding a decoded change but
not when replaying the transaction. So we should not have any toast
reconstruction data at this point unless I'm missing something. Do you
have any scenario where we call ReorderBufferTruncateTXNIfAborted()
while a transaction has TOAST reconstruction data?

4) This can be changed to a single line comment:
+       /*
+        * Quick return if the transaction status is already known.
+        */
+       if (rbtxn_is_committed(txn))
+               return false;

Fixed.

I'll post the updated version patch soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#32Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#30)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Nov 24, 2024 at 8:50 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi, Here are my review comments for patch v9-0001.

These are only trivial nits for some code comments. Everything else
looked good to me.

======
.../replication/logical/reorderbuffer.c

ReorderBufferTruncateTXN:

1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * asked it by passing 'mark_txn_streaming' being true.

/asked it/requested it/

/being true/as true/

~~~

ReorderBufferPrepare:

2.
+ /*
+ * Remember if the transaction is already aborted to check if we detect
+ * that the transaction is concurrently aborted during the replay.
+ */

SUGGESTION:
Remember if the transaction is already aborted so we can detect when
the transaction is concurrently aborted during the replay.

Thank you for the suggestions.

I've attached a new version patch that incorporates all comments I got so far.

I think the patch is in good shape but I'm considering whether we
might want to call ReorderBufferToastReset() after truncating all
changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
investigate further.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v10-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 9a3c5b3b4228270209e71599ccc5be2af0f5724a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v10] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 +-
 6 files changed, 226 insertions(+), 50 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..418ab627408 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool mark_txn_streaming);
+static bool ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +795,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1622,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
+ * The given transaction is marked as streamed if appropriate and the caller
+ * requested it by passing 'mark_txn_streaming' as true.
+ *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool mark_txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1657,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, mark_txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1687,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1710,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed, if appropriate.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1752,6 +1760,74 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call. Return true if the transaction is aborted, otherwise return
+ * false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/* Quick return if the transaction status is already known */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far.
+	 * The full cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * Since we don't check the transaction status while replaying the
+	 * transaction, we don't need to reset toast reconstruction data here.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2054,10 +2130,10 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
- * abort of the streaming transaction.  This resets the TXN such that it
- * can be used to stream the remaining data of transaction being processed.
- * This can happen when the subtransaction is aborted and we still want to
- * continue processing the main or other subtransactions data.
+ * abort of the streaming (prepared) transaction.  This resets the TXN such
+ * that it can be used to stream the remaining data of transaction being
+ * processed. This can happen when the subtransaction is aborted and we
+ * still want to continue processing the main or other subtransactions data.
  */
 static void
 ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2671,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2724,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2810,6 +2889,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 					 char *gid)
 {
 	ReorderBufferTXN *txn;
+	bool		already_aborted;
 
 	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
 								false);
@@ -2824,6 +2904,12 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * Remember if the transaction is already aborted so we can detect when
+	 * the transaction is concurrently aborted during the replay.
+	 */
+	already_aborted = rbtxn_is_aborted(txn);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	 * when rollback prepared is decoded and sent, the downstream should be
 	 * able to rollback such a xact. See comments atop DecodePrepare.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3566,7 +3652,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3695,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3754,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3765,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3784,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6ad5a8cb9c5..e4c09c86c76 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#33vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#31)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, 26 Nov 2024 at 02:58, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 18, 2024 at 11:12 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:

Thank you for reviewing the patch!

1) Should we have the Assert inside ReorderBufferTruncateTXNIfAborted
instead of having it at multiple callers, ReorderBufferResetTXN also
has the Assert inside the function after truncate of the transaction:
@@ -3672,6 +3758,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }
+
ReorderBufferStreamTXN(rb, txn);
}
else
@@ -3687,6 +3781,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->size > 0);
Assert(rb->size >= txn->size);
+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

Moved.

2) txn->txn_flags can be moved to the next line to keep it within 80
chars in this case:
* Check the transaction status by looking CLOG and discard all changes if
* the transaction is aborted. The transaction status is cached in
txn->txn_flags
* so we can skip future changes and avoid CLOG lookups on the next call. Return

Fixed.

3) Is there any scenario where the Assert can fail as the toast is not reset:
+        * Since we don't check the transaction status while replaying the
+        * transaction, we don't need to reset toast reconstruction data here.
+        */
+       ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

IIUC we reconstruct TOAST data when replaying the transaction. On the
other hand, this function is called while adding a decoded change but
not when replaying the transaction. So we should not have any toast
reconstruction data at this point unless I'm missing something. Do you
have any scenario where we call ReorderBufferTruncateTXNIfAborted()
while a transaction has TOAST reconstruction data?

I have checked further regarding the toast and verified the population
of the toast hash. I agree with you on this. Overall, the patch
appears to be in good shape.

Regards,
Vignesh

#34Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#33)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 26, 2024 at 10:01 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 26 Nov 2024 at 02:58, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 18, 2024 at 11:12 PM vignesh C <vignesh21@gmail.com> wrote:

Few comments:

Thank you for reviewing the patch!

1) Should we have the Assert inside ReorderBufferTruncateTXNIfAborted
instead of having it at multiple callers, ReorderBufferResetTXN also
has the Assert inside the function after truncate of the transaction:
@@ -3672,6 +3758,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }
+
ReorderBufferStreamTXN(rb, txn);
}
else
@@ -3687,6 +3781,14 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->size > 0);
Assert(rb->size >= txn->size);
+                       /* skip the transaction if aborted */
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

Moved.

2) txn->txn_flags can be moved to the next line to keep it within 80
chars in this case:
* Check the transaction status by looking CLOG and discard all changes if
* the transaction is aborted. The transaction status is cached in
txn->txn_flags
* so we can skip future changes and avoid CLOG lookups on the next call. Return

Fixed.

3) Is there any scenario where the Assert can fail as the toast is not reset:
+        * Since we don't check the transaction status while replaying the
+        * transaction, we don't need to reset toast reconstruction data here.
+        */
+       ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+                       if (ReorderBufferTruncateTXNIfAborted(rb, txn))
+                       {
+                               /* All changes should be discarded */
+                               Assert(txn->size == 0 && txn->total_size == 0);
+                               continue;
+                       }

IIUC we reconstruct TOAST data when replaying the transaction. On the
other hand, this function is called while adding a decoded change but
not when replaying the transaction. So we should not have any toast
reconstruction data at this point unless I'm missing something. Do you
have any scenario where we call ReorderBufferTruncateTXNIfAborted()
while a transaction has TOAST reconstruction data?

I have checked further regarding the toast and verified the population
of the toast hash. I agree with you on this. Overall, the patch
appears to be in good shape.

Thank you for the confirmation!

I thought we'd done performance tests with this patch but Michael-san
pointed out we've not done yet. So I've done benchmark tests in two
scenarios:

A. Skip decoding large aborted transactions.

1. Preparation (SQL commands)

create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
begin;
insert into test select generate_series(1, 1_000_000);
commit;
begin;
insert into test select generate_series(1, 1_000_000);
rollback;
begin;
insert into test select generate_series(1, 1_000_000);
rollback;

2. Performance tests (results are w/o patch vs. w/ patch)

-- causes some spill/streamed transactions
set logical_decoding_work_mem to '64MB';

select 'non-streaming', count(*) from
pg_logical_slot_peek_changes('s', null, null, 'stream-changes',
'false');
-> 2636.208 ms vs. 2070.906 ms

select 'streaming', count(*) from pg_logical_slot_peek_changes('s',
null, null, 'stream-changes', 'true');
-> 910.579 ms vs. 653.574 ms

-- no spill/streamed transactions
set logical_decoding_work_mem to '5GB';

select 'non-streaming', count(*) from
pg_logical_slot_peek_changes('s', null, null, 'stream-changes',
'false');
-> 962.863 ms vs. 956.910 ms

select 'streaming', count(*) from pg_logical_slot_peek_changes('s',
null, null, 'stream-changes', 'true');
-> 973.426 ms vs. 973.033 ms

According to the results, skipping logical decoding of already-aborted
transactions contributes performance improvements.

B. Decoding medium-size transactions to check overheads of CLOG lookups.

1. Preparation (shell script)

pgbench -i -s 1 postgres
psql -c "create table test (c int)"
psql -c "select pg_create_logical_replication_slot('s', 'test_decoding')"
echo "insert into test select generate_series(1, 100)" > /tmp/bench.sql
pgbench -t 10000 -c 10 -j 5 -f /tmp/bench.sql postgres

2. Performance tests

-- spill/streamed transactions
set logical_decoding_work_mem to '64';

select 'non-streaming', count(*) from
pg_logical_slot_peek_changes('s', null, null, 'stream-changes',
'false');
-> 7230.537 ms vs. 7154.322 ms

select 'streaming', count(*) from pg_logical_slot_peek_changes('s',
null, null, 'stream-changes', 'true');
-> 6702.438 ms vs. 6678.232 ms

Overall, I don't see noticeable overheads of CLOG lookups.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#32)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

Review comments:
===============
1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * requested it by passing 'mark_txn_streaming' as true.
+ *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)
 {
...
  }
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
+ /*
+ * Mark the transaction as streamed, if appropriate.

The comments related to the above changes don't clarify in which cases
the 'mark_txn_streaming' should be set. Before this patch, it was
clear from the comments and code about the cases where we would decide
to mark it as streamed.

2.
+ /*
+ * Mark the transaction as aborted so we ignore future changes of this
+ * transaction.

/so we ignore/so we can ignore/

3.
* Helper function for ReorderBufferProcessTXN to handle the concurrent
- * abort of the streaming transaction.  This resets the TXN such that it
- * can be used to stream the remaining data of transaction being processed.
- * This can happen when the subtransaction is aborted and we still want to
- * continue processing the main or other subtransactions data.
+ * abort of the streaming (prepared) transaction.
...

In the above comment, "... streaming (prepared)...", you added
prepared to imply that this function handles concurrent abort for both
in-progress and prepared transactions. Am I correct? If so, the
current change makes it less clear. If you see the comments at its
caller, they are clearer.

4.
+ /*
+ * Remember if the transaction is already aborted so we can detect when
+ * the transaction is concurrently aborted during the replay.
+ */
+ already_aborted = rbtxn_is_aborted(txn);
+
  ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
  txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
@@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
TransactionId xid,
  * when rollback prepared is decoded and sent, the downstream should be
  * able to rollback such a xact. See comments atop DecodePrepare.
  *
- * Note, for the concurrent_abort + streaming case a stream_prepare was
+ * Note, for the concurrent abort + streaming case a stream_prepare was
  * already sent within the ReorderBufferReplay call above.
  */
- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
  rb->prepare(rb, txn, txn->final_lsn);

It is not clear from the comments how the 'already_aborted' is
handled. I think after this patch we would have already truncated all
its changes. If so, why do we need to try to replay the changes of
such a xact?

5.
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call. Return true if the transaction is aborted, otherwise return
+ * false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{

I think this function is being invoked to mark a sub-transaction as
aborted. It is better to explain in comments how it interacts with
sub-transactions, why it is okay to mark them as aborted, and how the
other parts of the system interact with it.

--
With Regards,
Amit Kapila.

#36Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#32)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Nov 26, 2024 at 3:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

I think the patch is in good shape but I'm considering whether we
might want to call ReorderBufferToastReset() after truncating all
changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
investigate further.

There’s something that seems a bit odd to me. Consider the case where
the largest transaction(s) are aborted. If
ReorderBufferCanStartStreaming() returns true, the changes from this
transaction will only be discarded if it's a streamable transaction.
However, if ReorderBufferCanStartStreaming() is false, the changes
will be discarded regardless.

What seems strange to me in this patch is truncating the changes of a
large aborted transaction depending on whether we need to stream or
spill but actually that should be completely independent IMHO. My
concern is that if the largest transaction is aborted but isn’t yet
streamable, we might end up picking the next transaction, which could
be much smaller. This smaller transaction might not help us stay
within the memory limit, and we could repeat this process for a few
more transactions. In contrast, it might be more efficient to simply
discard the large aborted transaction, even if it’s not streamable, to
avoid this issue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#36)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Dec 10, 2024 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

I think the patch is in good shape but I'm considering whether we
might want to call ReorderBufferToastReset() after truncating all
changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
investigate further.

There’s something that seems a bit odd to me. Consider the case where
the largest transaction(s) are aborted. If
ReorderBufferCanStartStreaming() returns true, the changes from this
transaction will only be discarded if it's a streamable transaction.
However, if ReorderBufferCanStartStreaming() is false, the changes
will be discarded regardless.

What seems strange to me in this patch is truncating the changes of a
large aborted transaction depending on whether we need to stream or
spill but actually that should be completely independent IMHO. My
concern is that if the largest transaction is aborted but isn’t yet
streamable, we might end up picking the next transaction, which could
be much smaller. This smaller transaction might not help us stay
within the memory limit, and we could repeat this process for a few
more transactions. In contrast, it might be more efficient to simply
discard the large aborted transaction, even if it’s not streamable, to
avoid this issue.

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

--
With Regards,
Amit Kapila.

#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#35)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Dec 10, 2024 at 10:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

5.
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call. Return true if the transaction is aborted, otherwise return
+ * false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{

I think this function is being invoked to mark a sub-transaction as
aborted. It is better to explain in comments how it interacts with
sub-transactions, why it is okay to mark them as aborted, and how the
other parts of the system interact with it.

The current name suggests that the main purpose is to truncate the txn
which is okay but wouldn't it be better to name on the lines of
ReorderBufferCheckAndTruncateAbortedTXN()?

In the following comment, can we move 'Return ...' to the next line to
make the return values from the function clear?
+ * next call. Return true if the transaction is aborted, otherwise return
+ * false.

--
With Regards,
Amit Kapila.

#39Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#37)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Dec 10, 2024 at 11:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 10, 2024 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

I think the patch is in good shape but I'm considering whether we
might want to call ReorderBufferToastReset() after truncating all
changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
investigate further.

There’s something that seems a bit odd to me. Consider the case where
the largest transaction(s) are aborted. If
ReorderBufferCanStartStreaming() returns true, the changes from this
transaction will only be discarded if it's a streamable transaction.
However, if ReorderBufferCanStartStreaming() is false, the changes
will be discarded regardless.

What seems strange to me in this patch is truncating the changes of a
large aborted transaction depending on whether we need to stream or
spill but actually that should be completely independent IMHO. My
concern is that if the largest transaction is aborted but isn’t yet
streamable, we might end up picking the next transaction, which could
be much smaller. This smaller transaction might not help us stay
within the memory limit, and we could repeat this process for a few
more transactions. In contrast, it might be more efficient to simply
discard the large aborted transaction, even if it’s not streamable, to
avoid this issue.

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

I’m not saying this is a very common scenario, but I do feel that the
logic behind truncating the largest transaction doesn’t seem entirely
consistent. However, maybe this isn't a major issue. We could justify
the current behavior by saying that before picking any transaction for
streaming or spilling, we first check whether it has been aborted.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#40Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dilip Kumar (#39)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 10, 2024 at 11:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 10, 2024 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

I think the patch is in good shape but I'm considering whether we
might want to call ReorderBufferToastReset() after truncating all
changes, in ReorderBufferTruncateTXNIfAborted() just in case. Will
investigate further.

There’s something that seems a bit odd to me. Consider the case where
the largest transaction(s) are aborted. If
ReorderBufferCanStartStreaming() returns true, the changes from this
transaction will only be discarded if it's a streamable transaction.
However, if ReorderBufferCanStartStreaming() is false, the changes
will be discarded regardless.

What seems strange to me in this patch is truncating the changes of a
large aborted transaction depending on whether we need to stream or
spill but actually that should be completely independent IMHO. My
concern is that if the largest transaction is aborted but isn’t yet
streamable, we might end up picking the next transaction, which could
be much smaller. This smaller transaction might not help us stay
within the memory limit, and we could repeat this process for a few
more transactions. In contrast, it might be more efficient to simply
discard the large aborted transaction, even if it’s not streamable, to
avoid this issue.

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#41Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#40)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

IIRC, there could be cases where reorder buffers of transactions can
grow in size without having a base snapshot, I think transactions
doing DDLs and generating a lot of INVALIDATION messages could fall in
such a category. And that was one of the reasons why we were using
txns_by_base_snapshot_lsn inside
ReorderBufferLargestStreamableTopTXN().

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#41)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Dec 11, 2024 at 8:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

IIRC, there could be cases where reorder buffers of transactions can
grow in size without having a base snapshot, I think transactions
doing DDLs and generating a lot of INVALIDATION messages could fall in
such a category.

Are we recording such changes in the reorder buffer? If so, can you
please share how? AFAICU, the main idea behind skipping aborts is to
avoid sending a lot of data to the client that later needs to be
discarded or cases where we spent resources/time spilling the changes
that later need to be discarded. In that vein, the current idea of the
patch where it truncates and skips aborted xacts before streaming or
spilling them sounds reasonable.

--
With Regards,
Amit Kapila.

#43Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#42)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 12, 2024 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2024 at 8:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

IIRC, there could be cases where reorder buffers of transactions can
grow in size without having a base snapshot, I think transactions
doing DDLs and generating a lot of INVALIDATION messages could fall in
such a category.

Are we recording such changes in the reorder buffer? If so, can you
please share how?

xact_decode, do add the XLOG_XACT_INVALIDATIONS in the reorder buffer
and for such changes we don't call SnapBuildProcessChange() that means
it is possible to collect such changes in reorder buffer without
setting the base_snapshot

AFAICU, the main idea behind skipping aborts is to

avoid sending a lot of data to the client that later needs to be
discarded or cases where we spent resources/time spilling the changes
that later need to be discarded. In that vein, the current idea of the
patch where it truncates and skips aborted xacts before streaming or
spilling them sounds reasonable.

I believe in one of my previous responses (a few emails above), I
agreed that it's a reasonable goal to check for aborted transactions
just before spilling or streaming, and if we detect an aborted
transaction, we can avoid streaming/spilling and simply discard the
changes. However, I wanted to make a point that if we have a large
aborted transaction without a base snapshot (assuming that's
possible), we might end up streaming many small transactions to stay
under the memory limit. Even though we try to stay within the limit,
we still might not succeed because the main issue is the large aborted
transaction, which doesn't have a base snapshot.

So, instead of streaming many small transactions, if we had selected
the largest transaction first and checked if it was aborted, we could
have avoided streaming all those smaller transactions. I agree this is
a hypothetical scenario and may not be worth optimizing, and that's
completely fair. I just wanted to clarify the point I raised when I
first started reviewing this patch.

I haven't tried it myself, but I believe this scenario could be
created by starting a transaction that performs multiple DDLs and then
ultimately gets aborted.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dilip Kumar (#43)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Dec 11, 2024 at 10:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Dec 12, 2024 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 11, 2024 at 8:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

If the largest transaction is non-streamable, won't the transaction
returned by ReorderBufferLargestTXN() in the other case already
suffice the need?

I see your point, but I don’t think it’s quite the same. When
ReorderBufferCanStartStreaming() is true, the function
ReorderBufferLargestStreamableTopTXN() looks for the largest
transaction among those that have a base_snapshot. So, if the largest
transaction is aborted but hasn’t yet received a base_snapshot, it
will instead select the largest transaction that does have a
base_snapshot, which could be significantly smaller than the largest
aborted transaction.

IIUC the transaction entries in reorderbuffer have the base snapshot
before decoding the first change (see SnapBuildProcessChange()). In
which case the transaction doesn't have the base snapshot and has the
largest amount of changes? Subtransaction entries could transfer its
base snapshot to its parent transaction entry but such subtransactions
will be picked by ReorderBufferLargestTXN().

IIRC, there could be cases where reorder buffers of transactions can
grow in size without having a base snapshot, I think transactions
doing DDLs and generating a lot of INVALIDATION messages could fall in
such a category.

Are we recording such changes in the reorder buffer? If so, can you
please share how?

xact_decode, do add the XLOG_XACT_INVALIDATIONS in the reorder buffer
and for such changes we don't call SnapBuildProcessChange() that means
it is possible to collect such changes in reorder buffer without
setting the base_snapshot

DDLs write not only XLOG_XACT_INVALIDATIONS but also system catalog
changes. I think that when decoding these system catalog changes, we
end up calling SnapBuildProcessChange(). I understand that decoding
XLOG_XACT_INVALIDATIONS doesn't call SnapBuildProcessChange() but
queues invalidation messages to the reorderbuffer, but I still don't
understand cases where a transaction entry is quite big and has only a
lot of invalidation messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#45Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#44)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Dec 13, 2024 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

DDLs write not only XLOG_XACT_INVALIDATIONS but also system catalog
changes. I think that when decoding these system catalog changes, we
end up calling SnapBuildProcessChange(). I understand that decoding
XLOG_XACT_INVALIDATIONS doesn't call SnapBuildProcessChange() but
queues invalidation messages to the reorderbuffer, but I still don't
understand cases where a transaction entry is quite big and has only a
lot of invalidation messages.

You are right that SnapBuildProcessChange() will be called when there
are changes in the system catalog. However it is very much possible
that when you are processing the system catalog operation the
snapbuild state is not yet SNAPBUILD_FULL_SNAPSHOT and by the time you
reach to XLOG_XACT_INVALIDATIONS some concurrent transaction get
committed and snapbuild state change to SNAPBUILD_FULL_SNAPSHOT.
However, I need to agree that such a transaction can not really be
very large because this can contain Invalidation messages at max from
a single DDL command so maybe we don't need to do anything special for
them and we can go ahead with the approach you followed in the current
patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#45)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Dec 15, 2024 at 10:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 13, 2024 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

DDLs write not only XLOG_XACT_INVALIDATIONS but also system catalog
changes. I think that when decoding these system catalog changes, we
end up calling SnapBuildProcessChange(). I understand that decoding
XLOG_XACT_INVALIDATIONS doesn't call SnapBuildProcessChange() but
queues invalidation messages to the reorderbuffer, but I still don't
understand cases where a transaction entry is quite big and has only a
lot of invalidation messages.

You are right that SnapBuildProcessChange() will be called when there
are changes in the system catalog. However it is very much possible
that when you are processing the system catalog operation the
snapbuild state is not yet SNAPBUILD_FULL_SNAPSHOT and by the time you
reach to XLOG_XACT_INVALIDATIONS some concurrent transaction get
committed and snapbuild state change to SNAPBUILD_FULL_SNAPSHOT.
However, I need to agree that such a transaction can not really be
very large because this can contain Invalidation messages at max from
a single DDL command so maybe we don't need to do anything special for
them and we can go ahead with the approach you followed in the current
patch.

Thanks, I also think we can proceed with the current approach. So, the
pending task is to address a few comments raised by me.

--
With Regards,
Amit Kapila.

#47Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#35)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Dec 9, 2024 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

Review comments:

Thank you for reviewing the patch!

===============
1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * requested it by passing 'mark_txn_streaming' as true.
+ *
* 'txn_prepared' indicates that we have decoded the transaction at prepare
* time.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)
{
...
}
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
+ /*
+ * Mark the transaction as streamed, if appropriate.

The comments related to the above changes don't clarify in which cases
the 'mark_txn_streaming' should be set. Before this patch, it was
clear from the comments and code about the cases where we would decide
to mark it as streamed.

I think we can rename it to txn_streaming for consistency with
txn_prepared. I've changed the comment for that.

2.
+ /*
+ * Mark the transaction as aborted so we ignore future changes of this
+ * transaction.

/so we ignore/so we can ignore/

Fixed.

3.
* Helper function for ReorderBufferProcessTXN to handle the concurrent
- * abort of the streaming transaction.  This resets the TXN such that it
- * can be used to stream the remaining data of transaction being processed.
- * This can happen when the subtransaction is aborted and we still want to
- * continue processing the main or other subtransactions data.
+ * abort of the streaming (prepared) transaction.
...

In the above comment, "... streaming (prepared)...", you added
prepared to imply that this function handles concurrent abort for both
in-progress and prepared transactions. Am I correct? If so, the
current change makes it less clear. If you see the comments at its
caller, they are clearer.

I think we don't need this change as the patch doesn't change what
this function does and what the caller would expect. So removed.

4.
+ /*
+ * Remember if the transaction is already aborted so we can detect when
+ * the transaction is concurrently aborted during the replay.
+ */
+ already_aborted = rbtxn_is_aborted(txn);
+
ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
@@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
TransactionId xid,
* when rollback prepared is decoded and sent, the downstream should be
* able to rollback such a xact. See comments atop DecodePrepare.
*
- * Note, for the concurrent_abort + streaming case a stream_prepare was
+ * Note, for the concurrent abort + streaming case a stream_prepare was
* already sent within the ReorderBufferReplay call above.
*/
- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
rb->prepare(rb, txn, txn->final_lsn);

It is not clear from the comments how the 'already_aborted' is
handled. I think after this patch we would have already truncated all
its changes. If so, why do we need to try to replay the changes of
such a xact?

I used ReorderBufferReplay() for convenience; it sends begin_prepare()
and prepare() appropriately, handles streaming-prepared transactions,
and updates statistics etc. But as you pointed out, it would not be
necessary to set up a historical snapshot etc. I agree that we don't
need to try replaying such aborted transactions but I'd like to
confirm we don't really need to execute invalidation messages evein in
aborted transactions.

5.
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call. Return true if the transaction is aborted, otherwise return
+ * false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferTruncateTXNIfAborted(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{

I think this function is being invoked to mark a sub-transaction as
aborted. It is better to explain in comments how it interacts with
sub-transactions, why it is okay to mark them as aborted, and how the
other parts of the system interact with it.

This function can be called for top-level transactions and
subtransactions. IIUC there is no main difference between calling it
for top-level transaction and subtransaction. What interaction with
subtransactions are you concerned about?

I've attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v11-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From cc57d0cfee21404632ab849cf68119802f78b4e3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v11] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 204 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  17 +-
 6 files changed, 246 insertions(+), 49 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..9e624953741 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -259,7 +260,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-									 bool txn_prepared);
+									 bool txn_prepared, bool txn_streaming);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +795,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,17 +1622,21 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
  *
  * 'txn_prepared' indicates that we have decoded the transaction at prepare
  * time.
+ *
+ * 'txn_streaming' indicates that the transaction is a streaming transaction.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared,
+						 bool txn_streaming)
 {
 	dlist_mutable_iter iter;
 	Size		mem_freed = 0;
@@ -1650,7 +1656,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared, txn_streaming);
 	}
 
 	/* cleanup changes in the txn */
@@ -1680,24 +1686,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1721,6 +1709,25 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			ReorderBufferReturnChange(rb, change, true);
 		}
 	}
+	else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
+	{
+		/*
+		 * Mark the transaction as streamed.
+		 *
+		 * The top-level transaction, is marked as streamed always, even if it
+		 * does not contain any changes (that is, when all the changes are in
+		 * subtransactions).
+		 *
+		 * For subtransactions, we only mark them as streamed when there are
+		 * changes in them.
+		 *
+		 * We do it this way because of aborts - we don't want to send aborts
+		 * for XIDs the downstream is not aware of. And of course, it always
+		 * knows about the toplevel xact (we send the XID in all messages),
+		 * but we never stream XIDs of empty subxacts.
+		 */
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+	}
 
 	/*
 	 * Destroy the (relfilelocator, ctid) hashtable, so that we don't leak any
@@ -1752,6 +1759,75 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/* Quick return if the transaction status is already known */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far.
+	 * The full cleanup will happen as part of decoding ABORT record of this
+	 * transaction.
+	 *
+	 * Since we don't check the transaction status while replaying the
+	 * transaction, we don't need to reset toast reconstruction data here.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
 		 * just truncate txn by removing changes and tuplecids.
 		 */
-		ReorderBufferTruncateTXN(rb, txn, true);
+		ReorderBufferTruncateTXN(rb, txn, true, true);
 		/* Reset the CheckXidAlive */
 		CheckXidAlive = InvalidTransactionId;
 	}
@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2595,7 +2671,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), streaming);
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2648,7 +2724,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2824,18 +2903,46 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	/* The prepare info must have been updated in txn by now. */
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	/*
+	 * If the transaction is already known to be aborted, we can send a
+	 * prepare or a stream_prepare without replaying the transaction so that
+	 * later when rollback prepared is decoded and sent. The downstream should
+	 * be able to to rollback such a xact. See comments atop DecodePrepare.
+	 */
+	if (rbtxn_is_aborted(txn))
+	{
+		if (rbtxn_is_streamed(txn))
+			rb->stream_prepare(rb, txn, txn->final_lsn);
+		else
+		{
+			rb->begin_prepare(rb, txn);
+			rb->prepare(rb, txn, txn->final_lsn);
+
+			/*
+			 * Update total transaction count. Ensure to not count the
+			 * streamed transaction multiple times.
+			 */
+			rb->totalTxns++;
+		}
+
+		/*
+		 * The TXN will fully be cleaned up when decoding either commit or
+		 * rollback prepared.
+		 */
+		return;
+	}
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
+	 * Send a prepare if we detected the concurrent abort while replaying the
+	 * non-streaming transaction.
 	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
+	 * Note, for the concurrent abort + streaming case a stream_prepare was
 	 * already sent within the ReorderBufferReplay call above.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
 		rb->prepare(rb, txn, txn->final_lsn);
 }
 
@@ -3566,7 +3673,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3716,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3775,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3786,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3805,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3bc365a7b0c..bb6c73ca269 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_IS_COMMITTED			0x0200
+#define RBTXN_IS_ABORTED			0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,6 +232,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +433,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#47)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 19, 2024 at 7:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

Review comments:

Thank you for reviewing the patch!

===============
1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * requested it by passing 'mark_txn_streaming' as true.
+ *
* 'txn_prepared' indicates that we have decoded the transaction at prepare
* time.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)
{
...
}
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
+ /*
+ * Mark the transaction as streamed, if appropriate.

The comments related to the above changes don't clarify in which cases
the 'mark_txn_streaming' should be set. Before this patch, it was
clear from the comments and code about the cases where we would decide
to mark it as streamed.

I think we can rename it to txn_streaming for consistency with
txn_prepared. I've changed the comment for that.

@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
    ReorderBufferChange *specinsert)
 {
  /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  * full cleanup will happen as part of the COMMIT PREPAREDs, so now
  * just truncate txn by removing changes and tuplecids.
  */
- ReorderBufferTruncateTXN(rb, txn, true);
+ ReorderBufferTruncateTXN(rb, txn, true, true);

In both the above places, the patch unconditionally passes the
'txn_streaming' even for prepared transactions when it wouldn't be a
streaming xact. Inside the function, the patch handled that by first
checking whether the transaction is prepared (txn_prepared). So, the
logic will work but the function signature and the way its callers are
using make it difficult to use and extend in the future.

I think for the first case, we should get the streaming parameter in
ReorderBufferResetTXN(), and for the second case
ReorderBufferStreamCommit(), we should pass it as false because by
that time transaction is already streamed and prepared. We are
invoking it for cleanup. Even when we call ReorderBufferTruncateTXN()
from ReorderBufferCheckAndTruncateAbortedTXN(), it will be better to
write a comment at the caller about why we are passing this parameter
as false.

4.
+ /*
+ * Remember if the transaction is already aborted so we can detect when
+ * the transaction is concurrently aborted during the replay.
+ */
+ already_aborted = rbtxn_is_aborted(txn);
+
ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
@@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
TransactionId xid,
* when rollback prepared is decoded and sent, the downstream should be
* able to rollback such a xact. See comments atop DecodePrepare.
*
- * Note, for the concurrent_abort + streaming case a stream_prepare was
+ * Note, for the concurrent abort + streaming case a stream_prepare was
* already sent within the ReorderBufferReplay call above.
*/
- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
rb->prepare(rb, txn, txn->final_lsn);

It is not clear from the comments how the 'already_aborted' is
handled. I think after this patch we would have already truncated all
its changes. If so, why do we need to try to replay the changes of
such a xact?

I used ReorderBufferReplay() for convenience; it sends begin_prepare()
and prepare() appropriately, handles streaming-prepared transactions,
and updates statistics etc. But as you pointed out, it would not be
necessary to set up a historical snapshot etc. I agree that we don't
need to try replaying such aborted transactions but I'd like to
confirm we don't really need to execute invalidation messages evein in
aborted transactions.

We need to execute invalidations if we have loaded any cache entries,
for example in the case of streaming. See comments in the function
ReorderBufferAbort(). However, I find both the current changes and the
previous patch a bit difficult to follow. How about if we instead
invent a flag like RBTXN_SENT_PREPARE or something like that and then
use that flag to decide whether to send prepare in
ReorderBufferPrepare(). Then add comments for the cases in which
prepare will be sent from ReorderBufferPrepare().

*
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);

Can we expect the size to be zero without resetting the toast data? In
ReorderBufferToastReset(), we call ReorderBufferReturnChange() which
reduces the change size. So, won't that size still be accounted for in
txn?

--
With Regards,
Amit Kapila.

#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#48)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 19, 2024 at 2:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 19, 2024 at 7:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached a new version patch that incorporates all comments I got so far.

Review comments:

Thank you for reviewing the patch!

===============
1.
+ * The given transaction is marked as streamed if appropriate and the caller
+ * requested it by passing 'mark_txn_streaming' as true.
+ *
* 'txn_prepared' indicates that we have decoded the transaction at prepare
* time.
*/
static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared,
+ bool mark_txn_streaming)
{
...
}
+ else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
(txn->nentries_mem != 0)))
+ {
+ /*
+ * Mark the transaction as streamed, if appropriate.

The comments related to the above changes don't clarify in which cases
the 'mark_txn_streaming' should be set. Before this patch, it was
clear from the comments and code about the cases where we would decide
to mark it as streamed.

I think we can rename it to txn_streaming for consistency with
txn_prepared. I've changed the comment for that.

@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
{
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
ReorderBufferTXN *txn)
* full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuplecids.
*/
- ReorderBufferTruncateTXN(rb, txn, true);
+ ReorderBufferTruncateTXN(rb, txn, true, true);

In both the above places, the patch unconditionally passes the
'txn_streaming' even for prepared transactions when it wouldn't be a
streaming xact. Inside the function, the patch handled that by first
checking whether the transaction is prepared (txn_prepared). So, the
logic will work but the function signature and the way its callers are
using make it difficult to use and extend in the future.

Valid concern.

I think for the first case, we should get the streaming parameter in
ReorderBufferResetTXN(),

I think we cannot pass 'rbtxn_is_streamed(txn)' to
ReorderBufferTruncateTXN() in the first case. ReorderBufferResetTXN()
is called to handle the concurrent abort of the streaming transaction
but the transaction might not have been marked as streamed at that
time. Since ReorderBufferTruncateTXN() is responsible for both
discarding changes and marking the transaction as streamed, we need to
unconditionally pass txn_streaming = true in this case.

and for the second case
ReorderBufferStreamCommit(), we should pass it as false because by
that time transaction is already streamed and prepared. We are
invoking it for cleanup.

Agreed.

Even when we call ReorderBufferTruncateTXN()
from ReorderBufferCheckAndTruncateAbortedTXN(), it will be better to
write a comment at the caller about why we are passing this parameter
as false.

Agreed.

On second thoughts, I think the confusion related to txn_streaming
came from the fact that ReorderBufferTruncateTXN() does both
discarding changes and marking the transaction as streamed. If we make
the function do just discarding changes, we don't need to introduce
the txn_streaming function argument. Instead, we need to have a
separate function to mark the transaction as streamed and call it
before ReorderBufferTruncateTXN() where appropriate. And
ReorderBufferCheckAndTruncateAbortedTXN() just calls
ReorderBufferTruncateTXN().

4.
+ /*
+ * Remember if the transaction is already aborted so we can detect when
+ * the transaction is concurrently aborted during the replay.
+ */
+ already_aborted = rbtxn_is_aborted(txn);
+
ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
@@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
TransactionId xid,
* when rollback prepared is decoded and sent, the downstream should be
* able to rollback such a xact. See comments atop DecodePrepare.
*
- * Note, for the concurrent_abort + streaming case a stream_prepare was
+ * Note, for the concurrent abort + streaming case a stream_prepare was
* already sent within the ReorderBufferReplay call above.
*/
- if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+ if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
rb->prepare(rb, txn, txn->final_lsn);

It is not clear from the comments how the 'already_aborted' is
handled. I think after this patch we would have already truncated all
its changes. If so, why do we need to try to replay the changes of
such a xact?

I used ReorderBufferReplay() for convenience; it sends begin_prepare()
and prepare() appropriately, handles streaming-prepared transactions,
and updates statistics etc. But as you pointed out, it would not be
necessary to set up a historical snapshot etc. I agree that we don't
need to try replaying such aborted transactions but I'd like to
confirm we don't really need to execute invalidation messages evein in
aborted transactions.

We need to execute invalidations if we have loaded any cache entries,
for example in the case of streaming. See comments in the function
ReorderBufferAbort(). However, I find both the current changes and the
previous patch a bit difficult to follow. How about if we instead
invent a flag like RBTXN_SENT_PREPARE or something like that and then
use that flag to decide whether to send prepare in
ReorderBufferPrepare(). Then add comments for the cases in which
prepare will be sent from ReorderBufferPrepare().

The idea of using RBTXN_SENT_PREPARE sounds good to me. I'll use it.

*
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);

Can we expect the size to be zero without resetting the toast data? In
ReorderBufferToastReset(), we call ReorderBufferReturnChange() which
reduces the change size. So, won't that size still be accounted for in
txn?

IIUC the toast reconstruction data is created only while replaying the
transaction but the ReorderBufferCheckAndTruncateAbortedTXN() is not
called during that. So I think any toast data should not be
accumulated at that time.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#50Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#49)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Dec 20, 2024 at 12:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 19, 2024 at 2:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
{
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
ReorderBufferTXN *txn)
* full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuplecids.
*/
- ReorderBufferTruncateTXN(rb, txn, true);
+ ReorderBufferTruncateTXN(rb, txn, true, true);

In both the above places, the patch unconditionally passes the
'txn_streaming' even for prepared transactions when it wouldn't be a
streaming xact. Inside the function, the patch handled that by first
checking whether the transaction is prepared (txn_prepared). So, the
logic will work but the function signature and the way its callers are
using make it difficult to use and extend in the future.

Valid concern.

I think for the first case, we should get the streaming parameter in
ReorderBufferResetTXN(),

I think we cannot pass 'rbtxn_is_streamed(txn)' to
ReorderBufferTruncateTXN() in the first case. ReorderBufferResetTXN()
is called to handle the concurrent abort of the streaming transaction
but the transaction might not have been marked as streamed at that
time. Since ReorderBufferTruncateTXN() is responsible for both
discarding changes and marking the transaction as streamed, we need to
unconditionally pass txn_streaming = true in this case.

Can't we use 'stream_started' variable available at the call site of
ReorderBufferResetTXN() for our purpose?

On second thoughts, I think the confusion related to txn_streaming
came from the fact that ReorderBufferTruncateTXN() does both
discarding changes and marking the transaction as streamed. If we make
the function do just discarding changes, we don't need to introduce
the txn_streaming function argument. Instead, we need to have a
separate function to mark the transaction as streamed and call it
before ReorderBufferTruncateTXN() where appropriate. And
ReorderBufferCheckAndTruncateAbortedTXN() just calls
ReorderBufferTruncateTXN().

That sounds good to me. IIRC, initially, ReorderBufferTruncateTXN()
was used to truncate changes only for streaming transactions. Later,
it evolved for prepared facts and now for facts where we explicitly
detect whether they are aborted. So, I think it makes sense to improve
it by following your suggestion.

*
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);

Can we expect the size to be zero without resetting the toast data? In
ReorderBufferToastReset(), we call ReorderBufferReturnChange() which
reduces the change size. So, won't that size still be accounted for in
txn?

IIUC the toast reconstruction data is created only while replaying the
transaction but the ReorderBufferCheckAndTruncateAbortedTXN() is not
called during that. So I think any toast data should not be
accumulated at that time.

How about the case where in the first pass, we streamed the
transaction partially, where it has reconstructed toast data, and
then, in the second pass, when memory becomes full, the reorder buffer
contains some partial data, due to which it tries to spill the data
and finds that the transaction is aborted? I could be wrong here
because I haven't tried to test this code path, but I see that it is
theoretically possible.

--
With Regards,
Amit Kapila.

#51Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#50)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 19, 2024 at 9:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 20, 2024 at 12:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 19, 2024 at 2:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
{
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
ReorderBufferTXN *txn)
* full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuplecids.
*/
- ReorderBufferTruncateTXN(rb, txn, true);
+ ReorderBufferTruncateTXN(rb, txn, true, true);

In both the above places, the patch unconditionally passes the
'txn_streaming' even for prepared transactions when it wouldn't be a
streaming xact. Inside the function, the patch handled that by first
checking whether the transaction is prepared (txn_prepared). So, the
logic will work but the function signature and the way its callers are
using make it difficult to use and extend in the future.

Valid concern.

I think for the first case, we should get the streaming parameter in
ReorderBufferResetTXN(),

I think we cannot pass 'rbtxn_is_streamed(txn)' to
ReorderBufferTruncateTXN() in the first case. ReorderBufferResetTXN()
is called to handle the concurrent abort of the streaming transaction
but the transaction might not have been marked as streamed at that
time. Since ReorderBufferTruncateTXN() is responsible for both
discarding changes and marking the transaction as streamed, we need to
unconditionally pass txn_streaming = true in this case.

Can't we use 'stream_started' variable available at the call site of
ReorderBufferResetTXN() for our purpose?

Right, we can use it.

On second thoughts, I think the confusion related to txn_streaming
came from the fact that ReorderBufferTruncateTXN() does both
discarding changes and marking the transaction as streamed. If we make
the function do just discarding changes, we don't need to introduce
the txn_streaming function argument. Instead, we need to have a
separate function to mark the transaction as streamed and call it
before ReorderBufferTruncateTXN() where appropriate. And
ReorderBufferCheckAndTruncateAbortedTXN() just calls
ReorderBufferTruncateTXN().

That sounds good to me. IIRC, initially, ReorderBufferTruncateTXN()
was used to truncate changes only for streaming transactions. Later,
it evolved for prepared facts and now for facts where we explicitly
detect whether they are aborted. So, I think it makes sense to improve
it by following your suggestion.

I've changed the patch accordingly.

*
+ * Since we don't check the transaction status while replaying the
+ * transaction, we don't need to reset toast reconstruction data here.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);

Can we expect the size to be zero without resetting the toast data? In
ReorderBufferToastReset(), we call ReorderBufferReturnChange() which
reduces the change size. So, won't that size still be accounted for in
txn?

IIUC the toast reconstruction data is created only while replaying the
transaction but the ReorderBufferCheckAndTruncateAbortedTXN() is not
called during that. So I think any toast data should not be
accumulated at that time.

How about the case where in the first pass, we streamed the
transaction partially, where it has reconstructed toast data, and
then, in the second pass, when memory becomes full, the reorder buffer
contains some partial data, due to which it tries to spill the data
and finds that the transaction is aborted? I could be wrong here
because I haven't tried to test this code path, but I see that it is
theoretically possible.

Yeah, it seems possible. I've changed the patch to reset toast data as well.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v12-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From eda56d0738faa8a13c8461b51bc4e8e5daa0bdb7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v12] Skip logical decoding of already-aborted transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup check to determine if
a transaction is already aborted, so the logical decoding skips
further change also when it doesn't touch system catalogs. This
optimization enhances logical decoding performance, especially for
large transactions that have already been rolled back, as it avoids
unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 180 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  24 ++-
 6 files changed, 233 insertions(+), 45 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e3a5c7b660c..af7c4815780 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,73 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/* Quick return if the transaction status is already known */
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far
+	 * and toast reconstruction data. The full cleanup will happen as part of
+	 * decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1918,6 +1972,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * detected. See DecodePrepare for more information.
 		 */
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2107,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the toplevel xact (we send the XID in all messages), but we never
+	 * stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2622,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2677,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2733,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2920,14 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not yet. It happens if we detected the concurrent
+	 * abort while replaying the non-streaming transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3657,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3700,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3759,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3770,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3789,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3bc365a7b0c..f8cb7e38556 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -230,12 +233,30 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
 /* Is this a top-level transaction? */
 #define rbtxn_is_toptxn(txn) \
 ( \
@@ -419,9 +440,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#52Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#51)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawada-San.

Here are some review comments for the patch v12-0001.

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

1.
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */

Typo "by looking CLOG".

It should be something like "by CLOG lookup".

~~~

2.
+ /* Quick return if the transaction status is already known */
+ if (rbtxn_is_committed(txn))
+ return false;
+ if (rbtxn_is_aborted(txn))
+ {
+ /* Already-aborted transactions should not have any changes */
+ Assert(txn->size == 0);
+
+ return true;
+ }
+

Consider changing that 2nd 'if' to be 'else if', because then that
will make it more obvious that the earlier single line comment "Quick
return if...", in fact applies to both these conditions.

Alternatively, make that a block comment and add some blank lines like:

+ /*
+    * Quick returns if the transaction status is already known.
+    */
+
+ if (rbtxn_is_committed(txn))
+ return false;
+
+ if (rbtxn_is_aborted(txn))
+ {
+ /* Already-aborted transactions should not have any changes */
+ Assert(txn->size == 0);
+
+ return true;
+ }

~~~

3.
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ Assert(!rbtxn_is_aborted(txn));
+ txn->txn_flags |= RBTXN_IS_COMMITTED;
+ return false;
+ }
+
+ /*
+ * The transaction aborted. We discard the changes we've collected so far
+ * and toast reconstruction data. The full cleanup will happen as part of
+ * decoding ABORT record of this transaction.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferToastReset(rb, txn);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);
+
+ /*
+ * Mark the transaction as aborted so we can ignore future changes of this
+ * transaction.
+ */
+ Assert(!rbtxn_is_committed(txn));
+ txn->txn_flags |= RBTXN_IS_ABORTED;
+
+ return true;
+}

3a.
That whole last part related to "The transaction aborted", might be
clearer if the whole chunk of code was in an 'else' block from the
previous "if (TransactionIdDidCommit(txn->xid))".

~

3b.
"toast" is an acronym so it should be written in uppercase IMO.

~

3c.
The "and toast reconstruction data" seems to be missing a word/s. (??)
- "... and also discard TOAST reconstruction data"
- "... and reset TOAST reconstruction data"

~~~

ReorderBufferMaybeMarkTXNStreamed:

4.
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * The top-level transaction, is marked as streamed always, even if it
+ * does not contain any changes (that is, when all the changes are in
+ * subtransactions).
+ *
+ * For subtransactions, we only mark them as streamed when there are
+ * changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts for
+ * XIDs the downstream is not aware of. And of course, it always knows
+ * about the toplevel xact (we send the XID in all messages), but we never
+ * stream XIDs of empty subxacts.
+ */
+ if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;
+}

/the toplevel xact/the top-level xact/

~~~

5.
  /*
- * We send the prepare for the concurrently aborted xacts so that later
- * when rollback prepared is decoded and sent, the downstream should be
- * able to rollback such a xact. See comments atop DecodePrepare.
- *
- * Note, for the concurrent_abort + streaming case a stream_prepare was
- * already sent within the ReorderBufferReplay call above.
+ * Send a prepare if not yet. It happens if we detected the concurrent
+ * abort while replaying the non-streaming transaction.
  */

The first sentence "if not yet" seems incomplete/missing words.

SUGGESTION
Send a prepare if not already done so. This might occur if we had
detected a concurrent abort while replaying the non-streaming
transaction.

======
src/include/replication/reorderbuffer.h

6.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.

I feel there is now also some introduced ambiguity with these macros:

/* Has this transaction been prepared? */
#define rbtxn_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_PREPARE) != 0 \
)

+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)

e.g. It's also not clear from the comments what is the distinction
between the existing macro comment "Has this transaction been
prepared?" and the new macro comment "Has a prepare or stream_prepare
already been sent?".

Indeed, I was wondering if some of the places currently calling
"rbtxn_prepared(txn)" should now strictly be calling
"rbtxn_sent_prepared(txn)" macro instead?

IMO some minor renaming of the existing constants (and also their
associated macros) might help to make all this more coherent. For
example, perhaps like:

#define RBTXN_IS_PREPARE_NEEDED 0x0040
#define RBTXN_IS_PREPARE_SKIPPED 0x0080
#define RBTXN_IS_PREPARE_SENT 0x0200

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#52)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Jan 7, 2025 at 7:22 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/include/replication/reorderbuffer.h

6.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.

I feel there is now also some introduced ambiguity with these macros:

/* Has this transaction been prepared? */
#define rbtxn_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_PREPARE) != 0 \
)

+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)

e.g. It's also not clear from the comments what is the distinction
between the existing macro comment "Has this transaction been
prepared?" and the new macro comment "Has a prepare or stream_prepare
already been sent?".

Indeed, I was wondering if some of the places currently calling
"rbtxn_prepared(txn)" should now strictly be calling
"rbtxn_sent_prepared(txn)" macro instead?

Right, I think after this change, it appears we should try to rename
the existing constants. One place where we can consider to use new
macro is the current usage of rbtxn_prepared() in
SnapBuildDistributeNewCatalogSnapshot().

IMO some minor renaming of the existing constants (and also their
associated macros) might help to make all this more coherent. For
example, perhaps like:

#define RBTXN_IS_PREPARE_NEEDED 0x0040

The other option could be RBTXN_IS_PREPARE_REQUESTED.

--
With Regards,
Amit Kapila.

#54Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#53)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Jan 13, 2025 at 3:07 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 7, 2025 at 7:22 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/include/replication/reorderbuffer.h

6.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.

I feel there is now also some introduced ambiguity with these macros:

/* Has this transaction been prepared? */
#define rbtxn_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_PREPARE) != 0 \
)

+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)

e.g. It's also not clear from the comments what is the distinction
between the existing macro comment "Has this transaction been
prepared?" and the new macro comment "Has a prepare or stream_prepare
already been sent?".

Indeed, I was wondering if some of the places currently calling
"rbtxn_prepared(txn)" should now strictly be calling
"rbtxn_sent_prepared(txn)" macro instead?

Right, I think after this change, it appears we should try to rename
the existing constants. One place where we can consider to use new
macro is the current usage of rbtxn_prepared() in
SnapBuildDistributeNewCatalogSnapshot().

I think that RBTXN_PREPARE would mean that the transaction needs to be
prepared but it doesn't mean that a prepare or a stream_prepare has
already been sent. And RBTXN_SENT_PREPARE adds some internal details
about whether a prepare or a stream_prepare has actually been sent.
IIUC RBTXN_SENT_PREPARE is used only in a short term in
ReorderBufferPrepare(). So outside of reorderbuffer such as
snapbuild.c doesn't need to care about the RBTXN_SENT_PREPARE.

IMO some minor renaming of the existing constants (and also their
associated macros) might help to make all this more coherent. For
example, perhaps like:

#define RBTXN_IS_PREPARE_NEEDED 0x0040

The other option could be RBTXN_IS_PREPARE_REQUESTED.

I'm a bit concerned that these names sound like a state that the
transaction needs to be prepared but has not been done yet. But
rbtxn_prepared() is widely used to check if the transaction is a
prepared transaction regardless of a prepare or a stream_prepare
actually being sent. How about RBTXN_IS_PREPARED_TXN and
rbtxn_is_preapred_txn()? I think it would indicate well that the
transaction needs to be processed as a prepared transaction.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#55Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#52)
2 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Jan 6, 2025 at 5:52 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San.

Here are some review comments for the patch v12-0001.

Thank you for reviewing the patch!

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

1.
+/*
+ * Check the transaction status by looking CLOG and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */

Typo "by looking CLOG".

It should be something like "by CLOG lookup".

Fixed.

~~~

2.
+ /* Quick return if the transaction status is already known */
+ if (rbtxn_is_committed(txn))
+ return false;
+ if (rbtxn_is_aborted(txn))
+ {
+ /* Already-aborted transactions should not have any changes */
+ Assert(txn->size == 0);
+
+ return true;
+ }
+

Consider changing that 2nd 'if' to be 'else if', because then that
will make it more obvious that the earlier single line comment "Quick
return if...", in fact applies to both these conditions.

Alternatively, make that a block comment and add some blank lines like:

+ /*
+    * Quick returns if the transaction status is already known.
+    */
+
+ if (rbtxn_is_committed(txn))
+ return false;
+
+ if (rbtxn_is_aborted(txn))
+ {
+ /* Already-aborted transactions should not have any changes */
+ Assert(txn->size == 0);
+
+ return true;
+ }

I used a block comment.

~~~

3.
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ Assert(!rbtxn_is_aborted(txn));
+ txn->txn_flags |= RBTXN_IS_COMMITTED;
+ return false;
+ }
+
+ /*
+ * The transaction aborted. We discard the changes we've collected so far
+ * and toast reconstruction data. The full cleanup will happen as part of
+ * decoding ABORT record of this transaction.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferToastReset(rb, txn);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);
+
+ /*
+ * Mark the transaction as aborted so we can ignore future changes of this
+ * transaction.
+ */
+ Assert(!rbtxn_is_committed(txn));
+ txn->txn_flags |= RBTXN_IS_ABORTED;
+
+ return true;
+}

3a.
That whole last part related to "The transaction aborted", might be
clearer if the whole chunk of code was in an 'else' block from the
previous "if (TransactionIdDidCommit(txn->xid))".

I'm not sure it increases the readability. I think it pretty makes
sense to me that we return false in the 'if
(TransactionIdDidCommit(txn->xid))' block. If we add the 'else' block,
the reader might be confused as we have the 'else' block in spite of
having the return in the 'if' block. We can add a local variable for
the result and return it at the end of the function but I'm not sure
it's a good idea to increase the readability.

~

3b.
"toast" is an acronym so it should be written in uppercase IMO.

~

Hmm, it seems we don't use TOAST at all at least in reorderbuffer.c. I
would prefer to make it consistent with others.

3c.
The "and toast reconstruction data" seems to be missing a word/s. (??)
- "... and also discard TOAST reconstruction data"
- "... and reset TOAST reconstruction data"

I don't understand this comment. What words are you suggesting adding
to these sentences?

~~~

ReorderBufferMaybeMarkTXNStreamed:

4.
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /*
+ * The top-level transaction, is marked as streamed always, even if it
+ * does not contain any changes (that is, when all the changes are in
+ * subtransactions).
+ *
+ * For subtransactions, we only mark them as streamed when there are
+ * changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts for
+ * XIDs the downstream is not aware of. And of course, it always knows
+ * about the toplevel xact (we send the XID in all messages), but we never
+ * stream XIDs of empty subxacts.
+ */
+ if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;
+}

/the toplevel xact/the top-level xact/

Fixed.

~~~

5.
/*
- * We send the prepare for the concurrently aborted xacts so that later
- * when rollback prepared is decoded and sent, the downstream should be
- * able to rollback such a xact. See comments atop DecodePrepare.
- *
- * Note, for the concurrent_abort + streaming case a stream_prepare was
- * already sent within the ReorderBufferReplay call above.
+ * Send a prepare if not yet. It happens if we detected the concurrent
+ * abort while replaying the non-streaming transaction.
*/

The first sentence "if not yet" seems incomplete/missing words.

SUGGESTION
Send a prepare if not already done so. This might occur if we had
detected a concurrent abort while replaying the non-streaming
transaction.

Fixed.

======
src/include/replication/reorderbuffer.h

6.
#define RBTXN_PREPARE 0x0040
#define RBTXN_SKIPPED_PREPARE 0x0080
#define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.

I feel there is now also some introduced ambiguity with these macros:

/* Has this transaction been prepared? */
#define rbtxn_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_PREPARE) != 0 \
)

+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)

e.g. It's also not clear from the comments what is the distinction
between the existing macro comment "Has this transaction been
prepared?" and the new macro comment "Has a prepare or stream_prepare
already been sent?".

Indeed, I was wondering if some of the places currently calling
"rbtxn_prepared(txn)" should now strictly be calling
"rbtxn_sent_prepared(txn)" macro instead?

IMO some minor renaming of the existing constants (and also their
associated macros) might help to make all this more coherent. For
example, perhaps like:

#define RBTXN_IS_PREPARE_NEEDED 0x0040
#define RBTXN_IS_PREPARE_SKIPPED 0x0080
#define RBTXN_IS_PREPARE_SENT 0x0200

Fair point. I've clarified the comments for macros. As for renaming
the existing constants and associated macros, I sent my thoughts in an
email[1]/messages/by-id/CAD21AoBgxqFVKq1yf+NR2dHBt47xtkFQ=JtxwcAv1PSjTahoPw@mail.gmail.com and implemented it in a separate patch (the 0002 patch).

Regards,

[1]: /messages/by-id/CAD21AoBgxqFVKq1yf+NR2dHBt47xtkFQ=JtxwcAv1PSjTahoPw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v13-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v13-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 439535e8c55eba7fd94b2c798dbd40e817a941cb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v13 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit introduces an additional CLOG lookup to check the
transaction status, so the logical decoding skips further change also
when it doesn't touch system catalogs if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction.

Reviewed-by: Andres Freund, Amit Kapila, Dilip Kumar, Vignesh C
Reviewed-by: Ajin Cherian, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 245 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 79b60df7cf0..39110a2fc70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard the changes we've collected so far
+	 * and toast reconstruction data. The full cleanup will happen as part of
+	 * decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2681,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2737,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2924,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3662,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3705,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3764,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3775,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3794,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..0ce688ef909 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this transaction a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

v13-0002-Rename-RBTXN_XXX-constants-for-better-consistenc.patchapplication/octet-stream; name=v13-0002-Rename-RBTXN_XXX-constants-for-better-consistenc.patchDownload
From 2ffb87e93fc4075d9d10dbf637c92fb596c39978 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 13 Jan 2025 10:35:17 -0800
Subject: [PATCH v13 2/2] Rename RBTXN_XXX constants for better consistency.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/replication/logical/proto.c       |  2 +-
 .../replication/logical/reorderbuffer.c       | 24 +++++++++----------
 src/backend/replication/logical/snapbuild.c   |  2 +-
 src/include/replication/reorderbuffer.h       |  6 ++---
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bef350714db..628228e1228 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -163,7 +163,7 @@ logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
 	 * which case we expect to have a valid GID.
 	 */
 	Assert(txn->gid != NULL);
-	Assert(rbtxn_prepared(txn));
+	Assert(rbtxn_is_prepared_txn(txn));
 	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 39110a2fc70..07203529c1d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,7 +1793,7 @@ ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn
 	 * and toast reconstruction data. The full cleanup will happen as part of
 	 * decoding ABORT record of this transaction.
 	 */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared_txn(txn));
 	ReorderBufferToastReset(rb, txn);
 
 	/* All changes should be discarded */
@@ -1968,7 +1968,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	if (rbtxn_prepared(txn))
+	if (rbtxn_is_prepared_txn(txn))
 	{
 		/*
 		 * Note, we send stream prepare even if a concurrent abort is
@@ -2150,7 +2150,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared_txn(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2238,7 +2238,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (!streaming)
 		{
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared_txn(txn))
 				rb->begin_prepare(rb, txn);
 			else
 				rb->begin(rb, txn);
@@ -2280,7 +2280,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * required for the cases when we decode the changes before the
 			 * COMMIT record is processed.
 			 */
-			if (streaming || rbtxn_prepared(change->txn))
+			if (streaming || rbtxn_is_prepared_txn(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2625,7 +2625,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
 			 * regular ones).
 			 */
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared_txn(txn))
 			{
 				rb->prepare(rb, txn, commit_lsn);
 				txn->txn_flags |= RBTXN_SENT_PREPARE;
@@ -2679,12 +2679,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * For 4, as the entire txn has been decoded, we can fully clean up
 		 * the TXN reorder buffer.
 		 */
-		if (streaming || rbtxn_prepared(txn))
+		if (streaming || rbtxn_is_prepared_txn(txn))
 		{
 			if (streaming)
 				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared_txn(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2728,7 +2728,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * during a two-phase commit.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
-			(stream_started || rbtxn_prepared(txn)))
+			(stream_started || rbtxn_is_prepared_txn(txn)))
 		{
 			/* curtxn must be set for streaming or prepared transactions */
 			Assert(curtxn);
@@ -2815,7 +2815,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 		 * Removing this txn before a commit might result in the computation
 		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
 		 */
-		if (!rbtxn_prepared(txn))
+		if (!rbtxn_is_prepared_txn(txn))
 			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
@@ -2914,7 +2914,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_PREPARE;
+	txn->txn_flags |= RBTXN_IS_PREPARED_TXN;
 	txn->gid = pstrdup(gid);
 
 	/* The prepare info must have been updated in txn by now. */
@@ -2975,7 +2975,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 */
 	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
-		txn->txn_flags |= RBTXN_PREPARE;
+		txn->txn_flags |= RBTXN_IS_PREPARED_TXN;
 
 		/*
 		 * The prepare info must have been updated in txn even if we skip
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de318..9f33e51f21d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -761,7 +761,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * We don't need to add snapshot to prepared transactions as they
 		 * should not see the new catalog contents.
 		 */
-		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+		if (rbtxn_is_prepared_txn(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0ce688ef909..36d4a752bcb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,7 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SERIALIZED_CLEAR 	0x0008
 #define RBTXN_IS_STREAMED         	0x0010
 #define RBTXN_HAS_PARTIAL_CHANGE  	0x0020
-#define RBTXN_PREPARE             	0x0040
+#define RBTXN_IS_PREPARED_TXN 		0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
 #define RBTXN_SENT_PREPARE			0x0200
@@ -234,9 +234,9 @@ typedef struct ReorderBufferChange
  * committed. To check whether a prepare or a stream_prepare has already
  * been sent for this transaction, we need to use rbtxn_sent_prepare().
  */
-#define rbtxn_prepared(txn) \
+#define rbtxn_is_prepared_txn(txn) \
 ( \
-	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+	((txn)->txn_flags & RBTXN_IS_PREPARED_TXN) != 0 \
 )
 
 /* Has a prepare or stream_prepare already been sent? */
-- 
2.43.5

#56Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#55)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawada-San. Here are some cosmetic review comments for the patch v13-0001.

======
Commit message

1.
This commit introduces an additional CLOG lookup to check the
transaction status, so the logical decoding skips further change also
when it doesn't touch system catalogs if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

~

That first sentence seems confusing. How about:

This commit adds a CLOG lookup to check the transaction status,
allowing logical decoding to skip changes for non-system catalogs if
the transaction is already aborted.

On Tue, Jan 14, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 6, 2025 at 5:52 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San.

Here are some review comments for the patch v12-0001.

Thank you for reviewing the patch!

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

~~~

3.
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ Assert(!rbtxn_is_aborted(txn));
+ txn->txn_flags |= RBTXN_IS_COMMITTED;
+ return false;
+ }
+
+ /*
+ * The transaction aborted. We discard the changes we've collected so far
+ * and toast reconstruction data. The full cleanup will happen as part of
+ * decoding ABORT record of this transaction.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferToastReset(rb, txn);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);
+
+ /*
+ * Mark the transaction as aborted so we can ignore future changes of this
+ * transaction.
+ */
+ Assert(!rbtxn_is_committed(txn));
+ txn->txn_flags |= RBTXN_IS_ABORTED;
+
+ return true;
+}

3a.
That whole last part related to "The transaction aborted", might be
clearer if the whole chunk of code was in an 'else' block from the
previous "if (TransactionIdDidCommit(txn->xid))".

I'm not sure it increases the readability. I think it pretty makes
sense to me that we return false in the 'if
(TransactionIdDidCommit(txn->xid))' block. If we add the 'else' block,
the reader might be confused as we have the 'else' block in spite of
having the return in the 'if' block. We can add a local variable for
the result and return it at the end of the function but I'm not sure
it's a good idea to increase the readability.

2.
I think adding a local variable is overkill but OTOH introducing
“else” clarifies that the following code can only be reached when the
transaction is aborted. E.g. You don’t even need to read the previous
code block and see the “return false” to know that. Anyway, it’s
probably just a personal preference.

3c.
The "and toast reconstruction data" seems to be missing a word/s. (??)
- "... and also discard TOAST reconstruction data"
- "... and reset TOAST reconstruction data"

I don't understand this comment. What words are you suggesting adding
to these sentences?

3.
I meant something like:

BEFORE
We discard the changes we've collected so far and toast reconstruction data.

SUGGESTION
We discard both the changes collected so far and the TOAST reconstruction data.

======
src/include/replication/reorderbuffer.h

4.
-/* Has this transaction been prepared? */
+/*
+ * Is this transaction a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */

/Is this transaction a prepared transaction?/Is this a prepared transaction?/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#57Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#55)
Re: Skip collecting decoded changes of already-aborted transactions

Hi Sawada-San.

Some review comments for patch v13-0002.

======

I think the v12 ambiguity of RBTXN_PREPARE versus RBTXN_SENT_PREPARE
was mostly addressed already by the improved comments for the macros
in patch 0001.

Meanwhile, patch v13-0002 says it is renaming constants for better
consistency, but I don't think it went far enough.

For example, better name consistency would be achieved by changing
*all* of the constants related to prepared transactions:

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_IS_PREPARED_SKIPPED 0x0080
#define RBTXN_IS_PREPARED_SENT 0x0200

where:

RBTXN_IS_PREPARED. This means it's a prepared transaction. (but we
can't tell from this if it is skipped or sent).

RBTXN_IS_PREPARED_SKIPPED. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and it's being skipped.

RBTXN_IS_PREPARED_SENT. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and we've sent it.

~

A note about RBTXN_IS_PREPARED. Since all of these constants are
clearly about transactions (e.g. "TXN" in prefix "RBTXN_"), I felt
patch 0002 calling this RBTXN_IS_PREPARED_TXN just seemed like adding
a redundant _TXN. e.g. we don't say RBTXN_IS_COMMITTED_TXN etc.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#57)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Jan 14, 2025 at 7:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San.

Some review comments for patch v13-0002.

======

I think the v12 ambiguity of RBTXN_PREPARE versus RBTXN_SENT_PREPARE
was mostly addressed already by the improved comments for the macros
in patch 0001.

Meanwhile, patch v13-0002 says it is renaming constants for better
consistency, but I don't think it went far enough.

For example, better name consistency would be achieved by changing
*all* of the constants related to prepared transactions:

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_IS_PREPARED_SKIPPED 0x0080
#define RBTXN_IS_PREPARED_SENT 0x0200

where:

RBTXN_IS_PREPARED. This means it's a prepared transaction. (but we
can't tell from this if it is skipped or sent).

RBTXN_IS_PREPARED_SKIPPED. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and it's being skipped.

RBTXN_IS_PREPARED_SENT. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and we've sent it.

The first one (RBTXN_IS_PREPARED) sounds like an improvement over what
we have now. I am not convinced about the other two.

~

A note about RBTXN_IS_PREPARED. Since all of these constants are
clearly about transactions (e.g. "TXN" in prefix "RBTXN_"), I felt
patch 0002 calling this RBTXN_IS_PREPARED_TXN just seemed like adding
a redundant _TXN. e.g. we don't say RBTXN_IS_COMMITTED_TXN etc.

+1. I felt the same.

--
With Regards,
Amit Kapila.

#59Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#56)
1 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Jan 13, 2025 at 5:36 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San. Here are some cosmetic review comments for the patch v13-0001.

Thank you for reviewing the patch.

======
Commit message

1.
This commit introduces an additional CLOG lookup to check the
transaction status, so the logical decoding skips further change also
when it doesn't touch system catalogs if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

~

That first sentence seems confusing. How about:

This commit adds a CLOG lookup to check the transaction status,
allowing logical decoding to skip changes for non-system catalogs if
the transaction is already aborted.

I'm concerned that the proposed sentence doesn't explain the change
enough. I think that what we need to mention in the commit message is
that we will have more opportunities to check the transaction aborts
in addition to when touching system catalogs while replaying a
transaction in streaming mode.

On Tue, Jan 14, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 6, 2025 at 5:52 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San.

Here are some review comments for the patch v12-0001.

Thank you for reviewing the patch!

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

~~~

3.
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ /*
+ * Remember the transaction is committed so that we can skip CLOG
+ * check next time, avoiding the pressure on CLOG lookup.
+ */
+ Assert(!rbtxn_is_aborted(txn));
+ txn->txn_flags |= RBTXN_IS_COMMITTED;
+ return false;
+ }
+
+ /*
+ * The transaction aborted. We discard the changes we've collected so far
+ * and toast reconstruction data. The full cleanup will happen as part of
+ * decoding ABORT record of this transaction.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferToastReset(rb, txn);
+
+ /* All changes should be discarded */
+ Assert(txn->size == 0);
+
+ /*
+ * Mark the transaction as aborted so we can ignore future changes of this
+ * transaction.
+ */
+ Assert(!rbtxn_is_committed(txn));
+ txn->txn_flags |= RBTXN_IS_ABORTED;
+
+ return true;
+}

3a.
That whole last part related to "The transaction aborted", might be
clearer if the whole chunk of code was in an 'else' block from the
previous "if (TransactionIdDidCommit(txn->xid))".

I'm not sure it increases the readability. I think it pretty makes
sense to me that we return false in the 'if
(TransactionIdDidCommit(txn->xid))' block. If we add the 'else' block,
the reader might be confused as we have the 'else' block in spite of
having the return in the 'if' block. We can add a local variable for
the result and return it at the end of the function but I'm not sure
it's a good idea to increase the readability.

2.
I think adding a local variable is overkill but OTOH introducing
“else” clarifies that the following code can only be reached when the
transaction is aborted. E.g. You don’t even need to read the previous
code block and see the “return false” to know that. Anyway, it’s
probably just a personal preference.

I prefer to reduce blocks where possible.

3c.
The "and toast reconstruction data" seems to be missing a word/s. (??)
- "... and also discard TOAST reconstruction data"
- "... and reset TOAST reconstruction data"

I don't understand this comment. What words are you suggesting adding
to these sentences?

3.
I meant something like:

BEFORE
We discard the changes we've collected so far and toast reconstruction data.

SUGGESTION
We discard both the changes collected so far and the TOAST reconstruction data.

Thanks, fixed.

======
src/include/replication/reorderbuffer.h

4.
-/* Has this transaction been prepared? */
+/*
+ * Is this transaction a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */

/Is this transaction a prepared transaction?/Is this a prepared transaction?/

Fixed.

I've attached the updated patch (only 0001 patch). I'll submit the
updated patch for 0002 patch once we get consensus on names.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v14-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v14-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 0d37d211aa4ac8238e7d4c393e1c6b875085f348 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v14 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeble performance regression due to
CLOG lookups.

Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 245 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 79b60df7cf0..8278e6f2223 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard both the changes collected so far
+	 * and the toast reconstruction data. The full cleanup will happen as part
+	 * of decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2681,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2737,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2924,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3662,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3705,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3764,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3775,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3794,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..9d9ac2f0830 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#60Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#58)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Jan 13, 2025 at 8:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 14, 2025 at 7:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San.

Some review comments for patch v13-0002.

======

I think the v12 ambiguity of RBTXN_PREPARE versus RBTXN_SENT_PREPARE
was mostly addressed already by the improved comments for the macros
in patch 0001.

Meanwhile, patch v13-0002 says it is renaming constants for better
consistency, but I don't think it went far enough.

For example, better name consistency would be achieved by changing
*all* of the constants related to prepared transactions:

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_IS_PREPARED_SKIPPED 0x0080
#define RBTXN_IS_PREPARED_SENT 0x0200

where:

RBTXN_IS_PREPARED. This means it's a prepared transaction. (but we
can't tell from this if it is skipped or sent).

RBTXN_IS_PREPARED_SKIPPED. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and it's being skipped.

RBTXN_IS_PREPARED_SENT. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and we've sent it.

The first one (RBTXN_IS_PREPARED) sounds like an improvement over what
we have now. I am not convinced about the other two.

I agree with the above usage; it's more consistent to set
RBTXN_IS_PREPARED also for a skipped prepared transaction. But I'm not
sure it's better to have the RBTXN_IS_PREPARED prefix for all
constants.

~

A note about RBTXN_IS_PREPARED. Since all of these constants are
clearly about transactions (e.g. "TXN" in prefix "RBTXN_"), I felt
patch 0002 calling this RBTXN_IS_PREPARED_TXN just seemed like adding
a redundant _TXN. e.g. we don't say RBTXN_IS_COMMITTED_TXN etc.

+1. I felt the same.

I followed RBTXN_IS_SUBXACT (I think TXN and XACT have the same
meaning) but that's a fair point.

It seems we agreed on RBTXN_IS_PREPARED and rbtxn_is_prepared().
Adding 'IS' seems to clarify the transaction having this flag *is* a
prepared transaction. Both other two constants RBTXN_SENT_PREAPRE and
RBTXN_SKIPPED_PREPARE seem not bad to me. I find that the proposed
names don't increase the consistency much. Thoughts?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#61Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#60)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 15, 2025 at 3:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems we agreed on RBTXN_IS_PREPARED and rbtxn_is_prepared().
Adding 'IS' seems to clarify the transaction having this flag *is* a
prepared transaction. Both other two constants RBTXN_SENT_PREAPRE and
RBTXN_SKIPPED_PREPARE seem not bad to me.

Agreed.

I find that the proposed
names don't increase the consistency much. Thoughts?

I also think so.

--
With Regards,
Amit Kapila.

#62Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#61)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 15, 2025 at 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 15, 2025 at 3:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems we agreed on RBTXN_IS_PREPARED and rbtxn_is_prepared().
Adding 'IS' seems to clarify the transaction having this flag *is* a
prepared transaction. Both other two constants RBTXN_SENT_PREAPRE and
RBTXN_SKIPPED_PREPARE seem not bad to me.

Agreed.

I find that the proposed
names don't increase the consistency much. Thoughts?

I also think so.

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

But if I am outvoted that's OK. It is not a big deal.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#63Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#62)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Jan 15, 2025 at 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 15, 2025 at 3:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems we agreed on RBTXN_IS_PREPARED and rbtxn_is_prepared().
Adding 'IS' seems to clarify the transaction having this flag *is* a
prepared transaction. Both other two constants RBTXN_SENT_PREAPRE and
RBTXN_SKIPPED_PREPARE seem not bad to me.

Agreed.

I find that the proposed
names don't increase the consistency much. Thoughts?

I also think so.

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#64Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#63)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Jan 17, 2025 at 11:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Sounds reasonable. We agree with just changing RBTXN_PREPARE to
RBTXN_IS_PREPARED and its corresponding macro. The next step is to
update the patch to reflect the same.

--
With Regards,
Amit Kapila.

#65Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#64)
2 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Jan 19, 2025 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 17, 2025 at 11:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Sounds reasonable. We agree with just changing RBTXN_PREPARE to
RBTXN_IS_PREPARED and its corresponding macro. The next step is to
update the patch to reflect the same.

Right. I've attached the updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v15-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchapplication/octet-stream; name=v15-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchDownload
From 6cadfba13fc785bb615f43928654ec35bba5100b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 13 Jan 2025 10:35:17 -0800
Subject: [PATCH v15 2/2] Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better
 clarification.

Previously, RBTXN_PREPARE flag and rbtxn_prepared macro could be
misinterpreted as either indicating the transaction type (e.g. a
prepared transaction or a normal transaction) or its current
state (e.g. skipped or its prepare message is sent), especially after
commit XXX introduced the RBTXN_SENT_PREPARE flag and the
rbtxn_sent_prepare macro.

The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate  the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREAPRE flag to
the transaction that is a prepared transaction and has been skipped,
which previously had only the RBTXN_SKIPPED_PREPARE flag.

Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
---
 src/backend/replication/logical/proto.c       |  2 +-
 .../replication/logical/reorderbuffer.c       | 26 +++++++++----------
 src/backend/replication/logical/snapbuild.c   |  2 +-
 src/include/replication/reorderbuffer.h       |  6 ++---
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bef350714db..61b5283a2e1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -163,7 +163,7 @@ logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
 	 * which case we expect to have a valid GID.
 	 */
 	Assert(txn->gid != NULL);
-	Assert(rbtxn_prepared(txn));
+	Assert(rbtxn_is_prepared(txn));
 	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8278e6f2223..f8ba7c5b156 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,7 +1793,7 @@ ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn
 	 * and the toast reconstruction data. The full cleanup will happen as part
 	 * of decoding ABORT record of this transaction.
 	 */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 	ReorderBufferToastReset(rb, txn);
 
 	/* All changes should be discarded */
@@ -1968,7 +1968,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	if (rbtxn_prepared(txn))
+	if (rbtxn_is_prepared(txn))
 	{
 		/*
 		 * Note, we send stream prepare even if a concurrent abort is
@@ -2150,7 +2150,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2238,7 +2238,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (!streaming)
 		{
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 				rb->begin_prepare(rb, txn);
 			else
 				rb->begin(rb, txn);
@@ -2280,7 +2280,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * required for the cases when we decode the changes before the
 			 * COMMIT record is processed.
 			 */
-			if (streaming || rbtxn_prepared(change->txn))
+			if (streaming || rbtxn_is_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2625,7 +2625,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
 			 * regular ones).
 			 */
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 			{
 				rb->prepare(rb, txn, commit_lsn);
 				txn->txn_flags |= RBTXN_SENT_PREPARE;
@@ -2679,12 +2679,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * For 4, as the entire txn has been decoded, we can fully clean up
 		 * the TXN reorder buffer.
 		 */
-		if (streaming || rbtxn_prepared(txn))
+		if (streaming || rbtxn_is_prepared(txn))
 		{
 			if (streaming)
 				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2728,7 +2728,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * during a two-phase commit.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
-			(stream_started || rbtxn_prepared(txn)))
+			(stream_started || rbtxn_is_prepared(txn)))
 		{
 			/* curtxn must be set for streaming or prepared transactions */
 			Assert(curtxn);
@@ -2815,7 +2815,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 		 * Removing this txn before a commit might result in the computation
 		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
 		 */
-		if (!rbtxn_prepared(txn))
+		if (!rbtxn_is_prepared(txn))
 			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
@@ -2893,7 +2893,7 @@ ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+	txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);
 }
 
 /*
@@ -2914,7 +2914,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_PREPARE;
+	txn->txn_flags |= RBTXN_IS_PREPARED;
 	txn->gid = pstrdup(gid);
 
 	/* The prepare info must have been updated in txn by now. */
@@ -2975,7 +2975,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 */
 	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
-		txn->txn_flags |= RBTXN_PREPARE;
+		txn->txn_flags |= RBTXN_IS_PREPARED;
 
 		/*
 		 * The prepare info must have been updated in txn even if we skip
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de318..9b764c6c40b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -761,7 +761,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * We don't need to add snapshot to prepared transactions as they
 		 * should not see the new catalog contents.
 		 */
-		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+		if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
 			continue;
 
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d9ac2f0830..27d134198e3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,7 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SERIALIZED_CLEAR 	0x0008
 #define RBTXN_IS_STREAMED         	0x0010
 #define RBTXN_HAS_PARTIAL_CHANGE  	0x0020
-#define RBTXN_PREPARE             	0x0040
+#define RBTXN_IS_PREPARED 			0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
 #define RBTXN_SENT_PREPARE			0x0200
@@ -234,9 +234,9 @@ typedef struct ReorderBufferChange
  * committed. To check whether a prepare or a stream_prepare has already
  * been sent for this transaction, we need to use rbtxn_sent_prepare().
  */
-#define rbtxn_prepared(txn) \
+#define rbtxn_is_prepared(txn) \
 ( \
-	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+	((txn)->txn_flags & RBTXN_IS_PREPARED) != 0 \
 )
 
 /* Has a prepare or stream_prepare already been sent? */
-- 
2.43.5

v15-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v15-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 9f0dacbcdc3dc02c32d90d0d42b6c163e509e1fe Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v15 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeble performance regression due to
CLOG lookups.

Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 245 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 79b60df7cf0..8278e6f2223 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard both the changes collected so far
+	 * and the toast reconstruction data. The full cleanup will happen as part
+	 * of decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2681,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2737,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2924,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3662,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3705,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3764,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3775,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3794,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..9d9ac2f0830 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#66Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#65)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 22, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 19, 2025 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 17, 2025 at 11:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Sounds reasonable. We agree with just changing RBTXN_PREPARE to
RBTXN_IS_PREPARED and its corresponding macro. The next step is to
update the patch to reflect the same.

Right. I've attached the updated patches.

Some review comments for v15-0002.

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

======

I'm not trying to be pedantic, but there seems to be something strange
about the combination usage of these PREPARE constants, which raises
lots of questions for me...

For example.
I had thought RBTXN_SKIPPED_PREPARE meant it is a prepared tx AND it is skipped
I had thought RBTXN_SENT_PREPARE meant it is a prepared tx AND it is sent

So I was surprised that the patch makes this change:
- txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+ txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

because, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* mean it
is a prepared transaction then why does that constant even have
"PREPARE" in its name at all instead of just being called
RBTXN_SKIPPED?

e.g., either of these makes sense to me:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED);
txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

But this combination seemed odd:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
  continue;

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

and make appropriate macro changes

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE == RBTXN_SKIPPED_PREPARE) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_SENT_PREPARE == RBTXN_SENT_PREPARE) \
)

Thoughts?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#67Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#66)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Jan 22, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 19, 2025 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 17, 2025 at 11:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Sounds reasonable. We agree with just changing RBTXN_PREPARE to
RBTXN_IS_PREPARED and its corresponding macro. The next step is to
update the patch to reflect the same.

Right. I've attached the updated patches.

Some review comments for v15-0002.

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

======

I'm not trying to be pedantic, but there seems to be something strange
about the combination usage of these PREPARE constants, which raises
lots of questions for me...

For example.
I had thought RBTXN_SKIPPED_PREPARE meant it is a prepared tx AND it is skipped
I had thought RBTXN_SENT_PREPARE meant it is a prepared tx AND it is sent

So I was surprised that the patch makes this change:
- txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+ txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

because, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* mean it
is a prepared transaction then why does that constant even have
"PREPARE" in its name at all instead of just being called
RBTXN_SKIPPED?

e.g., either of these makes sense to me:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED);
txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

But this combination seemed odd:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Will that address your concern? Does anyone else have an opinion on this matter?

--
With Regards,
Amit Kapila.

#68Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#67)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Jan 22, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 19, 2025 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 17, 2025 at 11:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 15, 2025 at 4:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

My thoughts are that any consistency improvement is a step in the
right direction so even "don't increase the consistency much" is still
better than nothing.

I agree that doing something is better than nothing. The proposed
idea, having RBTXN_IS_PREPARED prefix for all related flags, improves
the consistency in terms of names, but I'm not sure this is the right
direction. For example, RBTXN_IS_PREPARED_SKIPPED is quite confusing
to me. I think this name implies "this is a prepared transaction but
is skipped", but I don't think it conveys the meaning well. In
addition to that, if we add RBTXN_IS_PREPARED flag also for skipped
prepared transactions, we would end up with doing like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_IS_PREPARED_SKIPPED);

Which seems quite redundant. It makes more sense to me to do like:

txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

I'd like to avoid a situation like where we rename these names just
for better consistency in terms of names and later rename them to
better names for other reasons again and again.

Sounds reasonable. We agree with just changing RBTXN_PREPARE to
RBTXN_IS_PREPARED and its corresponding macro. The next step is to
update the patch to reflect the same.

Right. I've attached the updated patches.

Some review comments for v15-0002.

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

======

I'm not trying to be pedantic, but there seems to be something strange
about the combination usage of these PREPARE constants, which raises
lots of questions for me...

For example.
I had thought RBTXN_SKIPPED_PREPARE meant it is a prepared tx AND it is skipped
I had thought RBTXN_SENT_PREPARE meant it is a prepared tx AND it is sent

So I was surprised that the patch makes this change:
- txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+ txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

because, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* mean it
is a prepared transaction then why does that constant even have
"PREPARE" in its name at all instead of just being called
RBTXN_SKIPPED?

e.g., either of these makes sense to me:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED);
txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

But this combination seemed odd:
txn->txn_flags |= (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE);

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#69Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#68)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 22, 2025 at 7:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

Will fix.

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

Right. We no longer need to check rbtxn_skip_prepared() here.

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Since the patch already does "txn->txn_flags |= (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE);", it's already ensured, no?

I think we need to add both flags in ReorderBufferSkipPrepare(),
because there is a case where a transaction might not be marked as
RBTXN_IS_PREPARED here.

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

I think the same is true for RBTXN_IS_SERIALIZED and
RBTXN_IS_SERIALIZED_CLEAR; RBTXN_IS_SERIALIZED_CLEAR can only be set
for RBTXN_IS_SERIALIZED transaction. Should we add some comments to
them too? But I'm concerned about having too much explanation if we
add descriptions to flags too while already having comments for
corresponding macros.

Another way to ensure that is to convert these macros to inline
functions and add an Assert() there, but it seems overkill.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#70Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#69)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Jan 24, 2025 at 12:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 22, 2025 at 7:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

Will fix.

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

Right. We no longer need to check rbtxn_skip_prepared() here.

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Since the patch already does "txn->txn_flags |= (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE);", it's already ensured, no?

I mean to say that we add assert to ensure the same.

I think we need to add both flags in ReorderBufferSkipPrepare(),
because there is a case where a transaction might not be marked as
RBTXN_IS_PREPARED here.

Are you talking about the case when it is invoked from
DecodePrepare()? I thought we would set the flag in that code path.

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

I think the same is true for RBTXN_IS_SERIALIZED and
RBTXN_IS_SERIALIZED_CLEAR; RBTXN_IS_SERIALIZED_CLEAR can only be set
for RBTXN_IS_SERIALIZED transaction. Should we add some comments to
them too? But I'm concerned about having too much explanation if we
add descriptions to flags too while already having comments for
corresponding macros.

Yeah, I am fine either way especially, if we decide to add asserts for
RBTXN_IS_PREPARED when we set those flags.

Another way to ensure that is to convert these macros to inline
functions and add an Assert() there, but it seems overkill.

True, but that would ensure, we won't make any coding mistakes which
Peter wants to ensure by writing additional comments but asserting is
probably a better way.

--
With Regards,
Amit Kapila.

#71Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#70)
2 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Jan 26, 2025 at 10:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 24, 2025 at 12:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 22, 2025 at 7:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

Will fix.

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

Right. We no longer need to check rbtxn_skip_prepared() here.

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Since the patch already does "txn->txn_flags |= (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE);", it's already ensured, no?

I mean to say that we add assert to ensure the same.

I think we need to add both flags in ReorderBufferSkipPrepare(),
because there is a case where a transaction might not be marked as
RBTXN_IS_PREPARED here.

Are you talking about the case when it is invoked from
DecodePrepare()?

Yes. IIUC ReorderBufferSkipPrepare() is called only from DecodePrepare().

I thought we would set the flag in that code path.

I agree that it makes sense to add the flag before calling
ReorderBufferSkipPrepare().

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

I think the same is true for RBTXN_IS_SERIALIZED and
RBTXN_IS_SERIALIZED_CLEAR; RBTXN_IS_SERIALIZED_CLEAR can only be set
for RBTXN_IS_SERIALIZED transaction. Should we add some comments to
them too? But I'm concerned about having too much explanation if we
add descriptions to flags too while already having comments for
corresponding macros.

Yeah, I am fine either way especially, if we decide to add asserts for
RBTXN_IS_PREPARED when we set those flags.

Another way to ensure that is to convert these macros to inline
functions and add an Assert() there, but it seems overkill.

True, but that would ensure, we won't make any coding mistakes which
Peter wants to ensure by writing additional comments but asserting is
probably a better way.

I've attached the updated patch. In the 0002 patch, I've marked the
transaction as a prepared transaction in
ReorderBufferRememberPrepareInfo() so that all prepared transactions
that have a ReordeBufferTXN entry at that time can be marked properly.
And I've put some Assertions to ensure that all prepared transaction
related flags have been set properly. Thoughts?

Nothing changed to the 0001 patch from the previous version.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v16-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchapplication/octet-stream; name=v16-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchDownload
From c846dea7369c953ce62fa68c6b6cb0e961a9d5ee Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 13 Jan 2025 10:35:17 -0800
Subject: [PATCH v16 2/2] Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better
 clarification.

Previously, RBTXN_PREPARE flag and rbtxn_prepared macro could be
misinterpreted as either indicating the transaction type (e.g. a
prepared transaction or a normal transaction) or its current
state (e.g. skipped or its prepare message is sent), especially after
commit XXX introduced the RBTXN_SENT_PREPARE flag and the
rbtxn_sent_prepare macro.

The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREAPRE flag also
to the transaction that is a prepared transaction and has been
skipped, which previously had only the RBTXN_SKIPPED_PREPARE flag.

Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
---
 src/backend/replication/logical/proto.c       |  2 +-
 .../replication/logical/reorderbuffer.c       | 46 ++++++++++++-------
 src/backend/replication/logical/snapbuild.c   |  2 +-
 src/include/replication/reorderbuffer.h       |  6 +--
 4 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bef350714db..61b5283a2e1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -163,7 +163,7 @@ logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
 	 * which case we expect to have a valid GID.
 	 */
 	Assert(txn->gid != NULL);
-	Assert(rbtxn_prepared(txn));
+	Assert(rbtxn_is_prepared(txn));
 	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8278e6f2223..92d2d7e6c69 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,7 +1793,7 @@ ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn
 	 * and the toast reconstruction data. The full cleanup will happen as part
 	 * of decoding ABORT record of this transaction.
 	 */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 	ReorderBufferToastReset(rb, txn);
 
 	/* All changes should be discarded */
@@ -1968,7 +1968,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	if (rbtxn_prepared(txn))
+	if (rbtxn_is_prepared(txn))
 	{
 		/*
 		 * Note, we send stream prepare even if a concurrent abort is
@@ -2150,7 +2150,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2238,7 +2238,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (!streaming)
 		{
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 				rb->begin_prepare(rb, txn);
 			else
 				rb->begin(rb, txn);
@@ -2280,7 +2280,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * required for the cases when we decode the changes before the
 			 * COMMIT record is processed.
 			 */
-			if (streaming || rbtxn_prepared(change->txn))
+			if (streaming || rbtxn_is_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2625,7 +2625,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
 			 * regular ones).
 			 */
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 			{
 				rb->prepare(rb, txn, commit_lsn);
 				txn->txn_flags |= RBTXN_SENT_PREPARE;
@@ -2679,12 +2679,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * For 4, as the entire txn has been decoded, we can fully clean up
 		 * the TXN reorder buffer.
 		 */
-		if (streaming || rbtxn_prepared(txn))
+		if (streaming || rbtxn_is_prepared(txn))
 		{
 			if (streaming)
 				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2728,7 +2728,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * during a two-phase commit.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
-			(stream_started || rbtxn_prepared(txn)))
+			(stream_started || rbtxn_is_prepared(txn)))
 		{
 			/* curtxn must be set for streaming or prepared transactions */
 			Assert(curtxn);
@@ -2815,7 +2815,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 		 * Removing this txn before a commit might result in the computation
 		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
 		 */
-		if (!rbtxn_prepared(txn))
+		if (!rbtxn_is_prepared(txn))
 			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
@@ -2852,7 +2852,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Record the prepare information for a transaction.
+ * Record the prepare information for a transaction. Also, mark the transaction
+ * as a prepared transaction.
  */
 bool
 ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
@@ -2878,6 +2879,11 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
+	/* Mark this transaction as a prepared transaction */
+	Assert((txn->txn_flags &
+			(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)) == 0);
+	txn->txn_flags |= RBTXN_IS_PREPARED;
+
 	return true;
 }
 
@@ -2893,6 +2899,9 @@ ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return;
 
+	/* txn must have been marked as a prepared transaction */
+	Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+
 	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
 }
 
@@ -2914,12 +2923,17 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_PREPARE;
-	txn->gid = pstrdup(gid);
-
-	/* The prepare info must have been updated in txn by now. */
+	/*
+	 * txn must have been marked as a prepared transaction and must have
+	 * neither been skipped nor sent a prepare. Also, the prepare info must
+	 * have been updated in it by now.
+	 */
+	Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+	Assert((txn->txn_flags & (RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)) == 0);
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	txn->gid = pstrdup(gid);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2975,7 +2989,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 */
 	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
-		txn->txn_flags |= RBTXN_PREPARE;
+		txn->txn_flags |= RBTXN_IS_PREPARED;
 
 		/*
 		 * The prepare info must have been updated in txn even if we skip
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de318..05687fd75e5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -761,7 +761,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * We don't need to add snapshot to prepared transactions as they
 		 * should not see the new catalog contents.
 		 */
-		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+		if (rbtxn_is_prepared(txn))
 			continue;
 
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d9ac2f0830..27d134198e3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,7 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SERIALIZED_CLEAR 	0x0008
 #define RBTXN_IS_STREAMED         	0x0010
 #define RBTXN_HAS_PARTIAL_CHANGE  	0x0020
-#define RBTXN_PREPARE             	0x0040
+#define RBTXN_IS_PREPARED 			0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
 #define RBTXN_SENT_PREPARE			0x0200
@@ -234,9 +234,9 @@ typedef struct ReorderBufferChange
  * committed. To check whether a prepare or a stream_prepare has already
  * been sent for this transaction, we need to use rbtxn_sent_prepare().
  */
-#define rbtxn_prepared(txn) \
+#define rbtxn_is_prepared(txn) \
 ( \
-	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+	((txn)->txn_flags & RBTXN_IS_PREPARED) != 0 \
 )
 
 /* Has a prepare or stream_prepare already been sent? */
-- 
2.43.5

v16-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v16-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 9f0dacbcdc3dc02c32d90d0d42b6c163e509e1fe Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v16 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeble performance regression due to
CLOG lookups.

Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 245 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 79b60df7cf0..8278e6f2223 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard both the changes collected so far
+	 * and the toast reconstruction data. The full cleanup will happen as part
+	 * of decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2681,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2737,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2924,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3662,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3705,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3764,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3775,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3794,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..9d9ac2f0830 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

#72Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#71)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Jan 28, 2025 at 4:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 26, 2025 at 10:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 24, 2025 at 12:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 22, 2025 at 7:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

Will fix.

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

Right. We no longer need to check rbtxn_skip_prepared() here.

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Since the patch already does "txn->txn_flags |= (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE);", it's already ensured, no?

I mean to say that we add assert to ensure the same.

I think we need to add both flags in ReorderBufferSkipPrepare(),
because there is a case where a transaction might not be marked as
RBTXN_IS_PREPARED here.

Are you talking about the case when it is invoked from
DecodePrepare()?

Yes. IIUC ReorderBufferSkipPrepare() is called only from DecodePrepare().

I thought we would set the flag in that code path.

I agree that it makes sense to add the flag before calling
ReorderBufferSkipPrepare().

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

I think the same is true for RBTXN_IS_SERIALIZED and
RBTXN_IS_SERIALIZED_CLEAR; RBTXN_IS_SERIALIZED_CLEAR can only be set
for RBTXN_IS_SERIALIZED transaction. Should we add some comments to
them too? But I'm concerned about having too much explanation if we
add descriptions to flags too while already having comments for
corresponding macros.

Hm That RBTXN_IS_SERIALIZED / RBTXN_IS_SERIALIZED_CLEAR is used
differently -- it seems more tricky because RBTXN_IS_SERIALIZED flag
is turned OFF again when RBTXN_IS_SERIALIZED_CLEAR is turned ON.
(Whereas setting SKIPPED_PREPARE and SENT_PREPARE will never turn off
the tx type IS_PREPARED)

To be honest, I didn't understand the "CLEAR" part of that name. It
seems more like it should've been called something like
RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
whatever instead of something that appears to be saying "has the
RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
to over-comment everything but OTOH currently there is no way really
to understand what these flags mean without looking through all the
code to try to figure them out from the usage.

My recurring gripe about these flags is simply that their meanings and
how to use them should be apparent just by looking at reorderbuffer.h
and not having to guess anything or look at how they get used in the
code. It doesn't matter if that is achieved by better constant names,
by more comments or by enhanced macros/functions with asserts but
currently just looking at that file still leaves the reader with lots
of unanswered questions.

Yeah, I am fine either way especially, if we decide to add asserts for
RBTXN_IS_PREPARED when we set those flags.

Another way to ensure that is to convert these macros to inline
functions and add an Assert() there, but it seems overkill.

True, but that would ensure, we won't make any coding mistakes which
Peter wants to ensure by writing additional comments but asserting is
probably a better way.

Maybe I misunderstood, but I thought Amit's reply there meant that
rewriting the macros as inline functions with asserts would be a good
way to ensure no coding mistakes. Yet, the macros are still unchanged
in v16-0002.

I've attached the updated patch. In the 0002 patch, I've marked the
transaction as a prepared transaction in
ReorderBufferRememberPrepareInfo() so that all prepared transactions
that have a ReordeBufferTXN entry at that time can be marked properly.
And I've put some Assertions to ensure that all prepared transaction
related flags have been set properly. Thoughts?

Here are a couple of other review comments for patch v16-0002

======
Commit message

1.
The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREAPRE flag also
to the transaction that is a prepared transaction and has been
skipped, which previously had only the RBTXN_SKIPPED_PREPARE flag.

Instead of fixing the "RBTXN_IS_PREAPRE" typo, it looks like a new
problem (double "also" in the same sentence) was added in v16.

======
.../replication/logical/reorderbuffer.c

2.
if ((txn->final_lsn < two_phase_at) && is_commit)
{
- txn->txn_flags |= RBTXN_PREPARE;
+ txn->txn_flags |= RBTXN_IS_PREPARED;

Won't this flag be already this flag already set? The next code
comment ("The prepare info must have been updated ...") made me think
so.

But if it does need to be assigned here, then why are there not the
same assertions about existing IS_PREPARED, SKIPPED and SENT as they
were in the other place where this flag was set?

======
Kind Regards,
Peter Smith.
Fujitsu Australia.

#73Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#72)
2 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Jan 27, 2025 at 7:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Jan 28, 2025 at 4:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 26, 2025 at 10:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 24, 2025 at 12:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 22, 2025 at 7:35 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jan 23, 2025 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 22, 2025 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote:

======
Commit message

typo /RBTXN_IS_PREAPRE/RBTXN_IS_PREPARE/

Will fix.

Also, this code (below) seems to be treating those macros as
unrelated, but IIUC we know that rbtxn_skip_prepared(txn) is not
possible unless rbtxn_is_prepared(txn) is true.

- if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+ if (rbtxn_is_prepared(txn) || rbtxn_skip_prepared(txn))
continue;

Right. We no longer need to check rbtxn_skip_prepared() here.

~~

Furthermore, if we cannot infer that RBTXN_SKIPPED_PREPARE *must* also
be a prepared transaction, then why aren't the macros changed to match
that interpretation?

e.g.

/* prepare for this transaction skipped? */
#define rbtxn_skip_prepared(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SKIPPED_PREPARE != 0) \
)

/* Has a prepare or stream_prepare already been sent? */
#define rbtxn_sent_prepare(txn) \
( \
((txn)->txn_flags & RBTXN_IS_PREPARED != 0) && \
((txn)->txn_flags & RBTXN_SENT_PREPARE != 0) \
)

~~~

I think a to fix all this might be to enforce the RBTXN_IS_PREPARED
bitflag is set also for RBTXN_SKIPPED_PREPARE and RBTXN_SENT_PREPARE
constants, removing the ambiguity about how exactly to interpret those
two constants.

e.g. something like

#define RBTXN_IS_PREPARED 0x0040
#define RBTXN_SKIPPED_PREPARE (0x0080 | RBTXN_IS_PREPARED)
#define RBTXN_SENT_PREPARE (0x0200 | RBTXN_IS_PREPARED)

I think the better way would be to ensure that where we set
RBTXN_SENT_PREPARE or RBTXN_SKIPPED_PREPARE, the transaction is a
prepared one (RBTXN_IS_PREPARED must be already set). It should be
already the case for RBTXN_SENT_PREPARE but we can ensure the same for
RBTXN_SKIPPED_PREPARE as well.

Since the patch already does "txn->txn_flags |= (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE);", it's already ensured, no?

I mean to say that we add assert to ensure the same.

I think we need to add both flags in ReorderBufferSkipPrepare(),
because there is a case where a transaction might not be marked as
RBTXN_IS_PREPARED here.

Are you talking about the case when it is invoked from
DecodePrepare()?

Yes. IIUC ReorderBufferSkipPrepare() is called only from DecodePrepare().

I thought we would set the flag in that code path.

I agree that it makes sense to add the flag before calling
ReorderBufferSkipPrepare().

Will that address your concern? Does anyone else have an opinion on this matter?

Yes that would be OK, but should also add some clarifying comments in
the "reorderbuffer.h" like:

#define RBTXN_SKIPPED_PREPARE 0x0080 /* this flag can only be set
for RBTXN_IS_PREPARED transactions */
#define RBTXN_SENT_PREPARE 0x0200 /* this flag can only be set for
RBTXN_IS_PREPARED transactions */

I think the same is true for RBTXN_IS_SERIALIZED and
RBTXN_IS_SERIALIZED_CLEAR; RBTXN_IS_SERIALIZED_CLEAR can only be set
for RBTXN_IS_SERIALIZED transaction. Should we add some comments to
them too? But I'm concerned about having too much explanation if we
add descriptions to flags too while already having comments for
corresponding macros.

Hm That RBTXN_IS_SERIALIZED / RBTXN_IS_SERIALIZED_CLEAR is used
differently -- it seems more tricky because RBTXN_IS_SERIALIZED flag
is turned OFF again when RBTXN_IS_SERIALIZED_CLEAR is turned ON.
(Whereas setting SKIPPED_PREPARE and SENT_PREPARE will never turn off
the tx type IS_PREPARED)

You're right.

To be honest, I didn't understand the "CLEAR" part of that name. It
seems more like it should've been called something like
RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
whatever instead of something that appears to be saying "has the
RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
to over-comment everything but OTOH currently there is no way really
to understand what these flags mean without looking through all the
code to try to figure them out from the usage.

My recurring gripe about these flags is simply that their meanings and
how to use them should be apparent just by looking at reorderbuffer.h
and not having to guess anything or look at how they get used in the
code. It doesn't matter if that is achieved by better constant names,
by more comments or by enhanced macros/functions with asserts but
currently just looking at that file still leaves the reader with lots
of unanswered questions.

I see your point. IIUC we have the comments about what the checks with
the flags means but not have the description about the relationship
among the flags. I think we can start a new thread for clarifying
these flags and their usage. We can also discuss renaming
RBTXN_IS_SERIALIZED[_CLEARE] there too.

Yeah, I am fine either way especially, if we decide to add asserts for
RBTXN_IS_PREPARED when we set those flags.

Another way to ensure that is to convert these macros to inline
functions and add an Assert() there, but it seems overkill.

True, but that would ensure, we won't make any coding mistakes which
Peter wants to ensure by writing additional comments but asserting is
probably a better way.

Maybe I misunderstood, but I thought Amit's reply there meant that
rewriting the macros as inline functions with asserts would be a good
way to ensure no coding mistakes. Yet, the macros are still unchanged
in v16-0002.

I forgot to mention; while converting all macros to inline functions
is a good idea, adding assertions to some places reasonably also makes
the code robust. The prepared transactions related flags are currently
used in specific cases. So I thought what the patch does also makes
sense to me.

I've attached the updated patch. In the 0002 patch, I've marked the
transaction as a prepared transaction in
ReorderBufferRememberPrepareInfo() so that all prepared transactions
that have a ReordeBufferTXN entry at that time can be marked properly.
And I've put some Assertions to ensure that all prepared transaction
related flags have been set properly. Thoughts?

Here are a couple of other review comments for patch v16-0002

Thank you for the comments!

======
Commit message

1.
The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREAPRE flag also
to the transaction that is a prepared transaction and has been
skipped, which previously had only the RBTXN_SKIPPED_PREPARE flag.

Instead of fixing the "RBTXN_IS_PREAPRE" typo, it looks like a new
problem (double "also" in the same sentence) was added in v16.

Fixed.

======
.../replication/logical/reorderbuffer.c

2.
if ((txn->final_lsn < two_phase_at) && is_commit)
{
- txn->txn_flags |= RBTXN_PREPARE;
+ txn->txn_flags |= RBTXN_IS_PREPARED;

Won't this flag be already this flag already set? The next code
comment ("The prepare info must have been updated ...") made me think
so.

Good point. The transaction must have both flags: RBTXN_IS_PREPARED
and RBTXN_SKIPPED_PREPARE, unless I'm missing something.

I've attached the updated patches.

BTW if there is no comment on 0001 patch, I'm going to push it this week .

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v17-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v17-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From 9f0dacbcdc3dc02c32d90d0d42b6c163e509e1fe Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v17 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeble performance regression due to
CLOG lookups.

Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 185 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 245 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 79b60df7cf0..8278e6f2223 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard both the changes collected so far
+	 * and the toast reconstruction data. The full cleanup will happen as part
+	 * of decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2681,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2737,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2924,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3662,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3705,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3764,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3775,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3794,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..9d9ac2f0830 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

v17-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchapplication/octet-stream; name=v17-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchDownload
From 2cb4cc7fc10ec99a10b37a98d16509af7ef9d90d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 13 Jan 2025 10:35:17 -0800
Subject: [PATCH v17 2/2] Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better
 clarification.

Previously, RBTXN_PREPARE flag and rbtxn_prepared macro could be
misinterpreted as either indicating the transaction type (e.g. a
prepared transaction or a normal transaction) or its current
state (e.g. skipped or its prepare message is sent), especially after
commit XXX introduced the RBTXN_SENT_PREPARE flag and the
rbtxn_sent_prepare macro.

The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREPARE flag to
the transaction that is a prepared transaction and has been skipped,
which previously had only the RBTXN_SKIPPED_PREPARE flag.

Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
---
 src/backend/replication/logical/proto.c       |  2 +-
 .../replication/logical/reorderbuffer.c       | 53 ++++++++++++-------
 src/backend/replication/logical/snapbuild.c   |  2 +-
 src/include/replication/reorderbuffer.h       |  6 +--
 4 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bef350714db..61b5283a2e1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -163,7 +163,7 @@ logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
 	 * which case we expect to have a valid GID.
 	 */
 	Assert(txn->gid != NULL);
-	Assert(rbtxn_prepared(txn));
+	Assert(rbtxn_is_prepared(txn));
 	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8278e6f2223..b93c85c7a6c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,7 +1793,7 @@ ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn
 	 * and the toast reconstruction data. The full cleanup will happen as part
 	 * of decoding ABORT record of this transaction.
 	 */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 	ReorderBufferToastReset(rb, txn);
 
 	/* All changes should be discarded */
@@ -1968,7 +1968,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	if (rbtxn_prepared(txn))
+	if (rbtxn_is_prepared(txn))
 	{
 		/*
 		 * Note, we send stream prepare even if a concurrent abort is
@@ -2150,7 +2150,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2238,7 +2238,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (!streaming)
 		{
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 				rb->begin_prepare(rb, txn);
 			else
 				rb->begin(rb, txn);
@@ -2280,7 +2280,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * required for the cases when we decode the changes before the
 			 * COMMIT record is processed.
 			 */
-			if (streaming || rbtxn_prepared(change->txn))
+			if (streaming || rbtxn_is_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2625,7 +2625,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
 			 * regular ones).
 			 */
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 			{
 				rb->prepare(rb, txn, commit_lsn);
 				txn->txn_flags |= RBTXN_SENT_PREPARE;
@@ -2679,12 +2679,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * For 4, as the entire txn has been decoded, we can fully clean up
 		 * the TXN reorder buffer.
 		 */
-		if (streaming || rbtxn_prepared(txn))
+		if (streaming || rbtxn_is_prepared(txn))
 		{
 			if (streaming)
 				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2728,7 +2728,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * during a two-phase commit.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
-			(stream_started || rbtxn_prepared(txn)))
+			(stream_started || rbtxn_is_prepared(txn)))
 		{
 			/* curtxn must be set for streaming or prepared transactions */
 			Assert(curtxn);
@@ -2815,7 +2815,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 		 * Removing this txn before a commit might result in the computation
 		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
 		 */
-		if (!rbtxn_prepared(txn))
+		if (!rbtxn_is_prepared(txn))
 			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
@@ -2852,7 +2852,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Record the prepare information for a transaction.
+ * Record the prepare information for a transaction. Also, mark the transaction
+ * as a prepared transaction.
  */
 bool
 ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
@@ -2878,6 +2879,11 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
+	/* Mark this transaction as a prepared transaction */
+	Assert((txn->txn_flags &
+			(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)) == 0);
+	txn->txn_flags |= RBTXN_IS_PREPARED;
+
 	return true;
 }
 
@@ -2893,6 +2899,9 @@ ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return;
 
+	/* txn must have been marked as a prepared transaction */
+	Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+
 	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
 }
 
@@ -2914,12 +2923,17 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_PREPARE;
-	txn->gid = pstrdup(gid);
-
-	/* The prepare info must have been updated in txn by now. */
+	/*
+	 * txn must have been marked as a prepared transaction and must have
+	 * neither been skipped nor sent a prepare. Also, the prepare info must
+	 * have been updated in it by now.
+	 */
+	Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+	Assert((txn->txn_flags & (RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)) == 0);
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	txn->gid = pstrdup(gid);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2975,12 +2989,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 */
 	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
-		txn->txn_flags |= RBTXN_PREPARE;
-
 		/*
-		 * The prepare info must have been updated in txn even if we skip
-		 * prepare.
+		 * txn must have been marked as a prepared transaction and skipped but
+		 * not sent a prepare. Also, the prepare info must have been updated
+		 * in txn even if we skip prepare.
 		 */
+		Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 0);
+		Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
 		Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 		/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de318..05687fd75e5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -761,7 +761,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * We don't need to add snapshot to prepared transactions as they
 		 * should not see the new catalog contents.
 		 */
-		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+		if (rbtxn_is_prepared(txn))
 			continue;
 
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d9ac2f0830..27d134198e3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,7 +170,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SERIALIZED_CLEAR 	0x0008
 #define RBTXN_IS_STREAMED         	0x0010
 #define RBTXN_HAS_PARTIAL_CHANGE  	0x0020
-#define RBTXN_PREPARE             	0x0040
+#define RBTXN_IS_PREPARED 			0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
 #define RBTXN_SENT_PREPARE			0x0200
@@ -234,9 +234,9 @@ typedef struct ReorderBufferChange
  * committed. To check whether a prepare or a stream_prepare has already
  * been sent for this transaction, we need to use rbtxn_sent_prepare().
  */
-#define rbtxn_prepared(txn) \
+#define rbtxn_is_prepared(txn) \
 ( \
-	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+	((txn)->txn_flags & RBTXN_IS_PREPARED) != 0 \
 )
 
 /* Has a prepare or stream_prepare already been sent? */
-- 
2.43.5

#74Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#73)
Re: Skip collecting decoded changes of already-aborted transactions

Some comments for patch v17-0001.

======
Commit message.

1.
typo /noticeble/noticeable/

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

2.
It seemed tricky that the only place that is setting the
RBTXN_IS_COMMITTED flag is the function
ReorderBufferCheckAndTruncateAbortedTXN because neither the function
name nor the function comment gives any indication that it should be
having this side effect

~~~

ReorderBufferProcessTXN:

3.
  if (rbtxn_prepared(txn))
+ {
  rb->prepare(rb, txn, commit_lsn);
+ txn->txn_flags |= RBTXN_SENT_PREPARE;
+ }

In ReorderBufferStreamCommit there is an assertion that we are not
trying to do another prepare() if the _SENT_PREPARE flag is already
set. Should this code have a similar assert?

======
src/include/replication/reorderbuffer.h

4.
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

IIUC, unlike the _SENT_PREPARE, those _IS_COMMITTED and _IS_ABORTED
flags are not quite the same as saying rb->commit() or rb->abort() was
called. But, those flags are only set some time later by
ReorderBufferCheckAndTruncateAbortedTXN() function based on the commit
log status.

The lag between the commit/abort happening and these flag getting set
seems unintuitive. Should they be named differently -- e.g. maybe
RBTXN_IS_CLOG_COMMITTED, RBTXN_IS_CLOG_ABORTED instead?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#75Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#73)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Jan 28, 2025 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 27, 2025 at 7:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

To be honest, I didn't understand the "CLEAR" part of that name. It
seems more like it should've been called something like
RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
whatever instead of something that appears to be saying "has the
RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
to over-comment everything but OTOH currently there is no way really
to understand what these flags mean without looking through all the
code to try to figure them out from the usage.

My recurring gripe about these flags is simply that their meanings and
how to use them should be apparent just by looking at reorderbuffer.h
and not having to guess anything or look at how they get used in the
code. It doesn't matter if that is achieved by better constant names,
by more comments or by enhanced macros/functions with asserts but
currently just looking at that file still leaves the reader with lots
of unanswered questions.

I see your point. IIUC we have the comments about what the checks with
the flags means but not have the description about the relationship
among the flags. I think we can start a new thread for clarifying
these flags and their usage. We can also discuss renaming
RBTXN_IS_SERIALIZED[_CLEARE] there too.

OK.

======

Some comments for patch v17-0002.

======
.../replication/logical/reorderbuffer.c

ReorderBufferSkipPrepare:

1.
+ /* txn must have been marked as a prepared transaction */
+ Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+
  txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

Should this also be asserting that the _SENT_PREPARE flag is false,
because we cannot be skipping it if we already sent the prepare.

~~~

ReorderBufferFinishPrepared:

2.

- txn->txn_flags |= RBTXN_PREPARE;
-
  /*
- * The prepare info must have been updated in txn even if we skip
- * prepare.
+ * txn must have been marked as a prepared transaction and skipped but
+ * not sent a prepare. Also, the prepare info must have been updated
+ * in txn even if we skip prepare.
  */
+ Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 0);
+ Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
  Assert(txn->final_lsn != InvalidXLogRecPtr);

2a.
If it must have been prepared *and* skipped (as the comment says) then
the first assert should be written as:
Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE))
== (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));

or easier to just have 2 asserts:
Assert(txn->txn_flags & RBTXN_IS_PREPARED);
Assert(txn->txn_flags & RBTXN_SKIPPED_PREPARE);

~

2b.
later in the same function there is code:

if (is_commit)
rb->commit_prepared(rb, txn, commit_lsn);
else
rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);

So it is OK to do a commit_prepared/rollback_prepared even though no
prepare() has been sent?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#76Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#74)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 29, 2025 at 9:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

Some comments for patch v17-0001.

Thank you for reviewing the patch!

======
Commit message.

1.
typo /noticeble/noticeable/

Fixed.

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

2.
It seemed tricky that the only place that is setting the
RBTXN_IS_COMMITTED flag is the function
ReorderBufferCheckAndTruncateAbortedTXN because neither the function
name nor the function comment gives any indication that it should be
having this side effect

Hmm, it doesn't seem so tricky to me that a function with the name
ReorderBufferCheckAndTruncateAbortedTXN() checks the transaction
status to truncate an aborted transaction and caches the transaction
status as a side effect.

~~~

ReorderBufferProcessTXN:

3.
if (rbtxn_prepared(txn))
+ {
rb->prepare(rb, txn, commit_lsn);
+ txn->txn_flags |= RBTXN_SENT_PREPARE;
+ }

In ReorderBufferStreamCommit there is an assertion that we are not
trying to do another prepare() if the _SENT_PREPARE flag is already
set. Should this code have a similar assert?

We can have a similar assert there but why do you think it's needed there?

======
src/include/replication/reorderbuffer.h

4.
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

IIUC, unlike the _SENT_PREPARE, those _IS_COMMITTED and _IS_ABORTED
flags are not quite the same as saying rb->commit() or rb->abort() was
called. But, those flags are only set some time later by
ReorderBufferCheckAndTruncateAbortedTXN() function based on the commit
log status.

The lag between the commit/abort happening and these flag getting set
seems unintuitive. Should they be named differently -- e.g. maybe
RBTXN_IS_CLOG_COMMITTED, RBTXN_IS_CLOG_ABORTED instead?

I'm not sure these names are better.

In logical decoding context, we neither commit nor rollback
transactions decoded from WAL records as the transaction outcomes come
only from WAL records. So I guess it's easy-to-grasp that
RBTXN_IS_COMMITTED means "this is a committed transaction" but not "we
committed the transaction". I think this is a similar understanding as
what we're trying to rename RBTXN_PREPARE to RBTXN_IS_PREPARE.
Similarly, we have rb->commit() and rb->abort(), I would not think
like we're committing or aborting the transaction. So the lag between
the ->commit()/abort() happening and these flags getting set is not
confusing (at least for me). I think we can leave these names as they
are, and if we need to remember if a commit message has been sent, we
would be able to have a flag like RBTXN_SENT_COMMIT.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#77Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#76)
Re: Skip collecting decoded changes of already-aborted transactions

On Fri, Jan 31, 2025 at 11:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 29, 2025 at 9:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

2.
It seemed tricky that the only place that is setting the
RBTXN_IS_COMMITTED flag is the function
ReorderBufferCheckAndTruncateAbortedTXN because neither the function
name nor the function comment gives any indication that it should be
having this side effect

Hmm, it doesn't seem so tricky to me that a function with the name
ReorderBufferCheckAndTruncateAbortedTXN() checks the transaction
status to truncate an aborted transaction and caches the transaction
status as a side effect.

I was coming at this from a different perspective, asking myself the
question "When can I know the RBTXN_IS_COMMITTED bit setting?" -- aka
rbtxn_is_committed()?

AFAICT it turns out we can only have confidence in that result when
know ReorderBufferCheckAndTruncateAbortedTXN was called already for
this tx. But this happens only when ReorderBufferCheckMemoryLimit()
gets called. So, these bitflags are getting set as a side-effect of
calling unrelated functions. (e.g. the fact we can't test if a tx was
aborted/committed unless ReorderBufferCheckMemoryLimit is called
seemed unusual to me). I don't know what the solution is; maybe some
more comments would be enough.

~~~

ReorderBufferProcessTXN:

3.
if (rbtxn_prepared(txn))
+ {
rb->prepare(rb, txn, commit_lsn);
+ txn->txn_flags |= RBTXN_SENT_PREPARE;
+ }

In ReorderBufferStreamCommit there is an assertion that we are not
trying to do another prepare() if the _SENT_PREPARE flag is already
set. Should this code have a similar assert?

We can have a similar assert there but why do you think it's needed there?

No particular reason, other than for consistency to have similar
assertions everywhere that the RBTXN_SENT_PREPARE flag is set.

======
src/include/replication/reorderbuffer.h

4.
+#define RBTXN_SENT_PREPARE 0x0200
+#define RBTXN_IS_COMMITTED 0x0400
+#define RBTXN_IS_ABORTED 0x0800

IIUC, unlike the _SENT_PREPARE, those _IS_COMMITTED and _IS_ABORTED
flags are not quite the same as saying rb->commit() or rb->abort() was
called. But, those flags are only set some time later by
ReorderBufferCheckAndTruncateAbortedTXN() function based on the commit
log status.

The lag between the commit/abort happening and these flag getting set
seems unintuitive. Should they be named differently -- e.g. maybe
RBTXN_IS_CLOG_COMMITTED, RBTXN_IS_CLOG_ABORTED instead?

I'm not sure these names are better.

In logical decoding context, we neither commit nor rollback
transactions decoded from WAL records as the transaction outcomes come
only from WAL records. So I guess it's easy-to-grasp that
RBTXN_IS_COMMITTED means "this is a committed transaction" but not "we
committed the transaction". I think this is a similar understanding as
what we're trying to rename RBTXN_PREPARE to RBTXN_IS_PREPARE.
Similarly, we have rb->commit() and rb->abort(), I would not think
like we're committing or aborting the transaction. So the lag between
the ->commit()/abort() happening and these flags getting set is not
confusing (at least for me). I think we can leave these names as they
are, and if we need to remember if a commit message has been sent, we
would be able to have a flag like RBTXN_SENT_COMMIT.

OK.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#78Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#75)
2 attachment(s)
Re: Skip collecting decoded changes of already-aborted transactions

On Wed, Jan 29, 2025 at 11:12 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Jan 28, 2025 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 27, 2025 at 7:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

To be honest, I didn't understand the "CLEAR" part of that name. It
seems more like it should've been called something like
RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
whatever instead of something that appears to be saying "has the
RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
to over-comment everything but OTOH currently there is no way really
to understand what these flags mean without looking through all the
code to try to figure them out from the usage.

My recurring gripe about these flags is simply that their meanings and
how to use them should be apparent just by looking at reorderbuffer.h
and not having to guess anything or look at how they get used in the
code. It doesn't matter if that is achieved by better constant names,
by more comments or by enhanced macros/functions with asserts but
currently just looking at that file still leaves the reader with lots
of unanswered questions.

I see your point. IIUC we have the comments about what the checks with
the flags means but not have the description about the relationship
among the flags. I think we can start a new thread for clarifying
these flags and their usage. We can also discuss renaming
RBTXN_IS_SERIALIZED[_CLEARE] there too.

OK.

======

Some comments for patch v17-0002.

Thank you for reviewing the patch.

======
.../replication/logical/reorderbuffer.c

ReorderBufferSkipPrepare:

1.
+ /* txn must have been marked as a prepared transaction */
+ Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+
txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

Should this also be asserting that the _SENT_PREPARE flag is false,
because we cannot be skipping it if we already sent the prepare.

~~~

ReorderBufferFinishPrepared:

2.

- txn->txn_flags |= RBTXN_PREPARE;
-
/*
- * The prepare info must have been updated in txn even if we skip
- * prepare.
+ * txn must have been marked as a prepared transaction and skipped but
+ * not sent a prepare. Also, the prepare info must have been updated
+ * in txn even if we skip prepare.
*/
+ Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 0);
+ Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
Assert(txn->final_lsn != InvalidXLogRecPtr);

2a.
If it must have been prepared *and* skipped (as the comment says) then
the first assert should be written as:
Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE))
== (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));

or easier to just have 2 asserts:
Assert(txn->txn_flags & RBTXN_IS_PREPARED);
Assert(txn->txn_flags & RBTXN_SKIPPED_PREPARE);

Agreed with all the above comments. Since checking
prepared-transaction-related-flags is getting complicated I've
introduced RBTXN_PREPARE_STATUS_FLAGS so that we can check the desired
prepared transaction status easily.

~

2b.
later in the same function there is code:

if (is_commit)
rb->commit_prepared(rb, txn, commit_lsn);
else
rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);

So it is OK to do a commit_prepared/rollback_prepared even though no
prepare() has been sent?

IIUC ReorderBufferReplay() is responsible for sending a prepare
message in this case. See the comment around there:

/*
* By this time the txn has the prepare record information and it is
* important to use that so that downstream gets the accurate
* information. If instead, we have passed commit information here
* then downstream can behave as it has already replayed commit
* prepared after the restart.
*/

I've attached the updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v18-0001-Skip-logical-decoding-of-already-aborted-transac.patchapplication/octet-stream; name=v18-0001-Skip-logical-decoding-of-already-aborted-transac.patchDownload
From cb595551629e445d1c5f1425978a477fb069b196 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 29 Oct 2024 13:21:18 -0700
Subject: [PATCH v18 1/2] Skip logical decoding of already-aborted
 transactions.

Previously, transaction aborts were detected concurrently only during
system catalog scans while replaying a transaction in streaming mode.

This commit adds an additional CLOG lookup to check the transaction
status, allowing the logical decoding to skip changes also when it
doesn't touch system catalogs, if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

To avoid potential slowdowns caused by frequent CLOG lookups for small
transactions (most of which commit), the CLOG lookup is performed only
for large transactions before eviction. The performance benchmark
results showed there is not noticeble performance regression due to
CLOG lookups.

Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian
Reviewed-by: Dilip Kumar, Andres Freund
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 contrib/test_decoding/expected/stats.out      |  42 +++-
 contrib/test_decoding/expected/stream.out     |   6 +
 contrib/test_decoding/sql/stats.sql           |  20 +-
 contrib/test_decoding/sql/stream.sql          |   6 +
 .../replication/logical/reorderbuffer.c       | 186 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  32 ++-
 6 files changed, 246 insertions(+), 46 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index 78d36429c8a..de6dc416130 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -138,12 +138,46 @@ SELECT slot_name FROM pg_stat_replication_slots;
 (3 rows)
 
 COMMIT;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+ ?column? 
+----------
+ init
+(1 row)
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ count 
+-------
+     1
+(1 row)
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+            slot_name            | spill_txns | spill_count 
+---------------------------------+------------+-------------
+ regression_slot_stats4_twophase |          0 |           0
+(1 row)
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
- pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
---------------------------+--------------------------+--------------------------
-                          |                          | 
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
+ pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot | pg_drop_replication_slot 
+--------------------------+--------------------------+--------------------------+--------------------------
+                          |                          |                          | 
 (1 row)
 
diff --git a/contrib/test_decoding/expected/stream.out b/contrib/test_decoding/expected/stream.out
index a76f77601e2..9879e02ca84 100644
--- a/contrib/test_decoding/expected/stream.out
+++ b/contrib/test_decoding/expected/stream.out
@@ -114,7 +114,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -128,6 +133,7 @@ SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
      5
 (1 row)
 
+RESET debug_logical_replication_streaming;
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/stats.sql b/contrib/test_decoding/sql/stats.sql
index 630371f147a..a022fe1bf07 100644
--- a/contrib/test_decoding/sql/stats.sql
+++ b/contrib/test_decoding/sql/stats.sql
@@ -50,7 +50,25 @@ SELECT slot_name FROM pg_stat_replication_slots;
 SELECT slot_name FROM pg_stat_replication_slots;
 COMMIT;
 
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_stats4_twophase', 'test_decoding', false, true) s4;
+
+-- The INSERT changes are large enough to be spilled but will not be, because
+-- the transaction is aborted. The logical decoding skips collecting further
+-- changes too. The transaction is prepared to make sure the decoding processes
+-- the aborted transaction.
+BEGIN;
+INSERT INTO stats_test SELECT 'serialize-toobig--1:'||g.i FROM generate_series(1, 5000) g(i);
+PREPARE TRANSACTION 'test1_abort';
+ROLLBACK PREPARED 'test1_abort';
+SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot_stats4_twophase', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Verify that the decoding doesn't spill already-aborted transaction's changes.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns, spill_count FROM pg_stat_replication_slots WHERE slot_name = 'regression_slot_stats4_twophase';
+
 DROP TABLE stats_test;
 SELECT pg_drop_replication_slot('regression_slot_stats1'),
     pg_drop_replication_slot('regression_slot_stats2'),
-    pg_drop_replication_slot('regression_slot_stats3');
+    pg_drop_replication_slot('regression_slot_stats3'),
+    pg_drop_replication_slot('regression_slot_stats4_twophase');
diff --git a/contrib/test_decoding/sql/stream.sql b/contrib/test_decoding/sql/stream.sql
index 7f43f0c2ab7..f1269403e0a 100644
--- a/contrib/test_decoding/sql/stream.sql
+++ b/contrib/test_decoding/sql/stream.sql
@@ -49,7 +49,12 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'incl
  * detect that the subtransaction was aborted, and reset the transaction while having
  * the TOAST changes in memory, resulting in deallocating both decoded changes and
  * TOAST reconstruction data. Memory usage counters must be updated correctly.
+ *
+ * Set debug_logical_replication_streaming to 'immediate' to disable the transaction
+ * status check happening before streaming the second insertion, so we can detect a
+ * concurrent abort while streaming.
  */
+SET debug_logical_replication_streaming = immediate;
 BEGIN;
 INSERT INTO stream_test SELECT repeat(string_agg(to_char(g.i, 'FM0000'), ''), 50) FROM generate_series(1, 500) g(i);
 ALTER TABLE stream_test ADD COLUMN i INT;
@@ -58,6 +63,7 @@ INSERT INTO stream_test(data, i) SELECT repeat(string_agg(to_char(g.i, 'FM0000')
 ROLLBACK TO s1;
 COMMIT;
 SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+RESET debug_logical_replication_streaming;
 
 DROP TABLE stream_test;
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 10a37667a51..ed5a2946dc1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -260,6 +261,8 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
+static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -793,11 +796,11 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 	/*
-	 * While streaming the previous changes we have detected that the
-	 * transaction is aborted.  So there is no point in collecting further
-	 * changes for it.
+	 * If we have detected that the transaction is aborted while streaming the
+	 * previous changes or by checking its CLOG, there is no point in
+	 * collecting further changes for it.
 	 */
-	if (txn->concurrent_abort)
+	if (rbtxn_is_aborted(txn))
 	{
 		/*
 		 * We don't need to update memory accounting for this change as we
@@ -1620,8 +1623,9 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 /*
  * Discard changes from a transaction (and subtransactions), either after
- * streaming or decoding them at PREPARE. Keep the remaining info -
- * transactions, tuplecids, invalidations and snapshots.
+ * streaming, decoding them at PREPARE, or detecting the transaction abort.
+ * Keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.
  *
  * We additionally remove tuplecids after decoding the transaction at prepare
  * time as we only need to perform invalidation at rollback or commit prepared.
@@ -1650,6 +1654,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
+		ReorderBufferMaybeMarkTXNStreamed(rb, subtxn);
 		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
@@ -1680,24 +1685,6 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	/* Update the memory counter */
 	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, mem_freed);
 
-	/*
-	 * Mark the transaction as streamed.
-	 *
-	 * The top-level transaction, is marked as streamed always, even if it
-	 * does not contain any changes (that is, when all the changes are in
-	 * subtransactions).
-	 *
-	 * For subtransactions, we only mark them as streamed when there are
-	 * changes in them.
-	 *
-	 * We do it this way because of aborts - we don't want to send aborts for
-	 * XIDs the downstream is not aware of. And of course, it always knows
-	 * about the toplevel xact (we send the XID in all messages), but we never
-	 * stream XIDs of empty subxacts.
-	 */
-	if ((!txn_prepared) && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
-		txn->txn_flags |= RBTXN_IS_STREAMED;
-
 	if (txn_prepared)
 	{
 		/*
@@ -1752,6 +1739,76 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 	txn->nentries = 0;
 }
 
+/*
+ * Check the transaction status by CLOG lookup and discard all changes if
+ * the transaction is aborted. The transaction status is cached in
+ * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
+ * next call.
+ *
+ * Return true if the transaction is aborted, otherwise return false.
+ *
+ * When the 'debug_logical_replication_streaming' is set to "immediate", we
+ * don't check the transaction status, meaning the caller will always process
+ * this transaction.
+ */
+static bool
+ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/* Quick return for regression tests */
+	if (unlikely(debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
+		return false;
+
+	/*
+	 * Quick return if the transaction status is already known.
+	 */
+
+	if (rbtxn_is_committed(txn))
+		return false;
+	if (rbtxn_is_aborted(txn))
+	{
+		/* Already-aborted transactions should not have any changes */
+		Assert(txn->size == 0);
+
+		return true;
+	}
+
+	/* Otherwise, check the transaction status using CLOG lookup */
+
+	if (TransactionIdIsInProgress(txn->xid))
+		return false;
+
+	if (TransactionIdDidCommit(txn->xid))
+	{
+		/*
+		 * Remember the transaction is committed so that we can skip CLOG
+		 * check next time, avoiding the pressure on CLOG lookup.
+		 */
+		Assert(!rbtxn_is_aborted(txn));
+		txn->txn_flags |= RBTXN_IS_COMMITTED;
+		return false;
+	}
+
+	/*
+	 * The transaction aborted. We discard both the changes collected so far
+	 * and the toast reconstruction data. The full cleanup will happen as part
+	 * of decoding ABORT record of this transaction.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferToastReset(rb, txn);
+
+	/* All changes should be discarded */
+	Assert(txn->size == 0);
+
+	/*
+	 * Mark the transaction as aborted so we can ignore future changes of this
+	 * transaction.
+	 */
+	Assert(!rbtxn_is_committed(txn));
+	txn->txn_flags |= RBTXN_IS_ABORTED;
+
+	return true;
+}
+
 /*
  * Build a hash with a (relfilelocator, ctid) -> (cmin, cmax) mapping for use by
  * HeapTupleSatisfiesHistoricMVCC.
@@ -1917,7 +1974,9 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * Note, we send stream prepare even if a concurrent abort is
 		 * detected. See DecodePrepare for more information.
 		 */
+		Assert(!rbtxn_sent_prepare(txn));
 		rb->stream_prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
 
 		/*
 		 * This is a PREPARED transaction, part of a two-phase commit. The
@@ -2052,6 +2111,30 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 												  txn, command_id);
 }
 
+/*
+ * Mark the given transaction as streamed if it's a top-level transaction
+ * or has changes.
+ */
+static void
+ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	/*
+	 * The top-level transaction, is marked as streamed always, even if it
+	 * does not contain any changes (that is, when all the changes are in
+	 * subtransactions).
+	 *
+	 * For subtransactions, we only mark them as streamed when there are
+	 * changes in them.
+	 *
+	 * We do it this way because of aborts - we don't want to send aborts for
+	 * XIDs the downstream is not aware of. And of course, it always knows
+	 * about the top-level xact (we send the XID in all messages), but we
+	 * never stream XIDs of empty subxacts.
+	 */
+	if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
+		txn->txn_flags |= RBTXN_IS_STREAMED;
+}
+
 /*
  * Helper function for ReorderBufferProcessTXN to handle the concurrent
  * abort of the streaming transaction.  This resets the TXN such that it
@@ -2543,7 +2626,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * regular ones).
 			 */
 			if (rbtxn_prepared(txn))
+			{
+				Assert(!rbtxn_sent_prepare(txn));
 				rb->prepare(rb, txn, commit_lsn);
+				txn->txn_flags |= RBTXN_SENT_PREPARE;
+			}
 			else
 				rb->commit(rb, txn, commit_lsn);
 		}
@@ -2595,6 +2682,9 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming || rbtxn_prepared(txn))
 		{
+			if (streaming)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
+
 			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
@@ -2648,7 +2738,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			FlushErrorState();
 			FreeErrorData(errdata);
 			errdata = NULL;
-			curtxn->concurrent_abort = true;
+
+			/* Remember the transaction is aborted. */
+			Assert(!rbtxn_is_committed(curtxn));
+			curtxn->txn_flags |= RBTXN_IS_ABORTED;
+
+			/* Mark the transaction is streamed if appropriate */
+			if (stream_started)
+				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
@@ -2828,15 +2925,15 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
-	 * We send the prepare for the concurrently aborted xacts so that later
-	 * when rollback prepared is decoded and sent, the downstream should be
-	 * able to rollback such a xact. See comments atop DecodePrepare.
-	 *
-	 * Note, for the concurrent_abort + streaming case a stream_prepare was
-	 * already sent within the ReorderBufferReplay call above.
+	 * Send a prepare if not already done so. This might occur if we have
+	 * detected a concurrent abort while replaying the non-streaming
+	 * transaction.
 	 */
-	if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
+	if (!rbtxn_sent_prepare(txn))
+	{
 		rb->prepare(rb, txn, txn->final_lsn);
+		txn->txn_flags |= RBTXN_SENT_PREPARE;
+	}
 }
 
 /*
@@ -3566,7 +3663,8 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
 }
 
 /*
- * Find the largest streamable toplevel transaction to evict (by streaming).
+ * Find the largest streamable (and non-aborted) toplevel transaction to evict
+ * (by streaming).
  *
  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
  * should give us the same transaction (because we don't update memory account
@@ -3608,9 +3706,15 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
 		/* base_snapshot must be set */
 		Assert(txn->base_snapshot != NULL);
 
+		/* Don't consider these kinds of transactions for eviction. */
+		if (rbtxn_has_partial_change(txn) ||
+			!rbtxn_has_streamable_change(txn) ||
+			rbtxn_is_aborted(txn))
+			continue;
+
+		/* Find the largest of the eviction candidates. */
 		if ((largest == NULL || txn->total_size > largest_size) &&
-			(txn->total_size > 0) && !(rbtxn_has_partial_change(txn)) &&
-			rbtxn_has_streamable_change(txn))
+			(txn->total_size > 0))
 		{
 			largest = txn;
 			largest_size = txn->total_size;
@@ -3661,8 +3765,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			rb->size > 0))
 	{
 		/*
-		 * Pick the largest transaction and evict it from memory by streaming,
-		 * if possible.  Otherwise, spill to disk.
+		 * Pick the largest non-aborted transaction and evict it from memory
+		 * by streaming, if possible.  Otherwise, spill to disk.
 		 */
 		if (ReorderBufferCanStartStreaming(rb) &&
 			(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
@@ -3672,6 +3776,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->total_size > 0);
 			Assert(rb->size >= txn->total_size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferStreamTXN(rb, txn);
 		}
 		else
@@ -3687,6 +3795,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			Assert(txn->size > 0);
 			Assert(rb->size >= txn->size);
 
+			/* skip the transaction if aborted */
+			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
+				continue;
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a669658b3f1..9d9ac2f0830 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -173,6 +173,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_PREPARE             	0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
+#define RBTXN_SENT_PREPARE			0x0200
+#define RBTXN_IS_COMMITTED			0x0400
+#define RBTXN_IS_ABORTED			0x0800
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -224,12 +227,36 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
-/* Has this transaction been prepared? */
+/*
+ * Is this a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */
 #define rbtxn_prepared(txn) \
 ( \
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has a prepare or stream_prepare already been sent? */
+#define rbtxn_sent_prepare(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
+)
+
+/* Is this transaction committed? */
+#define rbtxn_is_committed(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_COMMITTED) != 0 \
+)
+
+/* Is this transaction aborted? */
+#define rbtxn_is_aborted(txn) \
+( \
+	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
+)
+
 /* prepare for this transaction skipped? */
 #define rbtxn_skip_prepared(txn) \
 ( \
@@ -419,9 +446,6 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
-	/* If we have detected concurrent abort then ignore future changes. */
-	bool		concurrent_abort;
-
 	/*
 	 * Private data pointer of the output plugin.
 	 */
-- 
2.43.5

v18-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchapplication/octet-stream; name=v18-0002-Rename-RBTXN_PREPARE-to-RBTXN_IS_PREPARE-for-bet.patchDownload
From 76a08c6daaf2502f5c0946a1e2133ddcf35e14e5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 13 Jan 2025 10:35:17 -0800
Subject: [PATCH v18 2/2] Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better
 clarification.

Previously, RBTXN_PREPARE flag and rbtxn_prepared macro could be
misinterpreted as either indicating the transaction type (e.g. a
prepared transaction or a normal transaction) or its current
state (e.g. skipped or its prepare message is sent), especially after
commit XXX introduced the RBTXN_SENT_PREPARE flag and the
rbtxn_sent_prepare macro.

The RBTXN_PREPARE flag (and its corresponding macro) have been renamed
to RBTXN_IS_PREPARE to explicitly indicate the transaction
type. Therefore, this commit also adds the RBTXN_IS_PREPARE flag to
the transaction that is a prepared transaction and has been skipped,
which previously had only the RBTXN_SKIPPED_PREPARE flag.

Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
---
 src/backend/replication/logical/proto.c       |  2 +-
 .../replication/logical/reorderbuffer.c       | 50 ++++++++++++-------
 src/backend/replication/logical/snapbuild.c   |  2 +-
 src/include/replication/reorderbuffer.h       |  8 +--
 4 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index dc72b7c8f77..1a352b542dc 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -164,7 +164,7 @@ logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
 	 * which case we expect to have a valid GID.
 	 */
 	Assert(txn->gid != NULL);
-	Assert(rbtxn_prepared(txn));
+	Assert(rbtxn_is_prepared(txn));
 	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ed5a2946dc1..71502e241c7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,7 +1793,7 @@ ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn
 	 * and the toast reconstruction data. The full cleanup will happen as part
 	 * of decoding ABORT record of this transaction.
 	 */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 	ReorderBufferToastReset(rb, txn);
 
 	/* All changes should be discarded */
@@ -1968,7 +1968,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	if (rbtxn_prepared(txn))
+	if (rbtxn_is_prepared(txn))
 	{
 		/*
 		 * Note, we send stream prepare even if a concurrent abort is
@@ -2150,7 +2150,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2238,7 +2238,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (!streaming)
 		{
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 				rb->begin_prepare(rb, txn);
 			else
 				rb->begin(rb, txn);
@@ -2280,7 +2280,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * required for the cases when we decode the changes before the
 			 * COMMIT record is processed.
 			 */
-			if (streaming || rbtxn_prepared(change->txn))
+			if (streaming || rbtxn_is_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2625,7 +2625,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
 			 * regular ones).
 			 */
-			if (rbtxn_prepared(txn))
+			if (rbtxn_is_prepared(txn))
 			{
 				Assert(!rbtxn_sent_prepare(txn));
 				rb->prepare(rb, txn, commit_lsn);
@@ -2680,12 +2680,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * For 4, as the entire txn has been decoded, we can fully clean up
 		 * the TXN reorder buffer.
 		 */
-		if (streaming || rbtxn_prepared(txn))
+		if (streaming || rbtxn_is_prepared(txn))
 		{
 			if (streaming)
 				ReorderBufferMaybeMarkTXNStreamed(rb, txn);
 
-			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_is_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2729,7 +2729,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * during a two-phase commit.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
-			(stream_started || rbtxn_prepared(txn)))
+			(stream_started || rbtxn_is_prepared(txn)))
 		{
 			/* curtxn must be set for streaming or prepared transactions */
 			Assert(curtxn);
@@ -2816,7 +2816,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 		 * Removing this txn before a commit might result in the computation
 		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
 		 */
-		if (!rbtxn_prepared(txn))
+		if (!rbtxn_is_prepared(txn))
 			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
@@ -2853,7 +2853,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Record the prepare information for a transaction.
+ * Record the prepare information for a transaction. Also, mark the transaction
+ * as a prepared transaction.
  */
 bool
 ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
@@ -2879,6 +2880,10 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
+	/* Mark this transaction as a prepared transaction */
+	Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_FLAGS) == 0);
+	txn->txn_flags |= RBTXN_IS_PREPARED;
+
 	return true;
 }
 
@@ -2894,6 +2899,8 @@ ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return;
 
+	/* txn must have been marked as a prepared transaction */
+	Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_FLAGS) == RBTXN_IS_PREPARED);
 	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
 }
 
@@ -2915,12 +2922,16 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	if (txn == NULL)
 		return;
 
-	txn->txn_flags |= RBTXN_PREPARE;
-	txn->gid = pstrdup(gid);
-
-	/* The prepare info must have been updated in txn by now. */
+	/*
+	 * txn must have been marked as a prepared transaction and must have
+	 * neither been skipped nor sent a prepare. Also, the prepare info must
+	 * have been updated in it by now.
+	 */
+	Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_FLAGS) == RBTXN_IS_PREPARED);
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
+	txn->gid = pstrdup(gid);
+
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
 						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
@@ -2976,12 +2987,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 */
 	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
-		txn->txn_flags |= RBTXN_PREPARE;
-
 		/*
-		 * The prepare info must have been updated in txn even if we skip
-		 * prepare.
+		 * txn must have been marked as a prepared transaction and skipped but
+		 * not sent a prepare. Also, the prepare info must have been updated
+		 * in txn even if we skip prepare.
 		 */
+		Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_FLAGS) ==
+			   (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));
 		Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 		/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de318..05687fd75e5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -761,7 +761,7 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		 * We don't need to add snapshot to prepared transactions as they
 		 * should not see the new catalog contents.
 		 */
-		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+		if (rbtxn_is_prepared(txn))
 			continue;
 
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9d9ac2f0830..042d17f4f01 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -170,13 +170,15 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_SERIALIZED_CLEAR 	0x0008
 #define RBTXN_IS_STREAMED         	0x0010
 #define RBTXN_HAS_PARTIAL_CHANGE  	0x0020
-#define RBTXN_PREPARE             	0x0040
+#define RBTXN_IS_PREPARED 			0x0040
 #define RBTXN_SKIPPED_PREPARE	  	0x0080
 #define RBTXN_HAS_STREAMABLE_CHANGE	0x0100
 #define RBTXN_SENT_PREPARE			0x0200
 #define RBTXN_IS_COMMITTED			0x0400
 #define RBTXN_IS_ABORTED			0x0800
 
+#define RBTXN_PREPARE_STATUS_FLAGS	(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)
+
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
 ( \
@@ -234,9 +236,9 @@ typedef struct ReorderBufferChange
  * committed. To check whether a prepare or a stream_prepare has already
  * been sent for this transaction, we need to use rbtxn_sent_prepare().
  */
-#define rbtxn_prepared(txn) \
+#define rbtxn_is_prepared(txn) \
 ( \
-	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+	((txn)->txn_flags & RBTXN_IS_PREPARED) != 0 \
 )
 
 /* Has a prepare or stream_prepare already been sent? */
-- 
2.43.5

#79Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#77)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Jan 30, 2025 at 7:07 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Jan 31, 2025 at 11:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 29, 2025 at 9:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
.../replication/logical/reorderbuffer.c

ReorderBufferCheckAndTruncateAbortedTXN:

2.
It seemed tricky that the only place that is setting the
RBTXN_IS_COMMITTED flag is the function
ReorderBufferCheckAndTruncateAbortedTXN because neither the function
name nor the function comment gives any indication that it should be
having this side effect

Hmm, it doesn't seem so tricky to me that a function with the name
ReorderBufferCheckAndTruncateAbortedTXN() checks the transaction
status to truncate an aborted transaction and caches the transaction
status as a side effect.

I was coming at this from a different perspective, asking myself the
question "When can I know the RBTXN_IS_COMMITTED bit setting?" -- aka
rbtxn_is_committed()?

AFAICT it turns out we can only have confidence in that result when
know ReorderBufferCheckAndTruncateAbortedTXN was called already for
this tx. But this happens only when ReorderBufferCheckMemoryLimit()
gets called. So, these bitflags are getting set as a side-effect of
calling unrelated functions. (e.g. the fact we can't test if a tx was
aborted/committed unless ReorderBufferCheckMemoryLimit is called
seemed unusual to me).

I'm not sure if ReorderBufferCheckMemoryLimit() is an unrelated
function because the whole idea (also mentioned in the commit message)
is that we check the transaction status only for large transactions to
avoid CLOG lookup overheads. TBH I'm not sure why readers expect these
transaction status flags to always be set. Also in the function
comment we have:

* the transaction is aborted. The transaction status is cached in
* txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
* next call.

which describes the side-effect of the function that it caches the
transaction status.

I don't know what the solution is; maybe some
more comments would be enough.

I'm not sure how we can improve the comment TBH.

~~~

ReorderBufferProcessTXN:

3.
if (rbtxn_prepared(txn))
+ {
rb->prepare(rb, txn, commit_lsn);
+ txn->txn_flags |= RBTXN_SENT_PREPARE;
+ }

In ReorderBufferStreamCommit there is an assertion that we are not
trying to do another prepare() if the _SENT_PREPARE flag is already
set. Should this code have a similar assert?

We can have a similar assert there but why do you think it's needed there?

No particular reason, other than for consistency to have similar
assertions everywhere that the RBTXN_SENT_PREPARE flag is set.

Okay, addressed in the v18 patch I've just sent[1]/messages/by-id/CAD21AoDmYZtLnPLuiERT6Cibv1Gf1DwDjzBevtqKYn0ZzMQqBQ@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoDmYZtLnPLuiERT6Cibv1Gf1DwDjzBevtqKYn0ZzMQqBQ@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#80Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#78)
Re: Skip collecting decoded changes of already-aborted transactions

On Mon, Feb 3, 2025 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 29, 2025 at 11:12 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Jan 28, 2025 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 27, 2025 at 7:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

To be honest, I didn't understand the "CLEAR" part of that name. It
seems more like it should've been called something like
RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
whatever instead of something that appears to be saying "has the
RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
to over-comment everything but OTOH currently there is no way really
to understand what these flags mean without looking through all the
code to try to figure them out from the usage.

My recurring gripe about these flags is simply that their meanings and
how to use them should be apparent just by looking at reorderbuffer.h
and not having to guess anything or look at how they get used in the
code. It doesn't matter if that is achieved by better constant names,
by more comments or by enhanced macros/functions with asserts but
currently just looking at that file still leaves the reader with lots
of unanswered questions.

I see your point. IIUC we have the comments about what the checks with
the flags means but not have the description about the relationship
among the flags. I think we can start a new thread for clarifying
these flags and their usage. We can also discuss renaming
RBTXN_IS_SERIALIZED[_CLEARE] there too.

OK.

======

Some comments for patch v17-0002.

Thank you for reviewing the patch.

======
.../replication/logical/reorderbuffer.c

ReorderBufferSkipPrepare:

1.
+ /* txn must have been marked as a prepared transaction */
+ Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
+
txn->txn_flags |= RBTXN_SKIPPED_PREPARE;

Should this also be asserting that the _SENT_PREPARE flag is false,
because we cannot be skipping it if we already sent the prepare.

~~~

ReorderBufferFinishPrepared:

2.

- txn->txn_flags |= RBTXN_PREPARE;
-
/*
- * The prepare info must have been updated in txn even if we skip
- * prepare.
+ * txn must have been marked as a prepared transaction and skipped but
+ * not sent a prepare. Also, the prepare info must have been updated
+ * in txn even if we skip prepare.
*/
+ Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 0);
+ Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
Assert(txn->final_lsn != InvalidXLogRecPtr);

2a.
If it must have been prepared *and* skipped (as the comment says) then
the first assert should be written as:
Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE))
== (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));

or easier to just have 2 asserts:
Assert(txn->txn_flags & RBTXN_IS_PREPARED);
Assert(txn->txn_flags & RBTXN_SKIPPED_PREPARE);

Agreed with all the above comments. Since checking
prepared-transaction-related-flags is getting complicated I've
introduced RBTXN_PREPARE_STATUS_FLAGS so that we can check the desired
prepared transaction status easily.

~

2b.
later in the same function there is code:

if (is_commit)
rb->commit_prepared(rb, txn, commit_lsn);
else
rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);

So it is OK to do a commit_prepared/rollback_prepared even though no
prepare() has been sent?

IIUC ReorderBufferReplay() is responsible for sending a prepare
message in this case. See the comment around there:

/*
* By this time the txn has the prepare record information and it is
* important to use that so that downstream gets the accurate
* information. If instead, we have passed commit information here
* then downstream can behave as it has already replayed commit
* prepared after the restart.
*/

I've attached the updated patches.

If there are no further comments on v18 patches, I'm going to push
them tomorrow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#81Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#78)
Re: Skip collecting decoded changes of already-aborted transactions

Hi. Here are some minor comments for the v18* patch set.

//////////

Patch v18-0001

1.1. Commit message

A previously reported typo still exists:

/noticeble/noticeable/

//////////

Patch v18-0002

2.1
+#define RBTXN_PREPARE_STATUS_FLAGS (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)
+

AFAICT bitmasks like this are more commonly named with a _MASK suffix.

How about something like:
- RBTXN_PREPARE_MASK
- RBTXN_PREPARE_STATUS_MASK
- RBTXN_PREPARE_FLAGS_MASK

==========
Kind Regards,
Peter Smith.
Fujitsu Australia

#82Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#81)
Re: Skip collecting decoded changes of already-aborted transactions

On Tue, Feb 11, 2025 at 9:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi. Here are some minor comments for the v18* patch set.

//////////

Patch v18-0001

1.1. Commit message

A previously reported typo still exists:

/noticeble/noticeable/

//////////

Patch v18-0002

2.1
+#define RBTXN_PREPARE_STATUS_FLAGS (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)
+

AFAICT bitmasks like this are more commonly named with a _MASK suffix.

How about something like:
- RBTXN_PREPARE_MASK
- RBTXN_PREPARE_STATUS_MASK
- RBTXN_PREPARE_FLAGS_MASK

Pushed both patches after addressing the above comments. Thank you for
your review!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#83Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#82)
1 attachment(s)
RE: Skip collecting decoded changes of already-aborted transactions

Dear hackers,

I hope I'm in the correct thread. In abfb296, rbtxn_skip_prepared() was removed from
SnapBuildDistributeNewCatalogSnapshot(). ISTM it was an only caller of the function.

Is it an intentional for external projects? Or it can be removed like attached?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

remove_func.diffsapplication/octet-stream; name=remove_func.diffsDownload
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7eb..8c6a2b954e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -259,12 +259,6 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_ABORTED) != 0 \
 )
 
-/* prepare for this transaction skipped? */
-#define rbtxn_skip_prepared(txn) \
-( \
-	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
-)
-
 /* Is this a top-level transaction? */
 #define rbtxn_is_toptxn(txn) \
 ( \
#84Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#83)
Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Mar 13, 2025 at 10:04 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear hackers,

I hope I'm in the correct thread. In abfb296, rbtxn_skip_prepared() was removed from
SnapBuildDistributeNewCatalogSnapshot(). ISTM it was an only caller of the function.

Is it an intentional for external projects? Or it can be removed like attached?

I think we can keep it as all RBTXN_xxx flags have the corresponding
macro and the comments of these macros somewhat help understand what
the flag indicates.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com