Fix 035_standby_logical_decoding.pl race conditions

Started by Bertrand Drouvot11 months ago47 messages
#1Bertrand Drouvot
bertranddrouvot.pg@gmail.com
2 attachment(s)

Hi hackers,

Please find attached a patch to $SUBJECT.

In rare circumstances (and on slow machines) it is possible that a xl_running_xacts
is emitted and that the catalog_xmin of a logical slot on the standby advances
past the conflict point. In that case, no conflict is reported and the test
fails. It has been observed several times and the last discussion can be found
in [1]/messages/by-id/386386.1737736935@sss.pgh.pa.us.

To avoid the race condition to occur this commit adds an injection point to prevent
the catalog_xmin of a logical slot to advance past the conflict point.

While working on this patch, some adjustements have been needed for injection
points (they are proposed in 0001):

- Adds the ability to wakeup() and detach() while ensuring that no process can
wait in between. It's done thanks to a new injection_points_wakeup_detach()
function that is holding the spinlock during the whole duration.

- If the walsender is waiting on the injection point and that the logical slot
is conflicting, then the walsender process is killed and so it is not able to
"empty" it's injection slot. So the next injection_wait() should reuse this slot
(instead of using an empty one). injection_wait() has been modified that way
in 0001.

With 0001 in place, then we can make use of an injection point in
LogicalConfirmReceivedLocation() and update 035_standby_logical_decoding.pl to
prevent the catalog_xmin of a logical slot to advance past the conflict point.

Remarks:

R1. The issue still remains in v16 though (as injection points are available since
v17).
R2. 0001 should probably bump the injection point module to 1.1, but shouldn't
have been the case in d28cd3e7b21c?

[1]: /messages/by-id/386386.1737736935@sss.pgh.pa.us

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Add-injection_points_wakeup_detach-and-modify-inj.patchtext/x-diff; charset=us-asciiDownload
From 5008207f28c68360ec3d466852697797994bb330 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 10 Feb 2025 13:19:54 +0000
Subject: [PATCH v1 1/2] Add injection_points_wakeup_detach() and modify
 injection_wait()

This commit adds:

- injection_points_wakeup_detach() to be able to wakeup() and detach() while
ensuring that no process can wait in between (holding the spinlock during the
whole duration).
- A check in injection_wait() to search if an existing injection slot with the
same name already exists (If so, reuse it).
---
 .../injection_points--1.0.sql                 | 10 +++
 .../injection_points/injection_points.c       | 65 ++++++++++++++++---
 2 files changed, 65 insertions(+), 10 deletions(-)
 100.0% src/test/modules/injection_points/

diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql
index 5d83f08811b..b4ae67fd97b 100644
--- a/src/test/modules/injection_points/injection_points--1.0.sql
+++ b/src/test/modules/injection_points/injection_points--1.0.sql
@@ -75,6 +75,16 @@ RETURNS void
 AS 'MODULE_PATHNAME', 'injection_points_detach'
 LANGUAGE C STRICT PARALLEL UNSAFE;
 
+--
+-- injection_points_wakeup_detach()
+--
+-- Wakes up and detaches the current action, if any, from the given injection point.
+--
+CREATE FUNCTION injection_points_wakeup_detach(IN point_name TEXT)
+RETURNS void
+AS 'MODULE_PATHNAME', 'injection_points_wakeup_detach'
+LANGUAGE C STRICT PARALLEL UNSAFE;
+
 --
 -- injection_points_stats_numcalls()
 --
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index ad528d77752..97397449dfb 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -93,6 +93,9 @@ typedef struct InjectionPointSharedState
 /* Pointer to shared-memory state. */
 static InjectionPointSharedState *inj_state = NULL;
 
+static Datum injection_points_wakeup_internal(FunctionCallInfo fcinfo, bool lock,
+											  bool have_to_wait);
+
 extern PGDLLEXPORT void injection_error(const char *name,
 										const void *private_data);
 extern PGDLLEXPORT void injection_notice(const char *name,
@@ -294,19 +297,30 @@ injection_wait(const char *name, const void *private_data)
 	SpinLockAcquire(&inj_state->lock);
 	for (int i = 0; i < INJ_MAX_WAIT; i++)
 	{
-		if (inj_state->name[i][0] == '\0')
+		/*
+		 * It might be that a waiting process has been killed before being
+		 * able to reset inj_state->name[i][0] to '\0', so checking if there
+		 * is a slot with the same name.
+		 */
+		if (strcmp(name, inj_state->name[i]) == 0)
 		{
 			index = i;
-			strlcpy(inj_state->name[i], name, INJ_NAME_MAXLEN);
-			old_wait_counts = inj_state->wait_counts[i];
 			break;
 		}
+		else if (inj_state->name[i][0] == '\0' && index < 0)
+			index = i;
 	}
+
 	SpinLockRelease(&inj_state->lock);
 
 	if (index < 0)
 		elog(ERROR, "could not find free slot for wait of injection point %s ",
 			 name);
+	else
+	{
+		strlcpy(inj_state->name[index], name, INJ_NAME_MAXLEN);
+		old_wait_counts = inj_state->wait_counts[index];
+	}
 
 	/* And sleep.. */
 	ConditionVariablePrepareToSleep(&inj_state->wait_point);
@@ -427,10 +441,12 @@ injection_points_cached(PG_FUNCTION_ARGS)
 
 /*
  * SQL function for waking up an injection point waiting in injection_wait().
+ * If "lock" is true then the function handles the locking.
+ * If "have_to_wait" is true then the function returns an error if no process
+ * is waiting.
  */
-PG_FUNCTION_INFO_V1(injection_points_wakeup);
-Datum
-injection_points_wakeup(PG_FUNCTION_ARGS)
+static Datum
+injection_points_wakeup_internal(FunctionCallInfo fcinfo, bool lock, bool have_to_wait)
 {
 	char	   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
 	int			index = -1;
@@ -439,7 +455,8 @@ injection_points_wakeup(PG_FUNCTION_ARGS)
 		injection_init_shmem();
 
 	/* First bump the wait counter for the injection point to wake up */
-	SpinLockAcquire(&inj_state->lock);
+	if (lock)
+		SpinLockAcquire(&inj_state->lock);
 	for (int i = 0; i < INJ_MAX_WAIT; i++)
 	{
 		if (strcmp(name, inj_state->name[i]) == 0)
@@ -450,17 +467,29 @@ injection_points_wakeup(PG_FUNCTION_ARGS)
 	}
 	if (index < 0)
 	{
-		SpinLockRelease(&inj_state->lock);
-		elog(ERROR, "could not find injection point %s to wake up", name);
+		if (lock)
+			SpinLockRelease(&inj_state->lock);
+		if (have_to_wait)
+			elog(ERROR, "could not find injection point %s to wake up", name);
+		else
+			PG_RETURN_VOID();
 	}
 	inj_state->wait_counts[index]++;
-	SpinLockRelease(&inj_state->lock);
+	if (lock)
+		SpinLockRelease(&inj_state->lock);
 
 	/* And broadcast the change to the waiters */
 	ConditionVariableBroadcast(&inj_state->wait_point);
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(injection_points_wakeup);
+Datum
+injection_points_wakeup(PG_FUNCTION_ARGS)
+{
+	return injection_points_wakeup_internal(fcinfo, true, true);
+}
+
 /*
  * injection_points_set_local
  *
@@ -516,6 +545,22 @@ injection_points_detach(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+/*
+ * SQL function for waking up and dropping an injection point.
+ */
+PG_FUNCTION_INFO_V1(injection_points_wakeup_detach);
+Datum
+injection_points_wakeup_detach(PG_FUNCTION_ARGS)
+{
+	if (inj_state == NULL)
+		injection_init_shmem();
+
+	SpinLockAcquire(&inj_state->lock);
+	injection_points_wakeup_internal(fcinfo, false, false);
+	injection_points_detach(fcinfo);
+	SpinLockRelease(&inj_state->lock);
+	PG_RETURN_VOID();
+}
 
 void
 _PG_init(void)
-- 
2.34.1

v1-0002-Fix-race-conditions-in-035_standby_logical_decodi.patchtext/x-diff; charset=us-asciiDownload
From df29ebe3121e3b924f9e0fe40b05e55dad2bd4c8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 10 Feb 2025 13:36:48 +0000
Subject: [PATCH v1 2/2] Fix race conditions in 035_standby_logical_decoding.pl

In rare circumstances (and on slow machines) it is possible that a xl_running_xacts
is emitted and that the catalog_xmin of a logical slot advances past the conflict
point. In that case no conflict is reported and the test fails.

This commit adds a new injection point to prevent the catalog_xmin to advance
past the conflict point.
---
 src/backend/replication/logical/logical.c     |  3 +++
 .../t/035_standby_logical_decoding.pl         | 19 +++++++++++++++++++
 2 files changed, 22 insertions(+)
  11.7% src/backend/replication/logical/
  88.2% src/test/recovery/t/

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..578837bfc1c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -41,6 +41,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 
@@ -1826,6 +1827,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		bool		updated_xmin = false;
 		bool		updated_restart = false;
 
+		INJECTION_POINT("before-confirm-xmin-location");
+
 		SpinLockAcquire(&MyReplicationSlot->mutex);
 
 		MyReplicationSlot->data.confirmed_flush = lsn;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 505e85d1eb6..d6b8d28a7e0 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -256,6 +261,10 @@ sub wait_until_vacuum_can_remove
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
 
+	# Ensure catalog_xmin can not advance
+	$node_standby->safe_psql('testdb',
+		"SELECT injection_points_attach('before-confirm-xmin-location', 'wait');");
+
 	# Launch our sql.
 	$node_primary->safe_psql('testdb', qq[$sql]);
 
@@ -269,6 +278,10 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	# Unlock the catalog_xmin update (if any)
+	$node_standby->safe_psql('testdb',
+		"SELECT injection_points_wakeup_detach('before-confirm-xmin-location');");
 }
 
 ########################
@@ -490,6 +503,12 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
+# Wait until the extension has been created on the standby
+$node_primary->wait_for_replay_catchup($node_standby);
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-- 
2.34.1

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#1)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached a patch to $SUBJECT.

In rare circumstances (and on slow machines) it is possible that a xl_running_xacts
is emitted and that the catalog_xmin of a logical slot on the standby advances
past the conflict point. In that case, no conflict is reported and the test
fails. It has been observed several times and the last discussion can be found
in [1].

Is my understanding correct that bgwriter on primary node has created
a xl_running_xacts, then that record is replicated to standby, and
while decoding it (xl_running_xacts) on standby via active_slot, we
advanced the catalog_xmin of active_slot? If this happens then the
replay of vacuum record on standby won't be able to invalidate the
active slot, right?

So, if the above is correct, the reason for generating extra
xl_running_xacts on primary is Vacuum followed by Insert on primary
via below part of test:
$node_primary->safe_psql(
'testdb', qq[VACUUM $vac_option verbose $to_vac;
INSERT INTO flush_wal DEFAULT VALUES;]);

To avoid the race condition to occur this commit adds an injection point to prevent
the catalog_xmin of a logical slot to advance past the conflict point.

While working on this patch, some adjustements have been needed for injection
points (they are proposed in 0001):

- Adds the ability to wakeup() and detach() while ensuring that no process can
wait in between. It's done thanks to a new injection_points_wakeup_detach()
function that is holding the spinlock during the whole duration.

- If the walsender is waiting on the injection point and that the logical slot
is conflicting, then the walsender process is killed and so it is not able to
"empty" it's injection slot. So the next injection_wait() should reuse this slot
(instead of using an empty one). injection_wait() has been modified that way
in 0001.

With 0001 in place, then we can make use of an injection point in
LogicalConfirmReceivedLocation() and update 035_standby_logical_decoding.pl to
prevent the catalog_xmin of a logical slot to advance past the conflict point.

Remarks:

R1. The issue still remains in v16 though (as injection points are available since
v17).

This is not idle case because the test would still keep failing
intermittently on 16. I am wondering what if we start a transaction
before vacuum and do some DML in it but didn't commit that xact till
the active_slot test is finished then even the extra logging of
xl_running_xacts shouldn't advance xmin during decoding. This is
because reorder buffer may point to the xmin before vacuum. See
following code:

SnapBuildProcessRunningXacts()
....
xmin = ReorderBufferGetOldestXmin(builder->reorder);
if (xmin == InvalidTransactionId)
xmin = running->oldestRunningXid;
elog(DEBUG3, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u",
builder->xmin, builder->xmax, running->oldestRunningXid, xmin);
LogicalIncreaseXminForSlot(lsn, xmin);
...

Note that I have not tested this case, so I could be wrong. But if
possible, we should try to find some solution which could be
backpatched to 16 as well.

--
With Regards,
Amit Kapila.

#3Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Wed, Mar 19, 2025 at 12:12:19PM +0530, Amit Kapila wrote:

On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached a patch to $SUBJECT.

In rare circumstances (and on slow machines) it is possible that a xl_running_xacts
is emitted and that the catalog_xmin of a logical slot on the standby advances
past the conflict point. In that case, no conflict is reported and the test
fails. It has been observed several times and the last discussion can be found
in [1].

Thanks for looking at it!

Is my understanding correct that bgwriter on primary node has created
a xl_running_xacts, then that record is replicated to standby, and
while decoding it (xl_running_xacts) on standby via active_slot, we
advanced the catalog_xmin of active_slot? If this happens then the
replay of vacuum record on standby won't be able to invalidate the
active slot, right?

Yes, that's also my understanding. It's also easy to "simulate" by adding
a checkpoint on the primary and a long enough sleep after we launched our sql in
wait_until_vacuum_can_remove().

So, if the above is correct, the reason for generating extra
xl_running_xacts on primary is Vacuum followed by Insert on primary
via below part of test:
$node_primary->safe_psql(
'testdb', qq[VACUUM $vac_option verbose $to_vac;
INSERT INTO flush_wal DEFAULT VALUES;]);

I'm not sure, I think a xl_running_xacts could also be generated (for example by
the checkpointer) before the vacuum (should the system be slow enough).

Remarks:

R1. The issue still remains in v16 though (as injection points are available since
v17).

This is not idle case because the test would still keep failing
intermittently on 16.

I do agree.

I am wondering what if we start a transaction
before vacuum and do some DML in it but didn't commit that xact till
the active_slot test is finished then even the extra logging of
xl_running_xacts shouldn't advance xmin during decoding.

I'm not sure, as I think a xl_running_xacts could still be generated after
we execute "our sql" meaning:

"
$node_primary->safe_psql('testdb', qq[$sql]);
"

and before we launch the new DML. In that case I guess the issue could still
happen.

OTOH If we create the new DML "before" we launch "our sql" then the test
would also fail for both active and inactive slots because that would not
invalidate any slots.

I did observe the above with the attached changes (just changing the PREPARE
TRANSACTION location).

we should try to find some solution which could be
backpatched to 16 as well.

I agree, but I'm not sure it's doable as it looks to me that we should prevent
the catalog xmin to advance to advance past the conflict point while still
generating a conflict point. Will try to give it another thought.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

test_prepared_txn.txttext/plain; charset=us-asciiDownload
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..edd8009fbce 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -258,10 +258,20 @@ sub wait_until_vacuum_can_remove
 	# Launch our sql.
 	$node_primary->safe_psql('testdb', qq[$sql]);
 
+	$node_primary->safe_psql('testdb',"CHECKPOINT");
+	sleep(20);
+
+	$node_primary->safe_psql(
+	    'testdb', "
+		BEGIN;
+		PREPARE TRANSACTION 'prevent_slot_advance_v1';
+		INSERT INTO prevent_slot_advance VALUES (1);"
+	);
+
 	# Wait until we get a newer horizon.
-	$node_primary->poll_query_until('testdb',
-		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
-	) or die "new snapshot does not have a newer horizon";
+	#$node_primary->poll_query_until('testdb',
+	#	"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
+	#) or die "new snapshot does not have a newer horizon";
 
 	# Launch the vacuum command and insert into flush_wal (see CREATE
 	# TABLE flush_wal for the reason).
@@ -281,7 +291,9 @@ wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
 autovacuum = off
+max_prepared_transactions = 10
 });
+
 $node_primary->dump_info;
 $node_primary->start;
 
@@ -305,6 +317,7 @@ $node_primary->backup($backup_name);
 # Some tests need to wait for VACUUM to be replayed. But vacuum does not flush
 # WAL. An insert into flush_wal outside transaction does guarantee a flush.
 $node_primary->psql('testdb', q[CREATE TABLE flush_wal();]);
+$node_primary->psql('testdb', q[CREATE TABLE prevent_slot_advance(a int);]);
 
 #######################
 # Initialize standby node
@@ -565,6 +578,8 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 'prevent_slot_advance_v1'");
+
 # Attempting to alter an invalidated slot should result in an error
 ($result, $stdout, $stderr) = $node_standby->psql(
 	'postgres',
@@ -664,6 +679,8 @@ check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 'prevent_slot_advance_v1'");
+
 $handle =
   make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
 
@@ -699,6 +716,8 @@ check_for_invalidation('shared_row_removal_', $logstart,
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 'prevent_slot_advance_v1'");
+
 $handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
 	\$stderr);
 
@@ -737,6 +756,8 @@ ok( !$node_standby->log_contains(
 	'activeslot slot invalidation is not logged with vacuum on conflict_test'
 );
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 'prevent_slot_advance_v1'");
+
 # Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
 ok( $node_standby->poll_query_until(
 		'postgres',
#4Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bertrand Drouvot (#3)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand,

I'm also working on the thread to resolve the random failure.

Yes, that's also my understanding. It's also easy to "simulate" by adding
a checkpoint on the primary and a long enough sleep after we launched our sql in
wait_until_vacuum_can_remove().

Thanks for letting me know. For me, it could be reporoduced only the sleep().

So, if the above is correct, the reason for generating extra
xl_running_xacts on primary is Vacuum followed by Insert on primary
via below part of test:
$node_primary->safe_psql(
'testdb', qq[VACUUM $vac_option verbose $to_vac;
INSERT INTO flush_wal DEFAULT VALUES;]);

I'm not sure, I think a xl_running_xacts could also be generated (for example by
the checkpointer) before the vacuum (should the system be slow enough).

I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs,
I got the below ordering. This meant that RUNNING_XACTS are generated before the
prune triggered by the vacuum.
```
...
lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766
...
lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,...
...
```

I'm not sure, as I think a xl_running_xacts could still be generated after
we execute "our sql" meaning:

"
$node_primary->safe_psql('testdb', qq[$sql]);
"

and before we launch the new DML. In that case I guess the issue could still
happen.

OTOH If we create the new DML "before" we launch "our sql" then the test
would also fail for both active and inactive slots because that would not
invalidate any slots.

I did observe the above with the attached changes (just changing the PREPARE
TRANSACTION location).

I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.

I agree, but I'm not sure it's doable as it looks to me that we should prevent
the catalog xmin to advance to advance past the conflict point while still
generating a conflict point. Will try to give it another thought.

One primitive idea for me was to stop the walsender/pg_recvlogical process for a while.
SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows.
See 019_replslot_limit.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#5Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#4)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san,

On Fri, Mar 21, 2025 at 12:28:10PM +0000, Hayato Kuroda (Fujitsu) wrote:

I'm also working on the thread to resolve the random failure.

Thanks for looking at it!

I've also tried the idea with the living transaction via background_psql(),
but I got the same result. The test could fail when RUNNING_XACTS record was
generated before the transaction starts.

and thanks for testing and confirming too.

SIGSTOP signal for pg_recvlogical may do the idea,

Yeah, but would we be "really" testing an "active" slot?

At the end we want to produce an invalidation that may or not happen on a real
environment. The corner case is in the test, not an issue of the feature to
fix.

So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
we could bypass the "active" slot checks in 16 and 17 and use injection points as
proposed as of 18 (as we need the injection points changes proposed in 0001
up-thread). Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bertrand Drouvot (#5)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand,

SIGSTOP signal for pg_recvlogical may do the idea,

Yeah, but would we be "really" testing an "active" slot?

Yeah, this is also a debatable point.

At the end we want to produce an invalidation that may or not happen on a real
environment. The corner case is in the test, not an issue of the feature to
fix.

I also think this is the test-issue, not the codebase.

So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
we could bypass the "active" slot checks in 16 and 17 and use injection points as
proposed as of 18 (as we need the injection points changes proposed in 0001
up-thread). Thoughts?

I do not have other idea neither. I checked your patch set could solve the issue.

Comments for the patch:
I'm not sure whether new API is really needed. Isn't it enough to use both
injection_points_wakeup() and injection_points_detach()? This approach does not
require bumping the version, and can be backported to PG17.

Also, another check whether the extension can be installed for the node is required.
Please see 041_checkpoint_at_promote.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#6)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san,

On Mon, Mar 24, 2025 at 04:54:21AM +0000, Hayato Kuroda (Fujitsu) wrote:

So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
we could bypass the "active" slot checks in 16 and 17 and use injection points as
proposed as of 18 (as we need the injection points changes proposed in 0001
up-thread). Thoughts?

I do not have other idea neither. I checked your patch set could solve the issue.

Thanks for looking at it!

Comments for the patch:
I'm not sure whether new API is really needed. Isn't it enough to use both
injection_points_wakeup() and injection_points_detach()?

I think that the proposed changes are needed as they fix 2 issues that I hit
while working on 0002:

1. ensure that no process can wait in between wakeup() and detach().

2. If the walsender is waiting on the injection point and the logical slot
is conflicting, then the walsender process is killed and so it is not able to
"empty" it's injection slot. So the next injection_wait() should reuse this slot
(instead of using an empty one).

Also, another check whether the extension can be installed for the node is required.
Please see 041_checkpoint_at_promote.pl.

Indeed I can see the "# Check if the extension injection_points is available, as
it may be possible that this script is run with installcheck, where the module
would not be installed by default", in 041_checkpoint_at_promote.pl.

Thanks! I think that makes sense and will add it in the proposed patch (early
next week).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#5)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Fri, Mar 21, 2025 at 9:48 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
we could bypass the "active" slot checks in 16 and 17 and use injection points as
proposed as of 18 (as we need the injection points changes proposed in 0001
up-thread). Thoughts?

The key point is that snapshotConflictHorizon should always be greater
than or equal to oldestRunningXid for this test to pass. The challenge
is that vacuum LOGs the safest xid to be removed as
snapshotConflictHorizon, which I think will always be either one or
more lesser than oldestRunningXid. So, we can't make it pass unless we
ensure there is no running_xact record gets logged after the last
successful transaction (in this case SQL passed to function
wait_until_vacuum_can_remove) and the till the vacuum is replayed on
the standby. I see even check_for_invalidation('pruning_', $logstart,
'with on-access pruning'); failed [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-03-19%2007%3A08%3A16.

Seeing all these failures, I wonder whether we can reliably test
active slots apart from wal_level change test (aka Scenario 6:
incorrect wal_level on primary.). Sure, we can try by having some
injection point kind of tests, but is it really worth because, anyway
the active slots won't get invalidated in the scenarios for row
removal we are testing in this case. The other possibility is to add a
developer-level debug_disable_running_xact GUC to test this and
similar cases, or can't we have an injection point to control logging
this WAL record? I have seen the need to control logging running_xact
record in other cases as well.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-03-19%2007%3A08%3A16

--
With Regards,
Amit Kapila.

#9Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#8)
1 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit, Bertrand,

Seeing all these failures, I wonder whether we can reliably test
active slots apart from wal_level change test (aka Scenario 6:
incorrect wal_level on primary.). Sure, we can try by having some
injection point kind of tests, but is it really worth because, anyway
the active slots won't get invalidated in the scenarios for row
removal we are testing in this case. The other possibility is to add a
developer-level debug_disable_running_xact GUC to test this and
similar cases, or can't we have an injection point to control logging
this WAL record? I have seen the need to control logging running_xact
record in other cases as well.

Based on the idea which controls generating RUNNING_XACTS, I prototyped a patch.
When the instance is attached the new injection point, all processes would skip
logging the record. This does not need to extend injection_point module.
I tested with reproducer and passed on my env.
Sadly IS_INJECTION_POINT_ATTACHED() was introduced for PG18 so that the patch
could not backport for PG17 as-is.

How do you feel?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Use-injection_point-to-stabilize-035_standby_logical.patchapplication/octet-stream; name=0001-Use-injection_point-to-stabilize-035_standby_logical.patchDownload
From 1f439e0c6cadc952eecbcded2d2d249d9fec9d36 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 14:19:50 +0900
Subject: [PATCH] Use injection_point to stabilize 035_standby_logical_decoding

---
 src/backend/storage/ipc/standby.c             | 16 ++++++++
 .../t/035_standby_logical_decoding.pl         | 39 +++++++++++++++----
 2 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..35056eee67b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -31,6 +31,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/hsearch.h"
+#include "utils/injection_point.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -1287,6 +1288,21 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+	/* For testing slot invalidation due to the conflict */
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))
+	{
+		/*
+		 * In 035_standby_logical_decoding.pl, RUNNING_XACTS could move slots's
+		 * xmin forward and cause random failures. Skip generating to avoid it.
+		 *
+		 * XXX What value should we return here? Originally this returns the
+		 * inserted location of RUNNING_XACT record. Based on that, here
+		 * returns the latest insert location for now.
+		 */
+		return GetInsertRecPtr();
+	}
+#endif
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..1a721744ef0 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -251,6 +256,11 @@ sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
 
+	# Note that from this point the checkpointer and bgwriter will wait before
+	# they write RUNNING_XACT record.
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_attach('log-running-xacts', 'wait');");
+
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
@@ -258,6 +268,10 @@ sub wait_until_vacuum_can_remove
 	# Launch our sql.
 	$node_primary->safe_psql('testdb', qq[$sql]);
 
+	# XXX If the instance does not attach 'log-running-xacts', the bgwriter
+	# pocess would generate RUNNING_XACTS record, so that the test would fail.
+	sleep(20);
+
 	# Wait until we get a newer horizon.
 	$node_primary->poll_query_until('testdb',
 		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
@@ -268,6 +282,12 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	$node_primary->wait_for_replay_catchup($node_standby);
+
+	# Resume working processes
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_detach('log-running-xacts');");
 }
 
 ########################
@@ -285,6 +305,14 @@ autovacuum = off
 $node_primary->dump_info;
 $node_primary->start;
 
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 
 $node_primary->safe_psql('testdb',
@@ -528,6 +556,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -557,8 +588,6 @@ wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
 								 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
@@ -656,8 +685,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
 							 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
@@ -690,8 +717,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE ROLE create_trash;
 							 DROP ROLE create_trash;', 'pg_authid');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid');
@@ -724,8 +749,6 @@ wait_until_vacuum_can_remove(
 							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
 							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # message should not be issued
 ok( !$node_standby->log_contains(
 		"invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
-- 
2.43.5

#10Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#8)
1 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit, Bertrand,

Seeing all these failures, I wonder whether we can reliably test
active slots apart from wal_level change test (aka Scenario 6:
incorrect wal_level on primary.).

Hmm, agreed. We do not have good solution to stabilize tests, at least for now.
I've created a patch for PG16 which avoids using active slots in scenario 1, 2, 3,
and 5 like attached. Other tests still use active slots:

* Scenario 6 invalidate slots due to the incorrect wal_level, so it retained.
* 'behaves_ok_' testcase, scenario 4 and 'Test standby promotion...' testcase
won't invalidate slots, so they retained.
* 'DROP DATABASE should drops...' invalidates slots, but it does not related with
xmin horizon, so it retained.

The patch aimed only PG16, but can be created for PG17 as well, if needed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Avoid-using-active-slots-in-035_standby_logical_deco.patchapplication/octet-stream; name=0001-Avoid-using-active-slots-in-035_standby_logical_deco.patchDownload
From 8e4375d389bfbf68a3e8d19f4d69a0183c994b5b Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH] Avoid using active slots in 035_standby_logical_decoding

---
 .../t/035_standby_logical_decoding.pl         | 164 +++++++++---------
 1 file changed, 82 insertions(+), 82 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..9ab08d75fd3 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $needs_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -191,22 +201,22 @@ sub check_slots_conflicting_status
 	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +225,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +235,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -262,6 +276,10 @@ sub wait_until_vacuum_can_remove
 	# Launch our sql.
 	$node_primary->safe_psql('testdb', qq[$sql]);
 
+	# XXX: Reproducer - must be removed before being pushed
+	$node_primary->safe_psql('testdb', 'CHECKPOINT');
+	sleep(20);
+
 	# Wait until we get a newer horizon.
 	$node_primary->poll_query_until('testdb',
 		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
@@ -389,7 +407,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -536,11 +554,13 @@ $node_subscriber->stop;
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -550,19 +570,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -580,7 +592,7 @@ check_slots_conflicting_status(1);
 
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
-	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_inactiveslot' and conflicting is true;"
 );
 
 chomp($restart_lsn);
@@ -615,14 +627,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -632,32 +646,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -668,29 +676,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -733,13 +735,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -754,17 +758,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -777,10 +777,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -798,7 +798,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -830,10 +830,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -897,14 +897,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#9)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Mar 26, 2025 at 1:17 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Seeing all these failures, I wonder whether we can reliably test
active slots apart from wal_level change test (aka Scenario 6:
incorrect wal_level on primary.). Sure, we can try by having some
injection point kind of tests, but is it really worth because, anyway
the active slots won't get invalidated in the scenarios for row
removal we are testing in this case. The other possibility is to add a
developer-level debug_disable_running_xact GUC to test this and
similar cases, or can't we have an injection point to control logging
this WAL record? I have seen the need to control logging running_xact
record in other cases as well.

Based on the idea which controls generating RUNNING_XACTS, I prototyped a patch.
When the instance is attached the new injection point, all processes would skip
logging the record. This does not need to extend injection_point module.

Right, I think this is a better idea. I have a few comments:
1.
+ /*
+ * In 035_standby_logical_decoding.pl, RUNNING_XACTS could move slots's
+ * xmin forward and cause random failures.

No need to use test file name in code comments.

2. The comments atop wait_until_vacuum_can_remove can be changed to
indicate that we will avoid logging running_xact with the help of
injection points.

3.
+ # Note that from this point the checkpointer and bgwriter will wait before
+ # they write RUNNING_XACT record.
+ $node_primary->safe_psql('testdb',
+ "SELECT injection_points_attach('log-running-xacts', 'wait');");

Isn't it better to use 'error' as the second parameter as we don't
want to wait at this injection point?

4.
+ # XXX If the instance does not attach 'log-running-xacts', the bgwriter
+ # pocess would generate RUNNING_XACTS record, so that the test would fail.
+ sleep(20);

I think it is better to make a separate patch (as a first patch) for
this so that it can be used as a reproducer. I suggest to use
checkpoint as used by one of Bertrand's patches to ensure that the
issue reproduces in every environment.

Sadly IS_INJECTION_POINT_ATTACHED() was introduced for PG18 so that the patch
could not backport for PG17 as-is.

We can use 'wait' mode API in PG17 as used in one of the tests
(injection_points_attach('heap_update-before-pin', 'wait');) but I
think it may be better to just leave testing active slots in
backbranches because anyway the new development happens on HEAD and we
want to ensure that no breakage happens there.

--
With Regards,
Amit Kapila.

#12Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#11)
4 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit,

Right, I think this is a better idea. I have a few comments:
1.
+ /*
+ * In 035_standby_logical_decoding.pl, RUNNING_XACTS could move slots's
+ * xmin forward and cause random failures.

No need to use test file name in code comments.

Fixed.

2. The comments atop wait_until_vacuum_can_remove can be changed to
indicate that we will avoid logging running_xact with the help of
injection points.

Comments were updated for the master. In back-branches, they were removed
because the risk was removed.

3.
+ # Note that from this point the checkpointer and bgwriter will wait before
+ # they write RUNNING_XACT record.
+ $node_primary->safe_psql('testdb',
+ "SELECT injection_points_attach('log-running-xacts', 'wait');");

Isn't it better to use 'error' as the second parameter as we don't
want to wait at this injection point?

Right, and the comment atop it was updated.

4.
+ # XXX If the instance does not attach 'log-running-xacts', the bgwriter
+ # pocess would generate RUNNING_XACTS record, so that the test would fail.
+ sleep(20);

I think it is better to make a separate patch (as a first patch) for
this so that it can be used as a reproducer. I suggest to use
checkpoint as used by one of Bertrand's patches to ensure that the
issue reproduces in every environment.

Reproducer was separated to the .txt file.

Sadly IS_INJECTION_POINT_ATTACHED() was introduced for PG18 so that the

patch

could not backport for PG17 as-is.

We can use 'wait' mode API in PG17 as used in one of the tests
(injection_points_attach('heap_update-before-pin', 'wait');) but I
think it may be better to just leave testing active slots in
backbranches because anyway the new development happens on HEAD and we
want to ensure that no breakage happens there.

OK. I've attached a patch for PG17 as well. Commit messages for them were also
updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

reproducer.txttext/plain; name=reproducer.txtDownload
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index d68a8f9b828..71c3ad896d5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -273,6 +273,9 @@ sub wait_until_vacuum_can_remove
 	# Launch our sql.
 	$node_primary->safe_psql('testdb', qq[$sql]);
 
+	$node_primary->safe_psql('testdb', 'CHECKPOINT');
+	sleep(20);
+
 	# Wait until we get a newer horizon.
 	$node_primary->poll_query_until('testdb',
 		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
PG17-v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG17-v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From ac405310591719db0c37356e4487181bed8b3f71 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH vPG17] Stabilize 035_standby_logical_decoding.pl by using the
 injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 201 ++++++++----------
 1 file changed, 91 insertions(+), 110 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..d68a8f9b828 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $needs_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -171,42 +181,46 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check reason for conflict in pg_replication_slots.
 sub check_slots_conflict_reason
 {
-	my ($slot_prefix, $reason) = @_;
+	my ($slot_prefix, $reason, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
-	$res = $node_standby->safe_psql(
-		'postgres', qq(
-			 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
-	);
-
-	is($res, "$reason", "$active_slot reason for conflict is $reason");
-
 	$res = $node_standby->safe_psql(
 		'postgres', qq(
 			 select invalidation_reason from pg_replication_slots where slot_name = '$inactive_slot' and conflicting;)
 	);
 
 	is($res, "$reason", "$inactive_slot reason for conflict is $reason");
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$res = $node_standby->safe_psql(
+			'postgres', qq(
+				select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
+		);
+
+		is($res, "$reason", "$active_slot reason for conflict is $reason");
+	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +229,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,31 +239,29 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
 # launch a VACUUM.  $vac_option is the set of options to be passed to the
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
-#
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +400,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -539,21 +550,19 @@ $node_subscriber->stop;
 # active slot is invalidated.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # Ensure that replication slot stats are not empty before triggering the
 # conflict.
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,27 +571,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -602,7 +595,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
 	'postgres',
 	"SELECT restart_lsn FROM pg_replication_slots
-		WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+		WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );
 
 chomp($restart_lsn);
@@ -634,14 +627,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -651,32 +646,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -687,29 +676,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -758,13 +741,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -779,17 +764,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -802,10 +783,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -823,7 +804,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
@@ -855,10 +836,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -922,14 +903,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

PG16-v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG16-v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From 1f22028e50c2929c24fe2e60e09ac6d1448e6e71 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH vPG16_2] Stabilize 035_standby_logical_decoding.pl by using
 the injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 167 ++++++++----------
 1 file changed, 78 insertions(+), 89 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..c5899907d37 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $needs_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -191,22 +201,22 @@ sub check_slots_conflicting_status
 	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +225,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,31 +235,29 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
 # launch a VACUUM.  $vac_option is the set of options to be passed to the
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
-#
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +396,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -536,11 +543,13 @@ $node_subscriber->stop;
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -550,19 +559,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -580,7 +581,7 @@ check_slots_conflicting_status(1);
 
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
-	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_inactiveslot' and conflicting is true;"
 );
 
 chomp($restart_lsn);
@@ -615,14 +616,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -632,32 +635,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -668,29 +665,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -733,13 +724,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -754,17 +747,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -777,10 +766,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -798,7 +787,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -830,10 +819,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -897,14 +886,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=v2-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From 89e336611da6f70e2e57d6093b35ed385021e8ad Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 14:19:50 +0900
Subject: [PATCH v2] Stabilize 035_standby_logical_decoding.pl by using the
 injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip generating the record when the instance attached to a new
injection point.

This failure can happen since logical decoding is allowed on the standby server.
But the interface of injection_points we used exists only on master, so we do
not backpatch.
---
 src/backend/storage/ipc/standby.c             | 16 +++++++
 .../t/035_standby_logical_decoding.pl         | 45 +++++++++++++------
 2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..fd175147a70 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -31,6 +31,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/hsearch.h"
+#include "utils/injection_point.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -1287,6 +1288,21 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+	/* For testing slot invalidation due to the conflict */
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))
+	{
+		/*
+		 * RUNNING_XACTS could move slots's xmin forward and cause random
+		 * failures in some tests. Skip generating to avoid it.
+		 *
+		 * XXX What value should we return here? Originally this returns the
+		 * inserted location of RUNNING_XACT record. Based on that, here
+		 * returns the latest insert location for now.
+		 */
+		return GetInsertRecPtr();
+	}
+#endif
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..b0312e7c118 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -241,16 +246,19 @@ sub check_for_invalidation
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
 #
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# Note that injection_point is used to avoid the seeing a xl_running_xacts
+# that would advance an active replication slot's catalog_xmin. Advancing
+# the active replication slot's catalog_xmin would break some tests that
+# expect the active slot to conflict with the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
 
+	# Note that from this point the checkpointer and bgwriter will skip writing
+	# xl_running_xacts record.
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_attach('log-running-xacts', 'error');");
+
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
@@ -268,6 +276,12 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	$node_primary->wait_for_replay_catchup($node_standby);
+
+	# Resume generating the xl_running_xacts record
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_detach('log-running-xacts');");
 }
 
 ########################
@@ -285,6 +299,14 @@ autovacuum = off
 $node_primary->dump_info;
 $node_primary->start;
 
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 
 $node_primary->safe_psql('testdb',
@@ -528,6 +550,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -557,8 +582,6 @@ wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
 								 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
@@ -656,8 +679,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
 							 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
@@ -690,8 +711,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE ROLE create_trash;
 							 DROP ROLE create_trash;', 'pg_authid');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid');
@@ -724,8 +743,6 @@ wait_until_vacuum_can_remove(
 							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
 							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # message should not be issued
 ok( !$node_standby->log_contains(
 		"invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
-- 
2.43.5

#13Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#12)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san and Amit,

On Fri, Mar 28, 2025 at 09:02:29AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Amit,

Right, I think this is a better idea.

I like it too and the bonus point is that this injection point can be used
in more tests (more use cases).

A few comments:

==== About v2-0001-Stabilize

=== 1

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

=== 2 (Nit)

/* For testing slot invalidation due to the conflict */

Not sure "due to the conflict" is needed.

==== About PG17-v2-0001

=== 3

The commit message still mentions injection point.

=== 4

-# Note that pg_current_snapshot() is used to get the horizon. It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin. Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.

I'd be tempted to not remove this comment but reword it a bit instead. Something
like?

# Note that pg_current_snapshot() is used to get the horizon. It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin. Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon. We ensure that active replication slots are not
# created for tests that might produce this race condition though.

=== 5

The invalidation checks for active slots are kept for the wal_level case. Also
the active slots are still created to test that logical decoding on the standby
behaves correctly, when no conflict is expected and for the promotion.

The above makes sense to me.

=== 6 (Nit)

In drop_logical_slots(), s/needs_active_slot/drop_active_slot/?

=== 7 (Nit)

In check_slots_conflict_reason(), s/needs_active_slot/checks_active_slot/?

==== About PG16-v2-0001

Same as for PG17-v2-0001.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bertrand Drouvot (#13)
3 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand,

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

Fixed.

=== 2 (Nit)

/* For testing slot invalidation due to the conflict */

Not sure "due to the conflict" is needed.

OK, removed.

==== About PG17-v2-0001

=== 3

The commit message still mentions injection point.

Oh, removed.

=== 4

-# Note that pg_current_snapshot() is used to get the horizon. It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin. Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.

I'd be tempted to not remove this comment but reword it a bit instead. Something
like?

# Note that pg_current_snapshot() is used to get the horizon. It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin. Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon. We ensure that active replication slots are not
# created for tests that might produce this race condition though.

Added.

=== 6 (Nit)

In drop_logical_slots(), s/needs_active_slot/drop_active_slot/?

Fixed.

=== 7 (Nit)

In check_slots_conflict_reason(), s/needs_active_slot/checks_active_slot/?

Fixed.

==== About PG16-v2-0001

Same as for PG17-v2-0001.

I followed all needed changes.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From a6ded0068211e0a304eb803763f632665831a419 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 14:19:50 +0900
Subject: [PATCH v3] Stabilize 035_standby_logical_decoding.pl by using the
 injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip generating the record when the instance attached to a new
injection point.

This failure can happen since logical decoding is allowed on the standby server.
But the interface of injection_points we used exists only on master, so we do
not backpatch.
---
 src/backend/storage/ipc/standby.c             | 16 +++++++
 .../t/035_standby_logical_decoding.pl         | 45 +++++++++++++------
 2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..0e621e9996a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -31,6 +31,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/hsearch.h"
+#include "utils/injection_point.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -1287,6 +1288,21 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+	/* For testing slot invalidation */
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))
+	{
+		/*
+		 * RUNNING_XACTS could move slots's xmin forward and cause random
+		 * failures in some tests. Skip generating to avoid it.
+		 *
+		 * XXX What value should we return here? Originally this returns the
+		 * inserted location of RUNNING_XACT record. Based on that, here
+		 * returns the latest insert location for now.
+		 */
+		return GetInsertRecPtr();
+	}
+#endif
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..93b11a4de1e 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -241,16 +246,19 @@ sub check_for_invalidation
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
 #
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# Note that injection_point is used to avoid the seeing the xl_running_xacts
+# that would advance an active replication slot's catalog_xmin. Advancing
+# the active replication slot's catalog_xmin would break some tests that
+# expect the active slot to conflict with the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
 
+	# Note that from this point the checkpointer and bgwriter will skip writing
+	# xl_running_xacts record.
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_attach('log-running-xacts', 'error');");
+
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
@@ -268,6 +276,12 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	$node_primary->wait_for_replay_catchup($node_standby);
+
+	# Resume generating the xl_running_xacts record
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_detach('log-running-xacts');");
 }
 
 ########################
@@ -285,6 +299,14 @@ autovacuum = off
 $node_primary->dump_info;
 $node_primary->start;
 
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 
 $node_primary->safe_psql('testdb',
@@ -528,6 +550,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -557,8 +582,6 @@ wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
 								 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
@@ -656,8 +679,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
 							 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
@@ -690,8 +711,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE ROLE create_trash;
 							 DROP ROLE create_trash;', 'pg_authid');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid');
@@ -724,8 +743,6 @@ wait_until_vacuum_can_remove(
 							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
 							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # message should not be issued
 ok( !$node_standby->log_contains(
 		"invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
-- 
2.43.5

PG16-v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG16-v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From e42d953093251eb008a1322918e5f20d4bd41f2d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH vPG16-3] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 165 +++++++++---------
 1 file changed, 81 insertions(+), 84 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..54e9aa76621 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $drop_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($drop_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -191,22 +201,22 @@ sub check_slots_conflicting_status
 	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +225,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +235,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -247,10 +261,11 @@ sub check_for_invalidation
 #
 # Note that pg_current_snapshot() is used to get the horizon.  It does
 # not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that active replication slots are not
+# created for tests that might produce this race condition though. 
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +404,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -536,11 +551,13 @@ $node_subscriber->stop;
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -550,19 +567,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -580,7 +589,7 @@ check_slots_conflicting_status(1);
 
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
-	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_inactiveslot' and conflicting is true;"
 );
 
 chomp($restart_lsn);
@@ -615,14 +624,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -632,32 +643,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -668,29 +673,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -733,13 +732,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -754,17 +755,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -777,10 +774,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -798,7 +795,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -830,10 +827,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -897,14 +894,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

PG17-v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG17-v3-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From 5e585fbf44f95ed684f9aacb0b8c2ba46267d012 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH vPG17-3] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 199 +++++++++---------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..af29e075eaa 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $drop_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($drop_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -171,42 +181,46 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check reason for conflict in pg_replication_slots.
 sub check_slots_conflict_reason
 {
-	my ($slot_prefix, $reason) = @_;
+	my ($slot_prefix, $reason, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
-	$res = $node_standby->safe_psql(
-		'postgres', qq(
-			 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
-	);
-
-	is($res, "$reason", "$active_slot reason for conflict is $reason");
-
 	$res = $node_standby->safe_psql(
 		'postgres', qq(
 			 select invalidation_reason from pg_replication_slots where slot_name = '$inactive_slot' and conflicting;)
 	);
 
 	is($res, "$reason", "$inactive_slot reason for conflict is $reason");
+
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$res = $node_standby->safe_psql(
+			'postgres', qq(
+				select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
+		);
+
+		is($res, "$reason", "$active_slot reason for conflict is $reason");
+	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +229,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +239,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -247,10 +265,11 @@ sub check_for_invalidation
 #
 # Note that pg_current_snapshot() is used to get the horizon.  It does
 # not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that active replication slots are not
+# created for tests that might produce this race condition though. 
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +408,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -539,21 +558,19 @@ $node_subscriber->stop;
 # active slot is invalidated.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # Ensure that replication slot stats are not empty before triggering the
 # conflict.
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,27 +579,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -602,7 +603,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
 	'postgres',
 	"SELECT restart_lsn FROM pg_replication_slots
-		WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+		WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );
 
 chomp($restart_lsn);
@@ -634,14 +635,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -651,32 +654,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -687,29 +684,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -758,13 +749,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -779,17 +772,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -802,10 +791,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -823,7 +812,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
@@ -855,10 +844,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -922,14 +911,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

#15Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san,

On Tue, Apr 01, 2025 at 01:22:49AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand,

Thanks for the updated patch!

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

Fixed.

hmm, not sure as I still can see:

+# Note that injection_point is used to avoid the seeing the xl_running_xacts

=== 1

+                * XXX What value should we return here? Originally this returns the
+                * inserted location of RUNNING_XACT record. Based on that, here
+                * returns the latest insert location for now.
+                */
+               return GetInsertRecPtr();

Looking at the LogStandbySnapshot() that are using the output lsn, i.e:

pg_log_standby_snapshot()
BackgroundWriterMain()
ReplicationSlotReserveWal()

It looks ok to me to use GetInsertRecPtr().

But if we "really" want to produce a "new" WAL record, what about using
LogLogicalMessage()? It could also be used for debugging purpose. Bonus point:
it does not need wal_level to be set to logical. Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#15)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Tue, Apr 1, 2025 at 2:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Kuroda-san,

On Tue, Apr 01, 2025 at 01:22:49AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand,

Thanks for the updated patch!

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

Fixed.

hmm, not sure as I still can see:

+# Note that injection_point is used to avoid the seeing the xl_running_xacts

=== 1

+                * XXX What value should we return here? Originally this returns the
+                * inserted location of RUNNING_XACT record. Based on that, here
+                * returns the latest insert location for now.
+                */
+               return GetInsertRecPtr();

Looking at the LogStandbySnapshot() that are using the output lsn, i.e:

pg_log_standby_snapshot()
BackgroundWriterMain()
ReplicationSlotReserveWal()

It looks ok to me to use GetInsertRecPtr().

+1.

But if we "really" want to produce a "new" WAL record, what about using
LogLogicalMessage()?

We are using injection points for testing purposes, which means the
caller is aware of skipping the running_xacts record during the test
run. So, there doesn't seem to be any reason to do anything extra.

--
With Regards,
Amit Kapila.

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Tue, Apr 1, 2025 at 6:53 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

With respect to 0001, can't this problem happen for the following case as well?
# Recovery conflict: Invalidate conflicting slots, including in-use slots
# Scenario 5: conflict due to on-access pruning.

You have not added any injection point for the above case. Isn't it
possible that if running_xact record is logged concurrently to the
pruning record, it should move the active slot on standby, and the
same failure should occur in this case as well?

--
With Regards,
Amit Kapila.

#18Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bertrand Drouvot (#15)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand,

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

Fixed.

Sorry, I misunderstood your comment and wrongly fixed. I will address in next version.

=== 1

+                * XXX What value should we return here? Originally this
returns the
+                * inserted location of RUNNING_XACT record. Based on that,
here
+                * returns the latest insert location for now.
+                */
+               return GetInsertRecPtr();

Looking at the LogStandbySnapshot() that are using the output lsn, i.e:

pg_log_standby_snapshot()
BackgroundWriterMain()
ReplicationSlotReserveWal()

It looks ok to me to use GetInsertRecPtr().

But if we "really" want to produce a "new" WAL record, what about using
LogLogicalMessage()? It could also be used for debugging purpose. Bonus point:
it does not need wal_level to be set to logical. Thoughts?

Right. Similarly, an SQL function pg_logical_emit_message() is sometimes used for
the testing purpose, advance_wal() and emit_wal( in Cluster.pm. Even so, we have
not found the use-case yet, thus I want to retain now and will update based on
the future needs.

I'll investigate another point [1]/messages/by-id/CAA4eK1+x5-eOn5+MW6FiUjB_1bBCH8jCCARC1uMrx6erZ3J73w@mail.gmail.com and then will post new version.

[1]: /messages/by-id/CAA4eK1+x5-eOn5+MW6FiUjB_1bBCH8jCCARC1uMrx6erZ3J73w@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#19Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Amit Kapila (#16)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Tue, Apr 01, 2025 at 04:55:06PM +0530, Amit Kapila wrote:

On Tue, Apr 1, 2025 at 2:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

But if we "really" want to produce a "new" WAL record, what about using
LogLogicalMessage()?

We are using injection points for testing purposes, which means the
caller is aware of skipping the running_xacts record during the test
run. So, there doesn't seem to be any reason to do anything extra.

Agree, the idea was to provide extra debugging info for the tests. We can just
keep it in mind should we have a need for.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#20Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#17)
3 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit, Bertrand,

You have not added any injection point for the above case. Isn't it
possible that if running_xact record is logged concurrently to the
pruning record, it should move the active slot on standby, and the
same failure should occur in this case as well?

I considered that the timing failure can happen. Reproducer:

```
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', 'CHECKPOINT');
+sleep(20);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
```

And here is my theory...

Firstly, a new table was created with smaller fill factor. Then, after doing UPDATE
three times, the page became full. At fourth UPDATE command (let's say txn4),
the page pruning was done by the backend process and PRUNE_ON_ACCESS was generated.
It requested standbys to discard tuples before the third UPDATE (say txn3),
thus the slot could be invalidated.
However, if a RUNNING_XACTS record is generated between txn3 and txn4, the
oldestRunningXact would be same xid as txn4, and the catalog_xmin of the standby
slot would be advanced till that. Upcoming PRUNE_ON_ACCESS points the txn3 so that
slot invalidation won't happen in this case.

Based on the fact, I've updated to use injection_points for scenario 5. Of course,
PG16/17 patches won't use the active slot for that scenario.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

PG16-v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG16-v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From 564bea902f3ae85dca414b2537f35d4eca2c2d7d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v4-PG16] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 165 +++++++++---------
 1 file changed, 81 insertions(+), 84 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..ca967fb625f 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $drop_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($drop_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -191,22 +201,22 @@ sub check_slots_conflicting_status
 	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +225,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +235,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -247,10 +261,11 @@ sub check_for_invalidation
 #
 # Note that pg_current_snapshot() is used to get the horizon.  It does
 # not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that active replication slots are not
+# created for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +404,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -536,11 +551,13 @@ $node_subscriber->stop;
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -550,19 +567,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -580,7 +589,7 @@ check_slots_conflicting_status(1);
 
 # Get the restart_lsn from an invalidated slot
 my $restart_lsn = $node_standby->safe_psql('postgres',
-	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_activeslot' and conflicting is true;"
+	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 'vacuum_full_inactiveslot' and conflicting is true;"
 );
 
 chomp($restart_lsn);
@@ -615,14 +624,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -632,32 +643,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -668,29 +673,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -733,13 +732,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -754,17 +755,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -777,10 +774,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -798,7 +795,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -830,10 +827,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -897,14 +894,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

PG17-v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=PG17-v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From 5b2b4218433dd52ff4dc19b03e54b37b036bccb2 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v4-PG17] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 199 +++++++++---------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..da5ad1b78f2 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,37 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
+	my ($slot_prefix, $drop_active_slot) = @_;
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	if ($drop_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+			qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -171,42 +181,46 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check reason for conflict in pg_replication_slots.
 sub check_slots_conflict_reason
 {
-	my ($slot_prefix, $reason) = @_;
+	my ($slot_prefix, $reason, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
-	$res = $node_standby->safe_psql(
-		'postgres', qq(
-			 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
-	);
-
-	is($res, "$reason", "$active_slot reason for conflict is $reason");
-
 	$res = $node_standby->safe_psql(
 		'postgres', qq(
 			 select invalidation_reason from pg_replication_slots where slot_name = '$inactive_slot' and conflicting;)
 	);
 
 	is($res, "$reason", "$inactive_slot reason for conflict is $reason");
+
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$res = $node_standby->safe_psql(
+			'postgres', qq(
+				select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
+		);
+
+		is($res, "$reason", "$active_slot reason for conflict is $reason");
+	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
-
-	# drop the logical slots
-	drop_logical_slots($previous_slot_prefix);
+	my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -215,9 +229,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +239,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -247,10 +265,11 @@ sub check_for_invalidation
 #
 # Note that pg_current_snapshot() is used to get the horizon.  It does
 # not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that active replication slots are not
+# created for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +408,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -539,21 +558,19 @@ $node_subscriber->stop;
 # active slot is invalidated.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('behaves_ok_', 1);
+
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 0, 1, 0);
 
 # Ensure that replication slot stats are not empty before triggering the
 # conflict.
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,27 +579,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -602,7 +603,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
 	'postgres',
 	"SELECT restart_lsn FROM pg_replication_slots
-		WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+		WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );
 
 chomp($restart_lsn);
@@ -634,14 +635,16 @@ ok(!-f "$standby_walfile",
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('vacuum_full_', 0);
+
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
-reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -651,32 +654,26 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict on a shared catalog table is to
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
-reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -687,29 +684,23 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('shared_row_removal_', 0);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
-	'no_conflict_', 0, 1);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 0, 1);
 
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
@@ -758,13 +749,15 @@ $node_standby->restart;
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
+# drop the logical slots used by previous tests
+drop_logical_slots('no_conflict_', 1);
+
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+reactive_slots_change_hfs_and_wait_for_xmins('pruning_', 0, 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -779,17 +772,13 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -802,10 +791,10 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots('pruning_');
+drop_logical_slots('pruning_', 0);
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -823,7 +812,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
@@ -855,10 +844,10 @@ check_pg_recvlogical_stderr($handle,
 ##################################################
 
 # drop the logical slots
-drop_logical_slots('wal_level_');
+drop_logical_slots('wal_level_', 1);
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -922,14 +911,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=v4-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From f67f9fea45c6a3319aa05c8c1feac133178fd9e6 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 14:19:50 +0900
Subject: [PATCH v4] Stabilize 035_standby_logical_decoding.pl by using the
 injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip generating the record when the instance attached to a new
injection point.

This failure can happen since logical decoding is allowed on the standby server.
But the interface of injection_points we used exists only on master, so we do
not backpatch.
---
 src/backend/storage/ipc/standby.c             | 16 ++++++
 .../t/035_standby_logical_decoding.pl         | 57 ++++++++++++++-----
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..0e621e9996a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -31,6 +31,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/hsearch.h"
+#include "utils/injection_point.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -1287,6 +1288,21 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+	/* For testing slot invalidation */
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))
+	{
+		/*
+		 * RUNNING_XACTS could move slots's xmin forward and cause random
+		 * failures in some tests. Skip generating to avoid it.
+		 *
+		 * XXX What value should we return here? Originally this returns the
+		 * inserted location of RUNNING_XACT record. Based on that, here
+		 * returns the latest insert location for now.
+		 */
+		return GetInsertRecPtr();
+	}
+#endif
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..96dd0340e8d 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -241,16 +246,19 @@ sub check_for_invalidation
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
 #
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# Note that injection_point is used to avoid seeing a xl_running_xacts that
+# would advance an active replication slot's catalog_xmin. Advancing the active
+# replication slot's catalog_xmin would break some tests that expect the active
+# slot to conflict with the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
 
+	# Note that from this point the checkpointer and bgwriter will skip writing
+	# xl_running_xacts record.
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_attach('log-running-xacts', 'error');");
+
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
@@ -268,6 +276,12 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	$node_primary->wait_for_replay_catchup($node_standby);
+
+	# Resume generating the xl_running_xacts record
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_detach('log-running-xacts');");
 }
 
 ########################
@@ -285,6 +299,14 @@ autovacuum = off
 $node_primary->dump_info;
 $node_primary->start;
 
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 
 $node_primary->safe_psql('testdb',
@@ -528,6 +550,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -557,8 +582,6 @@ wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
 								 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
@@ -656,8 +679,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
 							 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
@@ -690,8 +711,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE ROLE create_trash;
 							 DROP ROLE create_trash;', 'pg_authid');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid');
@@ -724,8 +743,6 @@ wait_until_vacuum_can_remove(
 							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
 							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # message should not be issued
 ok( !$node_standby->log_contains(
 		"invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
@@ -773,6 +790,14 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
 	0);
 
+# Injection_point is used to avoid seeing an xl_running_xacts even here. In
+# scenario 5, we verify the case that the backend process detects the page has
+# enough tuples; thus, page pruning happens. If the record is generated just
+# before doing on-pruning, the catalog_xmin of the active slot would be
+# updated; hence, the conflict would not occur.
+$node_primary->safe_psql('testdb',
+	"SELECT injection_points_attach('log-running-xacts', 'error');");
+
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]
@@ -785,6 +810,10 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
+# Resume generating the xl_running_xacts record
+$node_primary->safe_psql('testdb',
+	"SELECT injection_points_detach('log-running-xacts');");
+
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
 
-- 
2.43.5

#21Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#20)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san,

On Wed, Apr 02, 2025 at 07:16:25AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Amit, Bertrand,

You have not added any injection point for the above case. Isn't it
possible that if running_xact record is logged concurrently to the
pruning record, it should move the active slot on standby, and the
same failure should occur in this case as well?

I considered that the timing failure can happen. Reproducer:

```
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', 'CHECKPOINT');
+sleep(20);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
```

Yeah, I was going to provide the exact same reproducer and then saw your email.

Based on the fact, I've updated to use injection_points for scenario 5. Of course,
PG16/17 patches won't use the active slot for that scenario.

Thanks for the updated patch!

As far v4-0001:

=== 1

+# would advance an active replication slot's catalog_xmin

s/would/could/? I mean the system also needs to be "slow" enough (so the
sleep() in the reproducer)

=== 2

+# Injection_point is used to avoid seeing an xl_running_xacts even here. In
+# scenario 5, we verify the case that the backend process detects the page has
+# enough tuples; thus, page pruning happens. If the record is generated just
+# before doing on-pruning, the catalog_xmin of the active slot would be
+# updated; hence, the conflict would not occur.

Not sure we need to explain what scenario 5 does here, but that does not hurt
if you feel the need.

Also maybe mention the last update in the comment and add some nuance (like
proposed in === 1), something like?

"
# Injection_point is used to avoid seeing a xl_running_xacts here. Indeed,
# if it is generated between the last 2 updates then the catalog_xmin of the active
# slot could be updated; hence, the conflict could not occur.
"

Apart from that the tests looks good to me and all the problematic scenarios
covered.

As far PG17-v4-0001:

=== 3

-# seeing a xl_running_xacts that would advance an active replication slot's
+# seeing the xl_running_xacts that would advance an active replication slot's

why?

=== 4

It looks like that check_slots_conflict_reason() is not called with checks_active_slot
as argument.

=== 5

I think that we could remove the need for the drop_active_slot parameter in
drop_logical_slots() and just check if an active slot exists (and if so drop
it). That said I'm not sure it's worth to go that far for backpatching.

As far PG16-v4:

=== 6

Same as === 3 and === 5 (=== 4 does not apply as check_slots_conflict_reason()
does not exist).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#21)
1 attachment(s)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 2, 2025 at 2:06 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Kuroda-san,

On Wed, Apr 02, 2025 at 07:16:25AM +0000, Hayato Kuroda (Fujitsu) wrote:

As far v4-0001:

=== 1

+# would advance an active replication slot's catalog_xmin

s/would/could/? I mean the system also needs to be "slow" enough (so the
sleep() in the reproducer)

=== 2

+# Injection_point is used to avoid seeing an xl_running_xacts even here. In
+# scenario 5, we verify the case that the backend process detects the page has
+# enough tuples; thus, page pruning happens. If the record is generated just
+# before doing on-pruning, the catalog_xmin of the active slot would be
+# updated; hence, the conflict would not occur.

Not sure we need to explain what scenario 5 does here, but that does not hurt
if you feel the need.

Also maybe mention the last update in the comment and add some nuance (like
proposed in === 1), something like?

"
# Injection_point is used to avoid seeing a xl_running_xacts here. Indeed,
# if it is generated between the last 2 updates then the catalog_xmin of the active
# slot could be updated; hence, the conflict could not occur.
"

I have changed it based on your suggestions and made a few other
changes in the comments. Please see attached.

*
+  if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))

It is better to name the injection point as skip-log-running-xacts as
that will be appropriate based on its usage.

--
With Regards,
Amit Kapila.

Attachments:

v4-0001-amit.1.patch.txttext/plain; charset=US-ASCII; name=v4-0001-amit.1.patch.txtDownload
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 0e621e9996a..7fa8d9247e0 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1288,21 +1288,17 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
-	/* For testing slot invalidation */
 #ifdef USE_INJECTION_POINTS
-	if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))
+	if (IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts"))
 	{
 		/*
-		 * RUNNING_XACTS could move slots's xmin forward and cause random
-		 * failures in some tests. Skip generating to avoid it.
-		 *
-		 * XXX What value should we return here? Originally this returns the
-		 * inserted location of RUNNING_XACT record. Based on that, here
-		 * returns the latest insert location for now.
+		 * This record could move slot's xmin forward during decoding, leading
+		 * to unpredictable results, so skip it when requested by the test.
 		 */
 		return GetInsertRecPtr();
 	}
 #endif
+
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 96dd0340e8d..39a8797a7cb 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -246,10 +246,10 @@ sub check_for_invalidation
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
 #
-# Note that injection_point is used to avoid seeing a xl_running_xacts that
-# would advance an active replication slot's catalog_xmin. Advancing the active
-# replication slot's catalog_xmin would break some tests that expect the active
-# slot to conflict with the catalog xmin horizon.
+# Note that the injection_point avoids seeing a xl_running_xacts that could
+# advance an active replication slot's catalog_xmin. Advancing the active
+# replication slot's catalog_xmin would break some tests that expect the
+# active slot to conflict with the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -257,7 +257,7 @@ sub wait_until_vacuum_can_remove
 	# Note that from this point the checkpointer and bgwriter will skip writing
 	# xl_running_xacts record.
 	$node_primary->safe_psql('testdb',
-		"SELECT injection_points_attach('log-running-xacts', 'error');");
+		"SELECT injection_points_attach('skip-log-running-xacts', 'error');");
 
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
@@ -281,7 +281,7 @@ sub wait_until_vacuum_can_remove
 
 	# Resume generating the xl_running_xacts record
 	$node_primary->safe_psql('testdb',
-		"SELECT injection_points_detach('log-running-xacts');");
+		"SELECT injection_points_detach('skip-log-running-xacts');");
 }
 
 ########################
@@ -790,13 +790,12 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
 	0);
 
-# Injection_point is used to avoid seeing an xl_running_xacts even here. In
-# scenario 5, we verify the case that the backend process detects the page has
-# enough tuples; thus, page pruning happens. If the record is generated just
-# before doing on-pruning, the catalog_xmin of the active slot would be
-# updated; hence, the conflict would not occur.
+# Injection_point avoids seeing an xl_running_xacts even here. This is required
+# because if it is generated between the last two updates, then the catalog_xmin
+# of the active slot could be updated, and hence, the conflict won't occur. See
+# comments atop wait_until_vacuum_can_remove.
 $node_primary->safe_psql('testdb',
-	"SELECT injection_points_attach('log-running-xacts', 'error');");
+	"SELECT injection_points_attach('skip-log-running-xacts', 'error');");
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -812,7 +811,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Resume generating the xl_running_xacts record
 $node_primary->safe_psql('testdb',
-	"SELECT injection_points_detach('log-running-xacts');");
+	"SELECT injection_points_detach('skip-log-running-xacts');");
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#21)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 2, 2025 at 2:06 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

As far PG17-v4-0001:

=== 4

It looks like that check_slots_conflict_reason() is not called with checks_active_slot
as argument.

=== 5

I think that we could remove the need for the drop_active_slot parameter in
drop_logical_slots() and just check if an active slot exists (and if so drop
it). That said I'm not sure it's worth to go that far for backpatching.

The other idea to simplify the changes for backbranches:
sub reactive_slots_change_hfs_and_wait_for_xmins
{
...
+  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;

  # create the logical slots
-  create_logical_slots($node_standby, $slot_prefix);
+  create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
...
+  if ($needs_active_slot)
+  {
+    $handle =
+      make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+  }

What if this function doesn't take input parameter needs_active_slot
and rather removes the call to make_slot_active? We will call
make_slot_active only at the required places. This should make the
changes much less because after that, we don't need to make changes
related to drop and create. Sure, in some cases, we will test two
inactive slots instead of one, but I guess that would be the price to
keep the tests simple and more like HEAD.

--
With Regards,
Amit Kapila.

#24Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Amit Kapila (#22)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Wed, Apr 02, 2025 at 03:04:07PM +0530, Amit Kapila wrote:

I have changed it based on your suggestions and made a few other
changes in the comments. Please see attached.

Thanks!

*
+  if (IS_INJECTION_POINT_ATTACHED("log-running-xacts"))

It is better to name the injection point as skip-log-running-xacts as
that will be appropriate based on its usage.

Agree.

+# Note that the injection_point avoids seeing a xl_running_xacts that could
and
+# Injection_point avoids seeing an xl_running_xacts even here. This is required

s/an xl_running_xacts/a xl_running_xacts/? in the second one? Also I'm not sure
"even here" is needed.

Apart from the above that LGTM.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#25Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#23)
2 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit, Bertrand,

The other idea to simplify the changes for backbranches:
sub reactive_slots_change_hfs_and_wait_for_xmins
{
...
+  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;

  # create the logical slots
-  create_logical_slots($node_standby, $slot_prefix);
+  create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
...
+  if ($needs_active_slot)
+  {
+    $handle =
+      make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+  }

What if this function doesn't take input parameter needs_active_slot
and rather removes the call to make_slot_active? We will call
make_slot_active only at the required places. This should make the
changes much less because after that, we don't need to make changes
related to drop and create. Sure, in some cases, we will test two
inactive slots instead of one, but I guess that would be the price to
keep the tests simple and more like HEAD.

Actually, I could not decide which one is better, so let me share both drafts.
V5-PG17-1 uses the previous approach, and v5-PG17-2 uses new proposed one.
Bertrand, which one do you like?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-PG17-1-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG17-1-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From 91c6bb335e4aa09e7e631f1cd8786459d6675b93 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG17-1] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 171 +++++++++---------
 1 file changed, 86 insertions(+), 85 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..1f3ae86c556 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -44,27 +44,50 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node, $slot_prefix) = @_;
+	my ($node, $slot_prefix, $needs_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot),
 		'testdb');
-	$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
-		'testdb');
+
+	if ($needs_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node->create_logical_slot_on_standby($node_primary, qq($active_slot),
+			'testdb');
+	}
+}
+
+# Checks the existence of the active slot. Returns the name if found, otherwise
+# undef.
+sub active_slot_exists
+{
+	my ($slot_prefix) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $active_slot_info = $node_standby->slot($active_slot);
+
+	return $active_slot_info->{'plugin'} eq '' ? undef : $active_slot;
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
 	my ($slot_prefix) = @_;
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	$node_standby->psql('postgres',
 		qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
-	$node_standby->psql('postgres',
-		qq[SELECT pg_drop_replication_slot('$active_slot')]);
+
+	# Drops the active slot as well, if exists
+	if (active_slot_exists($slot_prefix))
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$node_standby->psql('postgres',
+				qq[SELECT pg_drop_replication_slot('$active_slot')]);
+	}
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -173,40 +196,47 @@ sub check_slots_conflict_reason
 {
 	my ($slot_prefix, $reason) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
-	$res = $node_standby->safe_psql(
-		'postgres', qq(
-			 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
-	);
-
-	is($res, "$reason", "$active_slot reason for conflict is $reason");
-
 	$res = $node_standby->safe_psql(
 		'postgres', qq(
 			 select invalidation_reason from pg_replication_slots where slot_name = '$inactive_slot' and conflicting;)
 	);
 
 	is($res, "$reason", "$inactive_slot reason for conflict is $reason");
+
+	if (active_slot_exists($slot_prefix))
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$res = $node_standby->safe_psql(
+			'postgres', qq(
+				select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
+		);
+
+		is($res, "$reason", "$active_slot reason for conflict is $reason");
+	}
 }
 
-# Drop the slots, re-create them, change hot_standby_feedback,
-# check xmin and catalog_xmin values, make slot active and reset stat.
+# Create slots, change hot_standby_feedback, check xmin and catalog_xmin
+# values, make slot active and reset stat.
 sub reactive_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;
 
 	# drop the logical slots
 	drop_logical_slots($previous_slot_prefix);
 
 	# create the logical slots
-	create_logical_slots($node_standby, $slot_prefix);
+	create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($needs_active_slot)
+	{
+		$handle =
+		  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -217,7 +247,6 @@ sub check_for_invalidation
 {
 	my ($slot_prefix, $log_start, $test_name) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +255,23 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if (active_slot_exists($slot_prefix))
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +284,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that active replication slots are not
+# created for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -389,7 +424,7 @@ $node_standby->safe_psql('postgres',
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby, 'behaves_ok_');
+create_logical_slots($node_standby, 'behaves_ok_', 1);
 
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE decoding_test(x integer, y text);]);
@@ -543,17 +578,13 @@ $node_subscriber->stop;
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
 reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
-	0, 1);
+	0, 1, 0);
 
 # Ensure that replication slot stats are not empty before triggering the
 # conflict.
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -567,22 +598,6 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
-$handle =
-  make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -602,7 +617,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
 	'postgres',
 	"SELECT restart_lsn FROM pg_replication_slots
-		WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+		WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );
 
 chomp($restart_lsn);
@@ -641,7 +656,7 @@ my $logstart = -s $node_standby->logfile;
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the
 # standby.
 reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
-	0, 1);
+	0, 1, 0);
 
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
@@ -656,14 +671,6 @@ check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
-$handle =
-  make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
@@ -676,7 +683,7 @@ $logstart = -s $node_standby->logfile;
 # create/drop a role and launch a vacuum on pg_authid with
 # hot_standby_feedback turned off on the standby.
 reactive_slots_change_hfs_and_wait_for_xmins('row_removal_',
-	'shared_row_removal_', 0, 1);
+	'shared_row_removal_', 0, 1, 0);
 
 # Trigger the conflict
 wait_until_vacuum_can_remove(
@@ -692,14 +699,6 @@ check_for_invalidation('shared_row_removal_', $logstart,
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
-$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
-	\$stderr);
-
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
@@ -747,6 +746,12 @@ is( $node_standby->safe_psql(
 	'f',
 	'Logical slots are reported as non conflicting');
 
+my $tmp = $node_standby->safe_psql(
+		'postgres',
+		q[select slot_name, conflicting from pg_replication_slots
+			where slot_type = 'logical']);
+print $tmp;
+
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 
@@ -764,7 +769,7 @@ $logstart = -s $node_standby->logfile;
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
 reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
-	0);
+	0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
@@ -786,10 +791,6 @@ check_slots_conflict_reason('pruning_', 'rows_removed');
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -805,7 +806,7 @@ $logstart = -s $node_standby->logfile;
 drop_logical_slots('pruning_');
 
 # create the logical slots
-create_logical_slots($node_standby, 'wal_level_');
+create_logical_slots($node_standby, 'wal_level_', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
@@ -858,7 +859,7 @@ check_pg_recvlogical_stderr($handle,
 drop_logical_slots('wal_level_');
 
 # create the logical slots
-create_logical_slots($node_standby, 'drop_db_');
+create_logical_slots($node_standby, 'drop_db_', 1);
 
 $handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
@@ -922,14 +923,14 @@ $node_cascading_standby->append_conf(
 $node_cascading_standby->start;
 
 # create the logical slots
-create_logical_slots($node_standby, 'promotion_');
+create_logical_slots($node_standby, 'promotion_', 1);
 
 # Wait for the cascading standby to catchup before creating the slots
 $node_standby->wait_for_replay_catchup($node_cascading_standby,
 	$node_primary);
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby, 'promotion_');
+create_logical_slots($node_cascading_standby, 'promotion_', 1);
 
 # Make slots actives
 $handle =
-- 
2.43.5

v5-PG17-2-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG17-2-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From ea258ce0f4d2f416f0d33af974abe87219458511 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG17-2] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 66 +++++++++----------
 1 file changed, 31 insertions(+), 35 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..2b7ae4b74e9 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,9 +212,8 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
 
 	# message should be issued
@@ -226,18 +222,24 @@ sub check_for_invalidation
 			$log_start),
 		"inactiveslot slot invalidation is logged $test_name");
 
-	ok( $node_standby->log_contains(
-			"invalidating obsolete replication slot \"$active_slot\"",
-			$log_start),
-		"activeslot slot invalidation is logged $test_name");
-
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		ok( $node_standby->log_contains(
+				"invalidating obsolete replication slot \"$active_slot\"",
+				$log_start),
+			"activeslot slot invalidation is logged $test_name");
+
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +252,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that replication slots are not activated
+# for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -550,10 +553,6 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,19 +561,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
@@ -651,7 +642,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
@@ -687,7 +678,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
@@ -711,6 +702,11 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# This scenario won't produce the race condition by a xl_running_xacts, so
+# activate the slot. See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
+	\$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -779,7 +775,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
@@ -823,7 +819,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
-- 
2.43.5

#26Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#25)
1 attachment(s)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 02, 2025 at 12:13:52PM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Amit, Bertrand,

The other idea to simplify the changes for backbranches:
sub reactive_slots_change_hfs_and_wait_for_xmins
{
...
+  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;

  # create the logical slots
-  create_logical_slots($node_standby, $slot_prefix);
+  create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
...
+  if ($needs_active_slot)
+  {
+    $handle =
+      make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+  }

What if this function doesn't take input parameter needs_active_slot
and rather removes the call to make_slot_active? We will call
make_slot_active only at the required places. This should make the
changes much less because after that, we don't need to make changes
related to drop and create. Sure, in some cases, we will test two
inactive slots instead of one, but I guess that would be the price to
keep the tests simple and more like HEAD.

Actually, I could not decide which one is better, so let me share both drafts.

Thanks!

V5-PG17-1 uses the previous approach, and v5-PG17-2 uses new proposed one.
Bertrand, which one do you like?

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are invalidated
that's fine, they are not that's fine too).

Also I think that we should change this part:

"
 # Verify that invalidated logical slots do not lead to retaining WAL.
@@ -602,7 +610,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
        'postgres',
        "SELECT restart_lsn FROM pg_replication_slots
-               WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+               WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );

" to be on the safe side of thing.

What do you think of the attached (to apply on top of v5-PG17-2)?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-PG17-2-0001-bertrand.patch.txttext/plain; charset=us-asciiDownload
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 2b7ae4b74e9..d1be179fed6 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -171,24 +171,28 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check reason for conflict in pg_replication_slots.
 sub check_slots_conflict_reason
 {
-	my ($slot_prefix, $reason) = @_;
+	my ($slot_prefix, $reason, $checks_active_slot) = @_;
 
-	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
-
-	$res = $node_standby->safe_psql(
-		'postgres', qq(
-			 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
-	);
-
-	is($res, "$reason", "$active_slot reason for conflict is $reason");
-
 	$res = $node_standby->safe_psql(
 		'postgres', qq(
 			 select invalidation_reason from pg_replication_slots where slot_name = '$inactive_slot' and conflicting;)
 	);
 
 	is($res, "$reason", "$inactive_slot reason for conflict is $reason");
+
+	if ($checks_active_slot)
+	{
+		my $active_slot = $slot_prefix . 'activeslot';
+
+		$res = $node_standby->safe_psql(
+			'postgres', qq(
+				 select invalidation_reason from pg_replication_slots where slot_name = '$active_slot' and conflicting;)
+		);
+
+		is($res, "$reason", "$active_slot reason for conflict is $reason");
+
+	}
 }
 
 # Drop the slots, re-create them, change hot_standby_feedback,
@@ -205,6 +209,9 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
+	$handle =
+	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -252,8 +259,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon. We ensure that replication slots are not activated
-# for tests that might produce this race condition though.
+# the catalog xmin horizon. We ensure to not test for invalidations in such
+# cases.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -553,6 +560,10 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
+$node_standby->poll_query_until('testdb',
+	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
+) or die "replication slot stats of vacuum_full_activeslot not updated";
+
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -564,16 +575,19 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
-check_slots_conflict_reason('vacuum_full_', 'rows_removed');
+check_slots_conflict_reason('vacuum_full_', 'rows_removed', 0);
+
+# Ensure that replication slot stats are not removed after invalidation.
+is( $node_standby->safe_psql(
+		'testdb',
+		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
+	),
+	't',
+	'replication slot stats not removed after invalidation');
 
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"vacuum_full_activeslot\""
-);
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -583,7 +597,7 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $node_standby->restart;
 
 # Verify reason for conflict is retained across a restart.
-check_slots_conflict_reason('vacuum_full_', 'rows_removed');
+check_slots_conflict_reason('vacuum_full_', 'rows_removed', 0);
 
 ##################################################
 # Verify that invalidated logical slots do not lead to retaining WAL.
@@ -593,7 +607,7 @@ check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 my $restart_lsn = $node_standby->safe_psql(
 	'postgres',
 	"SELECT restart_lsn FROM pg_replication_slots
-		WHERE slot_name = 'vacuum_full_activeslot' AND conflicting;"
+		WHERE slot_name = 'vacuum_full_inactiveslot' AND conflicting;"
 );
 
 chomp($restart_lsn);
@@ -645,16 +659,11 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
-check_slots_conflict_reason('row_removal_', 'rows_removed');
+check_slots_conflict_reason('row_removal_', 'rows_removed', 0);
 
 $handle =
   make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
 # Scenario 3: conflict due to row removal with hot_standby_feedback off.
@@ -681,16 +690,11 @@ check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
-check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
+check_slots_conflict_reason('shared_row_removal_', 'rows_removed', 0);
 
 $handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
 	\$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"shared_row_removal_activeslot\""
-);
-
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
 # Scenario 4: No conflict expected.
@@ -702,11 +706,6 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
-# This scenario won't produce the race condition by a xl_running_xacts, so
-# activate the slot. See comments atop wait_until_vacuum_can_remove().
-make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
-	\$stderr);
-
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -778,14 +777,10 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
-check_slots_conflict_reason('pruning_', 'rows_removed');
+check_slots_conflict_reason('pruning_', 'rows_removed', 0);
 
 $handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
-# We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle,
-	"can no longer get changes from replication slot \"pruning_activeslot\"");
-
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
@@ -822,7 +817,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
-check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
+check_slots_conflict_reason('wal_level_', 'wal_level_insufficient', 1);
 
 $handle =
   make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#26)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 2, 2025 at 8:30 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

On Wed, Apr 02, 2025 at 12:13:52PM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Amit, Bertrand,

The other idea to simplify the changes for backbranches:
sub reactive_slots_change_hfs_and_wait_for_xmins
{
...
+  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_;

# create the logical slots
-  create_logical_slots($node_standby, $slot_prefix);
+  create_logical_slots($node_standby, $slot_prefix, $needs_active_slot);
...
+  if ($needs_active_slot)
+  {
+    $handle =
+      make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+  }

What if this function doesn't take input parameter needs_active_slot
and rather removes the call to make_slot_active? We will call
make_slot_active only at the required places. This should make the
changes much less because after that, we don't need to make changes
related to drop and create. Sure, in some cases, we will test two
inactive slots instead of one, but I guess that would be the price to
keep the tests simple and more like HEAD.

Actually, I could not decide which one is better, so let me share both drafts.

Thanks!

V5-PG17-1 uses the previous approach, and v5-PG17-2 uses new proposed one.
Bertrand, which one do you like?

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are invalidated
that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage. I feel Kuroda-San's second patch is simple, and we have
fewer chances to make mistakes and easy to maintain in the future as
well.

--
With Regards,
Amit Kapila.

#28Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#27)
3 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand, Amit,

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are

invalidated

that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage. I feel Kuroda-San's second patch is simple, and we have
fewer chances to make mistakes and easy to maintain in the future as
well.

I have concerns for Bertrand's patch that it could introduce another timing
issue. E.g., if the activated slots are not invalidated, dropping slots is keep
being activated so the dropping might be fail. I did not reproduce this but
something like this can happen if we activate slots.

Attached patch has a conclusion of these discussions, slots are created but
it seldomly be activated.

Naming of patches are bit different, but please ignore...

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From c69b5b2d0b53c28ac99705b2c1507be24658104b Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG16] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 41 +++++++++++--------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..1cf58f453f5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that replication slots are not activated
+# for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -550,7 +552,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -632,7 +634,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -668,7 +670,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -692,6 +694,11 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# This scenario won't produce the race condition by a xl_running_xacts, so
+# activate the slot. See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
+	\$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -754,7 +761,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -798,7 +805,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
-- 
2.43.5

v5-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From 5dce899cb4856908ea41a8817fa71135b505c0c2 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG17] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip activating slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 53 +++++++++----------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..752e31960ea 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that replication slots are not activated
+# for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -550,10 +552,6 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,19 +560,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
@@ -651,7 +641,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
@@ -687,7 +677,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
@@ -711,6 +701,11 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# This scenario won't produce the race condition by a xl_running_xacts, so
+# activate the slot. See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
+	\$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -779,7 +774,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
@@ -823,7 +818,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
-- 
2.43.5

0001-Fix-invalid-referring-of-hash-ref-for-replication-sl.patchapplication/octet-stream; name=0001-Fix-invalid-referring-of-hash-ref-for-replication-sl.patchDownload
From 47249c139c7fe7671d02657c6fb5f9bed128af14 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 3 Apr 2025 12:12:12 +0900
Subject: [PATCH] Fix invalid referring of hash-ref for replication slots

hash-ref gerenated by slot() did not have key 'slot_name', but some codes
referred it. Fix it by referring 'plugin' instead.
---
 src/test/recovery/t/006_logical_decoding.pl           | 8 ++++----
 src/test/recovery/t/010_logical_decoding_timelines.pl | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index a5678bc4dc4..2137c4e5e30 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -161,8 +161,8 @@ SKIP:
 	is($node_primary->psql('postgres', 'DROP DATABASE otherdb'),
 		3, 'dropping a DB with active logical slots fails');
 	$pg_recvlogical->kill_kill;
-	is($node_primary->slot('otherdb_slot')->{'slot_name'},
-		undef, 'logical slot still exists');
+	is($node_primary->slot('otherdb_slot')->{'plugin'},
+		'test_decoding', 'logical slot still exists');
 }
 
 $node_primary->poll_query_until('otherdb',
@@ -171,8 +171,8 @@ $node_primary->poll_query_until('otherdb',
 
 is($node_primary->psql('postgres', 'DROP DATABASE otherdb'),
 	0, 'dropping a DB with inactive logical slots succeeds');
-is($node_primary->slot('otherdb_slot')->{'slot_name'},
-	undef, 'logical slot was actually dropped with DB');
+is($node_primary->slot('otherdb_slot')->{'plugin'},
+	'', 'logical slot was actually dropped with DB');
 
 # Test logical slot advancing and its durability.
 # Passing failover=true (last arg) should not have any impact on advancing.
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
index 08615f1fca8..0199ae95abf 100644
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ b/src/test/recovery/t/010_logical_decoding_timelines.pl
@@ -94,8 +94,8 @@ is( $node_replica->safe_psql(
 		'postgres', q[SELECT 1 FROM pg_database WHERE datname = 'dropme']),
 	'',
 	'dropped DB dropme on standby');
-is($node_primary->slot('dropme_slot')->{'slot_name'},
-	undef, 'logical slot was actually dropped on standby');
+is($node_primary->slot('dropme_slot')->{'plugin'},
+	'', 'logical slot was actually dropped on standby');
 
 # Back to testing failover...
 $node_primary->safe_psql('postgres',
-- 
2.43.5

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#28)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Thu, Apr 3, 2025 at 11:04 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Bertrand, Amit,

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are

invalidated

that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage. I feel Kuroda-San's second patch is simple, and we have
fewer chances to make mistakes and easy to maintain in the future as
well.

I have concerns for Bertrand's patch that it could introduce another timing
issue. E.g., if the activated slots are not invalidated, dropping slots is keep
being activated so the dropping might be fail. I did not reproduce this but
something like this can happen if we activate slots.

Attached patch has a conclusion of these discussions, slots are created but
it seldomly be activated.

Naming of patches are bit different, but please ignore...

Isn't patch 0001-Fix-invalid-referring-of-hash-ref-for-replication-sl
unrelated to this thread? Or am, I missing something?

--
With Regards,
Amit Kapila.

#30Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#29)
3 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Isn't patch 0001-Fix-invalid-referring-of-hash-ref-for-replication-sl
unrelated to this thread? Or am, I missing something?

I did attach wrongly, PSA correct set. Sorry for inconvenience.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchapplication/octet-stream; name=v5-0001-Stabilize-035_standby_logical_decoding.pl-by-usin.patchDownload
From fbb3658c17dbf5bd4fdcee1803fda6a40d3839a4 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 14:19:50 +0900
Subject: [PATCH v5] Stabilize 035_standby_logical_decoding.pl by using the
 injection_points.

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip generating the record when the instance attached to a new
injection point.

This failure can happen since logical decoding is allowed on the standby server.
But the interface of injection_points we used exists only on master, so we do
not backpatch.
---
 src/backend/storage/ipc/standby.c             | 12 ++++
 .../t/035_standby_logical_decoding.pl         | 56 ++++++++++++++-----
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..7fa8d9247e0 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -31,6 +31,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
 #include "utils/hsearch.h"
+#include "utils/injection_point.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -1287,6 +1288,17 @@ LogStandbySnapshot(void)
 
 	Assert(XLogStandbyInfoActive());
 
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts"))
+	{
+		/*
+		 * This record could move slot's xmin forward during decoding, leading
+		 * to unpredictable results, so skip it when requested by the test.
+		 */
+		return GetInsertRecPtr();
+	}
+#endif
+
 	/*
 	 * Get details of any AccessExclusiveLocks being held at the moment.
 	 */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..52ebd24f7f1 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
 my ($stdout, $stderr, $cascading_stdout, $cascading_stderr, $handle);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
@@ -241,16 +246,19 @@ sub check_for_invalidation
 # VACUUM command, $sql the sql to launch before triggering the vacuum and
 # $to_vac the relation to vacuum.
 #
-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# Note that the injection_point avoids seeing a xl_running_xacts that could
+# advance an active replication slot's catalog_xmin. Advancing the active
+# replication slot's catalog_xmin would break some tests that expect the
+# active slot to conflict with the catalog xmin horizon.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
 
+	# Note that from this point the checkpointer and bgwriter will skip writing
+	# xl_running_xacts record.
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_attach('skip-log-running-xacts', 'error');");
+
 	# Get the current xid horizon,
 	my $xid_horizon = $node_primary->safe_psql('testdb',
 		qq[select pg_snapshot_xmin(pg_current_snapshot());]);
@@ -268,6 +276,12 @@ sub wait_until_vacuum_can_remove
 	$node_primary->safe_psql(
 		'testdb', qq[VACUUM $vac_option verbose $to_vac;
 										  INSERT INTO flush_wal DEFAULT VALUES;]);
+
+	$node_primary->wait_for_replay_catchup($node_standby);
+
+	# Resume generating the xl_running_xacts record
+	$node_primary->safe_psql('testdb',
+		"SELECT injection_points_detach('skip-log-running-xacts');");
 }
 
 ########################
@@ -285,6 +299,14 @@ autovacuum = off
 $node_primary->dump_info;
 $node_primary->start;
 
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 
 $node_primary->safe_psql('testdb',
@@ -528,6 +550,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+# Create the injection_points extension
+$node_primary->safe_psql('testdb', 'CREATE EXTENSION injection_points;');
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -557,8 +582,6 @@ wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
 								 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
@@ -656,8 +679,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
 							 DROP TABLE conflict_test;', 'pg_class');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
@@ -690,8 +711,6 @@ wait_until_vacuum_can_remove(
 	'', 'CREATE ROLE create_trash;
 							 DROP ROLE create_trash;', 'pg_authid');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
 	'with vacuum on pg_authid');
@@ -724,8 +743,6 @@ wait_until_vacuum_can_remove(
 							 INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
 							 UPDATE conflict_test set x=1, y=1;', 'conflict_test');
 
-$node_primary->wait_for_replay_catchup($node_standby);
-
 # message should not be issued
 ok( !$node_standby->log_contains(
 		"invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
@@ -773,6 +790,13 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0,
 	0);
 
+# Injection_point avoids seeing a xl_running_xacts. This is required because if
+# it is generated between the last two updates, then the catalog_xmin of the
+# active slot could be updated, and hence, the conflict won't occur. See
+# comments atop wait_until_vacuum_can_remove.
+$node_primary->safe_psql('testdb',
+	"SELECT injection_points_attach('skip-log-running-xacts', 'error');");
+
 # This should trigger the conflict
 $node_primary->safe_psql('testdb',
 	qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]
@@ -785,6 +809,10 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
+# Resume generating the xl_running_xacts record
+$node_primary->safe_psql('testdb',
+	"SELECT injection_points_detach('skip-log-running-xacts');");
+
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
 
-- 
2.43.5

v5-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From c69b5b2d0b53c28ac99705b2c1507be24658104b Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG16] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip using the active slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 41 +++++++++++--------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8120dfc2132..1cf58f453f5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that replication slots are not activated
+# for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -550,7 +552,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -632,7 +634,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -668,7 +670,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -692,6 +694,11 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# This scenario won't produce the race condition by a xl_running_xacts, so
+# activate the slot. See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
+	\$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -754,7 +761,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -798,7 +805,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
-- 
2.43.5

v5-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v5-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From 5dce899cb4856908ea41a8817fa71135b505c0c2 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 26 Mar 2025 19:03:50 +0900
Subject: [PATCH v5-PG17] Stabilize 035_standby_logical_decoding.pl

This test tries to invalidate slots on standby server, by running VACUUM on
primary and discarding needed tuples for slots. The problem is that
xl_running_xacts records are sotimetimes generated while testing, it advances
the catalog_xmin so that the invalidation might not happen in some cases.

The fix is to skip activating slots for some testcases.
---
 .../t/035_standby_logical_decoding.pl         | 53 +++++++++----------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index aeb79f51e71..752e31960ea 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,8 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon. We ensure that replication slots are not activated
+# for tests that might produce this race condition though.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -550,10 +552,6 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,19 +560,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
@@ -651,7 +641,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
@@ -687,7 +677,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
@@ -711,6 +701,11 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# This scenario won't produce the race condition by a xl_running_xacts, so
+# activate the slot. See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout,
+	\$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -779,7 +774,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
@@ -823,7 +818,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
-- 
2.43.5

#31Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#28)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand, Amit,

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are

invalidated

that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage.

I think that there is advantage. The pros are:

- the test would be closer to HEAD from a behavioural point of view
- it's very rare to hit the corner cases: so the test would behave the same
as on HEAD most of the time (and when it does not that would not hurt as the
checks are nor done)
- Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
while keeping the slot active is able to keep it (should the slot being invalidated
or not). But more on that in the comment === 1 below.

I feel Kuroda-San's second patch is simple, and we have

fewer chances to make mistakes and easy to maintain in the future as
well.

Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
fine with Kuroda-San's patch if both of you feel that it's better.

I have concerns for Bertrand's patch that it could introduce another timing
issue. E.g., if the activated slots are not invalidated, dropping slots is keep
being activated so the dropping might be fail.

Yeah, but the drop is done with "$node_standby->psql" so that the test does not
produce an error. It would produce an error should we use "$node_standby->safe_psql"
instead.

I did not reproduce this but
something like this can happen if we activate slots.

You can see it that way (+ reproducer.txt):

"
+       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
+       note "BDT: $bdt";
+
        $node_standby->psql('postgres',
                qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
"

You'd see the slot being active and the "$node_standby->psql" not reporting
any error.

Attached patch has a conclusion of these discussions, slots are created but
it seldomly be activated.

Thanks for the patch!

=== 1

-$node_standby->poll_query_until('testdb',
- qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
# This should trigger the conflict
wait_until_vacuum_can_remove(

I wonder if we could not keep this test and make the slot active for the
vacuum full case. Looking at drongo's failure in [1]/messages/by-id/386386.1737736935@sss.pgh.pa.us, there is no occurence
of "vacuum full" and that's probably linked to Andres's explanation in [2]/messages/by-id/zqypkuvtihtd2zbmwdfmcceujg4fuakrhojmjkxpp7t4udqkty@couhenc7dsor:

"
a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).
"

meaning that the active slot is invalidated even if the catalog xmin's moves
forward due to xl_running_xacts.

[1]: /messages/by-id/386386.1737736935@sss.pgh.pa.us
[2]: /messages/by-id/zqypkuvtihtd2zbmwdfmcceujg4fuakrhojmjkxpp7t4udqkty@couhenc7dsor

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#31)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand, Amit,

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are

invalidated

that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage.

I think that there is advantage. The pros are:

- the test would be closer to HEAD from a behavioural point of view
- it's very rare to hit the corner cases: so the test would behave the same
as on HEAD most of the time (and when it does not that would not hurt as the
checks are nor done)
- Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
while keeping the slot active is able to keep it (should the slot being invalidated
or not). But more on that in the comment === 1 below.

I feel Kuroda-San's second patch is simple, and we have

fewer chances to make mistakes and easy to maintain in the future as
well.

Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
fine with Kuroda-San's patch if both of you feel that it's better.

I have concerns for Bertrand's patch that it could introduce another timing
issue. E.g., if the activated slots are not invalidated, dropping slots is keep
being activated so the dropping might be fail.

Yeah, but the drop is done with "$node_standby->psql" so that the test does not
produce an error. It would produce an error should we use "$node_standby->safe_psql"
instead.

I did not reproduce this but
something like this can happen if we activate slots.

You can see it that way (+ reproducer.txt):

"
+       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
+       note "BDT: $bdt";
+
$node_standby->psql('postgres',
qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
"

You'd see the slot being active and the "$node_standby->psql" not reporting
any error.

Hmm, but adding some additional smarts also makes this test less easy
to backpatch. I see your points related to the benefits, but I still
mildly prefer to go with the lesser changes approach for backbranches
patch. Normally, we don't enhance backbranches code without making
equivalent changes in HEAD, so adding some new bugs only in
backbranches has a lesser chance.

--
With Regards,
Amit Kapila.

#33Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Bertrand Drouvot (#31)
1 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Bertrand,

I wonder if we could not keep this test and make the slot active for the
vacuum full case. Looking at drongo's failure in [1], there is no occurence
of "vacuum full" and that's probably linked to Andres's explanation in [2]:

"
a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).
"

I had been debugging and found the case that VACUUM FULL also has a timing issue.
This means the we cannot keep the testcase.

PSA the reproducer for PG17. IIUC this can happen even in PG16.
I considered what happened here;

1. Run a CHECKPOINT and wait sometime in wait_until_vacuum_can_remove().
This ensures that RUNNING_XACTS record can be generated and catalog_xmin can
be advanced after the user SQLs.
2. Assuming that another RUNNING_XACTS record is generated *WHILE* doing a VACUUM
FULL. This can be done by the periodic checkpoint or the reproducer.
3. Logical walsender detects the RUNNING_XACTS record.
Note that this must be done before startup tries to invalidate slot.
4. In sometime the walsender receives the ack and advance the catalog_xmin.
Note again that this must be done before startup tries to invalidate slot.
5. Startup process detects the PRUNE_ON_ACCESS record and tries to invalidate the
slot. However, the catalog_xmin has been advanced so that the invalidation
cannot be done.

Analysis
========
While analyzing this workload, I found that VACUUM FULL can generate four
PRUNE_ON_ACCESS records. More especially, first two records are generated while
clustering the table, others are done while updating pg_database.datfrozenxid.
Interestingly, latter records are genareted after the transaction is finished;
the VACUUM FULL command itselfs ends up the txn once (in vacuum_rel) and then
continue working on. Without the delay in testcode, the first PRUNE record leads
the invalidation the slot, and with the delay fourth PRUNE leads it. Per my
analysis, snapshotConflictHorizon is the xid which first PRUNE records exist.

Based on the fact, I considered that catalog_xmin can be advanced till the between
(non-)transactional PRUNE records. RequestCheckpoint() is added to generate the
RUNNING_XACTS in-between them.

Very thanks Amit for supporting me off-list for reproducing the issue.

Best regards,
Hayato Kuroda
Fujitsu LIMITED

Attachments:

repro_pg17.diffsapplication/octet-stream; name=repro_pg17.diffsDownload
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 79fe26a9325..ae4580ea7e1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
+#include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -685,6 +686,12 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
 		StartTransactionCommand();
 	}
 
+	/*
+	 * REPRO: Request chckpointer to wakeup to ensure xl_running_xacts exists
+	 * before the PRUNE_ON_ACCESS
+	 */
+	RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+
 	if ((params->options & VACOPT_VACUUM) &&
 		!(params->options & VACOPT_SKIP_DATABASE_STATS))
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a1d4768623f..7ad4e61ed8c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1570,6 +1570,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 			break;
 		}
 
+		/*
+		 * REPRO: wait sometime to avoid slot invalidation before the logical
+		 * walsender decodes xl_running_xacts and catalog_xmin is advanced.
+		 */
+		sleep(2);
+
 		/*
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 4eca17885d6..71a18891d44 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -267,6 +267,11 @@ sub wait_until_vacuum_can_remove
 		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
 	) or die "new snapshot does not have a newer horizon";
 
+	# REPRO: do CHECKPOINT and wait sometime to generate xl_running_xacts
+	# records
+	$node_primary->safe_psql('testdb', qq[CHECKPOINT]);
+	sleep(20);
+
 	# Launch the vacuum command and insert into flush_wal (see CREATE
 	# TABLE flush_wal for the reason).
 	$node_primary->safe_psql(
#34Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#33)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi Kuroda-san,

On Mon, Apr 07, 2025 at 06:15:13AM +0000, Hayato Kuroda (Fujitsu) wrote:

I had been debugging and found the case that VACUUM FULL also has a timing issue.
This means the we cannot keep the testcase.

PSA the reproducer for PG17. IIUC this can happen even in PG16.
I considered what happened here;

1. Run a CHECKPOINT and wait sometime in wait_until_vacuum_can_remove().
This ensures that RUNNING_XACTS record can be generated and catalog_xmin can
be advanced after the user SQLs.
2. Assuming that another RUNNING_XACTS record is generated *WHILE* doing a VACUUM
FULL. This can be done by the periodic checkpoint or the reproducer.
3. Logical walsender detects the RUNNING_XACTS record.
Note that this must be done before startup tries to invalidate slot.
4. In sometime the walsender receives the ack and advance the catalog_xmin.
Note again that this must be done before startup tries to invalidate slot.
5. Startup process detects the PRUNE_ON_ACCESS record and tries to invalidate the
slot. However, the catalog_xmin has been advanced so that the invalidation
cannot be done.

Thanks for the testing and explanation! I did apply your repro and I'm able to
see the test failing (with an active slot). The scenario is more unlikely
to happen (as compare to the non vacuum full cases) and that's why it was not
visible in drongo's reports in [1]/messages/by-id/386386.1737736935@sss.pgh.pa.us. So yeah, let's do as you suggested and do
not make the slot active for the vacuum full case too.

[1]: /messages/by-id/386386.1737736935@sss.pgh.pa.us

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#32)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand, Amit,

I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
keep the slots active and only avoid doing the checks for them (they are

invalidated

that's fine, they are not that's fine too).

I don't mind doing that, but there is no benefit in making slots
active unless we can validate them. And we will end up adding some
more checks, as in function check_slots_conflict_reason without any
advantage.

I think that there is advantage. The pros are:

- the test would be closer to HEAD from a behavioural point of view
- it's very rare to hit the corner cases: so the test would behave the same
as on HEAD most of the time (and when it does not that would not hurt as the
checks are nor done)
- Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
while keeping the slot active is able to keep it (should the slot being invalidated
or not). But more on that in the comment === 1 below.

I feel Kuroda-San's second patch is simple, and we have

fewer chances to make mistakes and easy to maintain in the future as
well.

Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
fine with Kuroda-San's patch if both of you feel that it's better.

I have concerns for Bertrand's patch that it could introduce another timing
issue. E.g., if the activated slots are not invalidated, dropping slots is keep
being activated so the dropping might be fail.

Yeah, but the drop is done with "$node_standby->psql" so that the test does not
produce an error. It would produce an error should we use "$node_standby->safe_psql"
instead.

I did not reproduce this but
something like this can happen if we activate slots.

You can see it that way (+ reproducer.txt):

"
+       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
+       note "BDT: $bdt";
+
$node_standby->psql('postgres',
qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
"

You'd see the slot being active and the "$node_standby->psql" not reporting
any error.

Hmm, but adding some additional smarts also makes this test less easy
to backpatch. I see your points related to the benefits, but I still
mildly prefer to go with the lesser changes approach for backbranches
patch. Normally, we don't enhance backbranches code without making
equivalent changes in HEAD, so adding some new bugs only in
backbranches has a lesser chance.

Bertrand, do you agree with the fewer changes approach (where active
slots won't be tested) for backbranches? I think now that we have
established that the vacuum full test is also prone to failure due to
race condition in the test, this is the only remaining open point.

--
With Regards,
Amit Kapila.

#36Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Amit Kapila (#35)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Mon, Apr 07, 2025 at 03:16:07PM +0530, Amit Kapila wrote:

On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm, but adding some additional smarts also makes this test less easy
to backpatch. I see your points related to the benefits, but I still
mildly prefer to go with the lesser changes approach for backbranches
patch. Normally, we don't enhance backbranches code without making
equivalent changes in HEAD, so adding some new bugs only in
backbranches has a lesser chance.

Bertrand, do you agree with the fewer changes approach (where active
slots won't be tested) for backbranches? I think now that we have
established that the vacuum full test is also prone to failure due to
race condition in the test, this is the only remaining open point.

Yeah that's all good on my side, let's keep it that way and don't make the slot
active.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#30)
1 attachment(s)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Thu, Apr 3, 2025 at 11:29 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I have changed quite a few comments and commit message for the PG17
patch in the attached. Can you update PG16 patch based on this and
also use the same commit message as used in attached for all the three
patches?

--
With Regards,
Amit Kapila.

Attachments:

v6-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v6-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From a0e08bae8eeea2e177abdc371b18fd5280728ac0 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 7 Apr 2025 17:58:23 +0530
Subject: [PATCH v6] Stabilize 035_standby_logical_decoding.pl.

Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.

We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).
---
 .../t/035_standby_logical_decoding.pl         | 64 +++++++++----------
 1 file changed, 29 insertions(+), 35 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 4eca17885d6..58c4402e80e 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,11 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon.  Even with the above precaution, there is a risk
+# of xl_running_xacts record being logged and replayed before the VACUUM
+# command, leading to the test failure.  So, we ensured that replication slots
+# are not activated for tests that can invalidate slots due to 'rows_removed'
+# conflict reason.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -532,11 +537,8 @@ $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-#
-# In passing, ensure that replication slot stats are not removed when the
-# active slot is invalidated.
 ##################################################
 
 # One way to produce recovery conflict is to create/drop a relation and
@@ -550,10 +552,6 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,19 +560,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
@@ -639,7 +629,7 @@ ok(!-f "$standby_walfile",
 	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
@@ -660,7 +650,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
@@ -696,7 +686,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
@@ -720,6 +710,10 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# As this scenario is not expected to produce any conflict, so activate the slot.
+# See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout, \$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -763,7 +757,7 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 $node_standby->restart;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
@@ -788,7 +782,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
@@ -832,7 +826,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
-- 
2.28.0.windows.1

#38Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#37)
2 attachment(s)
RE: Fix 035_standby_logical_decoding.pl race conditions

Dear Amit,

I have changed quite a few comments and commit message for the PG17
patch in the attached. Can you update PG16 patch based on this and
also use the same commit message as used in attached for all the three
patches?

Your patch looks good to me and it could pass on my env. PSA patches for PG16.
Patch for PG17 is not changed, just renamed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v6-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v6-PG17-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From 7ae736c680cb20432d1eaed1a6bb033c21675253 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 7 Apr 2025 17:58:23 +0530
Subject: [PATCH v6-PG17] Stabilize 035_standby_logical_decoding.pl.

Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.

We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).
---
 .../t/035_standby_logical_decoding.pl         | 64 +++++++++----------
 1 file changed, 29 insertions(+), 35 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 4eca17885d6..58c4402e80e 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,11 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon.  Even with the above precaution, there is a risk
+# of xl_running_xacts record being logged and replayed before the VACUUM
+# command, leading to the test failure.  So, we ensured that replication slots
+# are not activated for tests that can invalidate slots due to 'rows_removed'
+# conflict reason.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -532,11 +537,8 @@ $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
-#
-# In passing, ensure that replication slot stats are not removed when the
-# active slot is invalidated.
 ##################################################
 
 # One way to produce recovery conflict is to create/drop a relation and
@@ -550,10 +552,6 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 $node_primary->safe_psql('testdb',
 	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]);
 
-$node_standby->poll_query_until('testdb',
-	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-) or die "replication slot stats of vacuum_full_activeslot not updated";
-
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -562,19 +560,11 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
-# Ensure that replication slot stats are not removed after invalidation.
-is( $node_standby->safe_psql(
-		'testdb',
-		qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
-	),
-	't',
-	'replication slot stats not removed after invalidation');
-
 $handle =
   make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
@@ -639,7 +629,7 @@ ok(!-f "$standby_walfile",
 	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
@@ -660,7 +650,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
@@ -696,7 +686,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
@@ -720,6 +710,10 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# As this scenario is not expected to produce any conflict, so activate the slot.
+# See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout, \$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -763,7 +757,7 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 $node_standby->restart;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
@@ -788,7 +782,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('pruning_', 'rows_removed');
@@ -832,7 +826,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify reason for conflict is 'wal_level_insufficient' in pg_replication_slots
 check_slots_conflict_reason('wal_level_', 'wal_level_insufficient');
-- 
2.43.5

v6-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchapplication/octet-stream; name=v6-PG16-0001-Stabilize-035_standby_logical_decoding.pl.patchDownload
From a81a3b933bb1c2464f9a6442f2a4833d904cf628 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 7 Apr 2025 17:58:23 +0530
Subject: [PATCH v6-PG16] Stabilize 035_standby_logical_decoding.pl.

Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.

We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).
---
 .../t/035_standby_logical_decoding.pl         | 49 +++++++++++--------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 82ad7ce0c2b..360ed826094 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -205,9 +205,6 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
-
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
 }
@@ -215,7 +212,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $checks_active_slot) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -231,13 +228,17 @@ sub check_for_invalidation
 			$log_start),
 		"activeslot slot invalidation is logged $test_name");
 
-	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
-			't'),
-		'confl_active_logicalslot updated'
-	) or die "Timed out waiting confl_active_logicalslot to be updated";
+	if ($checks_active_slot)
+	{
+		# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
+		# been updated
+		ok( $node_standby->poll_query_until(
+				'postgres',
+				"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'",
+				't'),
+			'confl_active_logicalslot updated'
+		) or die "Timed out waiting confl_active_logicalslot to be updated";
+	}
 }
 
 # Launch $sql query, wait for a new snapshot that has a newer horizon and
@@ -250,7 +251,11 @@ sub check_for_invalidation
 # seeing a xl_running_xacts that would advance an active replication slot's
 # catalog_xmin.  Advancing the active replication slot's catalog_xmin
 # would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.
+# the catalog xmin horizon.  Even with the above precaution, there is a risk
+# of xl_running_xacts record being logged and replayed before the VACUUM
+# command, leading to the test failure.  So, we ensured that replication slots
+# are not activated for tests that can invalidate slots due to 'rows_removed'
+# conflict reason.
 sub wait_until_vacuum_can_remove
 {
 	my ($vac_option, $sql, $to_vac) = @_;
@@ -532,7 +537,7 @@ $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
@@ -550,7 +555,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -620,7 +625,7 @@ ok(!-f "$standby_walfile",
 	"invalidated logical slots do not lead to retaining WAL");
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
 ##################################################
 
@@ -641,7 +646,7 @@ wait_until_vacuum_can_remove(
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -677,7 +682,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 check_for_invalidation('shared_row_removal_', $logstart,
-	'with vacuum on pg_authid');
+	'with vacuum on pg_authid', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -701,6 +706,10 @@ $logstart = -s $node_standby->logfile;
 reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
 	'no_conflict_', 0, 1);
 
+# As this scenario is not expected to produce any conflict, so activate the slot.
+# See comments atop wait_until_vacuum_can_remove().
+make_slot_active($node_standby, 'no_conflict_', 1, \$stdout, \$stderr);
+
 # This should not trigger a conflict
 wait_until_vacuum_can_remove(
 	'', 'CREATE TABLE conflict_test(x integer, y text);
@@ -738,7 +747,7 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 $node_standby->restart;
 
 ##################################################
-# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Recovery conflict: Invalidate conflicting slots
 # Scenario 5: conflict due to on-access pruning.
 ##################################################
 
@@ -763,7 +772,7 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning', 0);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
@@ -807,7 +816,7 @@ $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # Check invalidation in the logfile and in pg_stat_database_conflicts
-check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level', 1);
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
-- 
2.43.5

#39Michael Paquier
michael@paquier.xyz
In reply to: Hayato Kuroda (Fujitsu) (#38)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Tue, Apr 08, 2025 at 02:00:35AM +0000, Hayato Kuroda (Fujitsu) wrote:

Your patch looks good to me and it could pass on my env. PSA patches for PG16.
Patch for PG17 is not changed, just renamed.

@@ -1287,6 +1288,17 @@ LogStandbySnapshot(void)

Assert(XLogStandbyInfoActive());

+#ifdef USE_INJECTION_POINTS
+    if (IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts"))
+    {
+        /*
+         * This record could move slot's xmin forward during decoding, leading
+         * to unpredictable results, so skip it when requested by the test.
+         */
+        return GetInsertRecPtr();
+    }
+#endif

I have unfortunately not been able to pay much attention to this
thread, but using an injection point as a trick to disable the
generation of these random standby snapshot records is an interesting
approach to stabilize the test, and it should make it faster as well.
Nice.
--
Michael

#40Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#39)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Tue, Apr 08, 2025 at 02:13:20PM +0900, Michael Paquier wrote:

On Tue, Apr 08, 2025 at 02:00:35AM +0000, Hayato Kuroda (Fujitsu) wrote:

Your patch looks good to me and it could pass on my env. PSA patches for PG16.
Patch for PG17 is not changed, just renamed.

@@ -1287,6 +1288,17 @@ LogStandbySnapshot(void)

Assert(XLogStandbyInfoActive());

+#ifdef USE_INJECTION_POINTS
+    if (IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts"))
+    {
+        /*
+         * This record could move slot's xmin forward during decoding, leading
+         * to unpredictable results, so skip it when requested by the test.
+         */
+        return GetInsertRecPtr();
+    }
+#endif

I have unfortunately not been able to pay much attention to this
thread, but using an injection point as a trick to disable the
generation of these random standby snapshot records is an interesting
approach to stabilize the test, and it should make it faster as well.
Nice.

Yeah. That said I still think it could be useful to implement what has been
proposed in v1-0001 in [1]/messages/by-id/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal i.e:

- A new injection_points_wakeup_detach() function that is holding the spinlock
during the whole duration to ensure that no process can wait in between the
wakeup and the detach.

- injection_wait() should try to reuse an existing slot (if any) before trying
to use an empty one.

This is not needed here anymore (as we're using another injection point that the
one initially prooposed) but I'll open a dedicated thread for that for 19 when
the timing will be appropriate.

[1]: /messages/by-id/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#41Andres Freund
andres@anarazel.de
In reply to: Hayato Kuroda (Fujitsu) (#38)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On 2025-04-08 02:00:35 +0000, Hayato Kuroda (Fujitsu) wrote:

I have changed quite a few comments and commit message for the PG17
patch in the attached. Can you update PG16 patch based on this and
also use the same commit message as used in attached for all the three
patches?

Your patch looks good to me and it could pass on my env. PSA patches for PG16.
Patch for PG17 is not changed, just renamed.

Thanks all for working together to for fix this. These test failures were
really rather painful!

Now we just need to fix the issue that causes random CI failures on windows
and the one that causes similar, but different, random failures on macos...

Greetings,

Andres Freund

#42Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#40)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Tue, Apr 08, 2025 at 06:19:02AM +0000, Bertrand Drouvot wrote:

- A new injection_points_wakeup_detach() function that is holding the spinlock
during the whole duration to ensure that no process can wait in between the
wakeup and the detach.

That would not a correct spinlock use. injection_points_detach() and
injection_points_wakeup_internal() do much more actions than what we
can internally do while holding a spinlock, including both
Postgres-specific calls as well as system calls. strcmp() and
strlcpy() are still OK-ish, even as system calls, as they work
directly on the strings given in input.
--
Michael

#43Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#42)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Tue, Apr 08, 2025 at 03:27:41PM +0900, Michael Paquier wrote:

On Tue, Apr 08, 2025 at 06:19:02AM +0000, Bertrand Drouvot wrote:

- A new injection_points_wakeup_detach() function that is holding the spinlock
during the whole duration to ensure that no process can wait in between the
wakeup and the detach.

That would not a correct spinlock use. injection_points_detach() and
injection_points_wakeup_internal() do much more actions than what we
can internally do while holding a spinlock,

Fully agree. Will need to find another way to prevent a process to wait between the
wakeup and the detach. I'll open a dedicated thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#44Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#43)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Tue, Apr 08, 2025 at 06:42:53AM +0000, Bertrand Drouvot wrote:

Fully agree. Will need to find another way to prevent a process to wait between the
wakeup and the detach. I'll open a dedicated thread.

By the way, there is a small thing that's itching me a bit about the
change done in LogStandbySnapshot() for 105b2cb33617. Could it be
useful for debugging to add a elog(DEBUG1) with the LSN returned by
GetInsertRecPtr() when taking the short path? We don't have any
visibility when the shortcut path is taken, which seems annoying in
the long term if we use the injection point skip-log-running-xacts for
other tests, and I suspect that there will be some as the standby
snapshots can be really annoying in tests where we want a predictible
set of WAL records when wal_level is "replica" or "logical".
--
Michael

#45Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#44)
Re: Fix 035_standby_logical_decoding.pl race conditions

Hi,

On Wed, Apr 09, 2025 at 12:03:06PM +0900, Michael Paquier wrote:

On Tue, Apr 08, 2025 at 06:42:53AM +0000, Bertrand Drouvot wrote:

Fully agree. Will need to find another way to prevent a process to wait between the
wakeup and the detach. I'll open a dedicated thread.

By the way, there is a small thing that's itching me a bit about the
change done in LogStandbySnapshot() for 105b2cb33617. Could it be
useful for debugging to add a elog(DEBUG1) with the LSN returned by
GetInsertRecPtr() when taking the short path? We don't have any
visibility when the shortcut path is taken, which seems annoying in
the long term if we use the injection point skip-log-running-xacts for
other tests, and I suspect that there will be some as the standby
snapshots can be really annoying in tests where we want a predictible
set of WAL records when wal_level is "replica" or "logical".

Yeah, I also think that would be good to have some way to debug this. That's
why I did propose to generate a WAL record by making use of LogLogicalMessage()
instead of GetInsertRecPtr() (see [1]/messages/by-id/Z+uko5kbw/ek/h0F@ip-10-97-1-34.eu-west-3.compute.internal). We could generate "bypassing xl_running_xacts"
or such and bonus point that would advance the record ptr.

Adding an elog(DEBUG1) could make sense too.

[1]: /messages/by-id/Z+uko5kbw/ek/h0F@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Bertrand Drouvot (#45)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 9, 2025 at 11:24 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

On Wed, Apr 09, 2025 at 12:03:06PM +0900, Michael Paquier wrote:

On Tue, Apr 08, 2025 at 06:42:53AM +0000, Bertrand Drouvot wrote:

Fully agree. Will need to find another way to prevent a process to wait between the
wakeup and the detach. I'll open a dedicated thread.

By the way, there is a small thing that's itching me a bit about the
change done in LogStandbySnapshot() for 105b2cb33617. Could it be
useful for debugging to add a elog(DEBUG1) with the LSN returned by
GetInsertRecPtr() when taking the short path? We don't have any
visibility when the shortcut path is taken, which seems annoying in
the long term if we use the injection point skip-log-running-xacts for
other tests, and I suspect that there will be some as the standby
snapshots can be really annoying in tests where we want a predictible
set of WAL records when wal_level is "replica" or "logical".

Yeah, I also think that would be good to have some way to debug this.

I can't think of a good reason to have this DEBUG1 as there is no
predictability of it getting generated even with tests using an
injection point. OTOH, I don't have any objections to it if you would
like to proceed with this.

--
With Regards,
Amit Kapila.

#47Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#46)
Re: Fix 035_standby_logical_decoding.pl race conditions

On Wed, Apr 09, 2025 at 12:07:31PM +0530, Amit Kapila wrote:

On Wed, Apr 9, 2025 at 11:24 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
I can't think of a good reason to have this DEBUG1 as there is no
predictability of it getting generated even with tests using an
injection point. OTOH, I don't have any objections to it if you would
like to proceed with this.

The non-predictability of the event is my reason, as it can be useful
to know this information when grabbing for specific patterns in the
logs between failed and successful run differences. In short, I'd
like to think that we are OK here, still this information is free to
have and it could be useful if we still have problems. A custom
message WAL record is overdoing in it a bit, IMO, an elog() with the
LSN returned should be enough.
--
Michael