Random subscription 021_twophase test failure on kestrel

Started by vignesh C8 months ago5 messages
#1vignesh C
vignesh21@gmail.com
2 attachment(s)

Hi,

The 021_twophase test has failed on Kestrel at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-05-22%2021%3A19%3A22 with the following error:
# Failed test 'should be no prepared transactions on subscriber'
# at /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/021_twophase.pl
line 438.
# got: '1'
# expected: '0'
# Looks like you failed 1 test of 30.

This failure is caused by a prepared transaction that was not properly
committed due to replication lag on one of the subscriptions. The test
involves two subscriptions: tap_sub and tap_sub_copy. After committing
the prepared transaction 'mygid', the test only waits for tap_sub_copy
to catch up:
node_publisher->wait_for_catchup($appname_copy);

However, tap_sub is dropped before ensuring it has replayed the commit
of 'mygid' prepared transaction, leading to a leftover prepared
transaction on the subscriber:
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");

When the test later checks for the number of prepared transactions, it
fails because tap_sub had not finished applying the commit:
# at line 438
# got: '1'
# expected: '0'

This issue can be consistently reproduced by injecting a delay (e.g.,
3 seconds) in tap_sub's walsender while decoding the commit of
'mygid'. A patch to demonstrate this behavior is provided at
021_two_phase_test_failure_reproduce.patch. The test can be fixed by
explicitly waiting for both subscriptions to catch up before dropping
either. A patch implementing this fix is attached.

Thanks Amit for the offline discussion and sharing your thoughts on the same.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-05-22%2021%3A19%3A22

Regards,
Vignesh

Attachments:

0001-Fix-random-021_twophase-test-failure.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-random-021_twophase-test-failure.patchDownload
From 4f75511e9866ef29f79789e5be0a31f0ee5786e7 Mon Sep 17 00:00:00 2001
From: Vignesh <vignesh21@gmail.com>
Date: Fri, 23 May 2025 20:40:40 +0530
Subject: [PATCH] Fix random 021_twophase test failure

This issue occurred because the test dropped the subscription
tap_sub before confirming it had replayed the commit for the
previously prepared transaction 'mygid'. The test only waited
for the second subscription, tap_sub_copy, to catch up. As a
result, tap_sub had not yet applied the commit, leaving the
prepared transaction uncommitted on the subscriber.
---
 src/test/subscription/t/021_twophase.pl | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 61c427aed21..b8e4242d1f1 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -373,7 +373,14 @@ $result =
   $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
 is($result, qq(6), 'publisher inserted data');
 
+# Wait for both subscribers to catchup
 $node_publisher->wait_for_catchup($appname_copy);
+$node_publisher->wait_for_catchup($appname);
+
+# Make sure there are no prepared transactions on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'should be no prepared transactions on subscriber');
 
 $result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
-- 
2.43.0

test_failure_fix.patchtext/x-patch; charset=US-ASCII; name=test_failure_fix.patchDownload
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index cc03f0706e9..981a8e45003 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -38,6 +38,7 @@
 #include "replication/message.h"
 #include "replication/reorderbuffer.h"
 #include "replication/snapbuild.h"
+#include "replication/slot.h"
 #include "storage/standbydefs.h"
 
 /* individual record(group)'s handlers */
@@ -682,6 +683,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
 	int			i;
 
+	if (parsed->twophase_xid && strcmp(parsed->twophase_gid, "mygid") == 0 &&
+		strcmp(NameStr(MyReplicationSlot->data.name), "tap_sub") == 0)
+		sleep(3);
+
 	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
 		origin_lsn = parsed->origin_lsn;
#2Michael Paquier
michael@paquier.xyz
In reply to: vignesh C (#1)
Re: Random subscription 021_twophase test failure on kestrel

On Fri, May 23, 2025 at 08:55:27PM +0530, vignesh C wrote:

This issue can be consistently reproduced by injecting a delay (e.g.,
3 seconds) in tap_sub's walsender while decoding the commit of
'mygid'. A patch to demonstrate this behavior is provided at
021_two_phase_test_failure_reproduce.patch. The test can be fixed by
explicitly waiting for both subscriptions to catch up before dropping
either. A patch implementing this fix is attached.

+    if (parsed->twophase_xid && strcmp(parsed->twophase_gid, "mygid") == 0 &&
+        strcmp(NameStr(MyReplicationSlot->data.name), "tap_sub") == 0)
+        sleep(3);
+

Smart filtering to prove your point.

+# Wait for both subscribers to catchup
$node_publisher->wait_for_catchup($appname_copy);
+$node_publisher->wait_for_catchup($appname);
+
+# Make sure there are no prepared transactions on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+    "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'should be no prepared transactions on subscriber');

Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging. I'll take care of that, if there are no objections.
--
Michael

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#2)
Re: Random subscription 021_twophase test failure on kestrel

On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:

Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.

+1.

--
With Regards,
Amit Kapila.

#4Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#3)
Re: Random subscription 021_twophase test failure on kestrel

On Sat, May 24, 2025 at 11:27:05AM +0530, Amit Kapila wrote:

On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:

Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.

+1.

Applied down to v15.
--
Michael

#5vignesh C
vignesh21@gmail.com
In reply to: Michael Paquier (#4)
Re: Random subscription 021_twophase test failure on kestrel

On Mon, 26 May 2025 at 13:59, Michael Paquier <michael@paquier.xyz> wrote:

On Sat, May 24, 2025 at 11:27:05AM +0530, Amit Kapila wrote:

On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:

Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.

+1.

Applied down to v15.

Thanks for committing this. The buildfarm runs have been successful so
far; I’ll continue monitoring them over the next few days.

Regards,
Vignesh