Random subscription 021_twophase test failure on kestrel
Hi,
The 021_twophase test has failed on Kestrel at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-05-22%2021%3A19%3A22 with the following error:
# Failed test 'should be no prepared transactions on subscriber'
# at /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/021_twophase.pl
line 438.
# got: '1'
# expected: '0'
# Looks like you failed 1 test of 30.
This failure is caused by a prepared transaction that was not properly
committed due to replication lag on one of the subscriptions. The test
involves two subscriptions: tap_sub and tap_sub_copy. After committing
the prepared transaction 'mygid', the test only waits for tap_sub_copy
to catch up:
node_publisher->wait_for_catchup($appname_copy);
However, tap_sub is dropped before ensuring it has replayed the commit
of 'mygid' prepared transaction, leading to a leftover prepared
transaction on the subscriber:
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
When the test later checks for the number of prepared transactions, it
fails because tap_sub had not finished applying the commit:
# at line 438
# got: '1'
# expected: '0'
This issue can be consistently reproduced by injecting a delay (e.g.,
3 seconds) in tap_sub's walsender while decoding the commit of
'mygid'. A patch to demonstrate this behavior is provided at
021_two_phase_test_failure_reproduce.patch. The test can be fixed by
explicitly waiting for both subscriptions to catch up before dropping
either. A patch implementing this fix is attached.
Thanks Amit for the offline discussion and sharing your thoughts on the same.
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-05-22%2021%3A19%3A22
Regards,
Vignesh
On Fri, May 23, 2025 at 08:55:27PM +0530, vignesh C wrote:
This issue can be consistently reproduced by injecting a delay (e.g.,
3 seconds) in tap_sub's walsender while decoding the commit of
'mygid'. A patch to demonstrate this behavior is provided at
021_two_phase_test_failure_reproduce.patch. The test can be fixed by
explicitly waiting for both subscriptions to catch up before dropping
either. A patch implementing this fix is attached.
+ if (parsed->twophase_xid && strcmp(parsed->twophase_gid, "mygid") == 0 &&
+ strcmp(NameStr(MyReplicationSlot->data.name), "tap_sub") == 0)
+ sleep(3);
+
Smart filtering to prove your point.
+# Wait for both subscribers to catchup $node_publisher->wait_for_catchup($appname_copy); +$node_publisher->wait_for_catchup($appname); + +# Make sure there are no prepared transactions on the subscriber +$result = $node_subscriber->safe_psql('postgres', + "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(0), 'should be no prepared transactions on subscriber');
Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging. I'll take care of that, if there are no objections.
--
Michael
On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:
Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.
+1.
--
With Regards,
Amit Kapila.
On Sat, May 24, 2025 at 11:27:05AM +0530, Amit Kapila wrote:
On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:
Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.+1.
Applied down to v15.
--
Michael
On Mon, 26 May 2025 at 13:59, Michael Paquier <michael@paquier.xyz> wrote:
On Sat, May 24, 2025 at 11:27:05AM +0530, Amit Kapila wrote:
On Sat, May 24, 2025 at 6:07 AM Michael Paquier <michael@paquier.xyz> wrote:
Yes, agreed that your suggested fix looks sensible with an extra check
for pg_prepared_xacts on the subscriber side that can be useful for
debugging.+1.
Applied down to v15.
Thanks for committing this. The buildfarm runs have been successful so
far; I’ll continue monitoring them over the next few days.
Regards,
Vignesh