Test failures of 100_bugs.pl

Started by Andres Freundalmost 3 years ago5 messages
#1Andres Freund
andres@anarazel.de

Hi,

cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
occasionally. Just rarely enough that I never got around to looking into it
for real.

Just now there was another failure on master:
https://cirrus-ci.com/task/5279589287591936

[01:00:49.441] ok 1 - index predicates do not cause crash
[01:00:49.441] ok 2 - update to temporary table without replica identity with FOR ALL TABLES publication
[01:00:49.441] ok 3 - update to unlogged table without replica identity with FOR ALL TABLES publication
[01:00:49.441] # test failed
[01:00:49.441] --- stderr ---
[01:00:49.441] # poll_query_until timed out executing this query:
[01:00:49.441] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
[01:00:49.441] # expecting this output:
[01:00:49.441] # t
[01:00:49.441] # last actual query output:
[01:00:49.441] # f
[01:00:49.441] # with stderr:
[01:00:49.441] # Tests were run but no plan was declared and done_testing() was not seen.
[01:00:49.441] # Looks like your test exited with 29 just after 3.
[01:00:49.441]
[01:00:49.441] (test program exited with status code 29)

the regress log:
https://api.cirrus-ci.com/v1/artifact/task/5279589287591936/testrun/build-32/testrun/subscription/100_bugs/log/regress_log_100_bugs
and twoway's log:
https://api.cirrus-ci.com/v1/artifact/task/5279589287591936/testrun/build-32/testrun/subscription/100_bugs/log/100_bugs_twoways.log

We see t2 added to the publication:
2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2

And that *then* "t" was synchronized:
2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical replication table synchronization worker for subscription "testsub", table "t" has finished

and then that the refresh was issued:
2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION

And we see a walsender starting and the query to get the new tables being executed:
2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: statement: SELECT DISTINCT t.schemaname, t.tablename
, t.attnames
FROM pg_catalog.pg_publication_tables t
WHERE t.pubname IN ('testpub')

And that's it, the rest of the time is just polling.

Perhaps wait_for_subscription_sync() should dump the set of rel states to make
something like this more debuggable?

The fact that the synchronization for t finished just before the refresh makes
me wonder if a wakeup or a cache invalidation got lost?

Greetings,

Andres Freund

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#1)
Re: Test failures of 100_bugs.pl

On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
occasionally. Just rarely enough that I never got around to looking into it
for real.

...

We see t2 added to the publication:
2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2

And that *then* "t" was synchronized:
2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical replication table synchronization worker for subscription "testsub", table "t" has finished

and then that the refresh was issued:
2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION

And we see a walsender starting and the query to get the new tables being executed:
2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: statement: SELECT DISTINCT t.schemaname, t.tablename
, t.attnames
FROM pg_catalog.pg_publication_tables t
WHERE t.pubname IN ('testpub')

And that's it, the rest of the time is just polling.

Perhaps wait_for_subscription_sync() should dump the set of rel states to make
something like this more debuggable?

The fact that the synchronization for t finished just before the refresh makes
me wonder if a wakeup or a cache invalidation got lost?

From the LOGs, the only thing one could draw is lost invalidation
because the nap time of the apply worker is 1s, so it should process
invalidation during the time we are polling. Also, the rel should be
added to pg_subscription_rel because the test is still polling for
rels to be in 'ready' or 'done' state.

I think we can do three things to debug (a) as you suggest dump the
rel state in wait_for_subscription_sync; (b) add some DEBUG log in
invalidate_syncing_table_states() to ensure that invalidation has been
processed; (c) print rel states and relids from table_states_not_ready
in process_syncing_tables_for_apply() to see if t2 has ever appeared
in that list.

--
With Regards,
Amit Kapila.

#3Yu Shi (Fujitsu)
shiy.fnst@fujitsu.com
In reply to: Amit Kapila (#2)
1 attachment(s)
RE: Test failures of 100_bugs.pl

On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:

cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
occasionally. Just rarely enough that I never got around to looking into it
for real.

...

We see t2 added to the publication:
2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0]

LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2

And that *then* "t" was synchronized:
2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical

replication table synchronization worker for subscription "testsub", table "t" has
finished

and then that the refresh was issued:
2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0]

LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION

And we see a walsender starting and the query to get the new tables being

executed:

2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG:

statement: SELECT DISTINCT t.schemaname, t.tablename

, t.attnames
FROM pg_catalog.pg_publication_tables t
WHERE t.pubname IN ('testpub')

And that's it, the rest of the time is just polling.

Perhaps wait_for_subscription_sync() should dump the set of rel states to

make

something like this more debuggable?

The fact that the synchronization for t finished just before the refresh makes
me wonder if a wakeup or a cache invalidation got lost?

From the LOGs, the only thing one could draw is lost invalidation
because the nap time of the apply worker is 1s, so it should process
invalidation during the time we are polling. Also, the rel should be
added to pg_subscription_rel because the test is still polling for
rels to be in 'ready' or 'done' state.

I think we can do three things to debug (a) as you suggest dump the
rel state in wait_for_subscription_sync; (b) add some DEBUG log in
invalidate_syncing_table_states() to ensure that invalidation has been
processed; (c) print rel states and relids from table_states_not_ready
in process_syncing_tables_for_apply() to see if t2 has ever appeared
in that list.

There was a similar buildfarm failure on francolin recently[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&amp;dt=2023-04-16%2018%3A17%3A09. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.

--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--

I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&amp;dt=2023-04-16%2018%3A17%3A09

Regards,
Shi Yu

Attachments:

v1-0001-dump-rel-state-in-wait_for_subscription_sync.patchapplication/octet-stream; name=v1-0001-dump-rel-state-in-wait_for_subscription_sync.patchDownload
From 2cf7b69211b597e27b67599ea62e3a3fee70f4b6 Mon Sep 17 00:00:00 2001
From: Shi Yu <shiy.fnst@fujitsu.com>
Date: Thu, 20 Apr 2023 10:47:09 +0800
Subject: [PATCH v1] dump rel state in wait_for_subscription_sync

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..8130a8ac4a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2774,8 +2774,24 @@ sub wait_for_subscription_sync
 	print "Waiting for all subscriptions in \"$name\" to synchronize data\n";
 	my $query =
 	    qq[SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');];
-	$self->poll_query_until($dbname, $query)
-	  or croak "timed out waiting for subscriber to synchronize data";
+	if (!$self->poll_query_until($dbname, $query))
+	{
+		my $result = $self->safe_psql(
+			$dbname, qq(
+			SELECT
+				PS.subname AS subname,
+				N.nspname AS schemaname,
+				C.relname AS relname,
+				PSR.srsubstate AS substate
+			FROM pg_subscription_rel AS PSR
+				JOIN pg_subscription PS ON (PS.oid = PSR.srsubid)
+				JOIN pg_class C ON (C.oid = PSR.srrelid)
+				JOIN pg_namespace N ON (N.oid = C.relnamespace);
+		));
+		print "### Subscription rel state\n";
+		print "$result\n";
+		croak "timed out waiting for subscriber to synchronize data";
+	}
 
 	# Then, wait for the replication to catchup if required.
 	if (defined($publisher))
-- 
2.31.1

#4Yu Shi (Fujitsu)
shiy.fnst@fujitsu.com
In reply to: Yu Shi (Fujitsu) (#3)
2 attachment(s)
RE: Test failures of 100_bugs.pl

On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote:

I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.

Please see the attached v2 patch.

I added some debug logs when invalidating syncing table states and updating
table_states_not_ready list. I also adjusted the message level in the tests
which failed before.

Regards,
Shi Yu

Attachments:

v2-0001-dump-rel-state-in-wait_for_subscription_sync.patchapplication/octet-stream; name=v2-0001-dump-rel-state-in-wait_for_subscription_sync.patchDownload
From fc083fa6320605c54ae0d7b11ce5102c13977491 Mon Sep 17 00:00:00 2001
From: Shi Yu <shiy.fnst@fujitsu.com>
Date: Mon, 24 Apr 2023 17:25:56 +0800
Subject: [PATCH v2 1/2] dump rel state in wait_for_subscription_sync

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..89cfc0f8a0 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2774,8 +2774,25 @@ sub wait_for_subscription_sync
 	print "Waiting for all subscriptions in \"$name\" to synchronize data\n";
 	my $query =
 	    qq[SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');];
-	$self->poll_query_until($dbname, $query)
-	  or croak "timed out waiting for subscriber to synchronize data";
+	if (!$self->poll_query_until($dbname, $query))
+	{
+		my $result = $self->safe_psql(
+			$dbname, qq(
+			SELECT
+				PS.subname,
+				N.nspname,
+				C.relname,
+				PSR.srrelid,
+				PSR.srsubstate
+			FROM pg_subscription_rel AS PSR
+				JOIN pg_subscription PS ON (PS.oid = PSR.srsubid)
+				JOIN pg_class C ON (C.oid = PSR.srrelid)
+				JOIN pg_namespace N ON (N.oid = C.relnamespace);
+		));
+		print "### Subscription rel state\n";
+		print "$result\n";
+		croak "timed out waiting for subscriber to synchronize data";
+	}
 
 	# Then, wait for the replication to catchup if required.
 	if (defined($publisher))
-- 
2.30.0.windows.2

v2-0002-Add-logs-to-help-investigate-subscription-test-fa.patchapplication/octet-stream; name=v2-0002-Add-logs-to-help-investigate-subscription-test-fa.patchDownload
From 0359554d9e1fa808991b72757b1f11cfbcc96075 Mon Sep 17 00:00:00 2001
From: Shi Yu <shiy.fnst@fujitsu.com>
Date: Mon, 24 Apr 2023 17:26:36 +0800
Subject: [PATCH v2 2/2] Add logs to help investigate subscription test failure

---
 src/backend/replication/logical/tablesync.c | 3 +++
 src/test/subscription/t/014_binary.pl       | 8 ++++++++
 src/test/subscription/t/100_bugs.pl         | 9 +++++++++
 3 files changed, 20 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0c71ae9ba7..82f7474d5f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -272,6 +272,7 @@ void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
 	table_states_valid = false;
+	elog(DEBUG1, "invalidating syncing table states");
 }
 
 /*
@@ -1549,6 +1550,8 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(DEBUG1, "not ready subscription table %u, state: %c",
+				 rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
diff --git a/src/test/subscription/t/014_binary.pl b/src/test/subscription/t/014_binary.pl
index feefbe734e..e012b9ed3a 100644
--- a/src/test/subscription/t/014_binary.pl
+++ b/src/test/subscription/t/014_binary.pl
@@ -257,6 +257,10 @@ $node_publisher->safe_psql(
 # Check the subscriber log from now on.
 $offset = -s $node_subscriber->logfile;
 
+# Adjust log_min_messages, this helps when the test fails.
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_subscriber->reload;
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
     CREATE TABLE public.test_mismatching_types (
@@ -283,6 +287,10 @@ $node_publisher->wait_for_log(
 
 $node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub');
 
+# Reset the log_min_messages to default.
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = warning");
+$node_subscriber->reload;
+
 # Check the synced data on the subscriber
 $result = $node_subscriber->safe_psql('postgres',
 	'SELECT a FROM test_mismatching_types ORDER BY a;');
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index b832ddcf63..78cf555a9a 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -151,6 +151,11 @@ $node_twoways->safe_psql(
 	INSERT INTO t2 SELECT * FROM generate_series(1, $rows);
 	});
 $node_twoways->safe_psql('d1', 'ALTER PUBLICATION testpub ADD TABLE t2');
+
+# Adjust log_min_messages, this helps when the test fails.
+$node_twoways->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_twoways->reload;
+
 $node_twoways->safe_psql('d2',
 	'ALTER SUBSCRIPTION testsub REFRESH PUBLICATION');
 
@@ -159,6 +164,10 @@ $node_twoways->safe_psql('d2',
 # verify that tables are synced.
 $node_twoways->wait_for_subscription_sync($node_twoways, 'testsub', 'd2');
 
+# Reset the log_min_messages to default.
+$node_twoways->append_conf('postgresql.conf', "log_min_messages = warning");
+$node_twoways->reload;
+
 is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t"),
 	$rows * 2, "2x$rows rows in t");
 is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t2"),
-- 
2.30.0.windows.2

#5Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Yu Shi (Fujitsu) (#4)
RE: Test failures of 100_bugs.pl

On Monday, April 24, 2023 5:50 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote:

On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote:

I wrote a patch to dump rel state in wait_for_subscription_sync() as

suggested.

Please see the attached patch.
I will try to add some debug logs in code later.

Please see the attached v2 patch.

I added some debug logs when invalidating syncing table states and updating
table_states_not_ready list. I also adjusted the message level in the tests which
failed before.

Just a reference.

I think similar issue has been analyzed in other thread[1]/messages/by-id/711a6afe-edb7-1211-cc27-1bef8239eec7@gmail.com and the reason seems
clear that the table state cache invalidation got lost due to a race condition.
The fix is also being discussed there.

[1]: /messages/by-id/711a6afe-edb7-1211-cc27-1bef8239eec7@gmail.com

Best Regards,
Hou zj