Parallel tests publication and subscription might fail due to concurrent tuple update

Started by Alexander Lakhinabout 1 year ago4 messages
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

A recent desman failure [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=desman&dt=2024-12-09%2018%3A33%3A49&stg=check with the following diagnostics:
# parallel group (2 tests):  subscription publication
not ok 157   + publication                              2251 ms
ok 158       + subscription                              415 ms

--- /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/expected/publication.out 2024-12-09 
18:34:02.939762233 +0000
+++ /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/results/publication.out 2024-12-09 
18:44:48.582958859 +0000
@@ -1193,23 +1193,29 @@
  ERROR:  permission denied for database regression
  SET ROLE regress_publication_user;
  GRANT CREATE ON DATABASE regression TO regress_publication_user2;
+ERROR:  tuple concurrently updated
  SET ROLE regress_publication_user2;
  SET client_min_messages = 'ERROR';
  CREATE PUBLICATION testpub2;  -- ok
+ERROR:  permission denied for database regression

and postmaster.log containing:
2024-12-09 18:44:46.753 UTC [1345157:903] pg_regress/publication STATEMENT:  CREATE PUBLICATION testpub2;
2024-12-09 18:44:46.753 UTC [1345158:287] pg_regress/subscription LOG:  statement: REVOKE CREATE ON DATABASE REGRESSION
FROM regress_subscription_user3;
2024-12-09 18:44:46.754 UTC [1345157:904] pg_regress/publication LOG:  statement: SET ROLE regress_publication_user;
2024-12-09 18:44:46.754 UTC [1345157:905] pg_regress/publication LOG:  statement: GRANT CREATE ON DATABASE regression TO
regress_publication_user2;
2024-12-09 18:44:46.754 UTC [1345157:906] pg_regress/publication ERROR:  tuple concurrently updated
2024-12-09 18:44:46.754 UTC [1345157:907] pg_regress/publication STATEMENT:  GRANT CREATE ON DATABASE regression TO
regress_publication_user2;

shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.

I've managed to reproduce the error with:
sed -E "s/(REVOKE CREATE ON DATABASE REGRESSION FROM regress_subscription_user3;)/$(printf '\\1%.0s' {1..2000})/" -i.bak \
  src/test/regress/sql/subscription.sql src/test/regress/expected/subscription.out
sed -E "s/(GRANT CREATE ON DATABASE regression TO regress_publication_user2;)/$(printf '\\1%.0s' {1..1000})/" -i.bak \
  src/test/regress/sql/publication.sql src/test/regress/expected/publication.out
sed -E "s/(test: publication subscription$)/$(printf '\\1\\n%.0s' {1..10})/" -i.bak src/test/regress/parallel_schedule

This makes `make check` fail like below:
# parallel group (2 tests):  subscription publication
ok 170       + publication                               202 ms
ok 171       + subscription                              100 ms
# parallel group (2 tests):  subscription publication
ok 172       + publication                               198 ms
not ok 173   + subscription                              107 ms
# parallel group (2 tests):  subscription publication
ok 174       + publication                               204 ms
ok 175       + subscription                              100 ms

src/test/regress/regression.diffs contains:
+ERROR:  tuple concurrently updated

This issue is reproduced starting from commit c3afe8cf5 (dated 2023-03-30),
which added "REVOKE CREATE ON DATABASE REGRESSION ..." into the subscription
test.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=desman&dt=2024-12-09%2018%3A33%3A49&stg=check

Best regards,
Alexander

#2Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Alexander Lakhin (#1)
1 attachment(s)
Re: Parallel tests publication and subscription might fail due to concurrent tuple update

On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:

shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.

This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1]/messages/by-id/HE1PR8303MB0075BF78AF1BE904050DA16BF7729@HE1PR8303MB0075.EURPRD83.prod.outlook.com, except that this is for GRANT/REVOKE it
seems.

The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.

[1]: /messages/by-id/HE1PR8303MB0075BF78AF1BE904050DA16BF7729@HE1PR8303MB0075.EURPRD83.prod.outlook.com

Attachments:

v1-0001-Don-t-run-flaky-tests-in-parallel-with-eachother.patchapplication/octet-stream; name=v1-0001-Don-t-run-flaky-tests-in-parallel-with-eachother.patchDownload
From c10cfbf3aac9a0f202e4290f694f853f57578050 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <postgres@jeltef.nl>
Date: Sun, 15 Dec 2024 11:58:09 +0100
Subject: [PATCH v1] Don't run flaky tests in parallel with eachother

---
 src/test/regress/parallel_schedule | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..db1493ee116 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -91,8 +91,10 @@ test: select_parallel
 test: write_parallel
 test: vacuum_parallel
 
-# no relation related tests can be put in this group
-test: publication subscription
+# These test are flaky when run in parallel with eachother, see:
+# https://www.postgresql.org/message-id/flat/18dcfb7f-5deb-4487-ae22-a2c16839519a%40gmail.com
+test: publication
+test: subscription
 
 # ----------
 # Another group of parallel tests

base-commit: 56499315a74f97d220373e396903c389d85c8933
-- 
2.34.1

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#2)
Re: Parallel tests publication and subscription might fail due to concurrent tuple update

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:

shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.

This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
seems.

The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.

I grepped through the buildfarm logs and discovered that desman's
run of 2024-12-09 18:33:49 is the *only* such failure recorded
in the last year. What's more, that run was on v16 not master.

So now I'm inclined to think that "do nothing" is the right answer.
It would be kind of sad to lose all parallelism for these two
tests, and one-failure-per-year is surely below our noise threshold.
(Mind you, I'd love to be in a position where that sort of failure
rate does make it onto our radar. But we're not there today.)
The fact that it's only been seen on v16 may well mean that subsequent
changes in one or the other test have further reduced the failure
probability, too.

Also, we'd be unlikely to remember to undo this change if anyone
ever fixes the GRANT/REVOKE race condition. It seems possible that
someone will get annoyed enough with that to make it happen, because
we've seen related field complaints.

So on the whole I want to reject this. We can reconsider if we
see more such failures, of course.

regards, tom lane

#4vignesh C
vignesh21@gmail.com
In reply to: Tom Lane (#3)
Re: Parallel tests publication and subscription might fail due to concurrent tuple update

On Mon, 17 Mar 2025 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:

shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.

This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
seems.

The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.

I grepped through the buildfarm logs and discovered that desman's
run of 2024-12-09 18:33:49 is the *only* such failure recorded
in the last year. What's more, that run was on v16 not master.

So now I'm inclined to think that "do nothing" is the right answer.
It would be kind of sad to lose all parallelism for these two
tests, and one-failure-per-year is surely below our noise threshold.
(Mind you, I'd love to be in a position where that sort of failure
rate does make it onto our radar. But we're not there today.)
The fact that it's only been seen on v16 may well mean that subsequent
changes in one or the other test have further reduced the failure
probability, too.

Also, we'd be unlikely to remember to undo this change if anyone
ever fixes the GRANT/REVOKE race condition. It seems possible that
someone will get annoyed enough with that to make it happen, because
we've seen related field complaints.

So on the whole I want to reject this. We can reconsider if we
see more such failures, of course.

I suggest we close the commitfest entry at [1]https://commitfest.postgresql.org/patch/5459/ and create a new one if
we encounter this buildfarm failure again.
[1]: https://commitfest.postgresql.org/patch/5459/

Regards,
Vignesh