Parallel tests publication and subscription might fail due to concurrent tuple update
Hello hackers,
A recent desman failure [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=desman&dt=2024-12-09%2018%3A33%3A49&stg=check with the following diagnostics:
# parallel group (2 tests): subscription publication
not ok 157 + publication 2251 ms
ok 158 + subscription 415 ms
--- /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/expected/publication.out 2024-12-09
18:34:02.939762233 +0000
+++ /home/fedora/17-desman/buildroot/REL_16_STABLE/pgsql.build/src/test/regress/results/publication.out 2024-12-09
18:44:48.582958859 +0000
@@ -1193,23 +1193,29 @@
ERROR: permission denied for database regression
SET ROLE regress_publication_user;
GRANT CREATE ON DATABASE regression TO regress_publication_user2;
+ERROR: tuple concurrently updated
SET ROLE regress_publication_user2;
SET client_min_messages = 'ERROR';
CREATE PUBLICATION testpub2; -- ok
+ERROR: permission denied for database regression
and postmaster.log containing:
2024-12-09 18:44:46.753 UTC [1345157:903] pg_regress/publication STATEMENT: CREATE PUBLICATION testpub2;
2024-12-09 18:44:46.753 UTC [1345158:287] pg_regress/subscription LOG: statement: REVOKE CREATE ON DATABASE REGRESSION
FROM regress_subscription_user3;
2024-12-09 18:44:46.754 UTC [1345157:904] pg_regress/publication LOG: statement: SET ROLE regress_publication_user;
2024-12-09 18:44:46.754 UTC [1345157:905] pg_regress/publication LOG: statement: GRANT CREATE ON DATABASE regression TO
regress_publication_user2;
2024-12-09 18:44:46.754 UTC [1345157:906] pg_regress/publication ERROR: tuple concurrently updated
2024-12-09 18:44:46.754 UTC [1345157:907] pg_regress/publication STATEMENT: GRANT CREATE ON DATABASE regression TO
regress_publication_user2;
shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.
I've managed to reproduce the error with:
sed -E "s/(REVOKE CREATE ON DATABASE REGRESSION FROM regress_subscription_user3;)/$(printf '\\1%.0s' {1..2000})/" -i.bak \
src/test/regress/sql/subscription.sql src/test/regress/expected/subscription.out
sed -E "s/(GRANT CREATE ON DATABASE regression TO regress_publication_user2;)/$(printf '\\1%.0s' {1..1000})/" -i.bak \
src/test/regress/sql/publication.sql src/test/regress/expected/publication.out
sed -E "s/(test: publication subscription$)/$(printf '\\1\\n%.0s' {1..10})/" -i.bak src/test/regress/parallel_schedule
This makes `make check` fail like below:
# parallel group (2 tests): subscription publication
ok 170 + publication 202 ms
ok 171 + subscription 100 ms
# parallel group (2 tests): subscription publication
ok 172 + publication 198 ms
not ok 173 + subscription 107 ms
# parallel group (2 tests): subscription publication
ok 174 + publication 204 ms
ok 175 + subscription 100 ms
src/test/regress/regression.diffs contains:
+ERROR: tuple concurrently updated
This issue is reproduced starting from commit c3afe8cf5 (dated 2023-03-30),
which added "REVOKE CREATE ON DATABASE REGRESSION ..." into the subscription
test.
Best regards,
Alexander
On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.
This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1]/messages/by-id/HE1PR8303MB0075BF78AF1BE904050DA16BF7729@HE1PR8303MB0075.EURPRD83.prod.outlook.com, except that this is for GRANT/REVOKE it
seems.
The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.
[1]: /messages/by-id/HE1PR8303MB0075BF78AF1BE904050DA16BF7729@HE1PR8303MB0075.EURPRD83.prod.outlook.com
Attachments:
v1-0001-Don-t-run-flaky-tests-in-parallel-with-eachother.patchapplication/octet-stream; name=v1-0001-Don-t-run-flaky-tests-in-parallel-with-eachother.patchDownload
From c10cfbf3aac9a0f202e4290f694f853f57578050 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <postgres@jeltef.nl>
Date: Sun, 15 Dec 2024 11:58:09 +0100
Subject: [PATCH v1] Don't run flaky tests in parallel with eachother
---
src/test/regress/parallel_schedule | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..db1493ee116 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -91,8 +91,10 @@ test: select_parallel
test: write_parallel
test: vacuum_parallel
-# no relation related tests can be put in this group
-test: publication subscription
+# These test are flaky when run in parallel with eachother, see:
+# https://www.postgresql.org/message-id/flat/18dcfb7f-5deb-4487-ae22-a2c16839519a%40gmail.com
+test: publication
+test: subscription
# ----------
# Another group of parallel tests
base-commit: 56499315a74f97d220373e396903c389d85c8933
--
2.34.1
Jelte Fennema-Nio <postgres@jeltef.nl> writes:
On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.
This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
seems.
The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.
I grepped through the buildfarm logs and discovered that desman's
run of 2024-12-09 18:33:49 is the *only* such failure recorded
in the last year. What's more, that run was on v16 not master.
So now I'm inclined to think that "do nothing" is the right answer.
It would be kind of sad to lose all parallelism for these two
tests, and one-failure-per-year is surely below our noise threshold.
(Mind you, I'd love to be in a position where that sort of failure
rate does make it onto our radar. But we're not there today.)
The fact that it's only been seen on v16 may well mean that subsequent
changes in one or the other test have further reduced the failure
probability, too.
Also, we'd be unlikely to remember to undo this change if anyone
ever fixes the GRANT/REVOKE race condition. It seems possible that
someone will get annoyed enough with that to make it happen, because
we've seen related field complaints.
So on the whole I want to reject this. We can reconsider if we
see more such failures, of course.
regards, tom lane
On Mon, 17 Mar 2025 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jelte Fennema-Nio <postgres@jeltef.nl> writes:
On Sun, 15 Dec 2024 at 10:00, Alexander Lakhin <exclusion@gmail.com> wrote:
shows that the subscription and publication tests are not concurrent-safe,
because modifying the same pg_database entry might fail with the "tuple
concurrently updated" error.This seems related to this thread about concurrency issues in
ALTER/DROP SUBSCRIPTION[1], except that this is for GRANT/REVOKE it
seems.The easiest way to address the flakiness of this test though is
probably to just don't run these tests in in parallel. See attached.I grepped through the buildfarm logs and discovered that desman's
run of 2024-12-09 18:33:49 is the *only* such failure recorded
in the last year. What's more, that run was on v16 not master.So now I'm inclined to think that "do nothing" is the right answer.
It would be kind of sad to lose all parallelism for these two
tests, and one-failure-per-year is surely below our noise threshold.
(Mind you, I'd love to be in a position where that sort of failure
rate does make it onto our radar. But we're not there today.)
The fact that it's only been seen on v16 may well mean that subsequent
changes in one or the other test have further reduced the failure
probability, too.Also, we'd be unlikely to remember to undo this change if anyone
ever fixes the GRANT/REVOKE race condition. It seems possible that
someone will get annoyed enough with that to make it happen, because
we've seen related field complaints.So on the whole I want to reject this. We can reconsider if we
see more such failures, of course.
I suggest we close the commitfest entry at [1]https://commitfest.postgresql.org/patch/5459/ and create a new one if
we encounter this buildfarm failure again.
[1]: https://commitfest.postgresql.org/patch/5459/
Regards,
Vignesh