A failure in prepared_xacts test

Started by Richard Guoalmost 2 years ago25 messages
Jump to latest
#1Richard Guo
guofenglinux@gmail.com

Yesterday I noticed a failure on cirrus-ci for the 'Right Semi Join'
patch. The failure can be found at [1]https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs, and it looks like:

--- /tmp/cirrus-ci-build/src/test/regress/expected/prepared_xacts.out
2024-04-27 00:41:25.831297000 +0000
+++
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/prepared_xacts.out
  2024-04-27 00:45:50.261369000 +0000
@@ -83,8 +83,9 @@
 SELECT gid FROM pg_prepared_xacts;
  gid
 ------
+ gxid
  foo3
-(1 row)
+(2 rows)

Upon closer look, it seems that this issue is not caused by the patch
about 'Right Semi Join', because this query, which initially included
two left joins, can actually be reduced to a function scan after
removing these two useless left joins. It seems that no semi-joins
would be involved.

EXPLAIN SELECT gid FROM pg_prepared_xacts;
QUERY PLAN
----------------------------------------------------------------------------
Function Scan on pg_prepared_xact p (cost=0.00..10.00 rows=1000 width=32)
(1 row)

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?

[1]: https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs

Thanks
Richard

#2Michael Paquier
michael@paquier.xyz
In reply to: Richard Guo (#1)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 09:12:48AM +0800, Richard Guo wrote:

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

The only location in the whole tree where we want to do predictible
scans of pg_prepared_xacts is prepared_xacts.sql, so rather than
playing with 2PC transactions across a bunch of tests, I think that we
should do two things, both touching prepared_xacts.sql:
- The 2PC transactions run in the main regression test suite should
use names that would be unlikely used elsewhere.
- Limit the scans of pg_prepared_xacts on these name patterns to avoid
interferences.

See for example the attached with both expected outputs updated
depending on the value set for max_prepared_transactions in the
backend. There may be an argument in back-patching that, but I don't
recall seeing this failure in the CI, so perhaps that's not worth
bothering with. What do you think?
--
Michael

Attachments:

2pc-tests.patchtext/x-diff; charset=us-asciiDownload+105-98
#3Alexander Lakhin
exclusion@gmail.com
In reply to: Richard Guo (#1)
Re: A failure in prepared_xacts test

Hello Richard,

29.04.2024 04:12, Richard Guo wrote:

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears.  Does
it suggest that the test case is flaky?

I think this could be caused by the ecpg test twophase executed
simultaneously with the test prepared_xacts thanks to meson's jobs
parallelization.

Best regards,
Alexander

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#2)
Re: A failure in prepared_xacts test

Michael Paquier <michael@paquier.xyz> writes:

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

regards, tom lane

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#4)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 01:11:00AM -0400, Tom Lane wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I don't disagree with your point, still I'm not sure that this can be
made entirely bullet-proof. Anyway, I think that we should still
improve this test and make it more robust for parallel operations:
installcheck fails equally on HEAD if there is a prepared transaction
on the backend where the tests run, and that seems like a bad idea to
me to rely on cluster-wide scans for what should be a "local" test.
--
Michael

#6Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#4)
Re: A failure in prepared_xacts test

Hello Tom and Michael,

29.04.2024 08:11, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

Yes, I'm afraid of the same. For example, the test failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&amp;dt=2024-04-17%2016%3A33%3A23 is of that
ilk, I guess.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&amp;dt=2024-04-17%2016%3A33%3A23

Best regards,
Alexander

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#5)
Re: A failure in prepared_xacts test

Michael Paquier <michael@paquier.xyz> writes:

I don't disagree with your point, still I'm not sure that this can be
made entirely bullet-proof. Anyway, I think that we should still
improve this test and make it more robust for parallel operations:
installcheck fails equally on HEAD if there is a prepared transaction
on the backend where the tests run, and that seems like a bad idea to
me to rely on cluster-wide scans for what should be a "local" test.

True, it's antithetical to the point of an "installcheck" test if
unrelated actions in another database can break it. So I'm fine
with tightening up prepared_xacts's query. I just wonder how far
we want to try to carry this.

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

regards, tom lane

#8Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#7)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.
--
Michael

#9Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#4)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I have the same concern. I suspect that the scan of pg_prepared_xacts
is not the only test that could cause problems when running in parallel
to other tests from the entire source tree.

Thanks
Richard

#10Richard Guo
guofenglinux@gmail.com
In reply to: Michael Paquier (#8)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 2:58 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

Thanks
Richard

#11Michael Paquier
michael@paquier.xyz
In reply to: Richard Guo (#10)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 05:11:19PM +0800, Richard Guo wrote:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests run in isolation of the rest with their own clusters
initialized from a copy initdb'd (rather than initdb because that's
much cheaper), so these scans are OK left alone.
--
Michael

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#10)
Re: A failure in prepared_xacts test

Richard Guo <guofenglinux@gmail.com> writes:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

regards, tom lane

#13Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#12)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 09:45:16AM -0400, Tom Lane wrote:

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

There are a few more 2PC transactions in test_decoding (no
installcheck), temp.sql, test_extensions.sql and pg_stat_statements's
utility.sql (no installcheck) but their GIDs are not that bad.
twophase_stream.sql has a GID "test1", which is kind of generic, but
it won't run in parallel. At the end, only addressing the
prepared_xacts.sql and the ECPG bits looked enough to me, so I've
tweaked these with 7e61e4cc7cfc and called it a day.

I'd be curious about any discussion involving the structure of the
meson tests.
--
Michael

#14Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#12)
Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 9:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

Thank you for the explanation. I wasn't aware of this before.

Thanks
Richard

#15Richard Guo
guofenglinux@gmail.com
In reply to: Michael Paquier (#13)
Re: A failure in prepared_xacts test

On Tue, Apr 30, 2024 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:

I'd be curious about any discussion involving the structure of the
meson tests.

+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]/messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com. I haven't looked into it though.

[1]: /messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com
/messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com

Thanks
Richard

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#15)
Re: A failure in prepared_xacts test

Richard Guo <guofenglinux@gmail.com> writes:

+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]. I haven't looked into it though.
[1] /messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com

The mechanism there is pretty obvious: a plancache flush happened
at just the wrong (right?) time and caused the output to change,
as indeed the comment acknowledges:

-- currently, this fails due to cached plan for "r.f1 + 1" expression
-- (but if debug_discard_caches is on, it will succeed)

I wonder if we shouldn't just remove that test case as being
too unstable -- especially since it's not proving much anyway.

regards, tom lane

#17Jingxian Li
aqktjcm@qq.com
In reply to: Richard Guo (#1)
[PATCH] Fix bug when calling strncmp in check_authmethod_valid

Hi all,

Attached is a patch that fixes bug when calling strncmp function, in
which case the third argument (authmethod - strchr(authmethod, ' '))
may be negative, which is not as expected..

With Regards,
Jingxian Li.

Attachments:

v1-0001-Fix-bug-when-calling-strncmp-in-check_authmethod_valid.patchapplication/octet-stream; name=v1-0001-Fix-bug-when-calling-strncmp-in-check_authmethod_valid.patchDownload+1-2
#18Richard Guo
guofenglinux@gmail.com
In reply to: Jingxian Li (#17)
Re: [PATCH] Fix bug when calling strncmp in check_authmethod_valid

On Tue, Apr 30, 2024 at 10:41 AM Jingxian Li <aqktjcm@qq.com> wrote:

Attached is a patch that fixes bug when calling strncmp function, in
which case the third argument (authmethod - strchr(authmethod, ' '))
may be negative, which is not as expected..

Nice catch. I think you're right from a quick glance.

Thanks
Richard

#19Daniel Gustafsson
daniel@yesql.se
In reply to: Jingxian Li (#17)
Re: [PATCH] Fix bug when calling strncmp in check_authmethod_valid

On 30 Apr 2024, at 04:41, Jingxian Li <aqktjcm@qq.com> wrote:

Attached is a patch that fixes bug when calling strncmp function, in
which case the third argument (authmethod - strchr(authmethod, ' '))
may be negative, which is not as expected..

The calculation is indeed incorrect, but the lack of complaints of it being
broken made me wonder why this exist in the first place. This dates back to
e7029b212755, just shy of 2 decades old, which added --auth with support for
strings with auth-options to ident and pam like --auth 'pam <servicename>' and
'ident sameuser'. Support for options to ident was removed in 01c1a12a5bb4 but
options to pam is still supported (although not documented), but was AFAICT
broken in commit 8a02339e9ba3 some 12 years ago with this strncmp().

-  if (strncmp(authmethod, *p, (authmethod - strchr(authmethod, ' '))) == 0)
+  if (strncmp(authmethod, *p, (strchr(authmethod, ' ') - authmethod)) == 0)

This with compare "pam postgresql" with "pam" and not "pam " so the length
should be "(strchr(authmethod, ' ') - authmethod + 1)" since "pam " is a method
separate from "pam" in auth_methods_{host|local}. We don't want to allow "md5
" as that's not a method in the array of valid methods.

But, since it's been broken in all supported versions of postgres and has
AFAICT never been documented to exist, should we fix it or just remove it? We
don't support auth-options for any other methods, like clientcert to cert for
example. If we fix it we should also document that it works IMHO.

--
Daniel Gustafsson

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#4)
Re: A failure in prepared_xacts test

On 29.04.24 07:11, Tom Lane wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I don't think there is anything fundamentally different in the
parallelism setups of the make-based and the meson-based tests. There
are just different implementation details that might affect the likely
orderings and groupings.

#21Jingxian Li
aqktjcm@qq.com
In reply to: Richard Guo (#1)
#22Daniel Gustafsson
daniel@yesql.se
In reply to: Jingxian Li (#21)
#23Aleksander Alekseev
aleksander@timescale.com
In reply to: Daniel Gustafsson (#22)
#24Michael Paquier
michael@paquier.xyz
In reply to: Aleksander Alekseev (#23)
#25Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Paquier (#24)