A failure in prepared_xacts test

michael@paquier.xyz

almost 2 years ago

In reply to: Richard Guo (#1)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 09:12:48AM +0800, Richard Guo wrote:

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

The only location in the whole tree where we want to do predictible
scans of pg_prepared_xacts is prepared_xacts.sql, so rather than
playing with 2PC transactions across a bunch of tests, I think that we
should do two things, both touching prepared_xacts.sql:
- The 2PC transactions run in the main regression test suite should
use names that would be unlikely used elsewhere.
- Limit the scans of pg_prepared_xacts on these name patterns to avoid
interferences.

See for example the attached with both expected outputs updated
depending on the value set for max_prepared_transactions in the
backend. There may be an argument in back-patching that, but I don't
recall seeing this failure in the CI, so perhaps that's not worth
bothering with. What do you think?
--
Michael

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Richard Guo (#1)

Re: A failure in prepared_xacts test

Hello Richard,

29.04.2024 04:12, Richard Guo wrote:

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?

I think this could be caused by the ecpg test twophase executed
simultaneously with the test prepared_xacts thanks to meson's jobs
parallelization.

Best regards,
Alexander

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Michael Paquier (#2)

Re: A failure in prepared_xacts test

Michael Paquier <michael@paquier.xyz> writes:

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-17%2016%3A33%3A23

michael@paquier.xyz

almost 2 years ago

In reply to: Tom Lane (#4)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 01:11:00AM -0400, Tom Lane wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I don't disagree with your point, still I'm not sure that this can be
made entirely bullet-proof. Anyway, I think that we should still
improve this test and make it more robust for parallel operations:
installcheck fails equally on HEAD if there is a prepared transaction
on the backend where the tests run, and that seems like a bad idea to
me to rely on cluster-wide scans for what should be a "local" test.
--
Michael

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Tom Lane (#4)

Re: A failure in prepared_xacts test

Hello Tom and Michael,

29.04.2024 08:11, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

Yes, I'm afraid of the same. For example, the test failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-17%2016%3A33%3A23 is of that
ilk, I guess.

Best regards,
Alexander

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Michael Paquier (#5)

Re: A failure in prepared_xacts test

Michael Paquier <michael@paquier.xyz> writes:

I don't disagree with your point, still I'm not sure that this can be
made entirely bullet-proof. Anyway, I think that we should still
improve this test and make it more robust for parallel operations:
installcheck fails equally on HEAD if there is a prepared transaction
on the backend where the tests run, and that seems like a bad idea to
me to rely on cluster-wide scans for what should be a "local" test.

True, it's antithetical to the point of an "installcheck" test if
unrelated actions in another database can break it. So I'm fine
with tightening up prepared_xacts's query. I just wonder how far
we want to try to carry this.

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

regards, tom lane

michael@paquier.xyz

almost 2 years ago

In reply to: Tom Lane (#7)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.
--
Michael

guofenglinux@gmail.com

almost 2 years ago

In reply to: Tom Lane (#4)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I have the same concern. I suspect that the scan of pg_prepared_xacts
is not the only test that could cause problems when running in parallel
to other tests from the entire source tree.

Thanks
Richard

#10

guofenglinux@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#8)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 2:58 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

Thanks
Richard

#11

michael@paquier.xyz

almost 2 years ago

In reply to: Richard Guo (#10)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 05:11:19PM +0800, Richard Guo wrote:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests run in isolation of the rest with their own clusters
initialized from a copy initdb'd (rather than initdb because that's
much cheaper), so these scans are OK left alone.
--
Michael

#12

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Richard Guo (#10)

Re: A failure in prepared_xacts test

Richard Guo <guofenglinux@gmail.com> writes:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

regards, tom lane

#13

michael@paquier.xyz

almost 2 years ago

In reply to: Tom Lane (#12)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 09:45:16AM -0400, Tom Lane wrote:

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

There are a few more 2PC transactions in test_decoding (no
installcheck), temp.sql, test_extensions.sql and pg_stat_statements's
utility.sql (no installcheck) but their GIDs are not that bad.
twophase_stream.sql has a GID "test1", which is kind of generic, but
it won't run in parallel. At the end, only addressing the
prepared_xacts.sql and the ECPG bits looked enough to me, so I've
tweaked these with 7e61e4cc7cfc and called it a day.

I'd be curious about any discussion involving the structure of the
meson tests.
--
Michael

#14

guofenglinux@gmail.com

almost 2 years ago

In reply to: Tom Lane (#12)

Re: A failure in prepared_xacts test

On Mon, Apr 29, 2024 at 9:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

Thank you for the explanation. I wasn't aware of this before.

Thanks
Richard

#15

[1]: /messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com
/messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com

guofenglinux@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#13)

Re: A failure in prepared_xacts test

On Tue, Apr 30, 2024 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:

I'd be curious about any discussion involving the structure of the
meson tests.

+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]/messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com. I haven't looked into it though.

Thanks
Richard

#16

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Richard Guo (#15)

Re: A failure in prepared_xacts test

Richard Guo <guofenglinux@gmail.com> writes:

+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]. I haven't looked into it though.
[1] /messages/by-id/cbf0156f-5aa1-91db-5802-82435dda03e6@gmail.com

The mechanism there is pretty obvious: a plancache flush happened
at just the wrong (right?) time and caused the output to change,
as indeed the comment acknowledges:

-- currently, this fails due to cached plan for "r.f1 + 1" expression
-- (but if debug_discard_caches is on, it will succeed)

I wonder if we shouldn't just remove that test case as being
too unstable -- especially since it's not proving much anyway.

regards, tom lane

#17

Jingxian Li

aqktjcm@qq.com

almost 2 years ago

In reply to: Richard Guo (#1)

[PATCH] Fix bug when calling strncmp in check_authmethod_valid

Hi all,

Attached is a patch that fixes bug when calling strncmp function, in
which case the third argument (authmethod - strchr(authmethod, ' '))
may be negative, which is not as expected..

With Regards,
Jingxian Li.

#18