xid_wraparound tests intermittent failure.

Started by Andrew Dunstanover 1 year ago26 messages
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

In the latest iteration the failure looks like this

stderr:
# poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
#   SELECT *
#   FROM pg_database
#   WHERE age(datfrozenxid) >
current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 1.

(test program exited with status code 29)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

Summary of Failures:

295/295 postgresql:xid_wraparound / xid_wraparound/001_emergency_vacuum
ERROR           211.76s   exit status 29

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&amp;br=HEAD

perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:

'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},

One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: xid_wraparound tests intermittent failure.

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&amp;br=HEAD

perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:

'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},

One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux

It's a VM running on UTM/Apple Silicon

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#2)
Re: xid_wraparound tests intermittent failure.

Hello,

21.07.2024 20:34, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&amp;br=HEAD

I think this failure is counted at [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing. Please look at the linked message
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

Best regards,
Alexander

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon

Hmm, doesn't sound like that ought to be slow.

I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:

$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]

Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sys

Each of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.

regards, tom lane

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#4)
Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Alexander Lakhin (#4)
Re: xid_wraparound tests intermittent failure.

On 2024-07-21 Su 4:08 PM, Alexander Lakhin wrote:

Hello,

21.07.2024 20:34, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running
in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default.  (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&amp;br=HEAD

I think this failure is counted at [1]. Please look at the linked message
[2], where I described what makes the test fail.

[1]
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]
/messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

It's sad nothing has happened abut this for 2 months.

There's no point in having unreliable tests. What's not 100% clear to me
is whether this failure indicates a badly formulated test or the test is
correct and has identified an underlying bug.

Regarding the point in [2] about the test being run twice in buildfarm
clients, I think we should mark the module as NO_INSTALLCHECK in the
Makefile.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#6)
Re: xid_wraparound tests intermittent failure.

On Sun, Jul 21, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.

Wow, that's an incredible wiki page.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#5)
Re: xid_wraparound tests intermittent failure.

On Sun, Jul 21, 2024 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon

Hmm, doesn't sound like that ought to be slow.

I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:

$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]

Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sys

Each of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.

Since the server writes a lot of logs during the xid_wraparound test,
"slow disk" could also be a reason.

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C? I thought
the following link is the server logs but since it seems there were no
autovacuum logs I suspected there is another log file:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&amp;dt=2024-07-20%2020%3A35%3A39&amp;stg=testmodules-install-check-C&amp;raw=1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#9)
Re: xid_wraparound tests intermittent failure.

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)

regards, tom lane

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#10)
Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

During the xid_wraparound test in testmodules-install-check-C two
clusters are running at the same time. This fact could make the
xid_wraparound test slower by any chance.

(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)

+1 not running it twice.

There are test modules that have only TAP tests and are not marked as
NO_INSTALLCHECK, for example test_custom_rmgrs. Probably we don't want
to run these tests twice too?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the
issue, but that didn't wash. Here's what I did:

    for f in `seq 1 100`
      do echo iteration = $f
      meson test --suite xid_wraparound || break
    done

It took until iteration 6 to get an error. I don't think my Ubuntu
instance is especially slow. e.g. "meson compile" normally takes a
handful of seconds. Maybe concurrent tests make it more likely, but they
can't be the only cause.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#12)
Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#12)
Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.

regards, tom lane

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#13)
Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan<andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

See
<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2&gt;

The failure logs are from a run where both tests 1 and 2 failed.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#14)
Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 10:11 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.

Yeah, I agree that's perplexing. That step doesn't run with "make -j
nn", so it's a bit hard to see why it should get different results from
one run rather than the other.  The only thing that's different is that
there's another postgres instance running. Maybe that's just enough to
slow the test down? After all, this is an RPi.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#15)
Re: xid_wraparound tests intermittent failure.

On Tue, Jul 23, 2024 at 3:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2&gt;

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1]: /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#17)
Re: xid_wraparound tests intermittent failure.

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2&gt;

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#18)
Re: xid_wraparound tests intermittent failure.

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2&gt;

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#19)
Re: xid_wraparound tests intermittent failure.

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2&gt;

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

stabilize_xid_wraparound_test.patchapplication/octet-stream; name=stabilize_xid_wraparound_test.patchDownload+21-21
#21Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#20)
#22Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#21)
#23Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#22)
#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#23)
#25Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#24)
#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#25)