xid_wraparound tests intermittent failure.

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andrew Dunstan (#1)

Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:

'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},

One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

regards, tom lane

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

andrew@dunslane.net

over 1 year ago

In reply to: Tom Lane (#2)

Re: xid_wraparound tests intermittent failure.

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:

'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},

One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux

It's a VM running on UTM/Apple Silicon

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Tom Lane (#2)

Re: xid_wraparound tests intermittent failure.

Hello,

21.07.2024 20:34, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

I think this failure is counted at [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing. Please look at the linked message
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

Best regards,
Alexander

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andrew Dunstan (#3)

Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon

Hmm, doesn't sound like that ought to be slow.

I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:

$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]

Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sys

Each of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#4)

Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.

andrew@dunslane.net

over 1 year ago

In reply to: Alexander Lakhin (#4)

Re: xid_wraparound tests intermittent failure.

On 2024-07-21 Su 4:08 PM, Alexander Lakhin wrote:

Hello,

21.07.2024 20:34, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running
in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD

I think this failure is counted at [1]. Please look at the linked message
[2], where I described what makes the test fail.

[1]
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]
/messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com

It's sad nothing has happened abut this for 2 months.

There's no point in having unreliable tests. What's not 100% clear to me
is whether this failure indicates a badly formulated test or the test is
correct and has identified an underlying bug.

Regarding the point in [2] about the test being run twice in buildfarm
clients, I think we should mark the module as NO_INSTALLCHECK in the
Makefile.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Thomas Munro (#6)

Re: xid_wraparound tests intermittent failure.

On Sun, Jul 21, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.

Wow, that's an incredible wiki page.

--
Robert Haas
EDB: http://www.enterprisedb.com

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-07-20%2020%3A35%3A39&stg=testmodules-install-check-C&raw=1

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tom Lane (#5)

Re: xid_wraparound tests intermittent failure.

On Sun, Jul 21, 2024 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-21 Su 1:34 PM, Tom Lane wrote:

Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?

Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon

Hmm, doesn't sound like that ought to be slow.

I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:

$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]

Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sys

Each of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.

Since the server writes a lot of logs during the xid_wraparound test,
"slow disk" could also be a reason.

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C? I thought
the following link is the server logs but since it seems there were no
autovacuum logs I suspected there is another log file:

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Masahiko Sawada (#9)

Re: xid_wraparound tests intermittent failure.

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)

regards, tom lane

#11

sawada.mshk@gmail.com

over 1 year ago

In reply to: Tom Lane (#10)

Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

During the xid_wraparound test in testmodules-install-check-C two
clusters are running at the same time. This fact could make the
xid_wraparound test slower by any chance.

(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)

+1 not running it twice.

There are test modules that have only TAP tests and are not marked as
NO_INSTALLCHECK, for example test_custom_rmgrs. Probably we don't want
to run these tests twice too?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12

andrew@dunslane.net

over 1 year ago

In reply to: Tom Lane (#10)

Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the
issue, but that didn't wash. Here's what I did:

    for f in `seq 1 100`
      do echo iteration = $f
      meson test --suite xid_wraparound || break
    done

It took until iteration 6 to get an error. I don't think my Ubuntu
instance is especially slow. e.g. "meson compile" normally takes a
handful of seconds. Maybe concurrent tests make it more likely, but they
can't be the only cause.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#13

sawada.mshk@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#12)

Re: xid_wraparound tests intermittent failure.

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andrew Dunstan (#12)

Re: xid_wraparound tests intermittent failure.

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.

regards, tom lane

#15

See
<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

andrew@dunslane.net

over 1 year ago

In reply to: Masahiko Sawada (#13)

Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan<andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

The failure logs are from a run where both tests 1 and 2 failed.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#16

andrew@dunslane.net

over 1 year ago

In reply to: Tom Lane (#14)

Re: xid_wraparound tests intermittent failure.

On 2024-07-22 Mo 10:11 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada<sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.

Yeah, I agree that's perplexing. That step doesn't run with "make -j
nn", so it's a bit hard to see why it should get different results from
one run rather than the other. The only thing that's different is that
there's another postgres instance running. Maybe that's just enough to
slow the test down? After all, this is an RPi.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#17

[1]: /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

sawada.mshk@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#15)

Re: xid_wraparound tests intermittent failure.

On Tue, Jul 23, 2024 at 3:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:

On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?

Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?

It's not deterministic.

I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:

for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done

It took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.

Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#18

andrew@dunslane.net

over 1 year ago

In reply to: Masahiko Sawada (#17)

Re: xid_wraparound tests intermittent failure.

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#19

sawada.mshk@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#18)

Re: xid_wraparound tests intermittent failure.

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#20

sawada.mshk@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#19)

1 attachment(s)

Re: xid_wraparound tests intermittent failure.

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

stabilize_xid_wraparound_test.patchapplication/octet-stream; name=stabilize_xid_wraparound_test.patchDownload

diff --git a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
index 37550b67a4..2692b35f34 100644
--- a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
+++ b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
@@ -18,7 +18,6 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
 autovacuum_naptime = 1s
 # so it's easier to verify the order of operations
 autovacuum_max_workers = 1
@@ -27,23 +26,25 @@ log_autovacuum_min_duration = 0
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create tables for a few different test scenarios
+# Create tables for a few different test scenarios. We disable autovacuum
+# on these tables to run it only to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO large(data) SELECT generate_series(1,30000);
 
-CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO large_trunc(data) SELECT generate_series(1,30000);
 
-CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO small(data) SELECT generate_series(1,15000);
 
-CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO small_trunc(data) SELECT generate_series(1,15000);
-
-CREATE TABLE autovacuum_disabled(id serial primary key, data text) WITH (autovacuum_enabled=false);
-INSERT INTO autovacuum_disabled(data) SELECT generate_series(1,1000);
 ]);
 
 # Bump the query timeout to avoid false negatives on slow test systems.
@@ -63,7 +64,6 @@ $background_psql->query_safe(
 	DELETE FROM large_trunc WHERE id > 10000;
 	DELETE FROM small WHERE id % 2 = 0;
 	DELETE FROM small_trunc WHERE id > 1000;
-	DELETE FROM autovacuum_disabled WHERE id % 2 = 0;
 ]);
 
 # Consume 2 billion XIDs, to get us very close to wraparound
@@ -107,20 +107,18 @@ $ret = $node->safe_psql(
 	'postgres', qq[
 SELECT relname, age(relfrozenxid) > current_setting('autovacuum_freeze_max_age')::int
 FROM pg_class
-WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc', 'autovacuum_disabled')
+WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc')
 ORDER BY 1
 ]);
 
-is( $ret, "autovacuum_disabled|f
-large|f
+is( $ret, "large|f
 large_trunc|f
 small|f
 small_trunc|f", "all tables are vacuumed");
 
 # Check if vacuum failsafe was triggered for each table.
 my $log_contents = slurp_file($node->logfile, $log_offset);
-foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc',
-	'autovacuum_disabled')
+foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc')
 {
 	like(
 		$log_contents,
diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index c02c287167..3ac080f18b 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -27,17 +27,18 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only to prevent wraparound
+autovacuum = off
 autovacuum_naptime = 1s
 log_autovacuum_min_duration = 0
 ]);
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run it only
+# to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
 INSERT INTO wraparoundtest VALUES ('start');
 ]);
 
diff --git a/src/test/modules/xid_wraparound/t/003_wraparounds.pl b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
index 88063b4b52..bf9ec038a0 100644
--- a/src/test/modules/xid_wraparound/t/003_wraparounds.pl
+++ b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
@@ -21,7 +21,7 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
+autovacuum = off
 autovacuum_naptime = 1s
 # so it's easier to verify the order of operations
 autovacuum_max_workers = 1
@@ -30,10 +30,11 @@ log_autovacuum_min_duration = 0
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run
+# it only to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
 INSERT INTO wraparoundtest VALUES ('beginning');
 ]);

#21

andrew@dunslane.net

over 1 year ago

In reply to: Masahiko Sawada (#20)

Re: xid_wraparound tests intermittent failure.

On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada<sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan<andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in
003_wraparound.pl. With that change in your patch I retried my test, but
on iteration 100 out of 100 it failed on test 002_limits.pl.

You can see the logs at
<https://f001.backblazeb2.com/file/net-dunslane-public/002_limits-failure-log.tar.bz2>

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#22

sawada.mshk@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#21)

Re: xid_wraparound tests intermittent failure.

On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#23

andrew@dunslane.net

over 1 year ago

In reply to: Masahiko Sawada (#22)

1 attachment(s)

Re: xid_wraparound tests intermittent failure.

On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

xid_wraparound-test-fix.patchtext/x-patch; charset=UTF-8; name=xid_wraparound-test-fix.patchDownload

diff --git a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
index 37550b67a4..2692b35f34 100644
--- a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
+++ b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
@@ -18,7 +18,6 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
 autovacuum_naptime = 1s
 # so it's easier to verify the order of operations
 autovacuum_max_workers = 1
@@ -27,23 +26,25 @@ log_autovacuum_min_duration = 0
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create tables for a few different test scenarios
+# Create tables for a few different test scenarios. We disable autovacuum
+# on these tables to run it only to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO large(data) SELECT generate_series(1,30000);
 
-CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO large_trunc(data) SELECT generate_series(1,30000);
 
-CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO small(data) SELECT generate_series(1,15000);
 
-CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+   WITH (autovacuum_enabled = off);
 INSERT INTO small_trunc(data) SELECT generate_series(1,15000);
-
-CREATE TABLE autovacuum_disabled(id serial primary key, data text) WITH (autovacuum_enabled=false);
-INSERT INTO autovacuum_disabled(data) SELECT generate_series(1,1000);
 ]);
 
 # Bump the query timeout to avoid false negatives on slow test systems.
@@ -63,7 +64,6 @@ $background_psql->query_safe(
 	DELETE FROM large_trunc WHERE id > 10000;
 	DELETE FROM small WHERE id % 2 = 0;
 	DELETE FROM small_trunc WHERE id > 1000;
-	DELETE FROM autovacuum_disabled WHERE id % 2 = 0;
 ]);
 
 # Consume 2 billion XIDs, to get us very close to wraparound
@@ -107,20 +107,18 @@ $ret = $node->safe_psql(
 	'postgres', qq[
 SELECT relname, age(relfrozenxid) > current_setting('autovacuum_freeze_max_age')::int
 FROM pg_class
-WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc', 'autovacuum_disabled')
+WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc')
 ORDER BY 1
 ]);
 
-is( $ret, "autovacuum_disabled|f
-large|f
+is( $ret, "large|f
 large_trunc|f
 small|f
 small_trunc|f", "all tables are vacuumed");
 
 # Check if vacuum failsafe was triggered for each table.
 my $log_contents = slurp_file($node->logfile, $log_offset);
-foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc',
-	'autovacuum_disabled')
+foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc')
 {
 	like(
 		$log_contents,
diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index c02c287167..aca3fa1514 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -27,17 +27,17 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only to prevent wraparound
 autovacuum_naptime = 1s
 log_autovacuum_min_duration = 0
 ]);
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run it only
+# to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
 INSERT INTO wraparoundtest VALUES ('start');
 ]);
 
diff --git a/src/test/modules/xid_wraparound/t/003_wraparounds.pl b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
index 88063b4b52..3eaa46a94d 100644
--- a/src/test/modules/xid_wraparound/t/003_wraparounds.pl
+++ b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
@@ -21,7 +21,6 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
 $node->init;
 $node->append_conf(
 	'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
 autovacuum_naptime = 1s
 # so it's easier to verify the order of operations
 autovacuum_max_workers = 1
@@ -30,10 +29,11 @@ log_autovacuum_min_duration = 0
 $node->start;
 $node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
 
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run
+# it only to prevent wraparound.
 $node->safe_psql(
 	'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
 INSERT INTO wraparoundtest VALUES ('beginning');
 ]);

#24

sawada.mshk@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#23)

Re: xid_wraparound tests intermittent failure.

On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:

On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:

See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>

The failure logs are from a run where both tests 1 and 2 failed.

Thank you for sharing the logs.

I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.

Regards,

[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com

OK, do you want to propose a patch?

Yes, I'll prepare and share it soon.

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#25

andrew@dunslane.net

over 1 year ago

In reply to: Masahiko Sawada (#24)

Re: xid_wraparound tests intermittent failure.

On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote:

I've attached the patch. Could you please test if the patch fixes the
instability you observed?

Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.

I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.

I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?

The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Thanks, pushed.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#26