xid_wraparound tests intermittent failure.
I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.
In the latest iteration the failure looks like this
stderr:
# poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) >
current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 1.
(test program exited with status code 29)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
Summary of Failures:
295/295 postgresql:xid_wraparound / xid_wraparound/001_emergency_vacuum
ERROR 211.76s exit status 29
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.
I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:
'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},
One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)
Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?
regards, tom lane
On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
perentie doesn't seem to be having a problem, but I will bet that
part of the reason is it's running with cranked-up timeouts:'build_env' => {
'PG_TEST_EXTRA' => 'xid_wraparound',
'PG_TEST_TIMEOUT_DEFAULT' => '360'
},One thing that seems quite interesting is that the test seems to
take about 10 minutes when successful on dodo, but when it fails
it's twice that. Why the instability? (Perhaps dodo has highly
variable background load, and the thing simply times out in some
runs but not others?)Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?
Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hello,
21.07.2024 20:34, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
I think this failure is counted at [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing. Please look at the linked message
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com
[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]: /messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com
Best regards,
Alexander
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?
Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple Silicon
Hmm, doesn't sound like that ought to be slow.
I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:
$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]
Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sys
Each of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.
regards, tom lane
On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures
This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.
On 2024-07-21 Su 4:08 PM, Alexander Lakhin wrote:
Hello,
21.07.2024 20:34, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running
in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default. (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.) AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD
I think this failure is counted at [1]. Please look at the linked message
[2], where I described what makes the test fail.[1]
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing
[2]
/messages/by-id/5811175c-1a31-4869-032f-7af5e3e4506a@gmail.com
It's sad nothing has happened abut this for 2 months.
There's no point in having unreliable tests. What's not 100% clear to me
is whether this failure indicates a badly formulated test or the test is
correct and has identified an underlying bug.
Regarding the point in [2] about the test being run twice in buildfarm
clients, I think we should mark the module as NO_INSTALLCHECK in the
Makefile.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Sun, Jul 21, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Jul 22, 2024 at 8:08 AM Alexander Lakhin <exclusion@gmail.com> wrote:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures
This is great. Thanks for collating all this info here! And of
course all the research and reports behind it.
Wow, that's an incredible wiki page.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Sun, Jul 21, 2024 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-07-21 Su 1:34 PM, Tom Lane wrote:
Locally, I've not managed to reproduce the failure yet; so perhaps
there is some platform dependency. What are you testing on?Linux ub22arm 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:08:40 UTC
2024 aarch64 aarch64 aarch64 GNU/Linux
It's a VM running on UTM/Apple SiliconHmm, doesn't sound like that ought to be slow.
I did manage to reproduce dodo's failures by running xid_wraparound
manually on mamba's very slow host:$ time make -s installcheck PROVE_FLAGS=--timer
# +++ tap install-check in src/test/modules/xid_wraparound +++
[13:37:49] t/001_emergency_vacuum.pl .. 1/? # poll_query_until timed out executing this query:
#
# SELECT NOT EXISTS (
# SELECT *
# FROM pg_database
# WHERE age(datfrozenxid) > current_setting('autovacuum_freeze_max_age')::int)
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 4 just after 1.
[13:37:49] t/001_emergency_vacuum.pl .. Dubious, test returned 4 (wstat 1024, 0x400)
All 1 subtests passed
[14:06:51] t/002_limits.pl ............ 2/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
[14:06:51] t/002_limits.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
[14:31:16] t/003_wraparounds.pl ....... ok 7564763 ms ( 0.00 usr 0.01 sys + 13.82 cusr 9.26 csys = 23.09 CPU)
[16:37:21]Test Summary Report
-------------------
t/001_emergency_vacuum.pl (Wstat: 1024 (exited 4) Tests: 1 Failed: 0)
Non-zero exit status: 4
Parse errors: No plan found in TAP output
t/002_limits.pl (Wstat: 7424 (exited 29) Tests: 2 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
Files=3, Tests=4, 10772 wallclock secs ( 0.15 usr 0.06 sys + 58.50 cusr 59.88 csys = 118.59 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:442: installcheck] Error 1
10772.99 real 59.34 user 60.14 sysEach of those two failures looks just like something that dodo has
shown at one time or another. So it's at least plausible that
"slow machine" is the whole explanation. I'm still wondering
though if there's some effect that causes the test's runtime to
be unstable in itself, sometimes leading to timeouts.
Since the server writes a lot of logs during the xid_wraparound test,
"slow disk" could also be a reason.
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C? I thought
the following link is the server logs but since it seems there were no
autovacuum logs I suspected there is another log file:
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?
Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?
(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)
regards, tom lane
On Mon, Jul 22, 2024 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?
During the xid_wraparound test in testmodules-install-check-C two
clusters are running at the same time. This fact could make the
xid_wraparound test slower by any chance.
(I agree with the comment that we shouldn't be running this test
twice, but that's a separate matter.)
+1 not running it twice.
There are test modules that have only TAP tests and are not marked as
NO_INSTALLCHECK, for example test_custom_rmgrs. Probably we don't want
to run these tests twice too?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada<sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?
It's not deterministic.
I tested the theory that it was some other concurrent tests causing the
issue, but that didn't wash. Here's what I did:
for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
done
It took until iteration 6 to get an error. I don't think my Ubuntu
instance is especially slow. e.g. "meson compile" normally takes a
handful of seconds. Maybe concurrent tests make it more likely, but they
can't be the only cause.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?It's not deterministic.
I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:
for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
doneIt took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.
Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada<sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?
Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?
It's not deterministic.
Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.
regards, tom lane
On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:
On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan<andrew@dunslane.net> wrote:
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada<sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?It's not deterministic.
I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:
for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
doneIt took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.
Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.
See
<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 2024-07-22 Mo 10:11 PM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada<sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?It's not deterministic.
Perhaps. I tried "make check" on mamba's host and got exactly the
same failures as with "make installcheck", which counts in favor of
dodo's results being just luck. Still, dodo has now shown 11 failures
in "make installcheck" and zero in "make check", so it's getting hard
to credit that there's no difference.
Yeah, I agree that's perplexing. That step doesn't run with "make -j
nn", so it's a bit hard to see why it should get different results from
one run rather than the other. The only thing that's different is that
there's another postgres instance running. Maybe that's just enough to
slow the test down? After all, this is an RPi.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Tue, Jul 23, 2024 at 3:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-22 Mo 9:29 PM, Masahiko Sawada wrote:
On Mon, Jul 22, 2024 at 12:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-22 Mo 12:46 PM, Tom Lane wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
Looking at dodo's failures, it seems that while it passes
module-xid_wraparound-check, all failures happened only during
testmodules-install-check-C. Can we check the server logs written
during xid_wraparound test in testmodules-install-check-C?Oooh, that is indeed an interesting observation. There are enough
examples now that it's hard to dismiss it as chance, but why would
the two runs be different?It's not deterministic.
I tested the theory that it was some other concurrent tests causing the issue, but that didn't wash. Here's what I did:
for f in `seq 1 100`
do echo iteration = $f
meson test --suite xid_wraparound || break
doneIt took until iteration 6 to get an error. I don't think my Ubuntu instance is especially slow. e.g. "meson compile" normally takes a handful of seconds. Maybe concurrent tests make it more likely, but they can't be the only cause.
Could you provide server logs in both OK and NG tests? I want to see
if there's a difference in the rate at which tables are vacuumed.See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.
Regards,
[1]: /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
I've attached the patch. Could you please test if the patch fixes the
instability you observed?
Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
stabilize_xid_wraparound_test.patchapplication/octet-stream; name=stabilize_xid_wraparound_test.patchDownload
diff --git a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
index 37550b67a4..2692b35f34 100644
--- a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
+++ b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
@@ -18,7 +18,6 @@ my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
autovacuum_naptime = 1s
# so it's easier to verify the order of operations
autovacuum_max_workers = 1
@@ -27,23 +26,25 @@ log_autovacuum_min_duration = 0
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create tables for a few different test scenarios
+# Create tables for a few different test scenarios. We disable autovacuum
+# on these tables to run it only to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO large(data) SELECT generate_series(1,30000);
-CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO large_trunc(data) SELECT generate_series(1,30000);
-CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO small(data) SELECT generate_series(1,15000);
-CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO small_trunc(data) SELECT generate_series(1,15000);
-
-CREATE TABLE autovacuum_disabled(id serial primary key, data text) WITH (autovacuum_enabled=false);
-INSERT INTO autovacuum_disabled(data) SELECT generate_series(1,1000);
]);
# Bump the query timeout to avoid false negatives on slow test systems.
@@ -63,7 +64,6 @@ $background_psql->query_safe(
DELETE FROM large_trunc WHERE id > 10000;
DELETE FROM small WHERE id % 2 = 0;
DELETE FROM small_trunc WHERE id > 1000;
- DELETE FROM autovacuum_disabled WHERE id % 2 = 0;
]);
# Consume 2 billion XIDs, to get us very close to wraparound
@@ -107,20 +107,18 @@ $ret = $node->safe_psql(
'postgres', qq[
SELECT relname, age(relfrozenxid) > current_setting('autovacuum_freeze_max_age')::int
FROM pg_class
-WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc', 'autovacuum_disabled')
+WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc')
ORDER BY 1
]);
-is( $ret, "autovacuum_disabled|f
-large|f
+is( $ret, "large|f
large_trunc|f
small|f
small_trunc|f", "all tables are vacuumed");
# Check if vacuum failsafe was triggered for each table.
my $log_contents = slurp_file($node->logfile, $log_offset);
-foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc',
- 'autovacuum_disabled')
+foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc')
{
like(
$log_contents,
diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index c02c287167..3ac080f18b 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -27,17 +27,18 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only to prevent wraparound
+autovacuum = off
autovacuum_naptime = 1s
log_autovacuum_min_duration = 0
]);
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run it only
+# to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
INSERT INTO wraparoundtest VALUES ('start');
]);
diff --git a/src/test/modules/xid_wraparound/t/003_wraparounds.pl b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
index 88063b4b52..bf9ec038a0 100644
--- a/src/test/modules/xid_wraparound/t/003_wraparounds.pl
+++ b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
@@ -21,7 +21,7 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
+autovacuum = off
autovacuum_naptime = 1s
# so it's easier to verify the order of operations
autovacuum_max_workers = 1
@@ -30,10 +30,11 @@ log_autovacuum_min_duration = 0
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run
+# it only to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
INSERT INTO wraparoundtest VALUES ('beginning');
]);
On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada<sawada.mshk@gmail.com> wrote:
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan<andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See<https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1]/messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.
I assume you actually meant to remove the "autovacuum = off" in
003_wraparound.pl. With that change in your patch I retried my test, but
on iteration 100 out of 100 it failed on test 002_limits.pl.
You can see the logs at
<https://f001.backblazeb2.com/file/net-dunslane-public/002_limits-failure-log.tar.bz2>
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?
The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
xid_wraparound-test-fix.patchtext/x-patch; charset=UTF-8; name=xid_wraparound-test-fix.patchDownload
diff --git a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
index 37550b67a4..2692b35f34 100644
--- a/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
+++ b/src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
@@ -18,7 +18,6 @@ my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
autovacuum_naptime = 1s
# so it's easier to verify the order of operations
autovacuum_max_workers = 1
@@ -27,23 +26,25 @@ log_autovacuum_min_duration = 0
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create tables for a few different test scenarios
+# Create tables for a few different test scenarios. We disable autovacuum
+# on these tables to run it only to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO large(data) SELECT generate_series(1,30000);
-CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE large_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO large_trunc(data) SELECT generate_series(1,30000);
-CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO small(data) SELECT generate_series(1,15000);
-CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10));
+CREATE TABLE small_trunc(id serial primary key, data text, filler text default repeat(random()::text, 10))
+ WITH (autovacuum_enabled = off);
INSERT INTO small_trunc(data) SELECT generate_series(1,15000);
-
-CREATE TABLE autovacuum_disabled(id serial primary key, data text) WITH (autovacuum_enabled=false);
-INSERT INTO autovacuum_disabled(data) SELECT generate_series(1,1000);
]);
# Bump the query timeout to avoid false negatives on slow test systems.
@@ -63,7 +64,6 @@ $background_psql->query_safe(
DELETE FROM large_trunc WHERE id > 10000;
DELETE FROM small WHERE id % 2 = 0;
DELETE FROM small_trunc WHERE id > 1000;
- DELETE FROM autovacuum_disabled WHERE id % 2 = 0;
]);
# Consume 2 billion XIDs, to get us very close to wraparound
@@ -107,20 +107,18 @@ $ret = $node->safe_psql(
'postgres', qq[
SELECT relname, age(relfrozenxid) > current_setting('autovacuum_freeze_max_age')::int
FROM pg_class
-WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc', 'autovacuum_disabled')
+WHERE relname IN ('large', 'large_trunc', 'small', 'small_trunc')
ORDER BY 1
]);
-is( $ret, "autovacuum_disabled|f
-large|f
+is( $ret, "large|f
large_trunc|f
small|f
small_trunc|f", "all tables are vacuumed");
# Check if vacuum failsafe was triggered for each table.
my $log_contents = slurp_file($node->logfile, $log_offset);
-foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc',
- 'autovacuum_disabled')
+foreach my $tablename ('large', 'large_trunc', 'small', 'small_trunc')
{
like(
$log_contents,
diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index c02c287167..aca3fa1514 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -27,17 +27,17 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only to prevent wraparound
autovacuum_naptime = 1s
log_autovacuum_min_duration = 0
]);
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run it only
+# to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
INSERT INTO wraparoundtest VALUES ('start');
]);
diff --git a/src/test/modules/xid_wraparound/t/003_wraparounds.pl b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
index 88063b4b52..3eaa46a94d 100644
--- a/src/test/modules/xid_wraparound/t/003_wraparounds.pl
+++ b/src/test/modules/xid_wraparound/t/003_wraparounds.pl
@@ -21,7 +21,6 @@ my $node = PostgreSQL::Test::Cluster->new('wraparound');
$node->init;
$node->append_conf(
'postgresql.conf', qq[
-autovacuum = off # run autovacuum only when to anti wraparound
autovacuum_naptime = 1s
# so it's easier to verify the order of operations
autovacuum_max_workers = 1
@@ -30,10 +29,11 @@ log_autovacuum_min_duration = 0
$node->start;
$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
-# Create a test table
+# Create a test table. We disable autovacuum on the table to run
+# it only to prevent wraparound.
$node->safe_psql(
'postgres', qq[
-CREATE TABLE wraparoundtest(t text);
+CREATE TABLE wraparoundtest(t text) WITH (autovacuum_enabled = off);
INSERT INTO wraparoundtest VALUES ('beginning');
]);
On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
See <https://bitbucket.org/adunstan/rotfang-fdw/downloads/xid-wraparound-result.tar.bz2>
The failure logs are from a run where both tests 1 and 2 failed.
Thank you for sharing the logs.
I think that the problem seems to match what Alexander Lakhin
mentioned[1]. Probably we can fix such a race condition somehow but
I'm not sure it's worth it as setting autovacuum = off and
autovacuum_max_workers = 1 (or a low number) is an extremely rare
case. I think it would be better to stabilize these tests. One idea is
to turn the autovacuum GUC parameter on while setting
autovacuum_enabled = off for each table. That way, we can ensure that
autovacuum workers are launched. And I think it seems to align real
use cases.Regards,
[1] /messages/by-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6@gmail.com
OK, do you want to propose a patch?
Yes, I'll prepare and share it soon.
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.
Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote:
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.
Thanks, pushed.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Tue, Jul 30, 2024 at 3:29 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-07-29 Mo 5:25 PM, Masahiko Sawada wrote:
I've attached the patch. Could you please test if the patch fixes the
instability you observed?Since we turn off autovacuum on all three tests and we wait for
autovacuum to complete processing databases, these tests potentially
have a similar (but lower) risk. So I modified these tests to turn it
on so we can ensure the autovacuum runs periodically.I assume you actually meant to remove the "autovacuum = off" in 003_wraparound.pl. With that change in your patch I retried my test, but on iteration 100 out of 100 it failed on test 002_limits.pl.
I think we need to remove the "autovacuum = off' also in 002_limits.pl
as it waits for autovacuum to process both template0 and template1
databases. Just to be clear, the failure happened even without
"autovacuum = off"?The attached patch, a slight modification of yours, removes "autovacuum
= off" for all three tests, and given that a set of 200 runs was clean
for me.Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.Thanks, pushed.
Thanks!
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com