pgsql: Test replay of regression tests, attempt II.

Started by Thomas Munroover 4 years ago18 messagescomitters

thomas.munro@gmail.com

over 4 years ago

Test replay of regression tests, attempt II.

See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319. The
only change this time is the order of the arguments passed to
pg_regress. The previously version broke in the build farm environment
due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c
which had to do something similar).

Discussion: /messages/by-id/CA+hUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk+Heskxg@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/f47ed79cc8a0cfa154dc7f01faaf59822552363f

Modified Files
--------------
doc/src/sgml/regress.sgml | 11 +++++
src/test/recovery/Makefile | 6 ++-
src/test/recovery/t/027_stream_regress.pl | 79 +++++++++++++++++++++++++++++++
src/tools/msvc/vcregress.pl | 2 +
4 files changed, 97 insertions(+), 1 deletion(-)

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Thomas Munro (#1)

Re: pgsql: Test replay of regression tests, attempt II.

On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote:

Test replay of regression tests, attempt II.

See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319. The
only change this time is the order of the arguments passed to
pg_regress. The previously version broke in the build farm environment
due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c
which had to do something similar).

Discussion: /messages/by-id/CA+hUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk+Heskxg@mail.gmail.com

Butterflyfish has just failed in this new test:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD

Ad the pg_regress command has failed one of the tests:
reloptions ... FAILED 1377 ms

No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.
--
Michael

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Michael Paquier (#2)

Re: pgsql: Test replay of regression tests, attempt II.

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote:

Test replay of regression tests, attempt II.

Butterflyfish has just failed in this new test:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD
Ad the pg_regress command has failed one of the tests:
reloptions ... FAILED 1377 ms
No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.

I'm quite appalled that this patch has apparently added an entire new run
of the core regression tests to the standard check-world/buildfarm cycle.
Was any thought taken for the developer-time and buildfarm-load impact
that will have? Neither of those resources is free.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#3)

Re: pgsql: Test replay of regression tests, attempt II.

I wrote:

I'm quite appalled that this patch has apparently added an entire new run
of the core regression tests to the standard check-world/buildfarm cycle.

Not only that, but it leaves junk that is neither cleaned by "make clean"
nor .gitignored:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
(use "git add <file>..." to include in what will be committed)
primary.dump
results/
standby.dump

$ make clean
rm -rf tmp_check

These files should have been put into the tmp_check directory...
that's what it's for.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#4)

Re: pgsql: Test replay of regression tests, attempt II.

On Tue, Jan 18, 2022 at 6:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I'm quite appalled that this patch has apparently added an entire new run
of the core regression tests to the standard check-world/buildfarm cycle.

I wonder if there is a good way to share the resulting data directory
with the pg_upgrade test.

These files should have been put into the tmp_check directory...
that's what it's for.

Yeah, I mentioned this problem on the thread and was about to update
the clean target to remove them, but this sounds like a better idea.
Will fix.

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Michael Paquier (#2)

Re: pgsql: Test replay of regression tests, attempt II.

On Mon, Jan 17, 2022 at 8:02 PM Michael Paquier <michael@paquier.xyz> wrote:

Butterflyfish has just failed in this new test:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD

Ad the pg_regress command has failed one of the tests:
reloptions ... FAILED 1377 ms

Also spurfowl and curculio.

No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.

Hmm. Yeah.

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Michael Paquier (#2)

Re: pgsql: Test replay of regression tests, attempt II.

Hi,

On 2022-01-17 16:02:34 +0900, Michael Paquier wrote:

On Mon, Jan 17, 2022 at 03:35:48AM +0000, Thomas Munro wrote:

Test replay of regression tests, attempt II.

See commit message for 123828a7fa563025d0ceee10cf1b2a253cd05319. The
only change this time is the order of the arguments passed to
pg_regress. The previously version broke in the build farm environment
due to the contents of EXTRA_REGRESS_OPTS (see also commit 8cade04c
which had to do something similar).

Discussion: /messages/by-id/CA+hUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk+Heskxg@mail.gmail.com

Butterflyfish has just failed in this new test:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=butterflyfish&br=HEAD

Ad the pg_regress command has failed one of the tests:
reloptions ... FAILED 1377 ms

No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.

This also failed on CI, which collected the regression.diffs:
https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/regress_diffs/src/test/recovery/regression.diffs
https://cirrus-ci.com/task/6460573058596864

diff -w -U3 c:/cirrus/src/test/recovery/../regress/expected/reloptions.out c:/cirrus/src/test/recovery/results/reloptions.out
--- c:/cirrus/src/test/recovery/../regress/expected/reloptions.out	2022-01-17 07:08:52.779337800 +0000
+++ c:/cirrus/src/test/recovery/results/reloptions.out	2022-01-17 07:24:46.785394900 +0000
@@ -133,7 +133,7 @@
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
- t
+ f
 (1 row)

and the server log:
https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log

I'm not entirely clear on what the regression test is trying to test here.

It's plausible the standby was a bit behind at that point and prevented
VACUUM from truncating via hot_standby_feeback?

Greetings,

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#6)

Re: pgsql: Test replay of regression tests, attempt II.

Hi,

On 2022-01-18 07:47:48 +1300, Thomas Munro wrote:

No idea what's going on here, but I guess that we'd better show up the
contents of regression.diffs in the TAP logs if it exists.

Hmm. Yeah.

It already is in the regress_log_*, but I agree, it'd be good to add it to the
normal output in the failure case.
https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/tap/src/test/recovery/tmp_check/log/regress_log_027_stream_regress

I think system_or_bail() might be the wrong tool here. If I understand
Test::More's BAIL_OUT() correctly [1]https://perldoc.perl.org/Test::More#Test-control "Indicates to the harness that things are going so badly all testing should terminate. This includes the running of any additional test scripts.", it'll prevent other tests from running as
well. Which doesn't seem quite right for a failure like this?

Kinda seems like nearly all of our uses of system_or_bail() are bad?

Greetings,

Andres Freund

[1]: https://perldoc.perl.org/Test::More#Test-control "Indicates to the harness that things are going so badly all testing should terminate. This includes the running of any additional test scripts."
"Indicates to the harness that things are going so badly all testing
should terminate. This includes the running of any additional test scripts."

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#7)

Re: pgsql: Test replay of regression tests, attempt II.

On Tue, Jan 18, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:

diff -w -U3 c:/cirrus/src/test/recovery/../regress/expected/reloptions.out c:/cirrus/src/test/recovery/results/reloptions.out
--- c:/cirrus/src/test/recovery/../regress/expected/reloptions.out      2022-01-17 07:08:52.779337800 +0000
+++ c:/cirrus/src/test/recovery/results/reloptions.out  2022-01-17 07:24:46.785394900 +0000
@@ -133,7 +133,7 @@
SELECT pg_relation_size('reloptions_test') = 0;
?column?
----------
- t
+ f
(1 row)
and the server log:
https://api.cirrus-ci.com/v1/artifact/task/6460573058596864/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log

I'm not entirely clear on what the regression test is trying to test here.

It's plausible the standby was a bit behind at that point and prevented
VACUUM from truncating via hot_standby_feeback?

I don't think hot_standby_feedback is enabled. There is some history
of flakiness here, recently adjusted by the following commit. I'm not
entirely sure what replication has to do with it... perhaps it's just
messing with CPU scheduling.

commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb
Author: Michael Paquier <michael@paquier.xyz>
Date: Fri Apr 2 09:44:42 2021 +0900

Improve stability of test with vacuum_truncate in reloptions.sql

Hmm... looking at that commit and the referenced discussion, I'm not
sure why that commit added only FREEZE and not also
DISABLE_PAGE_SKIPPING, following the example of:

commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Jan 20 11:49:29 2021 -0500

Disable vacuum page skipping in selected test cases.

#10

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#9)

Re: pgsql: Test replay of regression tests, attempt II.

On Tue, Jan 18, 2022 at 11:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:

commit fe246d1c111d43fd60a1b0afff25ed09b7ae11eb
Author: Michael Paquier <michael@paquier.xyz>
Date: Fri Apr 2 09:44:42 2021 +0900

Improve stability of test with vacuum_truncate in reloptions.sql

Hmm... looking at that commit and the referenced discussion, I'm not
sure why that commit added only FREEZE and not also
DISABLE_PAGE_SKIPPING, following the example of:

commit c2dc1a79767a0f947e1145f82eb65dfe4360d25f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Jan 20 11:49:29 2021 -0500

Disable vacuum page skipping in selected test cases.

Alright, I've pushed a change like that. Let's see if that clears it
up. (From the curious coincidences department: the earlier
complaint[1]/messages/by-id/87tuotr2hh.fsf@ars-thinkpad also involved some kind of replication (Postgres Pro
multimaster, about which I know nothing).)

[1]: /messages/by-id/87tuotr2hh.fsf@ars-thinkpad

#11

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#10)

Re: pgsql: Test replay of regression tests, attempt II.

Thomas Munro <thomas.munro@gmail.com> writes:

Alright, I've pushed a change like that. Let's see if that clears it
up.

Nope:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57

That reloptions test has been there awhile, and we weren't seeing
issues with it before. What about the replication environment
would cause VACUUM to behave differently? Maybe it thinks there
are non-removable tuples because the walsender is holding back
global xmin?

regards, tom lane

#12

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#11)

Re: pgsql: Test replay of regression tests, attempt II.

On 2022-01-18 15:15:44 -0500, Tom Lane wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-01-18%2019%3A50%3A57

That reloptions test has been there awhile, and we weren't seeing
issues with it before. What about the replication environment
would cause VACUUM to behave differently? Maybe it thinks there
are non-removable tuples because the walsender is holding back
global xmin?

hot_standby_feedback isn't enabled, so it shouldn't...

I wonder if it's just a question of the different configuration used by
Cluster.pm, rather than the fact that a replica is actually connected.

Locally I see the following additions to the configuration for the primary in
027_stream_regres.pl

fsync = off
restart_after_crash = off
log_line_prefix = '%m [%p] %q%a '
log_statement = all
log_replication_commands = on
wal_retrieve_retry_interval = '500ms'
stats_temp_directory = 'pg_stat_tmp'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_log_hints = on
hot_standby = on
shared_buffers = 1MB
max_connections 25
max_wal_size = 128MB
port = 63698
unix_socket_directories = '/tmp/098ygO1bAF'
listen_addresses = ''
max_prepared_transactions = 10

That's an extremely small shared_buffers for running the regression tests, it'd not
be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
ends up skipping over a page because of page contention?

Also, it's odd that there's "max_connections 25" without an equal sign. I'd
kind of expected that to cause an error....

Greetings,

Andres Freund

#13

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#12)

Re: pgsql: Test replay of regression tests, attempt II.

Andres Freund <andres@anarazel.de> writes:

That's an extremely small shared_buffers for running the regression tests, it'd not
be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
ends up skipping over a page because of page contention?

Hmm, good thought. I tried running the test with even smaller
shared_buffers, but could not make the reloptions test fall over for
me. But this theory implies a strong timing dependency, so it might
still only happen on particular machines. (If anyone else tries it:
below about 400kB, other tests start failing with "no free unpinned
buffers" and the like.)

Also, it's odd that there's "max_connections 25" without an equal sign. I'd
kind of expected that to cause an error....

I see that guc.c intentionally allows the equal sign to be optional.
Too lazy to check if that's documented. It is weird that this one
test appears to be intentionally testing that, though, because it's
sure got nada to do with the expressed purpose of the test.

regards, tom lane

#14

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#13)

Re: pgsql: Test replay of regression tests, attempt II.

On Wed, Jan 19, 2022 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Also, it's odd that there's "max_connections 25" without an equal sign. I'd
kind of expected that to cause an error....

I see that guc.c intentionally allows the equal sign to be optional.
Too lazy to check if that's documented. It is weird that this one
test appears to be intentionally testing that, though, because it's
sure got nada to do with the expressed purpose of the test.

That's coming from here:

$node_primary->adjust_conf('postgresql.conf', 'max_connections', '25', 1);

That's passed to:

=item $node->adjust_conf(filename, setting, value, skip_equals)

That 1 was unintentional (and I don't see anything else using this
facility), but it doesn't seem to hurt anything since it's parsed just
fine. I'll take it out in passing in a future commit once we figure
out what's up with reloptions...

#15

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#13)

Re: pgsql: Test replay of regression tests, attempt II.

Hi,

On 2022-01-18 17:19:06 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

That's an extremely small shared_buffers for running the regression tests, it'd not
be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
ends up skipping over a page because of page contention?

Hmm, good thought. I tried running the test with even smaller
shared_buffers, but could not make the reloptions test fall over for
me. But this theory implies a strong timing dependency, so it might
still only happen on particular machines. (If anyone else tries it:
below about 400kB, other tests start failing with "no free unpinned
buffers" and the like.)

I ran the test in a loop for 200+ times now, without reproducing the
problem. Rorqual runs on a shared machine though, so it's quite possible that
IO will be slower, and thus triggering the issue.

I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
that'd show information about the number of pages with tuples etc. But I don't
currently see a way of that causing the regression tests to fail.

Even if I set client_min_messages=error, the messages still get sent to the
client, because elevel == INFO is special cased in
should_output_to_client(). And I don't see a way of redirecting the output of
common.c:NoticeProcessor() in psql either.

Greetings,

Andres Freund

#16

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#15)

Re: pgsql: Test replay of regression tests, attempt II.

On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-01-18 17:19:06 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

That's an extremely small shared_buffers for running the regression tests, it'd not
be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
ends up skipping over a page because of page contention?

Hmm, good thought. I tried running the test with even smaller
shared_buffers, but could not make the reloptions test fall over for
me. But this theory implies a strong timing dependency, so it might
still only happen on particular machines. (If anyone else tries it:
below about 400kB, other tests start failing with "no free unpinned
buffers" and the like.)

I ran the test in a loop for 200+ times now, without reproducing the
problem. Rorqual runs on a shared machine though, so it's quite possible that
IO will be slower, and thus triggering the issue.

I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
that'd show information about the number of pages with tuples etc. But I don't
currently see a way of that causing the regression tests to fail.

Even if I set client_min_messages=error, the messages still get sent to the
client, because elevel == INFO is special cased in
should_output_to_client(). And I don't see a way of redirecting the output of
common.c:NoticeProcessor() in psql either.

I hacked a branch thusly:

@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
verbose = (params->options & VACOPT_VERBOSE) != 0;
instrument = (verbose || (IsAutoVacuumWorkerProcess() &&

params->log_min_duration >= 0));
+ instrument = true;
if (instrument)
{
pg_rusage_init(&ru0);

Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log
there, with the smoking gun:

pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen

There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:

LockBuffer(buf, BUFFER_LOCK_SHARE);
if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
{
UnlockReleaseBuffer(buf);
vacrel->scanned_pages++;
vacrel->pinskipped_pages++;
if (hastup)
vacrel->nonempty_pages = blkno + 1;
continue;
}

Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.

[1]: https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log

#17

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#16)

Re: pgsql: Test replay of regression tests, attempt II.

Hi,

On 2022-01-20 17:23:30 +1300, Thomas Munro wrote:

Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:

Phew, finally.

Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.

The most likely source for such a pin is bgwriter, I think?

It's not actually *immediately* obviously why the page doesn't require
wraparound, as VACUUM FREEZE should use a very aggressive one. The reason is
presumably that there are concurrent backends holding back the horizon. Note
that the oldest xmin in the log below (15356 for both vacuums) is older than
the xids from the intentionally failed insert (15465).

I wonder if the easiest way to make this test reliable would be to make the
table a temporary one? That now uses very aggressive horizons, there's no
bgwriter that could pin the page, etc.

Regards,

Andres

2022-01-20 03:58:31.966 UTC [32584][client backend] [pg_regress/reloptions][4/5416:0] LOG: statement: INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] DETAIL: Failing row contains (null, null).
2022-01-20 03:58:31.967 UTC [32584][client backend] [pg_regress/reloptions][4/5416:15465] STATEMENT: INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);

2022-01-20 03:58:31.970 UTC [32584][client backend] [pg_regress/reloptions][4/5417:0] LOG: statement: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] LOG: automatic aggressive vacuum of table "regression.public.reloptions_test": index scans: 0
pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 372.024 MB/s, avg write rate: 0.000 MB/s
buffer usage: 3 hits, 6 misses, 0 dirtied
WAL usage: 1 records, 0 full page images, 237 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5418:0] STATEMENT: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] LOG: automatic aggressive vacuum of table "regression.pg_toast.pg_toast_35786": index scans: 0
pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 15356
avg read rate: 790.028 MB/s, avg write rate: 0.000 MB/s
buffer usage: 13 hits, 9 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-01-20 03:58:31.973 UTC [32584][client backend] [pg_regress/reloptions][4/5419:0] STATEMENT: VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;

#18

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#17)

Re: pgsql: Test replay of regression tests, attempt II.

On Thu, Jan 20, 2022 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if the easiest way to make this test reliable would be to make the
table a temporary one? That now uses very aggressive horizons, there's no
bgwriter that could pin the page, etc.

Good idea, thanks. I pushed that minimal change.

Skipping over some other unrelated recoveryCheck failures showing in
the BF today[1]/messages/by-id/CA+hUKGKV6fOHvfiPt8=dOKzvswjAyLoFoJF1iQXMNpi7+hD1JQ@mail.gmail.com, next up we have another kind of failure on these
Linux sparc animals:

2022-01-19 16:39:48.768 CET [9703:4] DETAIL: Last completed
transaction was at log time 2022-01-19 16:39:48.669624+01.
2022-01-19 16:39:54.629 CET [9703:5] LOG: restartpoint starting: wal
2022-01-19 16:39:55.180 CET [9705:5] LOG: incorrect resource manager
data checksum in record at 0/AD445A8

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-19%2015%3A44%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2022-01-19%2015%3A14%3A06

That would have kept me busy for a long time, if Noah hadn't recently
diagnosed an ext4-on-sparc bug where you sometimes read bogus zeroes
from a file that is being concurrently written[2]/messages/by-id/20220116210241.GC756210@rfd.leadboat.com, which of course
breaks streaming replication and thus this test. To turn that green, I
guess we'll need to switch to another filesystem, or wait for a
patched kernel.

[1]: /messages/by-id/CA+hUKGKV6fOHvfiPt8=dOKzvswjAyLoFoJF1iQXMNpi7+hD1JQ@mail.gmail.com
[2]: /messages/by-id/20220116210241.GC756210@rfd.leadboat.com