snapper and skink and fairywren (oh my!)

Started by Robert Haasalmost 4 years ago6 messages
#1Robert Haas
robertmhaas@gmail.com

I was thinking about whether it made sense to try to commit anything
and decided it would be a good idea to check how the buildfarm looks
first. It doesn't look great.

1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
latest run as of this writing shows this:

[19:09:50] t/026_overwrite_contrecord.pl ........ ok 43136 ms
# poll_query_until timed out executing this query:
# SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
# expecting this output:
# t
# last actual query output:
#

2. skink failed the last run in this MiscCheck phase. I had no idea
what this phase was, because the name isn't very descriptive. It seems
that it runs some stuff in contrib and some stuff in src/test/modules,
which seems a bit confusing. Anyway, the failure here is:

test oldest_xmin ... FAILED 5533 ms

To find the failure, the regression test suite suggests looking at
file contrib/test_decoding/output_iso/regression.diffs or
regression.out. But neither file is in the buildfarm results so far as
I can see. It does have the postmaster log but I can't tell what's
gone wrong from looking at that. In fact I'm not really sure it's the
correct log file, because oldest_xmin.spec uses a slot called
"isolation_slot" and test_decoding/log/postmaster.log refers only to
"regression_slot", so it seems like this postmaster.log file might
cover only the pg_regress tests and not the results from the isolation
tester.

3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:

[16:30:02] t/002_standby.pl .... ok 13354 ms ( 0.06 usr 0.00 sys +
1.11 cusr 3.20 csys = 4.37 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f

I don't know what is causing any of these failures, and I don't know
if there's already some discussion elsewhere that I've missed, but
maybe this email will be helpful to someone. I also noticed that 2 out
of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
pg_hba_file_rules for authentication method cert" as the only new
commit since the last run, and it's hardly believable that that commit
would have broken this. Nor do I see any other recent changes that
look like likely culprits. Apologies in advance if any of this is my
fault.

--
Robert Haas
EDB: http://www.enterprisedb.com

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#1)
Re: snapper and skink and fairywren (oh my!)

On 1/26/22 16:17, Robert Haas wrote

3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:

[16:30:02] t/002_standby.pl .... ok 13354 ms ( 0.06 usr 0.00 sys +
1.11 cusr 3.20 csys = 4.37 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f

I don't know what is causing any of these failures, and I don't know
if there's already some discussion elsewhere that I've missed, but
maybe this email will be helpful to someone. I also noticed that 2 out
of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
pg_hba_file_rules for authentication method cert" as the only new
commit since the last run, and it's hardly believable that that commit
would have broken this. Nor do I see any other recent changes that
look like likely culprits. Apologies in advance if any of this is my
fault.

Intermittent failures give a false positive against the latest set of
commits. These failures started happening regularly about 49 days ago.
So we need to look at what exactly happened around then. See
<https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&amp;stage=module-commit_tsCheck&amp;filter=Submit&gt;

It's very unlikely any of this is your fault. In any case, intermittent
failures are very hard to nail down.

I should add that both drongo and fairywren run on a single
AWS/EC2/Windows2019 instance. I haven't updated the OS in a while. so
I'm doing that to see if matters improve. FWIW. I just reproduced the
error on a similar instance where I'm testing some stuff for Andres'
project to make TAP tests more portable.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andrew Dunstan (#2)
Re: snapper and skink and fairywren (oh my!)

At Wed, 26 Jan 2022 18:45:53 -0500, Andrew Dunstan <andrew@dunslane.net> wrote in

It's very unlikely any of this is your fault. In any case, intermittent
failures are very hard to nail down.

The primary starts at 2022-01-26 16:30:06.613 the accepted a connectin
from the standby at 2022-01-26 16:30:09.911.

P: 2022-01-26 16:30:06.613 UTC [74668:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 64-bit
S: 2022-01-26 16:30:09.637 UTC [72728:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 64-bit
P: 2022-01-26 16:30:09.911 UTC [74096:3] [unknown] LOG: replication connection authorized: user=pgrunner application_name=standby
S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

After this the primary restarts.

P: 2022-01-26 16:30:11.832 UTC [74668:7] LOG: database system is shut down
P: 2022-01-26 16:30:12.010 UTC [72140:1] LOG: starting PostgreSQL 15devel on x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 64-bit

But the standby doesn't notice the disconnection and continue with the
poll_query_until() on the stale connection. But the query should have
executed after reconnection finally. The following log lines are not
thinned out.

S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
S: 2022-01-26 16:30:12.825 UTC [72760:1] [unknown] LOG: connection received: host=127.0.0.1 port=60769
S: 2022-01-26 16:30:12.830 UTC [72760:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_003_standby_2_standby_data/pgdata/pg_hba.conf:2)
S: 2022-01-26 16:30:12.830 UTC [72760:3] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=003_standby_2.pl
S: 2022-01-26 16:30:12.838 UTC [72760:4] 003_standby_2.pl LOG: statement: SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()

Since the standby dones't receive WAL from the restarted server so the
poll_query_until() doesn't return.

I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a long time.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kyotaro Horiguchi (#3)
Re: snapper and skink and fairywren (oh my!)

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

I'm not sure why walsender of the standby continued running not knowing the primary has been once dead for such a long time.

Isn't this precisely the problem that made us revert the "graceful
disconnection" patch [1]/messages/by-id/CA+hUKG+OeoETZQ=Qw5Ub5h3tmwQhBmDA=nuNO3KG=zWfUypFAw@mail.gmail.com? Munro seems to have a theory about
why that broke things, but in any case it's not Haas' fault ;-)

regards, tom lane

[1]: /messages/by-id/CA+hUKG+OeoETZQ=Qw5Ub5h3tmwQhBmDA=nuNO3KG=zWfUypFAw@mail.gmail.com

#5Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#1)
Re: snapper and skink and fairywren (oh my!)

On Wed, Jan 26, 2022 at 04:17:19PM -0500, Robert Haas wrote:

1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
latest run as of this writing shows this:

[19:09:50] t/026_overwrite_contrecord.pl ........ ok 43136 ms
# poll_query_until timed out executing this query:
# SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
# expecting this output:
# t
# last actual query output:
#

I expect commit ce6d793 will end the failures on snapper, kittiwake, and
tadarida. I apologize for letting the damage last months.

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#2)
Re: snapper and skink and fairywren (oh my!)

On 1/26/22 18:45, Andrew Dunstan wrote:

On 1/26/22 16:17, Robert Haas wrote

3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:

Intermittent failures give a false positive against the latest set of
commits. These failures started happening regularly about 49 days ago.
So we need to look at what exactly happened around then. See
<https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&amp;stage=module-commit_tsCheck&amp;filter=Submit&gt;

Specifically, the issue is discussed here
</messages/by-id/CA+hUKG+OeoETZQ=Qw5Ub5h3tmwQhBmDA=nuNO3KG=zWfUypFAw@mail.gmail.com&gt;
Tom has already reverted the offending changes on the back branches.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com