Stream replication test fails of cfbot/windows server 2019

Started by Pavel Borisovalmost 4 years ago6 messages
#1Pavel Borisov
pashkin.elfe@gmail.com

Hi, hackers!

I've noticed that on my branch with amcheck improvements cfbot on windows
server 2019 fails stream replication test.
https://cirrus-ci.com/task/5353503093686272

I don't see any relation of it to the changes in my patch. Furthermore it
also fails on the other СF branch
https://cirrus-ci.com/task/4599128897355776

Is it known cfbot problem? Do I need to do something to my amcheck СF
branch mentioned above for it to become green on cfbot eventually?

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com&gt;

#2Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Pavel Borisov (#1)
Re: Stream replication test fails of cfbot/windows server 2019

Hello.

Looks like logical replication also affected:

[08:26:35.599] # poll_query_until timed out executing this query:
[08:26:35.599] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
srsubstate NOT IN ('r', 's');
[08:26:35.599] # expecting this output:
[08:26:35.599] # t
[08:26:35.599] # last actual query output:
[08:26:35.599] # f

https://cirrus-ci.com/task/6532060239101952
https://cirrus-ci.com/task/4755551606276096

Best regards,
Michail.

#3Julien Rouhaud
rjuju123@gmail.com
In reply to: Michail Nikolaev (#2)
Re: Stream replication test fails of cfbot/windows server 2019

On Wed, Jan 12, 2022 at 01:51:24PM +0300, Michail Nikolaev wrote:

Hello.

Looks like logical replication also affected:

[08:26:35.599] # poll_query_until timed out executing this query:
[08:26:35.599] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
srsubstate NOT IN ('r', 's');
[08:26:35.599] # expecting this output:
[08:26:35.599] # t
[08:26:35.599] # last actual query output:
[08:26:35.599] # f

https://cirrus-ci.com/task/6532060239101952
https://cirrus-ci.com/task/4755551606276096

Indeed, and yet CI on postgres tree doesn't exhibit any problem:
https://cirrus-ci.com/github/postgres/postgres

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Julien Rouhaud (#3)
Re: Stream replication test fails of cfbot/windows server 2019

On Thu, Jan 13, 2022 at 12:24 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Jan 12, 2022 at 01:51:24PM +0300, Michail Nikolaev wrote:

https://cirrus-ci.com/task/6532060239101952
https://cirrus-ci.com/task/4755551606276096

For the record, cfbot only started running the recovery tests on
Windows a couple of weeks ago (when the new improved .cirrus.yml
landed in the tree). I don't know if it's significant that Pavel's
patch is failing every time:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464

... while one mentioned by Michail has lower frequency random failures:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2979

Indeed, and yet CI on postgres tree doesn't exhibit any problem:
https://cirrus-ci.com/github/postgres/postgres

(It's very cool that we have that turned on now!) That has run ~35
times (once per commit) and never failed. Across all cfbot branches,
cfbot is triggering over 100 builds a day, so something like 1400
since we started running the recovery test on Windows, so it's not a
fair comparison: plenty more chances for random/timing based failures
to show up.

I don't know how many different kinds of flakiness we're suffering
from on Windows. Could these cases be explained by the FD_CLOSE
problem + timing differences?

#5Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#4)
Re: Stream replication test fails of cfbot/windows server 2019

Hi,

On 2022-01-13 12:40:00 +1300, Thomas Munro wrote:

On Thu, Jan 13, 2022 at 12:24 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Jan 12, 2022 at 01:51:24PM +0300, Michail Nikolaev wrote:

https://cirrus-ci.com/task/6532060239101952
https://cirrus-ci.com/task/4755551606276096

For the record, cfbot only started running the recovery tests on
Windows a couple of weeks ago (when the new improved .cirrus.yml
landed in the tree). I don't know if it's significant that Pavel's
patch is failing every time:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464

I think we need to make wait_for_catchup() log something more useful. It's
pretty hard to debug failures involving it right now.

E.g. the logs for https://cirrus-ci.com/task/5353503093686272 tells us that

Waiting for replication conn standby_1's replay_lsn to pass '0/3023208' on primary
# poll_query_until timed out executing this query:
# SELECT '0/3023208' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/001_stream_rep.pl line 50.

and there's lots of instances of that query in the primary's logs. But nowhere
do we see what the actual replication progress is. So we don't know if the
problem is just that there's a record that doesn't need to be flushed to disk,
or something more fundamental.

I think instead of croak("timed out waiting for catchup") we should make
wait_for_catchup() query the primary all columns of pg_stat_replication and
report those. And perhaps also report the result of
SELECT * FROM pg_control_recovery(), pg_control_checkpoint();
on the standby?

I don't know how many different kinds of flakiness we're suffering
from on Windows. Could these cases be explained by the FD_CLOSE
problem + timing differences?

Maybe. There's certainly something odd going on:

https://api.cirrus-ci.com/v1/artifact/task/5353503093686272/log/src/test/recovery/tmp_check/log/001_stream_rep_primary.log
https://api.cirrus-ci.com/v1/artifact/task/5353503093686272/log/src/test/recovery/tmp_check/log/001_stream_rep_standby_1.log

standby_1:
2022-01-12 08:21:36.543 GMT [8584][walreceiver] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

primary:
2022-01-12 08:21:38.855 GMT [6276][postmaster] LOG: received fast shutdown request
2022-01-12 08:21:39.146 GMT [6276][postmaster] LOG: database system is shut down
2022-01-12 08:21:50.235 GMT [5524][postmaster] LOG: starting PostgreSQL 15devel, compiled by Visual C++ build 1929, 64-bit
2022-01-12 08:21:50.417 GMT [5524][postmaster] LOG: database system is ready to accept connections

standby_1:
2022-01-12 08:21:53.469 GMT [5108][walsender] [standby_2][2/0:0] LOG: received replication command: START_REPLICATION 0/3000000 TIMELINE 1
2022-01-12 08:28:33.949 GMT [6484][postmaster] LOG: database system is shut down

afaict standby_1's walreceiver never realized that the primary stopped?

More evidence to that fact is that the above "last actual query output:" shows nothing
rather than 'f' for
SELECT '0/3023208' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';

I wonder if it's relevant that the .cirrus file uses "unix sockets" on
windows as well, to avoid port conflicts (otherwise I saw frequent spurious
test failures due to port conflicts).
# Avoids port conflicts between concurrent tap test runs
PG_TEST_USE_UNIX_SOCKETS: 1

It's not particularly hard to imagine that either our "windows unix socket"
support still has some bugs, or that windows' implementation of unix sockets
is borked.

Greetings,

Andres Freund

#6Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Thomas Munro (#4)
Re: Stream replication test fails of cfbot/windows server 2019

For the record, cfbot only started running the recovery tests on
Windows a couple of weeks ago (when the new improved .cirrus.yml
landed in the tree). I don't know if it's significant that Pavel's
patch is failing every time:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464

... while one mentioned by Michail has lower frequency random failures:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2979

Thomas, it's not exactly so. The patch
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464
failed
on the described reason only once.
Previous redness was due to compiler warnings in previous patch versions.

Furthermore, I've sent an updated contrib patch with very minor
improvements (completely unrelated to stream replication), and now test
passes.
I suppose there's some intermittent problem with windows cfbot
infrastructure which arise randomly sometimes and affects some random
patches being tested. I have feeling that all patches pushed around
yesterday's morning were affected and now everything works good.

It's pity I still have no idea about the source of the problems besides my
look at different cfbot behavior sometimes.
--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com&gt;