Why is src/test/modules/committs/t/002_standby.pl flaky?

Started by Thomas Munroover 4 years ago55 messageshackers

thomas.munro@gmail.com

over 4 years ago

Hi,

There's a wait for replay that is open coded (instead of using the
wait_for_catchup() routine), and sometimes the second of two such
waits at line 51 (in master) times out after 3 minutes with "standby
never caught up". It's happening on three particular Windows boxes,
but once also happened on the AIX box "tern".

branch | animal | count
---------------+-----------+-------
HEAD | drongo | 1
HEAD | fairywren | 8
REL_10_STABLE | drongo | 3
REL_10_STABLE | fairywren | 10
REL_10_STABLE | jacana | 3
REL_11_STABLE | drongo | 1
REL_11_STABLE | fairywren | 4
REL_11_STABLE | jacana | 3
REL_12_STABLE | drongo | 2
REL_12_STABLE | fairywren | 5
REL_12_STABLE | jacana | 1
REL_12_STABLE | tern | 1
REL_13_STABLE | fairywren | 3
REL_14_STABLE | drongo | 2
REL_14_STABLE | fairywren | 6

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-30%2014:42:30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-30%2013:13:22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-30%2006:03:07
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-22%2011:37:37
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-22%2010:46:07
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-22%2009:03:06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-17%2004:59:17
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-17%2003:59:51
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-16%2004:37:58
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-15%2009:57:14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-15%2002:38:43
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-14%2020:42:15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-14%2012:08:41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-14%2000:35:32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-13%2023:40:11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-13%2022:47:25
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-09%2006:59:10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-09%2006:04:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-09%2001:36:09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-08%2019:20:35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-08%2018:04:28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-08%2014:12:32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-12-08%2011:15:58
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-12-08%2004:04:22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-12-03%2017:31:49
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-11-11%2015:58:55
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-10-02%2022:00:17
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-09-09%2005:16:43
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-08-24%2004:45:09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17%2010:57:49
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-06-12%2016:05:32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-02-07%2012:59:43
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-03-24%2012:49:50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-02-01%2018:00:27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-02-01%2017:26:27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-01-30%2023:49:49
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-12-22%2014:19:02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-12-13%2000:12:11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-12-09%2006:02:05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-12-06%2003:07:42
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-11-02%2014:41:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-25%2013:12:08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-24%2013:12:41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-23%2023:10:00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-23%2018:00:39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-22%2015:05:57
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-18%2013:29:49
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-16%2014:54:46
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-15%2014:21:11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-14%2013:15:07
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-13%2014:19:41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-10-12%2016:32:06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2019-10-10%2013:12:09

andrew@dunslane.net

over 4 years ago

In reply to: Thomas Munro (#1)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On 12/30/21 15:01, Thomas Munro wrote:

Hi,

There's a wait for replay that is open coded (instead of using the
wait_for_catchup() routine), and sometimes the second of two such
waits at line 51 (in master) times out after 3 minutes with "standby
never caught up". It's happening on three particular Windows boxes,
but once also happened on the AIX box "tern".

branch | animal | count
---------------+-----------+-------
HEAD | drongo | 1
HEAD | fairywren | 8
REL_10_STABLE | drongo | 3
REL_10_STABLE | fairywren | 10
REL_10_STABLE | jacana | 3
REL_11_STABLE | drongo | 1
REL_11_STABLE | fairywren | 4
REL_11_STABLE | jacana | 3
REL_12_STABLE | drongo | 2
REL_12_STABLE | fairywren | 5
REL_12_STABLE | jacana | 1
REL_12_STABLE | tern | 1
REL_13_STABLE | fairywren | 3
REL_14_STABLE | drongo | 2
REL_14_STABLE | fairywren | 6

FYI, drongo and fairywren are run on the same AWS/EC2 Windows Server
2019 instance. Nothing else runs on it.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andrew Dunstan (#2)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Andrew Dunstan <andrew@dunslane.net> writes:

On 12/30/21 15:01, Thomas Munro wrote:

There's a wait for replay that is open coded (instead of using the
wait_for_catchup() routine), and sometimes the second of two such
waits at line 51 (in master) times out after 3 minutes with "standby
never caught up". It's happening on three particular Windows boxes,
but once also happened on the AIX box "tern".

FYI, drongo and fairywren are run on the same AWS/EC2 Windows Server
2019 instance. Nothing else runs on it.

I spent a little time looking into this just now. There are similar
failures in both 002_standby.pl and 003_standby_2.pl, which is
unsurprising because there are essentially-identical test sequences
in both. What I've realized is that the issue is triggered by
this sequence:

$standby->start;
...
$primary->restart;
$primary->safe_psql('postgres', 'checkpoint');
my $primary_lsn =
$primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
or die "standby never caught up";

(the failing poll_query_until is at line 51 in 002_standby.pl, or
line 37 in 003_standby_2.pl). That is, we have forced a primary
restart since the standby first connected to the primary, and
now we have to wait for the standby to reconnect and catch up.

*These two tests seem to be the only TAP tests that do that*.
So I think there's not really anything specific to commit_ts testing
involved, it's just a dearth of primary restarts elsewhere.

Looking at the logs in the failing cases, there's no evidence
that the standby has even detected the primary's disconnection,
which explains why it hasn't attempted to reconnect. For
example, in the most recent HEAD failure,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-03%2018%3A04%3A41

the standby reports successful connection:

2022-01-03 18:58:04.920 UTC [179700:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

(which we can also see in the primary's log), but after that
there's no log traffic at all except the test script's vain
checks of pg_last_wal_replay_lsn(). In the same animal's
immediately preceding successful run,

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2022-01-03%2015%3A04%3A41&stg=module-commit_ts-check

we see:

2022-01-03 15:59:24.186 UTC [176664:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2022-01-03 15:59:25.003 UTC [176664:2] LOG: replication terminated by primary server
2022-01-03 15:59:25.003 UTC [176664:3] DETAIL: End of WAL reached on timeline 1 at 0/3030CB8.
2022-01-03 15:59:25.003 UTC [176664:4] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2022-01-03 15:59:25.005 UTC [177092:5] LOG: invalid record length at 0/3030CB8: wanted 24, got 0
...
2022-01-03 15:59:25.564 UTC [177580:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

So for some reason, on these machines detection of walsender-initiated
connection close is unreliable ... or maybe, the walsender didn't close
the connection, but is somehow still hanging around? Don't have much idea
where to dig beyond that, but maybe someone else will. I wonder in
particular if this could be related to our recent discussions about
whether to use shutdown(2) on Windows --- could we need to do the
equivalent of 6051857fc/ed52c3707 on walsender connections?

regards, tom lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#3)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

I wrote:

So for some reason, on these machines detection of walsender-initiated
connection close is unreliable ... or maybe, the walsender didn't close
the connection, but is somehow still hanging around? Don't have much idea
where to dig beyond that, but maybe someone else will. I wonder in
particular if this could be related to our recent discussions about
whether to use shutdown(2) on Windows --- could we need to do the
equivalent of 6051857fc/ed52c3707 on walsender connections?

... wait a minute. After some more study of the buildfarm logs,
it was brought home to me that these failures started happening
just after 6051857fc went in:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&branch=&member=&stage=module-commit_tsCheck&filter=Submit

The oldest matching failure is jacana's on 2021-12-03.
(The above sweep finds an unrelated-looking failure on 2021-11-11,
but no others before 6051857fc went in on 2021-12-02. Also, it
looks likely that ed52c3707 on 2021-12-07 made the failure more
probable, because jacana's is the only matching failure before 12-07.)

So I'm now thinking it's highly likely that those commits are
causing it somehow, but how?

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Tom Lane (#4)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Hello Tom,
09.01.2022 04:17, Tom Lane wrote:

So for some reason, on these machines detection of walsender-initiated
connection close is unreliable ... or maybe, the walsender didn't close
the connection, but is somehow still hanging around? Don't have much idea
where to dig beyond that, but maybe someone else will. I wonder in
particular if this could be related to our recent discussions about
whether to use shutdown(2) on Windows --- could we need to do the
equivalent of 6051857fc/ed52c3707 on walsender connections?

... wait a minute. After some more study of the buildfarm logs,
it was brought home to me that these failures started happening
just after 6051857fc went in:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=90&branch=&member=&stage=module-commit_tsCheck&filter=Submit

The oldest matching failure is jacana's on 2021-12-03.
(The above sweep finds an unrelated-looking failure on 2021-11-11,
but no others before 6051857fc went in on 2021-12-02. Also, it
looks likely that ed52c3707 on 2021-12-07 made the failure more
probable, because jacana's is the only matching failure before 12-07.)

So I'm now thinking it's highly likely that those commits are
causing it somehow, but how?

I've managed to reproduce this failure too.
Removing "shutdown(MyProcPort->sock, SD_SEND);" doesn't help here, so
the culprit is exactly "closesocket(MyProcPort->sock);".
I've added `system("netstat -ano");` before die() in 002_standby.pl and see:
# Postmaster PID for node "primary" is 944
Proto Local Address          Foreign Address        State           PID
...
TCP    127.0.0.1:58545        127.0.0.1:61995        FIN_WAIT_2      944
...
TCP    127.0.0.1:61995        127.0.0.1:58545        CLOSE_WAIT      1352

(Replacing SD_SEND with SD_BOTH doesn't change the behaviour.)

Looking at the libpqwalreceiver.c:
        /* Now that we've consumed some input, try again */
        rawlen = PQgetCopyData(conn->streamConn, &conn->recvBuf, 1);
here we get -1 on the primary disconnection.
Then we get COMMAND_OK here:
        res = libpqrcv_PQgetResult(conn->streamConn);
        if (PQresultStatus(res) == PGRES_COMMAND_OK)
and finally just hang at:
            /* Verify that there are no more results. */
            res = libpqrcv_PQgetResult(conn->streamConn);
until the standby gets interrupted by the TAP test. (That call can also
return NULL and then the test completes successfully.)
Going down through the call chain, I see that at the end of it
WaitForMultipleObjects() hangs while waiting for the primary connection
socket event. So it looks like the socket, that is closed by the
primary, can get into a state unsuitable for WaitForMultipleObjects().
I tried to check the socket state with the WSAPoll() function and
discovered that it returns POLLHUP for the "problematic" socket.
The following draft addition in latch.c:
int
WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
                long timeout, uint32 wait_event_info)
{
    int            ret = 0;
    int            rc;
    WaitEvent    event;

#ifdef WIN32
    if (wakeEvents & WL_SOCKET_MASK) {
        WSAPOLLFD pollfd;
        pollfd.fd = sock;
        pollfd.events = POLLRDNORM | POLLWRNORM;
        pollfd.revents = 0;
        int rc = WSAPoll(&pollfd, 1, 0);
        if ((rc == 1) && (pollfd.revents & POLLHUP)) {
            elog(WARNING, "WaitLatchOrSocket: A stream-oriented
connection was either disconnected or aborted.");
            return WL_SOCKET_MASK;
        }
    }
#endif

makes the test 002_stanby.pl pass (100 of 100 iterations, while without
the fix I get failures roughly on each third). I'm not sure where to
place this check, maybe it's better to move it up to
libpqrcv_PQgetResult() to minimize it's footprint or to find less
Windows-specific approach, but I'd prefer a client-side fix anyway, as
graceful closing a socket by a server seems a legitimate action.

Best regards,
Alexander

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Alexander Lakhin (#5)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Alexander Lakhin <exclusion@gmail.com> writes:

09.01.2022 04:17, Tom Lane wrote:

... wait a minute. After some more study of the buildfarm logs,
it was brought home to me that these failures started happening
just after 6051857fc went in:

I've managed to reproduce this failure too.
Removing "shutdown(MyProcPort->sock, SD_SEND);" doesn't help here, so
the culprit is exactly "closesocket(MyProcPort->sock);".

Ugh. Did you try removing the closesocket and keeping shutdown?
I don't recall if we tried that combination before.

... I'm not sure where to
place this check, maybe it's better to move it up to
libpqrcv_PQgetResult() to minimize it's footprint or to find less
Windows-specific approach, but I'd prefer a client-side fix anyway, as
graceful closing a socket by a server seems a legitimate action.

What concerns me here is whether this implies that other clients
(libpq, jdbc, etc) are going to need changes as well. Maybe
libpq is okay, because we've not seen failures of the isolation
tests that use pg_cancel_backend(), but still it's worrisome.
I'm not entirely sure whether the isolationtester would notice
that a connection that should have died didn't.

regards, tom lane

thomas.munro@gmail.com

over 4 years ago

In reply to: Alexander Lakhin (#5)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Mon, Jan 10, 2022 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Going down through the call chain, I see that at the end of it
WaitForMultipleObjects() hangs while waiting for the primary connection
socket event. So it looks like the socket, that is closed by the
primary, can get into a state unsuitable for WaitForMultipleObjects().

I wonder if FD_CLOSE is edge-triggered, and it's already told us once.
I think that's what these Python Twisted guys are saying:

https://stackoverflow.com/questions/7598936/how-can-a-disconnected-tcp-socket-be-reliably-detected-using-msgwaitformultipleo

I tried to check the socket state with the WSAPoll() function and
discovered that it returns POLLHUP for the "problematic" socket.

Good discovery. I guess if the above theory is right, there's a
memory somewhere that makes this level-triggered as expected by users
of poll().

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#7)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Mon, Jan 10, 2022 at 8:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Jan 10, 2022 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Going down through the call chain, I see that at the end of it
WaitForMultipleObjects() hangs while waiting for the primary connection
socket event. So it looks like the socket, that is closed by the
primary, can get into a state unsuitable for WaitForMultipleObjects().

I wonder if FD_CLOSE is edge-triggered, and it's already told us once.

Can you reproduce it with this patch?

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Thomas Munro (#8)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

10.01.2022 05:00, Thomas Munro wrote:

On Mon, Jan 10, 2022 at 8:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Jan 10, 2022 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Going down through the call chain, I see that at the end of it
WaitForMultipleObjects() hangs while waiting for the primary connection
socket event. So it looks like the socket, that is closed by the
primary, can get into a state unsuitable for WaitForMultipleObjects().

I wonder if FD_CLOSE is edge-triggered, and it's already told us once.

Can you reproduce it with this patch?

Unfortunately, this fix (with the correction "(cur_event &
WL_SOCKET_MASK)" -> "(cur_event->events & WL_SOCKET_MASK") doesn't work,
because we have two separate calls to libpqrcv_PQgetResult():

Then we get COMMAND_OK here:
        res = libpqrcv_PQgetResult(conn->streamConn);
        if (PQresultStatus(res) == PGRES_COMMAND_OK)
and finally just hang at:
            /* Verify that there are no more results. */
            res = libpqrcv_PQgetResult(conn->streamConn);

The libpqrcv_PQgetResult function, in turn, invokes WaitLatchOrSocket()
where WaitEvents are defined locally, and the closed flag set on the
first invocation but expected to be checked on second.

I've managed to reproduce this failure too.
Removing "shutdown(MyProcPort->sock, SD_SEND);" doesn't help here, so
the culprit is exactly "closesocket(MyProcPort->sock);".

Ugh. Did you try removing the closesocket and keeping shutdown?
I don't recall if we tried that combination before.

Even with shutdown() only I still observe WaitForMultipleObjects()
hanging (and WSAPoll() returns POLLHUP for the socket).

As to your concern regarding other clients, I suspect that this issue is
caused by libpqwalreceiver' specific call pattern and may be other
clients just don't do that. I need some more time to analyze this.

Best regards,
Alexander

thomas.munro@gmail.com

over 4 years ago

In reply to: Alexander Lakhin (#9)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Mon, Jan 10, 2022 at 8:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

The libpqrcv_PQgetResult function, in turn, invokes WaitLatchOrSocket()
where WaitEvents are defined locally, and the closed flag set on the
first invocation but expected to be checked on second.

D'oh, right. There's also a WaitLatchOrSocket call in walreceiver.c.
We'd need a long-lived WaitEventSet common across all of these sites,
which is hard here (because the socket might change under you, as
discussed in other threads that introduced long lived WaitEventSets to
other places but not here).

/me wonders if it's possible that graceful FD_CLOSE is reported only
once, but abortive/error FD_CLOSE is reported multiple times...

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#10)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Mon, Jan 10, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Jan 10, 2022 at 8:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

The libpqrcv_PQgetResult function, in turn, invokes WaitLatchOrSocket()
where WaitEvents are defined locally, and the closed flag set on the
first invocation but expected to be checked on second.

D'oh, right. There's also a WaitLatchOrSocket call in walreceiver.c.
We'd need a long-lived WaitEventSet common across all of these sites,
which is hard here (because the socket might change under you, as
discussed in other threads that introduced long lived WaitEventSets to
other places but not here).

This is super quick-and-dirty code (and doesn't handle some errors or
socket changes correctly), but does it detect the closed socket?

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Thomas Munro (#11)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

10.01.2022 12:40, Thomas Munro wrote:

This is super quick-and-dirty code (and doesn't handle some errors or
socket changes correctly), but does it detect the closed socket?

Yes, it fixes the behaviour and makes the 002_standby test pass (100 of
100 iterations). I'm yet to find out whether the other
WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
disconnected socket state, but this approach definitely works for
walreceiver.

Best regards,
Alexander

thomas.munro@gmail.com

over 4 years ago

In reply to: Alexander Lakhin (#12)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Tue, Jan 11, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

10.01.2022 12:40, Thomas Munro wrote:

This is super quick-and-dirty code (and doesn't handle some errors or
socket changes correctly), but does it detect the closed socket?

Yes, it fixes the behaviour and makes the 002_standby test pass (100 of
100 iterations).

Thanks for testing. That result does seem to confirm the hypothesis
that FD_CLOSE is reported only once for the socket on graceful
shutdown (that is, it's edge-triggered and incidentally you won't get
FD_READ), so you need to keep track of it carefully. Incidentally,
another observation is that your WSAPoll() test appears to be
returning POLLHUP where at least Linux, FreeBSD and Solaris would not:
a socket that is only half shut down (the primary shut down its end
gracefully, but walreceiver did not), so I suspect Windows' POLLHUP
might have POLLRDHUP semantics.

I'm yet to find out whether the other
WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
disconnected socket state, but this approach definitely works for
walreceiver.

I see where you're going: there might be safe call sequences and
unsafe call sequences, and maybe walreceiver is asking for trouble by
double-polling. I'm not sure about that; I got the impression
recently that it's possible to get FD_CLOSE while you still have
buffered data to read, so then the next recv() will return > 0 and
then we don't have any state left anywhere to remember that we saw
FD_CLOSE, even if you're careful to poll and read in the ideal
sequence. I could be wrong, and it would be nice if there is an easy
fix along those lines... The documentation around FD_CLOSE is
unclear.

I do plan to make a higher quality patch like the one I showed
(material from earlier unfinished work[1]/messages/by-id/CA+hUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3+nwjmLcSeg@mail.gmail.com that needs a bit more
infrastructure), but to me that's new feature/efficiency work, not
something we'd want to back-patch.

Hmm, one thing I'm still unclear on: did this problem really start
with 6051857fc/ed52c3707? My initial email in this thread lists
similar failures going back further, doesn't it? (And what's tern
doing mixed up in this mess?)

[1]: /messages/by-id/CA+hUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3+nwjmLcSeg@mail.gmail.com

andrew@dunslane.net

over 4 years ago

In reply to: Thomas Munro (#13)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On 1/10/22 15:52, Thomas Munro wrote:

Hmm, one thing I'm still unclear on: did this problem really start
with 6051857fc/ed52c3707? My initial email in this thread lists
similar failures going back further, doesn't it? (And what's tern
doing mixed up in this mess?)

Your list contains at least some false positives. e.g.
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-12-22%2014:19:02>
which has a different script failing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#13)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Thomas Munro <thomas.munro@gmail.com> writes:

Hmm, one thing I'm still unclear on: did this problem really start
with 6051857fc/ed52c3707? My initial email in this thread lists
similar failures going back further, doesn't it? (And what's tern
doing mixed up in this mess?)

Well, those earlier ones may be committs failures, but a lot of
them contain different-looking symptoms, eg pg_ctl failures.

tern's failure at
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49
does look similar, but we can see in its log that the standby
*did* notice the primary disconnection and then reconnect:

2021-07-17 16:29:08.248 UTC [17498380:2] LOG: replication terminated by primary server
2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL: End of WAL reached on timeline 1 at 0/30378F8.
2021-07-17 16:29:08.248 UTC [17498380:4] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2021-07-17 16:29:08.248 UTC [25166230:5] LOG: invalid record length at 0/30378F8: wanted 24, got 0
2021-07-17 16:29:08.350 UTC [16318578:1] FATAL: could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2021-07-17 16:29:36.369 UTC [7077918:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up
2021-07-17 16:29:36.380 UTC [11338028:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up
...
2021-07-17 16:29:36.881 UTC [17367092:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

So I'm not sure what happened there, but it's not an instance
of this problem. One thing that looks a bit suspicious is
this in the primary's log:

2021-07-17 16:26:47.832 UTC [12386550:1] LOG: using stale statistics instead of current ones because stats collector is not responding

which makes me wonder if the timeout is down to out-of-date
pg_stats data. The loop in 002_standby.pl doesn't appear to
depend on the stats collector:

my $primary_lsn =
$primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
or die "standby never caught up";

but maybe I'm missing the connection.

Apropos of that, it's worth noting that wait_for_catchup *is*
dependent on up-to-date stats, and here's a recent run where
it sure looks like the timeout cause is AWOL stats collector:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34

I wonder if we should refactor wait_for_catchup to probe the
standby directly instead of relying on the upstream's view.

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Thomas Munro (#13)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

10.01.2022 23:52, Thomas Munro wrote:

I'm yet to find out whether the other
WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
disconnected socket state, but this approach definitely works for
walreceiver.

I see where you're going: there might be safe call sequences and
unsafe call sequences, and maybe walreceiver is asking for trouble by
double-polling. I'm not sure about that; I got the impression
recently that it's possible to get FD_CLOSE while you still have
buffered data to read, so then the next recv() will return > 0 and
then we don't have any state left anywhere to remember that we saw
FD_CLOSE, even if you're careful to poll and read in the ideal
sequence. I could be wrong, and it would be nice if there is an easy
fix along those lines... The documentation around FD_CLOSE is
unclear.

I had no strong opinion regarding unsafe sequence, though initially I
suspected that exactly the second libpqrcv_PQgetResult call could cause
the issue. But after digging into WaitLatchOrSocket I'd inclined to put
the fix deeper to satisfy all possible callers.
At the other hand, I've shared Tom's concerns regarding other clients,
that can stuck on WaitForMultipleObjects() just as walreceiver does, and
hoped that only walreceiver suffer from a graceful server socket closing.
So to get these doubts cleared, I've made a simple test for postgres_fdw
(please look at the attachment; you can put it into
contrib/postgres_fdw/t and run `vcregress taptest contrib\postgres_fdw`).
This test shows for me:
===
...
t/001_disconnection.pl .. # 12:13:39.481084 executing query...
# 12:13:43.245277 result: 0
# 0|0

# 12:13:43.246342 executing query...
# 12:13:46.525924 result: 0
# 0|0

# 12:13:46.527097 executing query...
# 12:13:47.745176 result:       3
#
# psql:<stdin>:1: WARNING: no connection to the server
# psql:<stdin>:1: ERROR: FATAL: terminating connection due to
administrator co
mmand
# server closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# CONTEXT: remote SQL command: FETCH 100 FROM c1
# 12:13:47.794612 executing query...
# 12:13:51.073318 result:       0
# 0|0

# 12:13:51.074347 executing query...
===

With the simple logging added to connection.c:
                /* Sleep until there's something to do */
elog(LOG, "pgfdw_get_result before WaitLatchOrSocket");
                wc = WaitLatchOrSocket(MyLatch,
                                       WL_LATCH_SET | WL_SOCKET_READABLE |
                                       WL_EXIT_ON_PM_DEATH,
                                       PQsocket(conn),
                                       -1L, PG_WAIT_EXTENSION);
elog(LOG, "pgfdw_get_result after WaitLatchOrSocket");

I see in 001_disconnection_local.log:
...
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG:
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG:
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:36.976 MSK|||61dd74ac.840|DEBUG: autovacuum:
processing database "postgres"
2022-01-11 15:14:51.088 MSK|Administrator|postgres|61dd747f.5e4|LOG:
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:14:51.088
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:51.089 MSK|Administrator|postgres|61dd747f.5e4|LOG:
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:14:51.089
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:15:37.006 MSK|||61dd74e9.9e8|DEBUG: autovacuum:
processing database "postgres"
2022-01-11 15:16:37.116 MSK|||61dd7525.ad0|DEBUG: autovacuum:
processing database "postgres"
2022-01-11 15:17:37.225 MSK|||61dd7561.6a0|DEBUG: autovacuum:
processing database "postgres"
2022-01-11 15:18:36.916 MSK|||61dd7470.704|LOG: checkpoint starting: time
...
2022-01-11 15:36:38.225 MSK|||61dd79d6.2a0|DEBUG: autovacuum:
processing database "postgres"
...

So here we get similar hanging on WaitLatchOrSocket().
Just to make sure that it's indeed the same issue, I've removed socket
shutdown&close and the test executed to the end (several times). Argh.

Best regards,
Alexander

thomas.munro@gmail.com

over 4 years ago

In reply to: Alexander Lakhin (#16)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Wed, Jan 12, 2022 at 4:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

So here we get similar hanging on WaitLatchOrSocket().
Just to make sure that it's indeed the same issue, I've removed socket
shutdown&close and the test executed to the end (several times). Argh.

Ouch. I think our options at this point are:
1. Revert 6051857fc (and put it back when we have a working
long-lived WES as I showed). This is not very satisfying, now that we
understand the bug, because even without that change I guess you must
be able to reach the hanging condition by using Windows postgres_fdw
to talk to a non-Windows server (ie a normal TCP stack with graceful
shutdown/linger on process exit).
2. Put your poll() check into the READABLE side. There's some
precedent for that sort of kludge on the WRITEABLE side (and a
rejection of the fragile idea that clients of latch.c should only
perform "safe" sequences):

/*
* Windows does not guarantee to log an FD_WRITE network event
* indicating that more data can be sent unless the previous send()
* failed with WSAEWOULDBLOCK. While our caller might well have made
* such a call, we cannot assume that here. Therefore, if waiting for
* write-ready, force the issue by doing a dummy send(). If the dummy
* send() succeeds, assume that the socket is in fact write-ready, and
* return immediately. Also, if it fails with something other than
* WSAEWOULDBLOCK, return a write-ready indication to let our caller
* deal with the error condition.
*/

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#17)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Thomas Munro <thomas.munro@gmail.com> writes:

Ouch. I think our options at this point are:
1. Revert 6051857fc (and put it back when we have a working
long-lived WES as I showed). This is not very satisfying, now that we
understand the bug, because even without that change I guess you must
be able to reach the hanging condition by using Windows postgres_fdw
to talk to a non-Windows server (ie a normal TCP stack with graceful
shutdown/linger on process exit).

It'd be worth checking, perhaps. One thing I've been wondering all
along is how much of this behavior is specific to the local-loopback
case where Windows can see both ends of the connection. You'd think
that they couldn't long get away with such blatant violations of the
TCP specs when talking to external servers, because the failures
would be visible to everyone with a web browser.

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Tom Lane (#18)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

11.01.2022 23:16, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Ouch. I think our options at this point are:
1. Revert 6051857fc (and put it back when we have a working
long-lived WES as I showed). This is not very satisfying, now that we
understand the bug, because even without that change I guess you must
be able to reach the hanging condition by using Windows postgres_fdw
to talk to a non-Windows server (ie a normal TCP stack with graceful
shutdown/linger on process exit).

It'd be worth checking, perhaps. One thing I've been wondering all
along is how much of this behavior is specific to the local-loopback
case where Windows can see both ends of the connection. You'd think
that they couldn't long get away with such blatant violations of the
TCP specs when talking to external servers, because the failures
would be visible to everyone with a web browser.

I've split my test (both parts attached) and run it on two virtual
machines with clean builds from master (ac7c8075) on both (just the
debugging output added to connection.c). I provide probably redundant
info (also see attached screenshot) just to make sure that I didn't make
a mistake.
The excerpt from 001_disconnection1_local.log:
...
2022-01-12 09:29:48.099 MSK|Administrator|postgres|61de755a.a54|LOG:
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.099
MSK|Administrator|postgres|61de755a.a54|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG:
pgfdw_get_result: after WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG:
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|LOG:
pgfdw_get_result: after WaitLatchOrSocket
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
2022-01-12 09:29:48.100 MSK|Administrator|postgres|61de755a.a54|ERROR:
FATAL: terminating connection due to administrator command
    server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|CONTEXT: remote SQL command:
FETCH 100 FROM c1
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|WARNING: no connection to the
server
2022-01-12 09:29:48.100
MSK|Administrator|postgres|61de755a.a54|CONTEXT: remote SQL command:
ABORT TRANSACTION
2022-01-12 09:29:48.107 MSK|Administrator|postgres|61de755a.a54|LOG:
disconnection: session time: 0:00:01.577 user=Administrator
database=postgres host=127.0.0.1 port=49752
2022-01-12 09:29:48.257 MSK|[unknown]|[unknown]|61de755c.a4c|LOG:
connection received: host=127.0.0.1 port=49754
2022-01-12 09:29:48.261 MSK|Administrator|postgres|61de755c.a4c|LOG:
connection authenticated: identity="WIN-FCPSOVMM1JC\Administrator"
method=sspi
(C:/src/postgrespro/contrib/postgres_fdw/tmp_check/t_001_disconnection1_local_data/pgdata/pg_hba.conf:2)
2022-01-12 09:29:48.261 MSK|Administrator|postgres|61de755c.a4c|LOG:
connection authorized: user=Administrator database=postgres
application_name=001_disconnection1.pl
2022-01-12 09:29:48.263 MSK|Administrator|postgres|61de755c.a4c|LOG:
statement: SELECT * FROM large WHERE a = fx2(a)
2022-01-12 09:29:48.285 MSK|Administrator|postgres|61de755c.a4c|LOG:
pgfdw_get_result: before WaitLatchOrSocket
2022-01-12 09:29:48.285
MSK|Administrator|postgres|61de755c.a4c|STATEMENT: SELECT * FROM large
WHERE a = fx2(a)
...

By the look of things, you are right and this is the localhost-only issue.
I've rechecked that the test 001_disconnection.pl (local-loopback
version) hangs on both machines while 001_disconnection1.pl performs
successfully in both directions. I'm not sure whether the Windows client
and non-Windows server or reverse combinations are of interest in light
of the above.

Best regards,
Alexander

thomas.munro@gmail.com

over 4 years ago

In reply to: Alexander Lakhin (#19)

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

On Wed, Jan 12, 2022 at 8:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

By the look of things, you are right and this is the localhost-only issue.

But can't that be explained with timing races? You change some stuff
around and it becomes less likely that you get a FIN to arrive in a
super narrow window, which I'm guessing looks something like: recv ->
EWOULDBLOCK, [receive FIN], wait -> FD_CLOSE, wait [hangs]. Note that
it's not happening on several Windows BF animals, and the ones it is
happening on only do it only every few weeks.

Here's a draft attempt at a fix. First I tried to use recv(fd, &c, 1,
MSG_PEEK) == 0 to detect EOF, which seemed to me to be a reasonable
enough candidate, but somehow it corrupts the stream (!?), so I used
Alexander's POLLHUP idea, except I pushed it down to a more principled
place IMHO. Then I suppressed it after the initial check because then
the logic from my earlier patch takes over, so stuff like FeBeWaitSet
doesn't suffer from extra calls, only these two paths that haven't
been converted to long-lived WESes yet. Does this pass the test?

I wonder if this POLLHUP technique is reliable enough (I know that
wouldn't work on other systems[1]https://illumos.topicbox.com/groups/developer/T5576767e764aa26a-Maf8f3460c2866513b0ac51bf, which is why I was trying to make
MSG_PEEK work...).

What about environment variable PG_TEST_USE_UNIX_SOCKETS=1, does it
reproduce with that set, and does the patch fix it? I'm hoping that
explains some Windows CI failures from a nearby thread[2]/messages/by-id/CALT9ZEG=C=JSypzt2gz6NoNtx-ew2tYHbwiOfY_xNo+yBY_=jw@mail.gmail.com.

[1]: https://illumos.topicbox.com/groups/developer/T5576767e764aa26a-Maf8f3460c2866513b0ac51bf
[2]: /messages/by-id/CALT9ZEG=C=JSypzt2gz6NoNtx-ew2tYHbwiOfY_xNo+yBY_=jw@mail.gmail.com

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Thomas Munro (#20)

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#20)

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Thomas Munro (#20)

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#22)

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#24)

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#24)

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#26)

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#26)

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#26)

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#28)

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#25)

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#30)

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Andres Freund (#28)

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#29)

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#34)

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#35)

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#36)

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#37)

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#38)

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#39)

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#40)

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#37)

Alexander Lakhin

exclusion@gmail.com

over 4 years ago

In reply to: Andres Freund (#42)

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Alexander Lakhin (#43)

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#44)

noah@leadboat.com

about 4 years ago

In reply to: Tom Lane (#15)

thomas.munro@gmail.com

about 4 years ago

In reply to: Noah Misch (#46)

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Thomas Munro (#47)

thomas.munro@gmail.com

about 4 years ago

In reply to: Tom Lane (#48)

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Thomas Munro (#49)

thomas.munro@gmail.com

over 2 years ago

In reply to: Tom Lane (#50)

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Thomas Munro (#51)

robertmhaas@gmail.com

over 2 years ago

In reply to: Thomas Munro (#51)

smithpb2250@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#52)

vignesh21@gmail.com

over 2 years ago

In reply to: Peter Smith (#54)