streaming replication master can fail to shut down

Started by Nick Cleatonabout 10 years ago9 messagesbugs
Jump to latest
#1Nick Cleaton
nick@cleaton.net

I've found that the master server does not finish shutting down in
response to SIGINT to the postmaster when a streaming replica is
connected, if and only if all of the following hold:

* The master server is a specific new bit of kit, which has slightly
higher performance CPUs than the replica and is much more idle.
* The master is configured with archive_mode=on and a non-empty archive command.
* I do not have "strace -p ..." running on the master against the wal
sender servicing the slow replica.

While it's failing to shut down, about 10 megabytes per second of
network traffic is exchanged between master and replica, and the wal
receiver on the replica is at 100% CPU whereas the wal sender on the
master is around 70% to 80%. The wal receiver has about 4MiB of queued
tcp input, the wal sender has a few bytes.This appears to continue
indefinitely, I've tried up to 30 minutes.

If I "strace -p ... >/dev/null 2>&1" the wal sender during the blocked
shutdown, it seems to slow it down enough (wal sender's cpu usage
drops to about 50%) that after about 10 to 20 seconds the shutdown
completes cleanly and the replica logs "end of wal".

I've seen this only with one specific server acting as the master, so
I cannot absolutely rule out a hardware issue of some kind.

This was originally observed with 9.4.6, but I've now replicated it
with 9.5.1 using the pgdg packages under Ubuntu:

postgresql-9.5 9.5.1-1.pgdg14.04+1
postgresql-client-9.5 9.5.1-1.pgdg14.04+1
postgresql-contrib-9.5 9.5.1-1.pgdg14.04+1

Under 9.5.1 I've observed this by simply installing the packages and
then following https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_7_Steps
with only the addition of archive_mode=on and
archive_command=/bin/true on the master.

CPUs on the master: dual Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz,
hyperthreading off.

CPUs on the replica: dual Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz,
hyperthreading off.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Nick Cleaton
nick@cleaton.net
In reply to: Nick Cleaton (#1)
Re: streaming replication master can fail to shut down

On 9 March 2016 at 16:17, Nick Cleaton <nick@cleaton.net> wrote:

I've found that the master server does not finish shutting down in
response to SIGINT to the postmaster when a streaming replica is
connected, if and only if all of the following hold:

* The master server is a specific new bit of kit, which has slightly
higher performance CPUs than the replica and is much more idle.
* The master is configured with archive_mode=on and a non-empty archive command.
* I do not have "strace -p ..." running on the master against the wal
sender servicing the slow replica.

To debug the replication conversation during a failing master shutdown
with my 9.5.1 test installation, I've now also set ssl=false in the
server config and then:

* start a tcpdump
* start the master
* start the replica
* sleep 60
* kill -INT the postmaster on the master

A libpcap dump of the first 100k packets exchanged is at
http://nick.cleaton.net/protodump-100k-nossl.xz (800k compressed)

Highlights from pgshark: at master shutdown time there's:

Packet: t=1457611872.617496, session=17578864249730
PGSQL: type=XLogData, B -> F
XLOG DATA len=2896

... and everything following that is a storm of:

Packet: t=1457611872.761098, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

Packet: t=1457611872.761221, session=17578864249730
PGSQL: type=StandbyStatusUpdate, F -> B
STANDBY STATUS UPDATE

Packet: t=1457611872.761363, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

There were a total of 25486 StandbyStatusUpdate messages and 64769
PrimaryKeepalive messages during the 0.6 seconds between the start of
master shutdown and tcpdump hitting the 100k packet limit.

Relevant top and netstat output about 10 minutes into the blocked shutdown:

PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
master: 6167 postgres 20 0 251632 3020 1372 S 74.7 0.0
0:10.64 postgres: wal sender process postgres 10.122.82.98(49730)
streaming 0/11000098
replica: 28211 postgres 20 0 263100 4580 1964 R 100.0 0.0
0:21.26 postgres: wal receiver process streaming 0/11000000

Proto Recv-Q Send-Q Local Address Foreign Address
State
master: tcp 0 2553544 10.122.83.27:5432
10.122.82.98:49730 ESTABLISHED
replica: tcp 4284291 390 10.122.82.98:49730
10.122.83.27:5432 ESTABLISHED

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Nick Cleaton
nick@cleaton.net
In reply to: Nick Cleaton (#2)
Re: streaming replication master can fail to shut down

On 10 March 2016 at 13:13, Nick Cleaton <nick@cleaton.net> wrote:

... and everything following that is a storm of:

Packet: t=1457611872.761098, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

Packet: t=1457611872.761221, session=17578864249730
PGSQL: type=StandbyStatusUpdate, F -> B
STANDBY STATUS UPDATE

Packet: t=1457611872.761363, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

This patch is enough to eliminate the problem on my hardware; in this
test the server sent only 29 keepalives during the shutdown:
http://nick.cleaton.net/protodump-100k-nossl-patched.xz (47k)

Attachments:

no-keep-flood.patchtext/x-patch; charset=US-ASCII; name=no-keep-flood.patchDownload+3-1
#4Andres Freund
andres@anarazel.de
In reply to: Nick Cleaton (#3)
Re: streaming replication master can fail to shut down

Hi,

On 2016-03-11 14:12:37 +0000, Nick Cleaton wrote:

This patch is enough to eliminate the problem on my hardware

diff -Nurd postgresql-9.5.1.orig/src/backend/replication/walsender.c postgresql-9.5.1/src/backend/replication/walsender.c
--- postgresql-9.5.1.orig/src/backend/replication/walsender.c	2016-02-08 21:12:28.000000000 +0000
+++ postgresql-9.5.1/src/backend/replication/walsender.c	2016-03-11 11:56:41.121361222 +0000
@@ -2502,8 +2502,10 @@
proc_exit(0);
}
-	if (!waiting_for_ping_response)
+	if (!waiting_for_ping_response) {
WalSndKeepalive(true);
+		waiting_for_ping_response = true;
+	}
}

That looks (besides non-postges paren placement), reasonable. Will
commit & backpatch (to 9.4, where it looks like the bug has been
introduced).

in this test the server sent only 29 keepalives during the shutdown:
http://nick.cleaton.net/protodump-100k-nossl-patched.xz (47k)

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

Greetings,

Andres Freund

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Nick Cleaton
nick@cleaton.net
In reply to: Andres Freund (#4)
Re: streaming replication master can fail to shut down

On 28 April 2016 at 19:14, Andres Freund <andres@anarazel.de> wrote:

in this test the server sent only 29 keepalives during the shutdown:
http://nick.cleaton.net/protodump-100k-nossl-patched.xz (47k)

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

No, IIRC this was on the test cluster that I installed for the purpose
of replicating the problem under 9.5; it was essentially idle.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Andres Freund
andres@anarazel.de
In reply to: Nick Cleaton (#5)
Re: streaming replication master can fail to shut down

On 2016-04-28 20:04:21 +0100, Nick Cleaton wrote:

On 28 April 2016 at 19:14, Andres Freund <andres@anarazel.de> wrote:

in this test the server sent only 29 keepalives during the shutdown:
http://nick.cleaton.net/protodump-100k-nossl-patched.xz (47k)

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

No, IIRC this was on the test cluster that I installed for the purpose
of replicating the problem under 9.5; it was essentially idle.

The reason I'm asking is that I so far can't really replicate the issue
so far. It's pretty clear that waiting_for_ping_response = true; is
needed, but I'm suspicious that that's not all.

Was your standby on a separate machine? What kind of latency?

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Nick Cleaton
nick@cleaton.net
In reply to: Andres Freund (#6)
Re: streaming replication master can fail to shut down

On 29 April 2016 at 04:38, Andres Freund <andres@anarazel.de> wrote:

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

No, IIRC this was on the test cluster that I installed for the purpose
of replicating the problem under 9.5; it was essentially idle.

The reason I'm asking is that I so far can't really replicate the issue
so far. It's pretty clear that waiting_for_ping_response = true; is
needed, but I'm suspicious that that's not all.

Was your standby on a separate machine?

Yes, I've only seen it happen when the standby was on a machine with
slower CPU cores than the primary. All my attempts to replicate it on
a single machine by trying to slow down the wal receiver have failed.
I'm fairly convinced it's some sort of race that depends on wal sender
+ network being faster than wal receiver.

What kind of latency?

1G switches.

root@XXX:~# ping XXX
PING XXX) 56(84) bytes of data.
64 bytes from XXX: icmp_seq=1 ttl=64 time=0.162 ms
64 bytes from XXX: icmp_seq=2 ttl=64 time=0.223 ms
64 bytes from XXX: icmp_seq=3 ttl=64 time=0.122 ms
64 bytes from XXX: icmp_seq=4 ttl=64 time=0.126 ms
64 bytes from XXX: icmp_seq=5 ttl=64 time=0.149 ms

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#6)
Re: streaming replication master can fail to shut down

On Apr 29, 2016 05:38, "Andres Freund" <andres@anarazel.de> wrote:

On 2016-04-28 20:04:21 +0100, Nick Cleaton wrote:

On 28 April 2016 at 19:14, Andres Freund <andres@anarazel.de> wrote:

in this test the server sent only 29 keepalives during the shutdown:
http://nick.cleaton.net/protodump-100k-nossl-patched.xz (47k)

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

No, IIRC this was on the test cluster that I installed for the purpose
of replicating the problem under 9.5; it was essentially idle.

The reason I'm asking is that I so far can't really replicate the issue
so far. It's pretty clear that waiting_for_ping_response = true; is
needed, but I'm suspicious that that's not all.

Was your standby on a separate machine? What kind of latency?

I had no problem reproducing it on my mostly idle laptop. Not enough to
make a proper slowdown, but if I instrumented the receiver it saw the
multiple packets every single time.

/Magnus

#9Andres Freund
andres@anarazel.de
In reply to: Nick Cleaton (#7)
Re: streaming replication master can fail to shut down

Hi,

I pushed a fix for this to 9.4,9.5 and master yesterday. I'm not
convinced it's all that needs to be fixed, particularly for Magnus'
report.

On 2016-04-29 08:05:51 +0100, Nick Cleaton wrote:

On 29 April 2016 at 04:38, Andres Freund <andres@anarazel.de> wrote:

I guess you have a fair amount of WAL traffic, and the receiver was
behind a good bit?

No, IIRC this was on the test cluster that I installed for the purpose
of replicating the problem under 9.5; it was essentially idle.

The reason I'm asking is that I so far can't really replicate the issue
so far. It's pretty clear that waiting_for_ping_response = true; is
needed, but I'm suspicious that that's not all.

Was your standby on a separate machine?

Yes, I've only seen it happen when the standby was on a machine with
slower CPU cores than the primary. All my attempts to replicate it on
a single machine by trying to slow down the wal receiver have failed.
I'm fairly convinced it's some sort of race that depends on wal sender
+ network being faster than wal receiver.

Yes, that's kind of what I'm expecting. You'll only hit that branch if
there's outstanding data to be replicated, but the message has been
handed to the os (!pq_is_send_pending()). Locally that's just a small
data volume, but over actual network on a longer lived connection that
can be a lot more.

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs