Repeated requests for feedback in logical standby

Started by Magnus Haganderabout 10 years ago9 messagesbugs
Jump to latest
#1Magnus Hagander
magnus@hagander.net

I've been running some tests, and come across a weird behavior on shutdown
with logical replication, 9.5.2. It seems that on shutdown, a lot of extra
requests for feedback are sent from the walsender.

On my laptop I get about 3-15 requests (different each time), and on a
server with a bit more power but still very low activity, the number is
simply huge.

Basically what happens is:

* Replication works fine
* I have *zero* activity on the master since it got started, not even a
single psql
* I issue a fast shutdown on the master
* Nothing happens at all for a while
* Once pg_recvlogical times out and sends a status message things wake up
* The server sends a whole bunch of requests for status, which
pg_recvlogical responds to
* Eventually things shut down (on the proper server, I've had to kill
things because it takes too long - I'm unsure if it ever shuts down)

Log output from pg_recvlogical:

$ ./pg_recvlogical -v -p 5495 -S trivialslot -o include_transaction=on -v
-d postgres --start -f - -P test_decoding 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'
[2016-04-12 17:14:31] pg_recvlogical: starting log streaming at 0/0 (slot
trivialslot)
[2016-04-12 17:14:31] pg_recvlogical: streaming initiated
[2016-04-12 17:14:31] pg_recvlogical: confirming write up to 0/0, flush to
0/0 (slot trivialslot)
[2016-04-12 17:14:41] pg_recvlogical: confirming write up to 0/18F9DE88,
flush to 0/18F9DE88 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DEF8,
flush to 0/18F9DEF8 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: disconnected; waiting 5 seconds to
try again

In this example, the server has:
2016-04-12 17:14:45 CEST LOG: received fast shutdown request
2016-04-12 17:14:45 CEST LOG: aborting any active transactions
2016-04-12 17:14:45 CEST LOG: autovacuum launcher shutting down
2016-04-12 17:14:45 CEST LOG: shutting down
2016-04-12 17:14:45 CEST LOG: database system is shut down

Note how it took a while (6 seconds) for pg_recvlogical to pick up that
anything was happening - looks like something might not have been flushed?

When running on the server where there's more data it was very clear though
- if I break the walsender with gdb, pg_recvlogical stops spitting out
debugging messages, and as soon as I "cont" in gdb it starts spitting them
out again, so there's a clear connection between the two. And it takes a
*lot* more than a few seconds.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#2Nick Cleaton
nick@cleaton.net
In reply to: Magnus Hagander (#1)
Re: Repeated requests for feedback in logical standby

On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote:

I've been running some tests, and come across a weird behavior on shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot of extra requests for feedback are sent from the walsender.

On my laptop I get about 3-15 requests (different each time), and on a server with a bit more power but still very low activity, the number is simply huge.

Basically what happens is:

* Replication works fine
* I have *zero* activity on the master since it got started, not even a single psql
* I issue a fast shutdown on the master
* Nothing happens at all for a while
* Once pg_recvlogical times out and sends a status message things wake up
* The server sends a whole bunch of requests for status, which pg_recvlogical responds to
* Eventually things shut down (on the proper server, I've had to kill things because it takes too long - I'm unsure if it ever shuts down)

That sounds similar to something I've seen with wal-based replication:
/messages/by-id/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Nick

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Magnus Hagander
magnus@hagander.net
In reply to: Nick Cleaton (#2)
Re: Repeated requests for feedback in logical standby

On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick@cleaton.net> wrote:

On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote:

I've been running some tests, and come across a weird behavior on

shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot
of extra requests for feedback are sent from the walsender.

On my laptop I get about 3-15 requests (different each time), and on a

server with a bit more power but still very low activity, the number is
simply huge.

Basically what happens is:

* Replication works fine
* I have *zero* activity on the master since it got started, not even a

single psql

* I issue a fast shutdown on the master
* Nothing happens at all for a while
* Once pg_recvlogical times out and sends a status message things wake up
* The server sends a whole bunch of requests for status, which

pg_recvlogical responds to

* Eventually things shut down (on the proper server, I've had to kill

things because it takes too long - I'm unsure if it ever shuts down)

That sounds similar to something I've seen with wal-based replication:

/messages/by-id/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Agreed, that sounds very similar.

I don't know the code enough to say whether that's a proper fix though, or
if it might have other sideeffects.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#4Magnus Hagander
magnus@hagander.net
In reply to: Magnus Hagander (#3)
Re: Repeated requests for feedback in logical standby

On Thu, Apr 14, 2016 at 12:30 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick@cleaton.net> wrote:

On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote:

I've been running some tests, and come across a weird behavior on

shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot
of extra requests for feedback are sent from the walsender.

On my laptop I get about 3-15 requests (different each time), and on a

server with a bit more power but still very low activity, the number is
simply huge.

Basically what happens is:

* Replication works fine
* I have *zero* activity on the master since it got started, not even a

single psql

* I issue a fast shutdown on the master
* Nothing happens at all for a while
* Once pg_recvlogical times out and sends a status message things wake

up

* The server sends a whole bunch of requests for status, which

pg_recvlogical responds to

* Eventually things shut down (on the proper server, I've had to kill

things because it takes too long - I'm unsure if it ever shuts down)

That sounds similar to something I've seen with wal-based replication:

/messages/by-id/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Agreed, that sounds very similar.

I don't know the code enough to say whether that's a proper fix though, or
if it might have other sideeffects.

I should add that yes, this patch does seem to remove the problem for me as
well. But I can't confirm if it breaks something else. Hopefully someone
who has worked more on that code can?

I also still see the problem that "pg_ctrl stop -m fast" on the master does
*nothing* until the client sends one message. I tried increasing the
feedback interval to 2 minutes, and that increased the shutdown time to 2
minutes as well. Which can't be right - clearly the shutdown is failing to
actually make the walsender send the request and shut down. This happens
both with and without the suggested patch.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#5Marko Tiikkaja
marko@joh.to
In reply to: Magnus Hagander (#4)
Re: Repeated requests for feedback in logical standby

On 14/04/16 12:51, Magnus Hagander wrote:

I should add that yes, this patch does seem to remove the problem for me as
well. But I can't confirm if it breaks something else.

Did you also see whether it fixes the problem on the bigger box which
seemed to not shut down at all?

.m

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Magnus Hagander
magnus@hagander.net
In reply to: Marko Tiikkaja (#5)
Re: Repeated requests for feedback in logical standby

On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko@joh.to> wrote:

On 14/04/16 12:51, Magnus Hagander wrote:

I should add that yes, this patch does seem to remove the problem for me
as
well. But I can't confirm if it breaks something else.

Did you also see whether it fixes the problem on the bigger box which
seemed to not shut down at all?

Not yet, but those symptoms sound even more like those of Nick. I have it
fixing two different installations that I have locally here though.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#7Nick Cleaton
nick@cleaton.net
In reply to: Magnus Hagander (#3)
Re: Repeated requests for feedback in logical standby

On 14 April 2016 at 11:30, Magnus Hagander <magnus@hagander.net> wrote:

/messages/by-id/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Agreed, that sounds very similar.

I don't know the code enough to say whether that's a proper fix though, or
if it might have other sideeffects.

Certainly. My "this fixes it for me" was intended more as a piece of
diagnostic information than as a proposed change to the code.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Michael Paquier
michael@paquier.xyz
In reply to: Nick Cleaton (#7)
Re: Repeated requests for feedback in logical standby

On Thu, Apr 14, 2016 at 8:54 PM, Nick Cleaton <nick@cleaton.net> wrote:

Certainly. My "this fixes it for me" was intended more as a piece of
diagnostic information than as a proposed change to the code.

As I am just bumping on it. For the archive's sake, a fix has been
pushed as 5945501 (commit log references this thread as well).
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Magnus Hagander
magnus@hagander.net
In reply to: Magnus Hagander (#6)
Re: Repeated requests for feedback in logical standby

On Thu, Apr 14, 2016 at 1:21 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko@joh.to> wrote:

On 14/04/16 12:51, Magnus Hagander wrote:

I should add that yes, this patch does seem to remove the problem for me
as
well. But I can't confirm if it breaks something else.

Did you also see whether it fixes the problem on the bigger box which
seemed to not shut down at all?

Not yet, but those symptoms sound even more like those of Nick. I have it
fixing two different installations that I have locally here though.

Finally got around to test this properly. And no, it does not work - 9.5.3
with logical replication. I captured a couple of stack traces, and they all
look like this:

#0 0x00007f8fa933e6b0 in __read_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1 0x00005641a49aded3 in read (__nbytes=8192, __buf=0x5641a53d9090,
__fd=6) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 XLogRead (buf=buf@entry=0x5641a53d9090 "\207\320\002",
startptr=startptr@entry=1056964608, count=count@entry=8192)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2124
#3 0x00005641a49af0db in logical_read_xlog_page (state=<optimized out>,
targetPagePtr=1056964608, reqLen=<optimized out>, targetRecPtr=<optimized
out>,
cur_page=0x5641a53d9090 "\207\320\002", pageTLI=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:772
#4 0x00005641a480f4d2 in ReadPageInternal (state=state@entry=0x5641a53d6b08,
pageptr=pageptr@entry=1062813696, reqLen=reqLen@entry=360)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:524
#5 0x00005641a480fed1 in XLogReadRecord (state=0x5641a53d6b08,
RecPtr=1062814032, errormsg=errormsg@entry=0x7ffe3bd8ee48)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:243
#6 0x00005641a49ae17b in XLogSendLogical () at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2437
#7 0x00005641a49afc63 in WalSndLoop (send_data=send_data@entry=0x5641a49ae150
<XLogSendLogical>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1852
#8 0x00005641a49b0aa2 in StartLogicalReplication (cmd=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1015
#9 exec_replication_command (cmd_string=cmd_string@entry=0x5641a53a3c50
"START_REPLICATION SLOT \"nineone\" LOGICAL 0/3F5244F8
(\"include_transaction\" 'on')")
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1344
#10 0x00005641a49ed8b4 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x5641a5335118, dbname=0x5641a5335030 "db",
username=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/tcop/postgres.c:4028
#11 0x00005641a47925b8 in BackendRun (port=0x5641a5354bb0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:4239
#12 BackendStartup (port=0x5641a5354bb0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:3913
#13 ServerLoop () at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1684
#14 0x00005641a4993e7b in PostmasterMain (argc=5, argv=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1292
#15 0x00005641a47932b2 in main (argc=5, argv=0x5641a53341f0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/main/main.c:228

They all come out at approximately the same place there.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/