how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?

Started by Rui Hai Jiangover 9 years ago4 messages

ruihaij@gmail.com

over 9 years ago

Hello,

I have one Primary server and one Standby server. They are doing streaming
replication well.

I did some testing. I broke the network connection between them for a few
minutes, and then restored the network. I found the both the WAL sender and
WAL receiver were stopped and the restarted.

I wonder how WAL receiver process is stopped and restarted. I have checked
the code hoping to find out the answer, but I don't have any clue.

Could anyone help?

Thanks,
Rui Hai

craig@2ndquadrant.com

over 9 years ago

In reply to: Rui Hai Jiang (#1)

Re: how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?

On 22 June 2016 at 23:52, Rui Hai Jiang <ruihaij@gmail.com> wrote:

Hello,

I have one Primary server and one Standby server. They are doing streaming
replication well.

I did some testing. I broke the network connection between them for a few
minutes, and then restored the network. I found the both the WAL sender and
WAL receiver were stopped and the restarted.

I wonder how WAL receiver process is stopped and restarted. I have checked
the code hoping to find out the answer, but I don't have any clue.

If TCP keepalives are enabled, the TCP connection will break when the
keepalives stop arriving.

If wal receiver timeout is enabled, it'll notice that it didn't get any
data from the walsender and assume it went away.

If the OS notices that the socket went away - say, it got a TCP RST from
the remote peer as it shut down cleanly - it'll close the walreceiver
socket and the walreceiver will quit.

Otherwise it won't notice and will wait indefinitely.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

ruihaij@gmail.com

over 9 years ago

In reply to: Craig Ringer (#2)

Re: how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?

Thank you Craig for your suggestion.

I followed the clue and spent the whole day digging into the code.

Finally I figured out how the WAL receiver exits and restarts.

Question-1. How the WAL receiver process exits
===============================================
When the network connection is broken, WAL receiver couldn't communicate
with the WAL sender. For a long time (timer：wal_receiver_timeout), the WAL
receiver gets nothing from the WAL sender, the WAL receiver process exits
by calling "ereport(ERROR,...)".

Calling ereport(ERROR,...) causes the current process exit, but calling
ereport(LOG,...) doesn't.

WalReceiverMain(void)
{
len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
if (len != 0)
{
}
else
{
if (wal_receiver_timeout > 0)
{
if (now >= timeout)
ereport(ERROR,
(errmsg("terminating walreceiver
due to timeout")));
}
}
}

Question-2. How WAL receiver process starts again
=====================================================

At the Standby side, the startup process is responsible for recovery
processing. If streaming replication is configured and the startup process
finds that the WAL receiver process is not running, it notify the
Postmaster to start the WAL receiver process.Note: This is also how the WAL
receiver process starts for the first time!

(1) startup process notify Postmaster to start the WAL receiver by sending
a SIGUSR1.

RequestXLogStreaming()
{
if (launch)
SendPostmasterSignal(PMSignalReason
reason=PMSIGNAL_START_WALRECEIVER)
{
kill(PostmasterPid, SIGUSR1);
}
}

(2) Postmaster gets SIGUSR1 and starts the WAL receiver process.

sigusr1_handler(SIGNAL_ARGS)
{
WalReceiverPID = StartWalReceiver();
}

Please let me know if my understanding is incorrect.

thanks,
Rui Hai

robertmhaas@gmail.com

over 9 years ago

In reply to: Rui Hai Jiang (#3)

Re: how is the WAL receiver process stopped and restarted when the network connection is broken and then restored?

On Thu, Jun 23, 2016 at 10:56 AM, Rui Hai Jiang <ruihaij@gmail.com> wrote:

Please let me know if my understanding is incorrect.

I think you've got it about right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers