BUG #7534: walreceiver takes long time to detect n/w breakdown

Started by Nonameover 13 years ago80 messages
#1Noname
amit.kapila@huawei.com

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

#2Magnus Hagander
magnus@hagander.net
In reply to: Noname (#1)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wed, Sep 12, 2012 at 1:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

The master will detect it quicker, because it will get an error when
it tries to send something.

But the standby should detect it either when sending the feedback
message (what's your wal_receiver_status_interval set to?) or when
ythe kernel does (have you configured the tcp keepalive on the slave
somehow?)

Oh, and what do you actually mean by "long time"?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

What about setting keepalives_xxx libpq parameters?
http://www.postgresql.org/docs/devel/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS

Keepalives are not a perfect solution for the termination of connection, but
it would help to a certain extent. If you need something like
walreceiver-version
of replication_timeout, such feature has not been implemented yet. Please feel
free to implement that!

Regards,

--
Fujii Masao

#4Amit Kapila
amit.kapila@huawei.com
In reply to: Magnus Hagander (#2)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wednesday, September 12, 2012 10:12 PM Magnus Hagander wrote:
On Wed, Sep 12, 2012 at 1:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

The master will detect it quicker, because it will get an error when
it tries to send something.

But the standby should detect it either when sending the feedback
message (what's your wal_receiver_status_interval set to?) or when
ythe kernel does (have you configured the tcp keepalive on the slave
somehow?)

wal_receiver_status_interval - 10s (we have not changed this. Used as
default).
We have tried by using tcp keepalive as well, it might not be able to
detect as receiver is anyway trying to send
Receiver status.
It fails during send socket call from XLogWalRcvSendReply() after calling
the same many times as internally might be in
send() until the sockets internal buffer is full, it keeps accumulating
even if other side recv has not received the
data.
Also in walsender, it is failing to replication_timeout parameter not due
to send failure.
So in my opinion, the full-proof solution would be to have mechanism
(replication_timeout) similar to walsender in
walreceiver.

Oh, and what do you actually mean by "long time"?

15-20 mins.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#5Amit Kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#3)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

What about setting keepalives_xxx libpq parameters?

http://www.postgresql.org/docs/devel/static/libpq-connect.html#LIBPQ-PARAMKE
YWORDS

Keepalives are not a perfect solution for the termination of connection,

but

it would help to a certain extent.

We have tried by enabling keepalive, but it didn't worked maybe because
walreceiver is trying to send reveiver status.
It fails in sending that after many attempts of same.

If you need something like walreceiver-version of replication_timeout,

such feature has not been implemented yet.

Please feel free to implement that!

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

The only point in having different timeout parameters for walsender and
walreceiver is for the case of standby which
has both walsender and walreceiver to send logs to cascaded standby, in
such case somebody might want to have different timeout parameters for
walsender and walreceiver.
OTOH it will create confusion to have too many parameters. My opinion is to
have one timeout parameter for both walsender and walrecevier.

Let me know your suggestion/opinion about same.

Note- I am marking cc to pgsql-hackers, as it will be a feature request.

With Regards,
Amit Kapila.

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

What about setting keepalives_xxx libpq parameters?

http://www.postgresql.org/docs/devel/static/libpq-connect.html#LIBPQ-PARAMKE
YWORDS

Keepalives are not a perfect solution for the termination of connection,

but

it would help to a certain extent.

We have tried by enabling keepalive, but it didn't worked maybe because
walreceiver is trying to send reveiver status.
It fails in sending that after many attempts of same.

If you need something like walreceiver-version of replication_timeout,

such feature has not been implemented yet.

Please feel free to implement that!

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Regards,

--
Fujii Masao

#7Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#6)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
The main changes are:
1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming CommitFest.

Suggestions/Comments?

With Regards,
Amit Kapila.

Attachments:

replication_timeout.patchtext/plain; name=replication_timeout.patchDownload
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 62,67 **** walrcv_connect_type walrcv_connect = NULL;
--- 62,69 ----
  walrcv_receive_type walrcv_receive = NULL;
  walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
+ int			wal_receiver_replication_timeout = 60 * 1000;	/* maximum time to receive one
+ 												 * WAL data message */
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
***************
*** 174,179 **** WalReceiverMain(void)
--- 176,184 ----
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
  
+ 	TimestampTz last_recv_timestamp;
+ 	TimestampTz timeout = 0;
+ 
  	/*
  	 * WalRcv should be set up already (if we are a backend, we inherit this
  	 * by fork() or EXEC_BACKEND mechanism from the postmaster).
***************
*** 282,287 **** WalReceiverMain(void)
--- 287,295 ----
  	MemSet(&reply_message, 0, sizeof(reply_message));
  	MemSet(&feedback_message, 0, sizeof(feedback_message));
  
+ 	/* Initialize the last recv timestamp */
+ 	last_recv_timestamp = GetCurrentTimestamp();
+ 
  	/* Loop until end-of-streaming or error */
  	for (;;)
  	{
***************
*** 316,327 **** WalReceiverMain(void)
--- 324,343 ----
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
+ 			/* Something is received from master, so reset last receive time*/
+ 			last_recv_timestamp = GetCurrentTimestamp();
+ 			
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
+ 			{
+ 				/* Something is received from master, so reset last receive time*/
+ 				last_recv_timestamp = GetCurrentTimestamp();
+ 				
  				XLogWalRcvProcessMsg(type, buf, len);
+ 			}
  
  			/* Let the master know that we received some data. */
  			XLogWalRcvSendReply();
***************
*** 334,339 **** WalReceiverMain(void)
--- 350,369 ----
  		}
  		else
  		{
+ 			/* Check if time since last receive from standby has reached the configured limit
+ 			 * No need to check if it is disabled by giving value as 0*/
+ 			if (wal_receiver_replication_timeout > 0)
+ 			{
+ 				timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
+ 														  wal_receiver_replication_timeout);
+ 
+ 				if (GetCurrentTimestamp() >= timeout)
+ 				{
+ 					ereport(ERROR,
+ 						(errmsg("Could not receive any message from WalSender for configured timeout period")));
+ 				}
+ 			}
+ 		
  			/*
  			 * We didn't receive anything new, but send a status update to the
  			 * master anyway, to report any progress in applying WAL.
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2382,2398 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"wal_receiver_replication_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+ 			gettext_noop("Sets the maximum wait time to receive data from master."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&wal_receiver_replication_timeout,
+ 		60 * 1000, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 		
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 237,242 ****
--- 237,244 ----
  					# 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
+ #wal_receiver_replication_timeout = 60s	# in milliseconds; 0 disables; time 
+ 					# till receiver waits for communication from master.
  
  
  #------------------------------------------------------------------------------
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 19,24 ****
--- 19,25 ----
  
  extern int	wal_receiver_status_interval;
  extern bool hot_standby_feedback;
+ extern int wal_receiver_replication_timeout;
  
  /*
   * MAXCONNINFO: maximum size of a connection string.
#8Fujii Masao
masao.fujii@gmail.com
In reply to: Amit kapila (#7)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

Bug reference: 7534
Logged by: Amit Kapila
Email address: amit.kapila@huawei.com
PostgreSQL version: 9.2.0
Operating system: Suse 10
Description:

1. Both master and standby machine are connected normally,
2. then you use the command: ifconfig ip down; make the network card of
master and standby down,

Observation
master can detect connect abnormal, but the standby can't detect connect
abnormal and show a connected channel long time.

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
The main changes are:
1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming CommitFest.

Suggestions/Comments?

You also need to change walsender so that it periodically sends the heartbeat
message, like walreceiver does each wal_receiver_status_interval. Otherwise,
walreceiver will detect the timeout wrongly whenever there is no traffic in the
master.

Regards,

--
Fujii Masao

#9Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#8)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
The main changes are:
1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming CommitFest.

Suggestions/Comments?

You also need to change walsender so that it periodically sends the heartbeat
message, like walreceiver does each wal_receiver_status_interval. Otherwise,
walreceiver will detect the timeout wrongly whenever there is no traffic in the
master.

Doesn't current keepalive message from walsender will suffice that need?

With Regards,
Amit Kapila.

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Amit kapila (#9)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
The main changes are:
1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming CommitFest.

Suggestions/Comments?

You also need to change walsender so that it periodically sends the heartbeat
message, like walreceiver does each wal_receiver_status_interval. Otherwise,
walreceiver will detect the timeout wrongly whenever there is no traffic in the
master.

Doesn't current keepalive message from walsender will suffice that need?

No. Though the keepalive interval should be smaller than the timeout,
IIRC there is
no way to specify the keepalive interval now.

Regards,

--
Fujii Masao

#11Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#10)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Sunday, September 16, 2012 12:14 AM Fujii Masao wrote:
On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

I would like to implement such feature for walreceiver, but there is one
confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as for
master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are placed in
the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your suggestion to have separate timeout parameter for walreceiver.
The main changes are:
1. Introduce a new configuration parameter wal_receiver_replication_timeout for walreceiver.
2. In function WalReceiverMain(), check if there is no communication till wal_receiver_replication_timeout, exit the walreceiver.
This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming CommitFest.

Suggestions/Comments?

You also need to change walsender so that it periodically sends the heartbeat
message, like walreceiver does each wal_receiver_status_interval. Otherwise,
walreceiver will detect the timeout wrongly whenever there is no traffic in the
master.

Doesn't current keepalive message from walsender will suffice that need?

No. Though the keepalive interval should be smaller than the timeout,
IIRC there is
no way to specify the keepalive interval now.

Currently AFAICS in the code on idle system, it should send keepalive after 10s which is hardcoded value as sleeptime.
You are right that if its not configurable, and somebody configures replication_timeout as value lower than 10s then the logic will fail.

So is it okay if a new config parameter similar to wal_receiver_status_interval be added and map it directly to sleeptime in the current code.
There will be no need for any new heartbeat message, existing keepalive will sufice that purpose.

With Regards,
Amit Kapila.

#12Amit Kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#10)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Sunday, September 16, 2012 12:14 AM Fujii Masao wrote:
On Sat, Sep 15, 2012 at 4:26 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, September 15, 2012 11:27 AM Fujii Masao wrote:
On Fri, Sep 14, 2012 at 10:01 PM, Amit kapila <amit.kapila@huawei.com>

wrote:

On Thursday, September 13, 2012 10:57 PM Fujii Masao
On Thu, Sep 13, 2012 at 1:22 PM, Amit Kapila <amit.kapila@huawei.com>

wrote:

On Wednesday, September 12, 2012 10:15 PM Fujii Masao
On Wed, Sep 12, 2012 at 8:54 PM, <amit.kapila@huawei.com> wrote:

The following bug has been logged on the website:

I would like to implement such feature for walreceiver, but there is

one

confusion that whether to use
same configuration parameter(replication_timeout) for walrecevier as

for

master or introduce a new
configuration parameter (receiver_replication_timeout).

I like the latter. I believe some users want to set the different
timeout values,
for example, in the case where the master and standby servers are

placed in

the same room, but cascaded standby is placed in other continent.

Thank you for your suggestion. I have implemented as per your

suggestion to have separate timeout parameter for walreceiver.

The main changes are:
1. Introduce a new configuration parameter

wal_receiver_replication_timeout for walreceiver.

2. In function WalReceiverMain(), check if there is no communication

till wal_receiver_replication_timeout, exit the walreceiver.

This is same as walsender functionality.

As this is a feature, So I am uploading the attached patch in coming

CommitFest.

Suggestions/Comments?

You also need to change walsender so that it periodically sends the

heartbeat

message, like walreceiver does each wal_receiver_status_interval.

Otherwise,

walreceiver will detect the timeout wrongly whenever there is no traffic

in the

master.

Doesn't current keepalive message from walsender will suffice that need?

No. Though the keepalive interval should be smaller than the timeout,
IIRC there is
no way to specify the keepalive interval now.

To define the behavior correctly, according to me there are 2 options now:

Approach-1 :
Document that both(sender and receiver) the timeout parameters should be
greater than wal_receiver_status_interval.
If both are greater, then I think it might never timeout due to Idle.

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

TimestampTz timeout = 0;
long sleeptime = 10000; /* 10 s */
int wakeEvents;

/* sleeptime should be equal to wal send interval if
it is not zero otherwise default as 10 sec*/
if (wal_send_status_interval > 0)
{
sleeptime = wal_send_status_interval;
}

wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
WL_SOCKET_READABLE | WL_TIMEOUT;

if (pq_is_send_pending())
wakeEvents |= WL_SOCKET_WRITEABLE;
else if (wal_send_status_interval > 0)
{
WalSndKeepalive(output_message);
/* Try to flush pending output to the client
*/
if (pq_flush_if_writable() != 0)
break;
}

/* Determine time until replication timeout */
if (replication_timeout > 0)
{
timeout =
TimestampTzPlusMilliseconds(last_reply_timestamp,

replication_timeout);

if (wal_send_status_interval <= 0)
{
sleeptime = 1 + (replication_timeout
/ 10);
}
}

/* Sleep until something happens or replication
timeout */
WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
MyProcPort->sock,
sleeptime);

/*
* Check for replication timeout. Note we ignore
the corner case
* possibility that the client replied just as we
reached the
* timeout ... he's supposed to reply *before* that.

*/
if (replication_timeout > 0 &&
GetCurrentTimestamp() >= timeout)
{
/*
* Since typically expiration of replication
timeout means
* communication problem, we don't send the
error message to
* the standby.
*/
ereport(COMMERROR,
(errmsg("terminating
walsender process due to replication timeout")));
break;
}
}

Which way you think is better or you have any other idea to handle.

With Regards,
Amit Kapila.

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

To define the behavior correctly, according to me there are 2 options now:

Approach-1 :
Document that both(sender and receiver) the timeout parameters should be
greater than wal_receiver_status_interval.
If both are greater, then I think it might never timeout due to Idle.

In this approach, keepalive messages are sent each wal_receiver_status_interval?

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

<snip>

Which way you think is better or you have any other idea to handle.

I think #2 is better because it's more intuitive to a user.

Regards,

--
Fujii Masao

#14Amit Kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#13)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tuesday, September 18, 2012 6:03 PM Fujii Masao wrote:
On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

To define the behavior correctly, according to me there are 2 options

now:

Approach-1 :
Document that both(sender and receiver) the timeout parameters should be
greater than wal_receiver_status_interval.
If both are greater, then I think it might never timeout due to Idle.

In this approach, keepalive messages are sent each

wal_receiver_status_interval?
wal_receiver_status_interval or sleeptime whichever is smaller.

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

<snip>

Which way you think is better or you have any other idea to handle.

I think #2 is better because it's more intuitive to a user.

I shall update the Patch as per Approach-2 and upload the same.

With Regards,
Amit Kapila.

#15Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#13)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tuesday, September 18, 2012 6:02 PM Fujii Masao wrote:
On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

<snip>

Which way you think is better or you have any other idea to handle.

I think #2 is better because it's more intuitive to a user.

Please find a patch attached for implementation of Approach-2.

With Regards,
Amit Kapila.

Attachments:

replication_timeout_patch_v2.patchapplication/octet-stream; name=replication_timeout_patch_v2.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2175,2180 **** SET ENABLE_SEQSCAN TO OFF;
--- 2175,2204 ----
         </para>
        </listitem>
       </varlistentry>
+ 	 
+ 	 <varlistentry id="guc-wal-receiver-replication-timeout" xreflabel="wal_receiver_replication_timeout">
+       <term><varname>wal_receiver_replication_timeout</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>wal_receiver_replication_timeout</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Terminate replication connections that are inactive longer
+         than the specified number of milliseconds. This is useful for
+         the receiving standby server to detect a primary node crash or network outage.
+         A value of zero disables the timeout mechanism.  This parameter
+         can only be set in
+         the <filename>postgresql.conf</> file or on the server command line.
+         The default value is 60 seconds.
+        </para>
+        <para>
+         To prevent connections from being terminated prematurely,
+         <xref linkend="guc-wal-send-status-interval">
+         must be enabled on the primary, and its value must be less than the
+         value of <varname>wal_receiver_replication_timeout</>.
+        </para>
+       </listitem>
+      </varlistentry>
  
       </variablelist>
      </sect2>
***************
*** 2397,2402 **** SET ENABLE_SEQSCAN TO OFF;
--- 2421,2450 ----
        </para>
        </listitem>
       </varlistentry>
+ 	 
+ 	 <varlistentry id="guc-wal-send-status-interval" xreflabel="wal_send_status_interval">
+       <term><varname>wal_send_status_interval</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>wal_send_status_interval</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+       <para>
+        Specifies the minimum frequency for the WAL sender
+        process on the primary to send heart-beat message to the standby.
+        This parameter's value is the maximum interval, in seconds, between heat-beat.  
+ 	   Updates are sent each time it receives response from standby, or at least as
+        often as specified by this parameter.  Setting this parameter to zero
+        disables status updates completely.  This parameter can only be set in
+        the <filename>postgresql.conf</> file or on the server command line.
+        The default value is 10 seconds.
+       </para>
+       <para>
+        When <xref linkend="guc-wal-receiver-replication-timeout"> is enabled on a receiving server,
+        <varname>wal_send_status_interval</> must be enabled, and its value
+        must be less than the value of <varname>wal_receiver_replication_timeout</>.
+       </para>
+       </listitem>
+      </varlistentry>	 
  
       <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
        <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 62,67 **** walrcv_connect_type walrcv_connect = NULL;
--- 62,69 ----
  walrcv_receive_type walrcv_receive = NULL;
  walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
+ int			wal_receiver_replication_timeout = 60 * 1000;	/* maximum time to receive one
+ 												 * WAL data message */
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
***************
*** 174,179 **** WalReceiverMain(void)
--- 176,184 ----
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
  
+ 	TimestampTz last_recv_timestamp;
+ 	TimestampTz timeout = 0;
+ 
  	/*
  	 * WalRcv should be set up already (if we are a backend, we inherit this
  	 * by fork() or EXEC_BACKEND mechanism from the postmaster).
***************
*** 282,287 **** WalReceiverMain(void)
--- 287,295 ----
  	MemSet(&reply_message, 0, sizeof(reply_message));
  	MemSet(&feedback_message, 0, sizeof(feedback_message));
  
+ 	/* Initialize the last recv timestamp */
+ 	last_recv_timestamp = GetCurrentTimestamp();
+ 
  	/* Loop until end-of-streaming or error */
  	for (;;)
  	{
***************
*** 316,327 **** WalReceiverMain(void)
--- 324,343 ----
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
+ 			/* Something is received from master, so reset last receive time*/
+ 			last_recv_timestamp = GetCurrentTimestamp();
+ 			
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
+ 			{
+ 				/* Something is received from master, so reset last receive time*/
+ 				last_recv_timestamp = GetCurrentTimestamp();
+ 				
  				XLogWalRcvProcessMsg(type, buf, len);
+ 			}
  
  			/* Let the master know that we received some data. */
  			XLogWalRcvSendReply();
***************
*** 334,339 **** WalReceiverMain(void)
--- 350,369 ----
  		}
  		else
  		{
+ 			/* Check if time since last receive from standby has reached the configured limit
+ 			 * No need to check if it is disabled by giving value as 0*/
+ 			if (wal_receiver_replication_timeout > 0)
+ 			{
+ 				timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
+ 														  wal_receiver_replication_timeout);
+ 
+ 				if (GetCurrentTimestamp() >= timeout)
+ 				{
+ 					ereport(ERROR,
+ 						(errmsg("Could not receive any message from WalSender for configured timeout period")));
+ 				}
+ 			}
+ 		
  			/*
  			 * We didn't receive anything new, but send a status update to the
  			 * master anyway, to report any progress in applying WAL.
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 82,87 **** bool		am_cascading_walsender = false;		/* Am I cascading WAL to
--- 82,89 ----
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
  int			replication_timeout = 60 * 1000;	/* maximum time to send one
  												 * WAL data message */
+ int			wal_send_status_interval = 10 * 1000; /* send replies at least this often to standby */
+ 												 	
  /*
   * State for WalSndWakeupRequest
   */
***************
*** 832,843 **** WalSndLoop(void)
  			long		sleeptime = 10000;		/* 10 s */
  			int			wakeEvents;
  
  			wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
  				WL_SOCKET_READABLE | WL_TIMEOUT;
  
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (MyWalSnd->sendKeepalive)
  			{
  				WalSndKeepalive(output_message);
  				/* Try to flush pending output to the client */
--- 834,855 ----
  			long		sleeptime = 10000;		/* 10 s */
  			int			wakeEvents;
  
+ 			/* sleeptime should be equal to wal send interval if it is greater than zero*/
+ 			if (wal_send_status_interval > 0)
+ 			{
+ 				sleeptime = wal_send_status_interval*1000;
+ 			}			
+ 
  			wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
  				WL_SOCKET_READABLE | WL_TIMEOUT;
  
+ 			/* 
+ 			  * send keepalive message if sendkeepalive is enabled or WAL send status 
+ 			  * interval is greater than zero.
+ 			  */
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (MyWalSnd->sendKeepalive || (wal_send_status_interval > 0))
  			{
  				WalSndKeepalive(output_message);
  				/* Try to flush pending output to the client */
***************
*** 850,856 **** WalSndLoop(void)
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  													  replication_timeout);
! 				sleeptime = 1 + (replication_timeout / 10);
  			}
  
  			/* Sleep until something happens or replication timeout */
--- 862,871 ----
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  													  replication_timeout);
! 				if (wal_send_status_interval <= 0)
! 				{
! 					sleeptime = 1 + (replication_timeout / 10);
! 				}
  			}
  
  			/* Sleep until something happens or replication timeout */
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1596,1601 **** static struct config_int ConfigureNamesInt[] =
--- 1596,1612 ----
  	},
  
  	{
+ 		{"wal_receiver_replication_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+ 			gettext_noop("Sets the maximum wait time to receive data from master."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&wal_receiver_replication_timeout,
+ 		60 * 1000, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
  			gettext_noop("Sets the maximum number of concurrent connections."),
  			NULL
***************
*** 2030,2035 **** static struct config_int ConfigureNamesInt[] =
--- 2041,2057 ----
  	},
  
  	{
+ 		{"wal_send_status_interval", PGC_SIGHUP, REPLICATION_SENDING,
+ 			gettext_noop("Sets the maximum interval between WAL sender reports to the standby."),
+ 			NULL,
+ 			GUC_UNIT_S
+ 		},
+ 		&wal_send_status_interval,
+ 		10, 0, INT_MAX / 1000,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_delay", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the delay in microseconds between transaction commit and "
  						 "flushing WAL to disk."),
***************
*** 2381,2387 **** static struct config_int ConfigureNamesInt[] =
  		1024, 100, 102400,
  		NULL, NULL, NULL
  	},
! 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
--- 2403,2409 ----
  		1024, 100, 102400,
  		NULL, NULL, NULL
  	},
! 		
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
+ #wal_send_status_interval = 10s	# send replies at least this often to standby
+ 					#  in seconds; 0 disables					
  
  # - Standby Servers -
  
***************
*** 234,242 ****
  					# when reading streaming WAL;
  					# -1 allows indefinite delay
  #wal_receiver_status_interval = 10s	# send replies at least this often
! 					# 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
  
  
  #------------------------------------------------------------------------------
--- 236,246 ----
  					# when reading streaming WAL;
  					# -1 allows indefinite delay
  #wal_receiver_status_interval = 10s	# send replies at least this often
! 					# in seconds; 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
+ #wal_receiver_replication_timeout = 60s	# in milliseconds; 0 disables; time 
+ 					# till receiver waits for communication from master.
  
  
  #------------------------------------------------------------------------------
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 19,24 ****
--- 19,25 ----
  
  extern int	wal_receiver_status_interval;
  extern bool hot_standby_feedback;
+ extern int wal_receiver_replication_timeout;
  
  /*
   * MAXCONNINFO: maximum size of a connection string.
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 26,31 **** extern bool wake_wal_senders;
--- 26,33 ----
  /* user-settable parameters */
  extern int	max_wal_senders;
  extern int	replication_timeout;
+ extern int	wal_send_status_interval;
+ 
  
  extern void WalSenderMain(void) __attribute__((noreturn));
  extern void WalSndSignals(void);
#16Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit kapila (#15)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 21.09.2012 14:18, Amit kapila wrote:

On Tuesday, September 18, 2012 6:02 PM Fujii Masao wrote:
On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila<amit.kapila@huawei.com> wrote:

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

<snip>

Which way you think is better or you have any other idea to handle.

I think #2 is better because it's more intuitive to a user.

Please find a patch attached for implementation of Approach-2.

Hmm, I think we need to step back a bit. I've never liked the way
replication_timeout works, where it's the user's responsibility to set
wal_receiver_status_interval < replication_timeout. It's not very
user-friendly. I'd rather not copy that same design to this walreceiver
timeout. If there's two different timeouts like that, it's even worse,
because it's easy to confuse the two.

So let's think how this should ideally work from a user's point of view.
I think there should be just two settings: walsender_timeout and
walreceiver_timeout. walsender_timeout specifies how long a walsender
will keep a connection open if it doesn't hear from the walreceiver, and
walreceiver_timeout is the same for walreceiver. The system should
figure out itself how often to send keepalive messages so that those
timeouts are not reached.

In walsender, after half of walsender_timeout has elapsed and we haven't
received anything from the client, the walsender process should send a
"ping" message to the client. Whenever the client receives a Ping, it
replies. The walreceiver does the same; when half of walreceiver_timeout
has elapsed, send a Ping message to the server. Each Ping-Pong roundtrip
resets the timer in both ends, regardless of which side initiated it, so
if e.g walsender_timeout < walreceiver_timeout, the client will never
have to initiate a Ping message, because walsender will always reach the
walsender_timeout/2 point first and initiate the heartbeat message.

The Ping/Pong messages don't necessarily need to be new message types,
we can use the message types we currently have, perhaps with an
additional flag attached to them, to request the other side to reply
immediately.

- Heikki

#17Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Oct 1, 2012 at 6:38 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Hmm, I think we need to step back a bit. I've never liked the way
replication_timeout works, where it's the user's responsibility to set
wal_receiver_status_interval < replication_timeout. It's not very
user-friendly. I'd rather not copy that same design to this walreceiver
timeout. If there's two different timeouts like that, it's even worse,
because it's easy to confuse the two.

I agree, but also note that wal_receiver_status_interval serves
another user-visible purpose as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Oct 1, 2012 at 7:38 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Hmm, I think we need to step back a bit. I've never liked the way
replication_timeout works, where it's the user's responsibility to set
wal_receiver_status_interval < replication_timeout. It's not very
user-friendly. I'd rather not copy that same design to this walreceiver
timeout. If there's two different timeouts like that, it's even worse,
because it's easy to confuse the two.

Agreed.

I'd like to specify the replication timeout like we do TCP keepalives, i.e.,
what about introducing something like following parameters?

walsender_keepalives_idle
walsender_keepalives_interval
walsender_keeaplives_count
walreceiver_keepalives_idle
walreceiver_keepalives_interval
walreceiver_keepalives_count

I believe many users are basically familiar with TCP keepalives and how to
specify it. So I think that this approach would be intuitive to users. Also
this approach includes your proposal. If you specify

walsender_keepalives_idle = walsender_timeout / 2
walsender_keepalives_interval = -1 (disable; Ping is never sent
again if there is no reply after first Ping is sent)
walsender_keepalives_count = 1

the replication timeout works as you proposed. But of course the downside
of this approach is that the number of parameter for replication timeout is
increased from two (replication_timeout and
wal_receiver_status_interval) to six,
and those parameters are confusingly similar to existing
tcp_keepalives parameters,
which might cause another confusion to users. One idea to solve this problem is
to use existing tcp_keepalives paramters values for the replication timeout.

Regards,

--
Fujii Masao

#19Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#18)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Oct 1, 2012 at 12:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I believe many users are basically familiar with TCP keepalives and how to
specify it. So I think that this approach would be intuitive to users.

My experience is that many users are unfamiliar with TCP keepalives
and that when given the options they tend to do it wrong. I think a
simpler system would be better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#19)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

Excerpts from Robert Haas's message of lun oct 01 21:02:54 -0300 2012:

On Mon, Oct 1, 2012 at 12:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I believe many users are basically familiar with TCP keepalives and how to
specify it. So I think that this approach would be intuitive to users.

My experience is that many users are unfamiliar with TCP keepalives
and that when given the options they tend to do it wrong. I think a
simpler system would be better.

+1

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Amit kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#16)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, October 01, 2012 4:08 PM Heikki Linnakangas wrote:
On 21.09.2012 14:18, Amit kapila wrote:

On Tuesday, September 18, 2012 6:02 PM Fujii Masao wrote:
On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila<amit.kapila@huawei.com> wrote:

Approach-2 :
Provide a variable wal_send_status_interval, such that if this is 0, then
the current behavior would prevail and if its non-zero then KeepAlive
message would be send maximum after that time.
The modified code of WALSendLoop will be as follows:

<snip>

Which way you think is better or you have any other idea to handle.

I think #2 is better because it's more intuitive to a user.

Please find a patch attached for implementation of Approach-2.

So let's think how this should ideally work from a user's point of view.
I think there should be just two settings: walsender_timeout and
walreceiver_timeout. walsender_timeout specifies how long a walsender
will keep a connection open if it doesn't hear from the walreceiver, and
walreceiver_timeout is the same for walreceiver. The system should
figure out itself how often to send keepalive messages so that those
timeouts are not reached.

By this it implies that we should remove wal_receiver_status_interval. Currently it is also used
incase of reply message of data sent by sender which contains till what point receiver has flushed. So if we remove this variable
receiver might start sending that message sonner than required.
Is that okay behavior?

In walsender, after half of walsender_timeout has elapsed and we haven't
received anything from the client, the walsender process should send a
"ping" message to the client. Whenever the client receives a Ping, it
replies. The walreceiver does the same; when half of walreceiver_timeout
has elapsed, send a Ping message to the server. Each Ping-Pong roundtrip
resets the timer in both ends, regardless of which side initiated it, so
if e.g walsender_timeout < walreceiver_timeout, the client will never
have to initiate a Ping message, because walsender will always reach the
walsender_timeout/2 point first and initiate the heartbeat message.

Just to clarify, walsender should reset timer after it gets reply from receiver of the message it sent.
walreceiver should reset timer after sending reply for heartbeat message.
Similar to above timers will be reset when receiver sent the heartbeat message.

The Ping/Pong messages don't necessarily need to be new message types,
we can use the message types we currently have, perhaps with an
additional flag attached to them, to request the other side to reply
immediately.

Can't we make the decision to send reply immediately based on message type, because these message types will be unique.

To clarify my understanding,
1. the heartbeat message from walsender side will be keepalive message ('k') and from walreceiver side it will be Hot Standby feedback message ('h').
2. the reply message from walreceiver side will be current reply message ('r').
3. currently there is no reply kind of message from walsender, so do we need to introduce one new message for it or can use some existing message only?
if new, do we need to send any additional information along with it, for existing messages can we use keepalive message it self as reply message but with an additional byte
to indicate it is reply?

With Regards,
Amit Kapila.

#22Amit kapila
amit.kapila@huawei.com
In reply to: Robert Haas (#17)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, October 01, 2012 8:36 PM Robert Haas wrote:
On Mon, Oct 1, 2012 at 6:38 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Hmm, I think we need to step back a bit. I've never liked the way
replication_timeout works, where it's the user's responsibility to set
wal_receiver_status_interval < replication_timeout. It's not very
user-friendly. I'd rather not copy that same design to this walreceiver
timeout. If there's two different timeouts like that, it's even worse,
because it's easy to confuse the two.

I agree, but also note that wal_receiver_status_interval serves
another user-visible purpose as well.

By above do you mean to say that wal_receiver_status_interval is used for reply of data sent by server to indicate till what point receiver has flushed data or something else?

With Regards,
Amit Kapila.

#23Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit kapila (#21)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 02.10.2012 10:36, Amit kapila wrote:

On Monday, October 01, 2012 4:08 PM Heikki Linnakangas wrote:

So let's think how this should ideally work from a user's point of view.
I think there should be just two settings: walsender_timeout and
walreceiver_timeout. walsender_timeout specifies how long a walsender
will keep a connection open if it doesn't hear from the walreceiver, and
walreceiver_timeout is the same for walreceiver. The system should
figure out itself how often to send keepalive messages so that those
timeouts are not reached.

By this it implies that we should remove wal_receiver_status_interval. Currently it is also used
incase of reply message of data sent by sender which contains till what point receiver has flushed. So if we remove this variable
receiver might start sending that message sonner than required.
Is that okay behavior?

I guess we should keep that setting, then, so that you can get status
updates more often than would be required for heartbeat purposes.

In walsender, after half of walsender_timeout has elapsed and we haven't
received anything from the client, the walsender process should send a
"ping" message to the client. Whenever the client receives a Ping, it
replies. The walreceiver does the same; when half of walreceiver_timeout
has elapsed, send a Ping message to the server. Each Ping-Pong roundtrip
resets the timer in both ends, regardless of which side initiated it, so
if e.g walsender_timeout< walreceiver_timeout, the client will never
have to initiate a Ping message, because walsender will always reach the
walsender_timeout/2 point first and initiate the heartbeat message.

Just to clarify, walsender should reset timer after it gets reply from receiver of the message it sent.

Right.

walreceiver should reset timer after sending reply for heartbeat message.
Similar to above timers will be reset when receiver sent the

heartbeat message.

walreceiver should reset the timer when it *receives* any message from
walsender. If it sends the reply right away, I guess that's the same
thing, but I'd phrase it so that it's the reception of a message from
the other end that resets the timer.

The Ping/Pong messages don't necessarily need to be new message types,
we can use the message types we currently have, perhaps with an
additional flag attached to them, to request the other side to reply
immediately.

Can't we make the decision to send reply immediately based on message type, because these message types will be unique.

To clarify my understanding,
1. the heartbeat message from walsender side will be keepalive message ('k') and from walreceiver side it will be Hot Standby feedback message ('h').
2. the reply message from walreceiver side will be current reply message ('r').

Yep. I wonder why need separate message types for Hot Standby Feedback
'h' and Reply 'r', though. Seems it would be simpler to have just one
messasge type that includes all the fields from both messages.

3. currently there is no reply kind of message from walsender, so do we need to introduce one new message for it or can use some existing message only?
if new, do we need to send any additional information along with it, for existing messages can we use keepalive message it self as reply message but with an additional byte
to indicate it is reply?

Hmm, I think I'd prefer to use the existing Keepalive message 'k', with
an additional flag.

- Heikki

#24Amit kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#23)
1 attachment(s)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tuesday, October 02, 2012 1:56 PM Heikki Linnakangas wrote:
On 02.10.2012 10:36, Amit kapila wrote:

On Monday, October 01, 2012 4:08 PM Heikki Linnakangas wrote:

So let's think how this should ideally work from a user's point of view.
I think there should be just two settings: walsender_timeout and
walreceiver_timeout. walsender_timeout specifies how long a walsender
will keep a connection open if it doesn't hear from the walreceiver, and
walreceiver_timeout is the same for walreceiver. The system should

The Ping/Pong messages don't necessarily need to be new message types,
we can use the message types we currently have, perhaps with an
additional flag attached to them, to request the other side to reply
immediately.

Can't we make the decision to send reply immediately based on message type, because these message types will be unique.

To clarify my understanding,
1. the heartbeat message from walsender side will be keepalive message ('k') and from walreceiver side it will be Hot Standby feedback message ('h').
2. the reply message from walreceiver side will be current reply message ('r').

Yep. I wonder why need separate message types for Hot Standby Feedback
'h' and Reply 'r', though. Seems it would be simpler to have just one
messasge type that includes all the fields from both messages.

moved the contents for Hot Standby Feedback 'h' to Reply 'r' and use 'h' for heart-beat purpose.

3. currently there is no reply kind of message from walsender, so do we need to introduce one new message for it or can use some existing message only?
if new, do we need to send any additional information along with it, for existing messages can we use keepalive message it self as reply message but with an additional byte
to indicate it is reply?

Hmm, I think I'd prefer to use the existing Keepalive message 'k', with an additional flag.

Okay. I have done it in Patch.

Thank you for suggestions.
I have addressed your suggestions in patch attached with this mail.

Following changes are done to support replication timeout in sender as well as receiver:

1. One new configuration parameter wal_receiver_timeout is added to detect timeout at receiver task.
2. Existing parameter replication_timeout is renamed to wal_sender_timeout.
3. Now PrimaryKeepaliveMessage structure is modified to add one more field to indicate whether keep-alive is of type 'r' (i.e.
reply) or 'h' (i.e. heart-beat).
4. Now the keep-alive message from sender will be sent to standby if it was idle for more than or equal to half of wal_sender_timeout.
In this case it will send keep-alive of type 'h'.
5. Once the standby receiver a keep-alive, it needs to send an immediate reply to primary to indicate connection is alive.
6. Now Reply message to send wal offset and Feedback message to send oldest transaction are merged into single Reply message.
So now the structure StandbyReplyMessage is changed to add two more fields as xmin and epoch. Also StandbyHSFeedbackMessage
structure is changed to remove xmin and epoch fields (as these are moved to StandbyReplyMessage).
7. Because of changes as in step-6, once receiver task receives some data from primary then it will only send Reply Message.
8. Same Reply message is sent in step-5 and step-7 but incase of step-5, then reply is sent immediately but incase of step-7, reply is sent
if wal_receiver_status_interval has lapsed (this part is same as earlier).
9. Similar to sender, if receiver finds itself idle for more than or equal to half of configured wal_receiver_timeout, then it will send the
hot-standby heartbeat. This heart-beat has been modified to send only sendTime.
10. Once sender task receiver heart-beat message from standby then it sends back the reply immediately. In this keep-alive message is
sent of type 'r'.
11. If even after wal_sender_timeout no message received from standby then it will be considered as network break at sender task.
12. If even after wal_receiver_timeout no message received from primary then it will be considered as network break at receiver task.

With Regards,
Amit Kapila.

Attachments:

replication_timeout_patch_v3.patchapplication/octet-stream; name=replication_timeout_patch_v3.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2152,2161 **** SET ENABLE_SEQSCAN TO OFF;
         </listitem>
        </varlistentry>
  
!      <varlistentry id="guc-replication-timeout" xreflabel="replication_timeout">
!       <term><varname>replication_timeout</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>replication_timeout</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
--- 2152,2161 ----
         </listitem>
        </varlistentry>
  
!      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
!       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>wal_sender_timeout</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
***************
*** 2167,2177 **** SET ENABLE_SEQSCAN TO OFF;
          the <filename>postgresql.conf</> file or on the server command line.
          The default value is 60 seconds.
         </para>
         <para>
!         To prevent connections from being terminated prematurely,
!         <xref linkend="guc-wal-receiver-status-interval">
!         must be enabled on the standby, and its value must be less than the
!         value of <varname>replication_timeout</>.
         </para>
        </listitem>
       </varlistentry>
--- 2167,2189 ----
          the <filename>postgresql.conf</> file or on the server command line.
          The default value is 60 seconds.
         </para>
+       </listitem>
+      </varlistentry>
+ 	 
+ 	 <varlistentry id="guc-wal-receiver-timeout" xreflabel="wal_receiver_timeout">
+       <term><varname>wal_receiver_timeout</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>wal_receiver_timeout</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
         <para>
!         Terminate replication connections that are inactive longer
!         than the specified number of milliseconds. This is useful for
!         the receiving standby server to detect a primary node crash or network outage.
!         A value of zero disables the timeout mechanism.  This parameter
!         can only be set in
!         the <filename>postgresql.conf</> file or on the server command line.
!         The default value is 60 seconds.
         </para>
        </listitem>
       </varlistentry>
***************
*** 2390,2400 **** SET ENABLE_SEQSCAN TO OFF;
         the <filename>postgresql.conf</> file or on the server command line.
         The default value is 10 seconds.
        </para>
-       <para>
-        When <xref linkend="guc-replication-timeout"> is enabled on a sending server,
-        <varname>wal_receiver_status_interval</> must be enabled, and its value
-        must be less than the value of <varname>replication_timeout</>.
-       </para>
        </listitem>
       </varlistentry>
  
--- 2402,2407 ----
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 38,43 ****
--- 38,44 ----
  #include <signal.h>
  #include <unistd.h>
  
+ #include "access/transam.h"
  #include "access/xlog_internal.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
***************
*** 62,67 **** walrcv_connect_type walrcv_connect = NULL;
--- 63,70 ----
  walrcv_receive_type walrcv_receive = NULL;
  walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
+ int			wal_receiver_timeout = 60 * 1000;	/* maximum time to receive one
+ 												 * WAL data message */
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
***************
*** 121,127 **** static void WalRcvDie(int code, Datum arg);
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(bool dying);
! static void XLogWalRcvSendReply(void);
  static void XLogWalRcvSendHSFeedback(void);
  static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
  
--- 124,130 ----
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(bool dying);
! static void XLogWalRcvSendReply(bool sendImmediate);
  static void XLogWalRcvSendHSFeedback(void);
  static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
  
***************
*** 174,179 **** WalReceiverMain(void)
--- 177,185 ----
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
  
+ 	TimestampTz last_recv_timestamp;
+ 	TimestampTz timeout = 0;
+ 
  	/*
  	 * WalRcv should be set up already (if we are a backend, we inherit this
  	 * by fork() or EXEC_BACKEND mechanism from the postmaster).
***************
*** 282,287 **** WalReceiverMain(void)
--- 288,296 ----
  	MemSet(&reply_message, 0, sizeof(reply_message));
  	MemSet(&feedback_message, 0, sizeof(feedback_message));
  
+ 	/* Initialize the last recv timestamp */
+ 	last_recv_timestamp = GetCurrentTimestamp();
+ 
  	/* Loop until end-of-streaming or error */
  	for (;;)
  	{
***************
*** 316,330 **** WalReceiverMain(void)
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
  				XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Let the master know that we received some data. */
! 			XLogWalRcvSendReply();
  
  			/*
  			 * If we've written some records, flush them to disk and let the
--- 325,347 ----
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
+ 			/* Something is received from master, so reset last receive time*/
+ 			last_recv_timestamp = GetCurrentTimestamp();
+ 			
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
+ 			{
+ 				/* Something is received from master, so reset last receive time*/
+ 				last_recv_timestamp = GetCurrentTimestamp();
+ 				
  				XLogWalRcvProcessMsg(type, buf, len);
+ 			}
  
  			/* Let the master know that we received some data. */
! 			XLogWalRcvSendReply(false);
  
  			/*
  			 * If we've written some records, flush them to disk and let the
***************
*** 334,345 **** WalReceiverMain(void)
  		}
  		else
  		{
! 			/*
! 			 * We didn't receive anything new, but send a status update to the
! 			 * master anyway, to report any progress in applying WAL.
! 			 */
! 			XLogWalRcvSendReply();
! 			XLogWalRcvSendHSFeedback();
  		}
  	}
  }
--- 351,380 ----
  		}
  		else
  		{
! 			/* Check if time since last receive from standby has reached the configured limit
! 			 * No need to check if it is disabled by giving value as 0*/
! 			if (wal_receiver_timeout > 0)
! 			{
! 				timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
! 														  wal_receiver_timeout);
! 
! 				if (GetCurrentTimestamp() >= timeout)
! 				{
! 					ereport(ERROR,
! 						(errmsg("Could not receive any message from WalSender for configured timeout period")));
! 				}
! 
! 				/*
! 				 * We didn't receive anything new, for half of receiver replication timeout.
! 				 */
! 				timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
! 														  (wal_receiver_timeout/2));
! 
! 				if (GetCurrentTimestamp() >= timeout)
! 				{
! 					XLogWalRcvSendHSFeedback();
! 				}								
! 			}		
  		}
  	}
  }
***************
*** 460,465 **** XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 495,506 ----
  				memcpy(&keepalive, buf, sizeof(PrimaryKeepaliveMessage));
  
  				ProcessWalSndrMessage(keepalive.walEnd, keepalive.sendTime);
+ 
+ 				/* For heart-beat message from primary, send an immediate reply*/
+ 				if (keepalive.msgType != 'r')
+ 				{
+ 					XLogWalRcvSendReply(true);
+ 				}				
  				break;
  			}
  		default:
***************
*** 610,636 **** XLogWalRcvFlush(bool dying)
  		/* Also let the master know that we made some progress */
  		if (!dying)
  		{
! 			XLogWalRcvSendReply();
! 			XLogWalRcvSendHSFeedback();
  		}
  	}
  }
  
  /*
!  * Send reply message to primary, indicating our current XLOG positions and
!  * the current time.
   */
  static void
! XLogWalRcvSendReply(void)
  {
  	char		buf[sizeof(StandbyReplyMessage) + 1];
  	TimestampTz now;
  
  	/*
  	 * If the user doesn't want status to be reported to the master, be sure
  	 * to exit before doing anything at all.
  	 */
! 	if (wal_receiver_status_interval <= 0)
  		return;
  
  	/* Get current timestamp. */
--- 651,684 ----
  		/* Also let the master know that we made some progress */
  		if (!dying)
  		{
! 			XLogWalRcvSendReply(false);
  		}
  	}
  }
  
  /*
!  * Send reply message to primary, indicating our current XLOG positions, oldest
!  * xmin and the current time.
!  * The parameter sendImmediate is used to decide whether the reply has to be 
!  *  send immediately or after wal_receiver_status_interval.
!  * If the reply is getting sent because of heart-beat from primary, then this param
!  * should be true otherwise false.
   */
  static void
! XLogWalRcvSendReply(bool sendImmediate)
  {
  	char		buf[sizeof(StandbyReplyMessage) + 1];
  	TimestampTz now;
+ 	TransactionId nextXid;
+ 	uint32		nextEpoch;
+ 	TransactionId xmin;
+ 	
  
  	/*
  	 * If the user doesn't want status to be reported to the master, be sure
  	 * to exit before doing anything at all.
  	 */
! 	if (!sendImmediate && wal_receiver_status_interval <= 0)
  		return;
  
  	/* Get current timestamp. */
***************
*** 645,651 **** XLogWalRcvSendReply(void)
  	 * this is only for reporting purposes and only on idle systems, that's
  	 * probably OK.
  	 */
! 	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  									   wal_receiver_status_interval * 1000))
--- 693,700 ----
  	 * this is only for reporting purposes and only on idle systems, that's
  	 * probably OK.
  	 */
! 	if (!sendImmediate
! 		&& XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  									   wal_receiver_status_interval * 1000))
***************
*** 654,660 **** XLogWalRcvSendReply(void)
  	/* Construct a new message */
  	reply_message.write = LogstreamResult.Write;
  	reply_message.flush = LogstreamResult.Flush;
! 	reply_message.apply = GetXLogReplayRecPtr(NULL);
  	reply_message.sendTime = now;
  
  	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
--- 703,709 ----
  	/* Construct a new message */
  	reply_message.write = LogstreamResult.Write;
  	reply_message.flush = LogstreamResult.Flush;
! 	reply_message.apply = GetXLogReplayRecPtr(NULL);	
  	reply_message.sendTime = now;
  
  	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
***************
*** 662,667 **** XLogWalRcvSendReply(void)
--- 711,749 ----
  		 (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
  		 (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
  
+ 	if (hot_standby_feedback && HotStandbyActive())
+ 	{
+ 		/*
+ 		 * Make the expensive call to get the oldest xmin once we are certain
+ 		 * everything else has been checked.
+ 		 */
+ 		xmin = GetOldestXmin(true, false);
+ 
+ 		/*
+ 		 * Get epoch and adjust if nextXid and oldestXmin are different sides of
+ 		 * the epoch boundary.
+ 		 */
+ 		GetNextXidAndEpoch(&nextXid, &nextEpoch);
+ 		if (nextXid < xmin)
+ 			nextEpoch--;
+ 
+ 		/*
+ 		 * Always send feedback message.
+ 		 */
+ 		reply_message.xmin = xmin;
+ 		reply_message.epoch = nextEpoch;
+ 
+ 		elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
+ 			 reply_message.xmin,
+ 			 reply_message.epoch);	
+ 	}
+ 	else
+ 	{
+ 		/* Mark xmin invalid so that while primary node process this message it
+ 		  * can find out that xmin and epoch are not sent as part of reply*/
+ 		reply_message.xmin = InvalidTransactionId;
+ 	}
+ 
  	/* Prepend with the message type and send it. */
  	buf[0] = 'r';
  	memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
***************
*** 681,734 **** XLogWalRcvSendHSFeedback(void)
  	uint32		nextEpoch;
  	TransactionId xmin;
  
- 	/*
- 	 * If the user doesn't want status to be reported to the master, be sure
- 	 * to exit before doing anything at all.
- 	 */
- 	if (wal_receiver_status_interval <= 0 || !hot_standby_feedback)
- 		return;
- 
  	/* Get current timestamp. */
  	now = GetCurrentTimestamp();
  
  	/*
- 	 * Send feedback at most once per wal_receiver_status_interval.
- 	 */
- 	if (!TimestampDifferenceExceeds(feedback_message.sendTime, now,
- 									wal_receiver_status_interval * 1000))
- 		return;
- 
- 	/*
- 	 * If Hot Standby is not yet active there is nothing to send. Check this
- 	 * after the interval has expired to reduce number of calls.
- 	 */
- 	if (!HotStandbyActive())
- 		return;
- 
- 	/*
- 	 * Make the expensive call to get the oldest xmin once we are certain
- 	 * everything else has been checked.
- 	 */
- 	xmin = GetOldestXmin(true, false);
- 
- 	/*
- 	 * Get epoch and adjust if nextXid and oldestXmin are different sides of
- 	 * the epoch boundary.
- 	 */
- 	GetNextXidAndEpoch(&nextXid, &nextEpoch);
- 	if (nextXid < xmin)
- 		nextEpoch--;
- 
- 	/*
  	 * Always send feedback message.
  	 */
  	feedback_message.sendTime = now;
- 	feedback_message.xmin = xmin;
- 	feedback_message.epoch = nextEpoch;
- 
- 	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
- 		 feedback_message.xmin,
- 		 feedback_message.epoch);
  
  	/* Prepend with the message type and send it. */
  	buf[0] = 'h';
--- 763,775 ----
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 80,86 **** bool		am_cascading_walsender = false;		/* Am I cascading WAL to
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
! int			replication_timeout = 60 * 1000;	/* maximum time to send one
  												 * WAL data message */
  /*
   * State for WalSndWakeupRequest
--- 80,86 ----
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
! int			wal_sender_timeout = 60 * 1000;	/* maximum time to send one
  												 * WAL data message */
  /*
   * State for WalSndWakeupRequest
***************
*** 132,142 **** static void WalSndKill(int code, Datum arg);
  static void XLogSend(char *msgbuf, bool *caughtup);
  static void IdentifySystem(void);
  static void StartReplication(StartReplicationCmd *cmd);
! static void ProcessStandbyMessage(void);
  static void ProcessStandbyReplyMessage(void);
! static void ProcessStandbyHSFeedbackMessage(void);
! static void ProcessRepliesIfAny(void);
! static void WalSndKeepalive(char *msgbuf);
  
  
  /* Main entry point for walsender process */
--- 132,142 ----
  static void XLogSend(char *msgbuf, bool *caughtup);
  static void IdentifySystem(void);
  static void StartReplication(StartReplicationCmd *cmd);
! static void ProcessStandbyMessage(char *output_message);
  static void ProcessStandbyReplyMessage(void);
! static void ProcessStandbyTransInfo(StandbyReplyMessage reply);
! static void ProcessRepliesIfAny(char *output_message);
! static void WalSndKeepalive(char *msgbuf, bool isReply);
  
  
  /* Main entry point for walsender process */
***************
*** 507,513 **** HandleReplicationCommand(const char *cmd_string)
   * Check if the remote end has closed the connection.
   */
  static void
! ProcessRepliesIfAny(void)
  {
  	unsigned char firstchar;
  	int			r;
--- 507,513 ----
   * Check if the remote end has closed the connection.
   */
  static void
! ProcessRepliesIfAny(char *output_message)
  {
  	unsigned char firstchar;
  	int			r;
***************
*** 537,543 **** ProcessRepliesIfAny(void)
  				 * 'd' means a standby reply wrapped in a CopyData packet.
  				 */
  			case 'd':
! 				ProcessStandbyMessage();
  				received = true;
  				break;
  
--- 537,543 ----
  				 * 'd' means a standby reply wrapped in a CopyData packet.
  				 */
  			case 'd':
! 				ProcessStandbyMessage(output_message);
  				received = true;
  				break;
  
***************
*** 566,572 **** ProcessRepliesIfAny(void)
   * Process a status update message received from standby.
   */
  static void
! ProcessStandbyMessage(void)
  {
  	char		msgtype;
  
--- 566,572 ----
   * Process a status update message received from standby.
   */
  static void
! ProcessStandbyMessage(char *output_message)
  {
  	char		msgtype;
  
***************
*** 595,601 **** ProcessStandbyMessage(void)
  			break;
  
  		case 'h':
! 			ProcessStandbyHSFeedbackMessage();
  			break;
  
  		default:
--- 595,605 ----
  			break;
  
  		case 'h':
! 			/* 
! 			  * Once this message is received, we should send an immediate reply 
! 			  * to standby node in order to indicate connection is alive
! 			  */
! 			WalSndKeepalive(output_message, true);
  			break;
  
  		default:
***************
*** 607,613 **** ProcessStandbyMessage(void)
  }
  
  /*
!  * Regular reply from standby advising of WAL positions on standby server.
   */
  static void
  ProcessStandbyReplyMessage(void)
--- 611,617 ----
  }
  
  /*
!  * Regular reply from standby advising of WAL positions and transaction on standby server.
   */
  static void
  ProcessStandbyReplyMessage(void)
***************
*** 636,659 **** ProcessStandbyReplyMessage(void)
  		SpinLockRelease(&walsnd->mutex);
  	}
  
  	if (!am_cascading_walsender)
  		SyncRepReleaseWaiters();
  }
  
  /*
!  * Hot Standby feedback
   */
  static void
! ProcessStandbyHSFeedbackMessage(void)
  {
- 	StandbyHSFeedbackMessage reply;
  	TransactionId nextXid;
  	uint32		nextEpoch;
  
- 	/* Decipher the reply message */
- 	pq_copymsgbytes(&reply_message, (char *) &reply,
- 					sizeof(StandbyHSFeedbackMessage));
- 
  	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
  		 reply.xmin,
  		 reply.epoch);
--- 640,660 ----
  		SpinLockRelease(&walsnd->mutex);
  	}
  
+ 	ProcessStandbyTransInfo(reply);
+ 
  	if (!am_cascading_walsender)
  		SyncRepReleaseWaiters();
  }
  
  /*
!  * Stores the transaction info received from standby
   */
  static void
! ProcessStandbyTransInfo(StandbyReplyMessage reply)
  {
  	TransactionId nextXid;
  	uint32		nextEpoch;
  
  	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
  		 reply.xmin,
  		 reply.epoch);
***************
*** 765,771 **** WalSndLoop(void)
  		}
  
  		/* Check for input from the client */
! 		ProcessRepliesIfAny();
  
  		/*
  		 * If we don't have any pending data in the output buffer, try to send
--- 766,772 ----
  		}
  
  		/* Check for input from the client */
! 		ProcessRepliesIfAny(output_message);
  
  		/*
  		 * If we don't have any pending data in the output buffer, try to send
***************
*** 835,856 **** WalSndLoop(void)
  			wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
  				WL_SOCKET_READABLE | WL_TIMEOUT;
  
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (MyWalSnd->sendKeepalive)
  			{
! 				WalSndKeepalive(output_message);
  				/* Try to flush pending output to the client */
  				if (pq_flush_if_writable() != 0)
  					break;
  			}
  
  			/* Determine time until replication timeout */
! 			if (replication_timeout > 0)
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
! 													  replication_timeout);
! 				sleeptime = 1 + (replication_timeout / 10);
  			}
  
  			/* Sleep until something happens or replication timeout */
--- 836,867 ----
  			wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
  				WL_SOCKET_READABLE | WL_TIMEOUT;
  
+ 			/*
+ 			  * Check if half of wal_sender_timeout has lapsed without receiving any reply from standby
+ 			  * then send a keep-alive message to standby to detect standby status
+ 			  */
+ 			timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+ 												  (wal_sender_timeout/2));
+ 			/* 
+ 			  * send keepalive message if sendkeepalive is enabled or WAL send status 
+ 			  * interval is greater than zero.
+ 			  */
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (MyWalSnd->sendKeepalive || (wal_sender_timeout > 0 && GetCurrentTimestamp() >= timeout))
  			{
! 				WalSndKeepalive(output_message, false);
  				/* Try to flush pending output to the client */
  				if (pq_flush_if_writable() != 0)
  					break;
  			}
  
  			/* Determine time until replication timeout */
! 			if (wal_sender_timeout > 0)
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
! 													  wal_sender_timeout);
! 				sleeptime = 1 + (wal_sender_timeout / 10);
  			}
  
  			/* Sleep until something happens or replication timeout */
***************
*** 862,868 **** WalSndLoop(void)
  			 * possibility that the client replied just as we reached the
  			 * timeout ... he's supposed to reply *before* that.
  			 */
! 			if (replication_timeout > 0 &&
  				GetCurrentTimestamp() >= timeout)
  			{
  				/*
--- 873,879 ----
  			 * possibility that the client replied just as we reached the
  			 * timeout ... he's supposed to reply *before* that.
  			 */
! 			if (wal_sender_timeout > 0 &&
  				GetCurrentTimestamp() >= timeout)
  			{
  				/*
***************
*** 1617,1630 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	return (Datum) 0;
  }
  
  static void
! WalSndKeepalive(char *msgbuf)
  {
  	PrimaryKeepaliveMessage keepalive_message;
  
  	/* Construct a new message */
  	keepalive_message.walEnd = sentPtr;
  	keepalive_message.sendTime = GetCurrentTimestamp();
  
  	elog(DEBUG2, "sending replication keepalive");
  
--- 1628,1655 ----
  	return (Datum) 0;
  }
  
+ /* 
+   * This function is used to send keepalive message to standby.
+   * Depending on the parameter isReply value as true or false,
+   * it sends reply or heart-beat message to standby respectively
+   */
+  
  static void
! WalSndKeepalive(char *msgbuf, bool isReply)
  {
  	PrimaryKeepaliveMessage keepalive_message;
  
  	/* Construct a new message */
  	keepalive_message.walEnd = sentPtr;
  	keepalive_message.sendTime = GetCurrentTimestamp();
+ 	if (isReply)
+ 	{
+ 		keepalive_message.msgType = 'r';
+ 	}
+ 	else
+ 	{
+ 		keepalive_message.msgType = 'h';
+ 	}
  
  	elog(DEBUG2, "sending replication keepalive");
  
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1596,1601 **** static struct config_int ConfigureNamesInt[] =
--- 1596,1612 ----
  	},
  
  	{
+ 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+ 			gettext_noop("Sets the maximum wait time to receive data from master."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&wal_receiver_timeout,
+ 		60 * 1000, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
  			gettext_noop("Sets the maximum number of concurrent connections."),
  			NULL
***************
*** 2019,2030 **** static struct config_int ConfigureNamesInt[] =
  	},
  
  	{
! 		{"replication_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
  			GUC_UNIT_MS
  		},
! 		&replication_timeout,
  		60 * 1000, 0, INT_MAX,
  		NULL, NULL, NULL
  	},
--- 2030,2041 ----
  	},
  
  	{
! 		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
  			GUC_UNIT_MS
  		},
! 		&wal_sender_timeout,
  		60 * 1000, 0, INT_MAX,
  		NULL, NULL, NULL
  	},
***************
*** 2381,2387 **** static struct config_int ConfigureNamesInt[] =
  		1024, 100, 102400,
  		NULL, NULL, NULL
  	},
! 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
--- 2392,2398 ----
  		1024, 100, 102400,
  		NULL, NULL, NULL
  	},
! 		
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 210,216 ****
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
! #replication_timeout = 60s	# in milliseconds; 0 disables
  
  # - Master Server -
  
--- 210,216 ----
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
! #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
  # - Master Server -
  
***************
*** 234,242 ****
  					# when reading streaming WAL;
  					# -1 allows indefinite delay
  #wal_receiver_status_interval = 10s	# send replies at least this often
! 					# 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
  
  
  #------------------------------------------------------------------------------
--- 234,244 ----
  					# when reading streaming WAL;
  					# -1 allows indefinite delay
  #wal_receiver_status_interval = 10s	# send replies at least this often
! 					# in seconds; 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
+ #wal_receiver_timeout = 60s	# in milliseconds; 0 disables; time 
+ 					# till receiver waits for communication from master.
  
  
  #------------------------------------------------------------------------------
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 27,32 **** typedef struct
--- 27,37 ----
  
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
+ 
+ 	/* This is the message type. If its type is r i.e. reply then it overrides the keepalive
+ 	 * and it serves the purpose of reply for heart-beat message from standby.
+ 	 */
+ 	char msgType; 
  } WalSndrMessage;
  
  
***************
*** 78,83 **** typedef struct
--- 83,102 ----
  	XLogRecPtr	flush;
  	XLogRecPtr	apply;
  
+ 	/* 
+ 	  * Transaction information (Earlier these field were used to be with HS Feedback 
+ 	  * message but now both of the messages are combined so combining structure
+ 	  * members also
+ 	  */
+ 	/*
+ 	 * The current xmin and epoch from the standby, for Hot Standby feedback.
+ 	 * This may be invalid if the standby-side does not support feedback, or
+ 	 * Hot Standby is not yet available.
+ 	 */	  
+ 	TransactionId xmin;
+ 	uint32		epoch;
+ 	  
+ 
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
  } StandbyReplyMessage;
***************
*** 90,103 **** typedef struct
   */
  typedef struct
  {
- 	/*
- 	 * The current xmin and epoch from the standby, for Hot Standby feedback.
- 	 * This may be invalid if the standby-side does not support feedback, or
- 	 * Hot Standby is not yet available.
- 	 */
- 	TransactionId xmin;
- 	uint32		epoch;
- 
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
  } StandbyHSFeedbackMessage;
--- 109,114 ----
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 19,24 ****
--- 19,25 ----
  
  extern int	wal_receiver_status_interval;
  extern bool hot_standby_feedback;
+ extern int wal_receiver_timeout;
  
  /*
   * MAXCONNINFO: maximum size of a connection string.
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 25,31 **** extern bool wake_wal_senders;
  
  /* user-settable parameters */
  extern int	max_wal_senders;
! extern int	replication_timeout;
  
  extern void WalSenderMain(void) __attribute__((noreturn));
  extern void WalSndSignals(void);
--- 25,31 ----
  
  /* user-settable parameters */
  extern int	max_wal_senders;
! extern int	wal_sender_timeout;
  
  extern void WalSenderMain(void) __attribute__((noreturn));
  extern void WalSndSignals(void);
#25Amit Kapila
amit.kapila@huawei.com
In reply to: Amit kapila (#24)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

-----Original Message-----
From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-
owner@postgresql.org] On Behalf Of Amit kapila
Sent: Thursday, October 04, 2012 3:43 PM
To: Heikki Linnakangas
Cc: Fujii Masao; pgsql-bugs@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w
breakdown

On Tuesday, October 02, 2012 1:56 PM Heikki Linnakangas wrote:
On 02.10.2012 10:36, Amit kapila wrote:

On Monday, October 01, 2012 4:08 PM Heikki Linnakangas wrote:

So let's think how this should ideally work from a user's point of

view.

I think there should be just two settings: walsender_timeout and
walreceiver_timeout. walsender_timeout specifies how long a
walsender will keep a connection open if it doesn't hear from the

Thank you for suggestions.
I have addressed your suggestions in patch attached with this mail.

Following changes are done to support replication timeout in sender as
well as receiver:

Testing Done for the Patch
--------------------------------
1. Verified the value of new configuration parameter and changed
configuration parameter using the show command (using Show of specific
parameter as well as show all).
2. Verified the new configuration parameter in --describe-config.
3. Verified the existing parameter replication_timeout's new name in
--describe-config.
4. Start primary and standby node with default timeout, leave it for
sometime in idle situation.
It should not error out due to network break error.
5. a. Start primary and standby node with default timeout, bring down the
network.
b. Both sender and receiver should be able to detect network break-down
almost at same time.
c. Once the network is up again, connection should get re-established
successfully.
5. a. Start primary and standby node with wal_sender_timeout less than
wal_receiver_timeout, bring down the network.
b. Sender should be able to detect network break-down before receiver
task.
c. Once the network is up again, connection should get re-established
successfully.
6. a. Start primary and standby node with wal_receiver_timeout less than
wal_sender_timeout, bring down the network.
b. Receiver should be able to detect network break-down before sender
task.
c. Once the network is up again, connection should get re-established
successfully.
7. a. In 5th test case, change the value of wal_receiver_status_interval to
more than wal_receiver_timeout and hence more than
wal_sender_timeout.
b. Then bring down the network down.
c. Sender task should be able to detect network break-down once
wal_sender_timeout has lapsed.
d. Once the network is up again, connection should get re-established
successfully.
Intent of this test is to check there is no dependency of
wal_sender_timeout on wal_receiver_status_interval for detection of
Network break.

All the above tests are passed.

With Regards,
Amit Kapila.

#26Robert Haas
robertmhaas@gmail.com
In reply to: Amit kapila (#24)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thu, Oct 4, 2012 at 6:12 AM, Amit kapila <amit.kapila@huawei.com> wrote:

1. One new configuration parameter wal_receiver_timeout is added to detect timeout at receiver task.
2. Existing parameter replication_timeout is renamed to wal_sender_timeout.

-1 from me on a backward compatibility break here. I don't know what
else to call the new GUC (replication_server_timeout?) but I'm not
excited about breaking existing conf files, nor do I particularly like
the proposed new names.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Amit Kapila
amit.kapila@huawei.com
In reply to: Robert Haas (#26)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, October 08, 2012 7:38 PM Robert Haas wrote:
On Thu, Oct 4, 2012 at 6:12 AM, Amit kapila <amit.kapila@huawei.com>
wrote:

1. One new configuration parameter wal_receiver_timeout is added to

detect timeout at receiver task.

2. Existing parameter replication_timeout is renamed to

wal_sender_timeout.

-1 from me on a backward compatibility break here. I don't know what
else to call the new GUC (replication_server_timeout?) but I'm not
excited about breaking existing conf files, nor do I particularly like
the proposed new names.

How about following:
1. replication_client_timeout -- shouldn't it be client as new configuration
is for wal receiver
2. replication_standby_timeout

If we introduce a new parameter for wal receiver, wouldn't
replication_timeout be confusing for user?

With Regards,
Amit Kapila.

#28Robert Haas
robertmhaas@gmail.com
In reply to: Noname (#1)
Re: [HACKERS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Oct 8, 2012 at 10:42 AM, Amit Kapila <amit.kapila@huawei.com> wrote:

How about following:
1. replication_client_timeout -- shouldn't it be client as new configuration
is for wal receiver
2. replication_standby_timeout

ISTM that the client and the standby are the same thing.

If we introduce a new parameter for wal receiver, wouldn't
replication_timeout be confusing for user?

Maybe. I actually don't think that I understand what problem we're
trying to solve here. If the connection between the master and the
standby is lost, shouldn't the standby realize that it's no longer
receiving keepalives from the master and terminate the connection? I
thought I had tested this at some point and it was working, so either
it's subsequently gotten broken again or the scenario you're talking
about is different in some way that I don't currently understand.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Amit Kapila
amit.kapila@huawei.com
In reply to: Robert Haas (#28)
Re: [HACKERS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tuesday, October 09, 2012 6:00 PM Robert Haas wrote:

On Mon, Oct 8, 2012 at 10:42 AM, Amit Kapila <amit.kapila@huawei.com>
wrote:

How about following:
1. replication_client_timeout -- shouldn't it be client as new

configuration

is for wal receiver
2. replication_standby_timeout

ISTM that the client and the standby are the same thing.

Yeah same, but may be one (replication_standby_timeout) can be more easily
understandable by user.

If we introduce a new parameter for wal receiver, wouldn't
replication_timeout be confusing for user?

Maybe.

I actually don't think that I understand what problem we're
trying to solve here. If the connection between the master and the
standby is lost, shouldn't the standby realize that it's no longer
receiving keepalives from the master and terminate the connection?

For wal receiver keepalives are also like one kind of message, so the
behavior is such that when it checks
that it doesn't receive any message, it tries to send reply/feedback message
to master after an interval of
wal_receiver_status_interval.
So after every wal_receiver_status_interval, wal receiver sends a reply, but
still the socket send doesn't
fail. It fails only after many send calls as internally might be in send(),
until the sockets internal buffer is full, it keeps accumulating even if
other side recv has not received the data.
So that's the reason we decided to introduce a timeout parameter in wal
receiver similar to what we have currently in walsender.

I
thought I had tested this at some point and it was working, so either
it's subsequently gotten broken again or the scenario you're talking
about is different in some way that I don't currently understand.

Standby takes quite longer around 15 minutes to detect whereas master is
able to
detect quite sooner in 2-3 mins and master also mainly detects due to
timeout functionality in wal sender.

With Regards,
Amit Kapila.

#30Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit kapila (#24)
1 attachment(s)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 04.10.2012 13:12, Amit kapila wrote:

Following changes are done to support replication timeout in sender as well as receiver:

1. One new configuration parameter wal_receiver_timeout is added to detect timeout at receiver task.
2. Existing parameter replication_timeout is renamed to wal_sender_timeout.

Ok. The other option would be to have just one GUC, I'm open to
bikeshedding on this one. On one hand, there's no reason the timeouts
have to the same, so it would be nice to have separate settings, but on
the other hand, I can't imagine a case where a single setting wouldn't
work just as well.

3. Now PrimaryKeepaliveMessage structure is modified to add one more field to indicate whether keep-alive is of type 'r' (i.e.
reply) or 'h' (i.e. heart-beat).
4. Now the keep-alive message from sender will be sent to standby if it was idle for more than or equal to half of wal_sender_timeout.
In this case it will send keep-alive of type 'h'.
5. Once the standby receiver a keep-alive, it needs to send an immediate reply to primary to indicate connection is alive.
6. Now Reply message to send wal offset and Feedback message to send oldest transaction are merged into single Reply message.
So now the structure StandbyReplyMessage is changed to add two more fields as xmin and epoch. Also StandbyHSFeedbackMessage
structure is changed to remove xmin and epoch fields (as these are moved to StandbyReplyMessage).
7. Because of changes as in step-6, once receiver task receives some data from primary then it will only send Reply Message.

Oh I see. That's not what I meant by combining the keep-alive and hs
feedback messages, I imagined that the hearbeats would *also* use the
same message type. Ie. there would be only a single message type from
standby to primary, used for:

1. updating the receive/apply pointer
2. HS feedback
3. for pinging the server when wal_receiver_timeout is approaching
4. to reply to to pings from the server.

Since we didn't quite achieve that, it seems best leave out this merging
of reply and HS feedback message types, to keep the patch small. We
might still want to do that, but better do that as a separate patch.

8. Same Reply message is sent in step-5 and step-7 but incase of step-5, then reply is sent immediately but incase of step-7, reply is sent
if wal_receiver_status_interval has lapsed (this part is same as earlier).
9. Similar to sender, if receiver finds itself idle for more than or equal to half of configured wal_receiver_timeout, then it will send the
hot-standby heartbeat. This heart-beat has been modified to send only sendTime.
10. Once sender task receiver heart-beat message from standby then it sends back the reply immediately. In this keep-alive message is
sent of type 'r'.
11. If even after wal_sender_timeout no message received from standby then it will be considered as network break at sender task.
12. If even after wal_receiver_timeout no message received from primary then it will be considered as network break at receiver task.

Attached is an updated patch. I reverted the merging of message types
and fixed a bunch of cosmetic issues. There was one bug: in the main
loop of walreceiver, you send the "ping" message on every wakeup after
enough time has passed since last reception. That means that if the
server doesn't reply promptly, you send a new ping message every 100 ms
(NAPTIME_PER_CYCLE), until it gets a reply. Walsender had the same
issue, but it was not quite as sever there because the naptime was
longer. Fixed that.

How does this look now?

- Heikki

Attachments:

replication_timeout_patch_v4_heikki.patchtext/x-diff; name=replication_timeout_patch_v4_heikki.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2236,2245 **** include 'filename'
         </listitem>
        </varlistentry>
  
!      <varlistentry id="guc-replication-timeout" xreflabel="replication_timeout">
!       <term><varname>replication_timeout</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>replication_timeout</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
--- 2236,2245 ----
         </listitem>
        </varlistentry>
  
!      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
!       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>wal_sender_timeout</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
***************
*** 2251,2262 **** include 'filename'
          the <filename>postgresql.conf</> file or on the server command line.
          The default value is 60 seconds.
         </para>
-        <para>
-         To prevent connections from being terminated prematurely,
-         <xref linkend="guc-wal-receiver-status-interval">
-         must be enabled on the standby, and its value must be less than the
-         value of <varname>replication_timeout</>.
-        </para>
        </listitem>
       </varlistentry>
  
--- 2251,2256 ----
***************
*** 2474,2484 **** include 'filename'
         the <filename>postgresql.conf</> file or on the server command line.
         The default value is 10 seconds.
        </para>
-       <para>
-        When <xref linkend="guc-replication-timeout"> is enabled on a sending server,
-        <varname>wal_receiver_status_interval</> must be enabled, and its value
-        must be less than the value of <varname>replication_timeout</>.
-       </para>
        </listitem>
       </varlistentry>
  
--- 2468,2473 ----
***************
*** 2507,2512 **** include 'filename'
--- 2496,2520 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-wal-receiver-timeout" xreflabel="wal_receiver_timeout">
+       <term><varname>wal_receiver_timeout</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>wal_receiver_timeout</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Terminate replication connections that are inactive longer
+         than the specified number of milliseconds. This is useful for
+         the receiving standby server to detect a primary node crash or network
+         outage.
+         A value of zero disables the timeout mechanism.  This parameter
+         can only be set in
+         the <filename>postgresql.conf</> file or on the server command line.
+         The default value is 60 seconds.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       </variablelist>
      </sect2>
     </sect1>
*** a/doc/src/sgml/release-9.1.sgml
--- b/doc/src/sgml/release-9.1.sgml
***************
*** 3322,3328 ****
       <listitem>
        <para>
         Add
!        <link linkend="guc-replication-timeout"><varname>replication_timeout</></link>
         setting (Fujii Masao, Heikki Linnakangas)
        </para>
  
--- 3322,3328 ----
       <listitem>
        <para>
         Add
!        <varname>replication_timeout</>
         setting (Fujii Masao, Heikki Linnakangas)
        </para>
  
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 55,60 ****
--- 55,61 ----
  
  /* GUC variables */
  int			wal_receiver_status_interval;
+ int			wal_receiver_timeout;
  bool		hot_standby_feedback;
  
  /* libpqreceiver hooks to these when loaded */
***************
*** 121,127 **** static void WalRcvDie(int code, Datum arg);
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(bool dying);
! static void XLogWalRcvSendReply(void);
  static void XLogWalRcvSendHSFeedback(void);
  static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
  
--- 122,128 ----
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(bool dying);
! static void XLogWalRcvSendReply(bool force, bool requestReply);
  static void XLogWalRcvSendHSFeedback(void);
  static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
  
***************
*** 170,178 **** WalReceiverMain(void)
  {
  	char		conninfo[MAXCONNINFO];
  	XLogRecPtr	startpoint;
- 
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
  
  	/*
  	 * WalRcv should be set up already (if we are a backend, we inherit this
--- 171,180 ----
  {
  	char		conninfo[MAXCONNINFO];
  	XLogRecPtr	startpoint;
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
+ 	TimestampTz last_recv_timestamp;
+ 	bool		ping_sent;
  
  	/*
  	 * WalRcv should be set up already (if we are a backend, we inherit this
***************
*** 282,287 **** WalReceiverMain(void)
--- 284,293 ----
  	MemSet(&reply_message, 0, sizeof(reply_message));
  	MemSet(&feedback_message, 0, sizeof(feedback_message));
  
+ 	/* Initialize the last recv timestamp */
+ 	last_recv_timestamp = GetCurrentTimestamp();
+ 	ping_sent = false;
+ 
  	/* Loop until end-of-streaming or error */
  	for (;;)
  	{
***************
*** 316,330 **** WalReceiverMain(void)
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
  				XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Let the master know that we received some data. */
! 			XLogWalRcvSendReply();
  
  			/*
  			 * If we've written some records, flush them to disk and let the
--- 322,344 ----
  		/* Wait a while for data to arrive */
  		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
  		{
+ 			/* Something was received from master, so reset timeout */
+ 			last_recv_timestamp = GetCurrentTimestamp();
+ 			ping_sent = false;
+ 
  			/* Accept the received data, and process it */
  			XLogWalRcvProcessMsg(type, buf, len);
  
  			/* Receive any more data we can without sleeping */
  			while (walrcv_receive(0, &type, &buf, &len))
+ 			{
+ 				last_recv_timestamp = GetCurrentTimestamp();
+ 				ping_sent = false;
  				XLogWalRcvProcessMsg(type, buf, len);
+ 			}
  
  			/* Let the master know that we received some data. */
! 			XLogWalRcvSendReply(false, false);
  
  			/*
  			 * If we've written some records, flush them to disk and let the
***************
*** 335,344 **** WalReceiverMain(void)
  		else
  		{
  			/*
! 			 * We didn't receive anything new, but send a status update to the
! 			 * master anyway, to report any progress in applying WAL.
  			 */
! 			XLogWalRcvSendReply();
  			XLogWalRcvSendHSFeedback();
  		}
  	}
--- 349,396 ----
  		else
  		{
  			/*
! 			 * We didn't receive anything new. If we haven't heard anything
! 			 * from the server for more than wal_receiver_timeout / 2,
! 			 * ping the server. Also, if it's been longer than
! 			 * wal_receiver_status_interval since the last update we sent,
! 			 * send a status update to the master anyway, to report any
! 			 * progress in applying WAL.
! 			 */
! 			bool requestReply = false;
! 
! 			/*
! 			 * Check if time since last receive from standby has reached the
! 			 * configured limit.
  			 */
! 			if (wal_receiver_timeout > 0)
! 			{
! 				TimestampTz now = GetCurrentTimestamp();
! 				TimestampTz timeout;
! 
! 				timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
! 													  wal_receiver_timeout);
! 
! 				if (now >= timeout)
! 					ereport(ERROR,
! 							(errmsg("terminating walreceiver due to timeout")));
! 
! 				/*
! 				 * We didn't receive anything new, for half of receiver
! 				 * replication timeout. Ping the server.
! 				 */
! 				if (!ping_sent)
! 				{
! 					timeout = TimestampTzPlusMilliseconds(last_recv_timestamp,
! 														  (wal_receiver_timeout/2));
! 					if (now >= timeout)
! 					{
! 						requestReply = true;
! 						ping_sent = true;
! 					}
! 				}
! 			}
! 
! 			XLogWalRcvSendReply(requestReply, requestReply);
  			XLogWalRcvSendHSFeedback();
  		}
  	}
***************
*** 460,465 **** XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 512,521 ----
  				memcpy(&keepalive, buf, sizeof(PrimaryKeepaliveMessage));
  
  				ProcessWalSndrMessage(keepalive.walEnd, keepalive.sendTime);
+ 
+ 				/* If the primary requested a reply, send one immediately */
+ 				if (keepalive.replyRequested)
+ 					XLogWalRcvSendReply(true, false);
  				break;
  			}
  		default:
***************
*** 609,627 **** XLogWalRcvFlush(bool dying)
  
  		/* Also let the master know that we made some progress */
  		if (!dying)
! 		{
! 			XLogWalRcvSendReply();
! 			XLogWalRcvSendHSFeedback();
! 		}
  	}
  }
  
  /*
!  * Send reply message to primary, indicating our current XLOG positions and
!  * the current time.
   */
  static void
! XLogWalRcvSendReply(void)
  {
  	char		buf[sizeof(StandbyReplyMessage) + 1];
  	TimestampTz now;
--- 665,688 ----
  
  		/* Also let the master know that we made some progress */
  		if (!dying)
! 			XLogWalRcvSendReply(false, false);
  	}
  }
  
  /*
!  * Send reply message to primary, indicating our current XLOG positions, oldest
!  * xmin and the current time.
!  *
!  * If 'force' is not true, the message is not sent unless enough time has
!  * passed since last status update to reach wal_receiver_status_internal (or
!  * if wal_receiver_status_interval is disabled altogether).
!  *
!  * If 'requestReply' is true, requests the server to reply immediately upon
!  * receiving this message. This is used for heartbearts, when approaching
!  * wal_receiver_timeout.
   */
  static void
! XLogWalRcvSendReply(bool force, bool requestReply)
  {
  	char		buf[sizeof(StandbyReplyMessage) + 1];
  	TimestampTz now;
***************
*** 630,636 **** XLogWalRcvSendReply(void)
  	 * If the user doesn't want status to be reported to the master, be sure
  	 * to exit before doing anything at all.
  	 */
! 	if (wal_receiver_status_interval <= 0)
  		return;
  
  	/* Get current timestamp. */
--- 691,697 ----
  	 * If the user doesn't want status to be reported to the master, be sure
  	 * to exit before doing anything at all.
  	 */
! 	if (!force && wal_receiver_status_interval <= 0)
  		return;
  
  	/* Get current timestamp. */
***************
*** 645,651 **** XLogWalRcvSendReply(void)
  	 * this is only for reporting purposes and only on idle systems, that's
  	 * probably OK.
  	 */
! 	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  									   wal_receiver_status_interval * 1000))
--- 706,713 ----
  	 * this is only for reporting purposes and only on idle systems, that's
  	 * probably OK.
  	 */
! 	if (!force
! 		&& XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  									   wal_receiver_status_interval * 1000))
***************
*** 656,661 **** XLogWalRcvSendReply(void)
--- 718,724 ----
  	reply_message.flush = LogstreamResult.Flush;
  	reply_message.apply = GetXLogReplayRecPtr(NULL);
  	reply_message.sendTime = now;
+ 	reply_message.replyRequested = requestReply;
  
  	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
  		 (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 82,88 **** static bool	replication_started = false; /* Started streaming yet? */
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
! int			replication_timeout = 60 * 1000;	/* maximum time to send one
  												 * WAL data message */
  /*
   * State for WalSndWakeupRequest
--- 82,88 ----
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
! int			wal_sender_timeout = 60 * 1000;	/* maximum time to send one
  												 * WAL data message */
  /*
   * State for WalSndWakeupRequest
***************
*** 103,117 **** static uint32 sendOff = 0;
   */
  static XLogRecPtr sentPtr = 0;
  
  /*
!  * Buffer for processing reply messages.
   */
! static StringInfoData reply_message;
  
  /*
   * Timestamp of the last receipt of the reply from the standby.
   */
  static TimestampTz last_reply_timestamp;
  
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
--- 103,122 ----
   */
  static XLogRecPtr sentPtr = 0;
  
+ /* Buffer for processing reply messages. */
+ static StringInfoData reply_message;
  /*
!  * Buffer for constructing outgoing messages
!  * (1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE bytes)
   */
! static char *output_message;
  
  /*
   * Timestamp of the last receipt of the reply from the standby.
   */
  static TimestampTz last_reply_timestamp;
+ /* Have we sent a heartbeat message asking for reply, since last reply? */
+ static bool	ping_sent = false;
  
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
***************
*** 126,139 **** static void WalSndLastCycleHandler(SIGNAL_ARGS);
  static void WalSndLoop(void) __attribute__((noreturn));
  static void InitWalSenderSlot(void);
  static void WalSndKill(int code, Datum arg);
! static void XLogSend(char *msgbuf, bool *caughtup);
  static void IdentifySystem(void);
  static void StartReplication(StartReplicationCmd *cmd);
  static void ProcessStandbyMessage(void);
  static void ProcessStandbyReplyMessage(void);
  static void ProcessStandbyHSFeedbackMessage(void);
  static void ProcessRepliesIfAny(void);
! static void WalSndKeepalive(char *msgbuf);
  
  
  /* Initialize walsender process before entering the main command loop */
--- 131,144 ----
  static void WalSndLoop(void) __attribute__((noreturn));
  static void InitWalSenderSlot(void);
  static void WalSndKill(int code, Datum arg);
! static void XLogSend(bool *caughtup);
  static void IdentifySystem(void);
  static void StartReplication(StartReplicationCmd *cmd);
  static void ProcessStandbyMessage(void);
  static void ProcessStandbyReplyMessage(void);
  static void ProcessStandbyHSFeedbackMessage(void);
  static void ProcessRepliesIfAny(void);
! static void WalSndKeepalive(bool requestReply);
  
  
  /* Initialize walsender process before entering the main command loop */
***************
*** 465,471 **** ProcessRepliesIfAny(void)
--- 470,479 ----
  	 * Save the last reply timestamp if we've received at least one reply.
  	 */
  	if (received)
+ 	{
  		last_reply_timestamp = GetCurrentTimestamp();
+ 		ping_sent = false;
+ 	}
  }
  
  /*
***************
*** 527,532 **** ProcessStandbyReplyMessage(void)
--- 535,544 ----
  		 (uint32) (reply.flush >> 32), (uint32) reply.flush,
  		 (uint32) (reply.apply >> 32), (uint32) reply.apply);
  
+ 	/* Send a reply if the standby requested one. */
+ 	if (reply.replyRequested)
+ 		WalSndKeepalive(false);
+ 
  	/*
  	 * Update shared state for this WalSender process based on reply data from
  	 * standby.
***************
*** 620,626 **** ProcessStandbyHSFeedbackMessage(void)
  static void
  WalSndLoop(void)
  {
- 	char	   *output_message;
  	bool		caughtup = false;
  
  	/*
--- 632,637 ----
***************
*** 638,643 **** WalSndLoop(void)
--- 649,655 ----
  
  	/* Initialize the last reply timestamp */
  	last_reply_timestamp = GetCurrentTimestamp();
+ 	ping_sent = false;
  
  	/* Loop forever, unless we get an error */
  	for (;;)
***************
*** 672,678 **** WalSndLoop(void)
  		 * caught up.
  		 */
  		if (!pq_is_send_pending())
! 			XLogSend(output_message, &caughtup);
  		else
  			caughtup = false;
  
--- 684,690 ----
  		 * caught up.
  		 */
  		if (!pq_is_send_pending())
! 			XLogSend(&caughtup);
  		else
  			caughtup = false;
  
***************
*** 708,714 **** WalSndLoop(void)
  			if (walsender_ready_to_stop)
  			{
  				/* ... let's just be real sure we're caught up ... */
! 				XLogSend(output_message, &caughtup);
  				if (caughtup && !pq_is_send_pending())
  				{
  					/* Inform the standby that XLOG streaming is done */
--- 720,726 ----
  			if (walsender_ready_to_stop)
  			{
  				/* ... let's just be real sure we're caught up ... */
! 				XLogSend(&caughtup);
  				if (caughtup && !pq_is_send_pending())
  				{
  					/* Inform the standby that XLOG streaming is done */
***************
*** 738,760 **** WalSndLoop(void)
  
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (MyWalSnd->sendKeepalive)
  			{
! 				WalSndKeepalive(output_message);
! 				/* Try to flush pending output to the client */
! 				if (pq_flush_if_writable() != 0)
! 					break;
  			}
  
  			/* Determine time until replication timeout */
! 			if (replication_timeout > 0)
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
! 													  replication_timeout);
! 				sleeptime = 1 + (replication_timeout / 10);
  			}
  
! 			/* Sleep until something happens or replication timeout */
  			ImmediateInterruptOK = true;
  			CHECK_FOR_INTERRUPTS();
  			WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
--- 750,783 ----
  
  			if (pq_is_send_pending())
  				wakeEvents |= WL_SOCKET_WRITEABLE;
! 			else if (wal_sender_timeout > 0 && !ping_sent)
  			{
! 				/*
! 				 * If half of wal_sender_timeout has lapsed without receiving
! 				 * any reply from standby, send a keep-alive message to standby
! 				 * requesting an immediate reply.
! 				 */
! 				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
! 													  wal_sender_timeout / 2);
! 				if (GetCurrentTimestamp() >= timeout)
! 				{
! 					WalSndKeepalive(true);
! 					ping_sent = true;
! 					/* Try to flush pending output to the client */
! 					if (pq_flush_if_writable() != 0)
! 						break;
! 				}
  			}
  
  			/* Determine time until replication timeout */
! 			if (wal_sender_timeout > 0)
  			{
  				timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
! 													  wal_sender_timeout);
! 				sleeptime = 1 + (wal_sender_timeout / 10);
  			}
  
! 			/* Sleep until something happens or we time out */
  			ImmediateInterruptOK = true;
  			CHECK_FOR_INTERRUPTS();
  			WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
***************
*** 766,773 **** WalSndLoop(void)
  			 * possibility that the client replied just as we reached the
  			 * timeout ... he's supposed to reply *before* that.
  			 */
! 			if (replication_timeout > 0 &&
! 				GetCurrentTimestamp() >= timeout)
  			{
  				/*
  				 * Since typically expiration of replication timeout means
--- 789,795 ----
  			 * possibility that the client replied just as we reached the
  			 * timeout ... he's supposed to reply *before* that.
  			 */
! 			if (wal_sender_timeout > 0 && GetCurrentTimestamp() >= timeout)
  			{
  				/*
  				 * Since typically expiration of replication timeout means
***************
*** 1016,1030 **** retry:
   * but not yet sent to the client, and buffer it in the libpq output
   * buffer.
   *
-  * msgbuf is a work area in which the output message is constructed.  It's
-  * passed in just so we can avoid re-palloc'ing the buffer on each cycle.
-  * It must be of size 1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE.
-  *
   * If there is no unsent WAL remaining, *caughtup is set to true, otherwise
   * *caughtup is set to false.
   */
  static void
! XLogSend(char *msgbuf, bool *caughtup)
  {
  	XLogRecPtr	SendRqstPtr;
  	XLogRecPtr	startptr;
--- 1038,1048 ----
   * but not yet sent to the client, and buffer it in the libpq output
   * buffer.
   *
   * If there is no unsent WAL remaining, *caughtup is set to true, otherwise
   * *caughtup is set to false.
   */
  static void
! XLogSend(bool *caughtup)
  {
  	XLogRecPtr	SendRqstPtr;
  	XLogRecPtr	startptr;
***************
*** 1107,1119 **** XLogSend(char *msgbuf, bool *caughtup)
  	/*
  	 * OK to read and send the slice.
  	 */
! 	msgbuf[0] = 'w';
  
  	/*
  	 * Read the log directly into the output buffer to avoid extra memcpy
  	 * calls.
  	 */
! 	XLogRead(msgbuf + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
  
  	/*
  	 * We fill the message header last so that the send timestamp is taken as
--- 1125,1137 ----
  	/*
  	 * OK to read and send the slice.
  	 */
! 	output_message[0] = 'w';
  
  	/*
  	 * Read the log directly into the output buffer to avoid extra memcpy
  	 * calls.
  	 */
! 	XLogRead(output_message + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
  
  	/*
  	 * We fill the message header last so that the send timestamp is taken as
***************
*** 1123,1131 **** XLogSend(char *msgbuf, bool *caughtup)
  	msghdr.walEnd = SendRqstPtr;
  	msghdr.sendTime = GetCurrentTimestamp();
  
! 	memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
  
! 	pq_putmessage_noblock('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
  
  	sentPtr = endptr;
  
--- 1141,1149 ----
  	msghdr.walEnd = SendRqstPtr;
  	msghdr.sendTime = GetCurrentTimestamp();
  
! 	memcpy(output_message + 1, &msghdr, sizeof(WalDataMessageHeader));
  
! 	pq_putmessage_noblock('d', output_message, 1 + sizeof(WalDataMessageHeader) + nbytes);
  
  	sentPtr = endptr;
  
***************
*** 1492,1512 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	return (Datum) 0;
  }
  
  static void
! WalSndKeepalive(char *msgbuf)
  {
  	PrimaryKeepaliveMessage keepalive_message;
  
  	/* Construct a new message */
  	keepalive_message.walEnd = sentPtr;
  	keepalive_message.sendTime = GetCurrentTimestamp();
  
  	elog(DEBUG2, "sending replication keepalive");
  
  	/* Prepend with the message type and send it. */
! 	msgbuf[0] = 'k';
! 	memcpy(msgbuf + 1, &keepalive_message, sizeof(PrimaryKeepaliveMessage));
! 	pq_putmessage_noblock('d', msgbuf, sizeof(PrimaryKeepaliveMessage) + 1);
  }
  
  /*
--- 1510,1536 ----
  	return (Datum) 0;
  }
  
+ /*
+   * This function is used to send keepalive message to standby.
+   * If requestReply is set, sets a flag in the message requesting the standby
+   * to send a message back to us, for heartbeat purposes.
+   */
  static void
! WalSndKeepalive(bool requestReply)
  {
  	PrimaryKeepaliveMessage keepalive_message;
  
  	/* Construct a new message */
  	keepalive_message.walEnd = sentPtr;
  	keepalive_message.sendTime = GetCurrentTimestamp();
+ 	keepalive_message.replyRequested = requestReply;
  
  	elog(DEBUG2, "sending replication keepalive");
  
  	/* Prepend with the message type and send it. */
! 	output_message[0] = 'k';
! 	memcpy(output_message + 1, &keepalive_message, sizeof(PrimaryKeepaliveMessage));
! 	pq_putmessage_noblock('d', output_message, sizeof(PrimaryKeepaliveMessage) + 1);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1596,1601 **** static struct config_int ConfigureNamesInt[] =
--- 1596,1612 ----
  	},
  
  	{
+ 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+ 			gettext_noop("Sets the maximum wait time to receive data from master."),
+ 			NULL,
+ 			GUC_UNIT_MS
+ 		},
+ 		&wal_receiver_timeout,
+ 		60 * 1000, 0, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
  			gettext_noop("Sets the maximum number of concurrent connections."),
  			NULL
***************
*** 2019,2030 **** static struct config_int ConfigureNamesInt[] =
  	},
  
  	{
! 		{"replication_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
  			GUC_UNIT_MS
  		},
! 		&replication_timeout,
  		60 * 1000, 0, INT_MAX,
  		NULL, NULL, NULL
  	},
--- 2030,2041 ----
  	},
  
  	{
! 		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
  			gettext_noop("Sets the maximum time to wait for WAL replication."),
  			NULL,
  			GUC_UNIT_MS
  		},
! 		&wal_sender_timeout,
  		60 * 1000, 0, INT_MAX,
  		NULL, NULL, NULL
  	},
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 210,216 ****
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
! #replication_timeout = 60s	# in milliseconds; 0 disables
  
  # - Master Server -
  
--- 210,216 ----
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
! #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
  # - Master Server -
  
***************
*** 237,242 ****
--- 237,245 ----
  					# 0 disables
  #hot_standby_feedback = off		# send info from standby to prevent
  					# query conflicts
+ #wal_receiver_timeout = 60s		# time that receiver waits for
+ 					# communication from master
+ 					# in milliseconds; 0 disables
  
  
  #------------------------------------------------------------------------------
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 27,32 **** typedef struct
--- 27,38 ----
  
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
+ 
+ 	/*
+ 	 * If replyRequested is set, the client should reply immediately to this
+ 	 * message, to avoid a timeout disconnect.
+ 	 */
+ 	bool		replyRequested;
  } WalSndrMessage;
  
  
***************
*** 80,85 **** typedef struct
--- 86,97 ----
  
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
+ 
+ 	/*
+ 	 * If replyRequested is set, the server should reply immediately to this
+ 	 * message, to avoid a timeout disconnect.
+ 	 */
+ 	bool		replyRequested;
  } StandbyReplyMessage;
  
  /*
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 17,23 ****
--- 17,25 ----
  #include "storage/spin.h"
  #include "pgtime.h"
  
+ /* user-settable parameters */
  extern int	wal_receiver_status_interval;
+ extern int	wal_receiver_timeout;
  extern bool hot_standby_feedback;
  
  /*
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 24,30 **** extern bool wake_wal_senders;
  
  /* user-settable parameters */
  extern int	max_wal_senders;
! extern int	replication_timeout;
  
  extern void InitWalSender(void);
  extern void exec_replication_command(const char *query_string);
--- 24,30 ----
  
  /* user-settable parameters */
  extern int	max_wal_senders;
! extern int	wal_sender_timeout;
  
  extern void InitWalSender(void);
  extern void exec_replication_command(const char *query_string);
*** a/src/include/replication/walsender_private.h
--- b/src/include/replication/walsender_private.h
***************
*** 37,43 **** typedef struct WalSnd
  	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
  	bool		needreload;		/* does currently-open file need to be
  								 * reloaded? */
- 	bool		sendKeepalive;	/* do we send keepalives on this connection? */
  
  	/*
  	 * The xlog locations that have been written, flushed, and applied by
--- 37,42 ----
#31Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#30)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wednesday, October 10, 2012 9:15 PM Heikki Linnakangas wrote:

On 04.10.2012 13:12, Amit kapila wrote:

Following changes are done to support replication timeout in sender as

well as receiver:

1. One new configuration parameter wal_receiver_timeout is added to

detect timeout at receiver task.

2. Existing parameter replication_timeout is renamed to

wal_sender_timeout.

Ok. The other option would be to have just one GUC, I'm open to
bikeshedding on this one. On one hand, there's no reason the timeouts
have to the same, so it would be nice to have separate settings, but on
the other hand, I can't imagine a case where a single setting wouldn't
work just as well.

I think for below case, they are required to be separate:

1. M1 (Master), S1 (Standby 1), S2 (Standby 2)
2. S1 is standby for M1, and S2 is standby for S1. Basically a simple case
of cascaded replication
3. M1 and S1 are on local network but S2 is placed at geographically
different location.
(what I want to say is n/w between M1-S1 is of good speed and S1-S2 is
very slow)
4. In above case, user might want to configure different timeouts for sender
and receiver on S1.

Attached is an updated patch. I reverted the merging of message types
and fixed a bunch of cosmetic issues. There was one bug: in the main
loop of walreceiver, you send the "ping" message on every wakeup after
enough time has passed since last reception. That means that if the
server doesn't reply promptly, you send a new ping message every 100 ms
(NAPTIME_PER_CYCLE), until it gets a reply. Walsender had the same
issue, but it was not quite as sever there because the naptime was
longer. Fixed that.

Thanks.

How does this look now?

The Patch is fine and test results are also fine.

With Regards,
Amit Kapila.

#32Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#31)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 11.10.2012 13:17, Amit Kapila wrote:

How does this look now?

The Patch is fine and test results are also fine.

Ok, thanks. Committed.

- Heikki

#33Amit kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#32)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, October 11, 2012 8:22 PM Heikki Linnakangas wrote:
On 11.10.2012 13:17, Amit Kapila wrote:

How does this look now?

The Patch is fine and test results are also fine.

Ok, thanks. Committed.

Thank you very much.

With Regards,
Amit Kapila.

#34Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#32)
1 attachment(s)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 11.10.2012 13:17, Amit Kapila wrote:

How does this look now?

The Patch is fine and test results are also fine.

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Is it worth adding the same mechanism (send back the reply immediately
if walsender request a reply) into pg_basebackup and pg_receivexlog?

Regards,

--
Fujii Masao

Attachments:

typo.patchapplication/octet-stream; name=typo.patchDownload
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 674,680 **** XLogWalRcvFlush(bool dying)
   * xmin and the current time.
   *
   * If 'force' is not set, the message is only sent if enough time has
!  * passed since last status update to reach wal_receiver_status_internal.
   * If wal_receiver_status_interval is disabled altogether and 'force' is
   * false, this is a no-op.
   *
--- 674,680 ----
   * xmin and the current time.
   *
   * If 'force' is not set, the message is only sent if enough time has
!  * passed since last status update to reach wal_receiver_status_interval.
   * If wal_receiver_status_interval is disabled altogether and 'force' is
   * false, this is a no-op.
   *
#35Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#34)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 13.10.2012 19:35, Fujii Masao wrote:

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

Thanks, fixed.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see
http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.

I'll look into doing that..

Is it worth adding the same mechanism (send back the reply immediately
if walsender request a reply) into pg_basebackup and pg_receivexlog?

Good catch. Yes, they should be taught about this too. I'll look into
doing that too.

- Heikki

#36Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#35)
1 attachment(s)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 15.10.2012 13:13, Heikki Linnakangas wrote:

On 13.10.2012 19:35, Fujii Masao wrote:

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see
http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want to
e.g use pg_receivexlog to back up a server that runs on a different
platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

- Heikki

Attachments:

make-replication-protocol-arch-independent.patchtext/x-diff; name=make-replication-protocol-arch-independent.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3d72a16..5a32517 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1366,7 +1366,8 @@ The commands accepted in walsender mode are:
       WAL data is sent as a series of CopyData messages.  (This allows
       other information to be intermixed; in particular the server can send
       an ErrorResponse message if it encounters a failure after beginning
-      to stream.)  The payload in each CopyData message follows this format:
+      to stream.)  The payload of each CopyData message from server to the
+      client contains a message of one of the following formats:
      </para>
 
      <para>
@@ -1390,34 +1391,32 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The starting point of the WAL data in this message, given in
-          XLogRecPtr format.
+          The starting point of the WAL data in this message.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The current end of WAL on the server, given in
-          XLogRecPtr format.
+          The current end of WAL on the server.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The server's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
       </para>
       </listitem>
       </varlistentry>
@@ -1445,25 +1444,12 @@ The commands accepted in walsender mode are:
        continuation records can be sent in different CopyData messages.
      </para>
      <para>
-       Note that all fields within the WAL data and the above-described header
-       will be in the sending server's native format.  Endianness, and the
-       format for the timestamp, are unpredictable unless the receiver has
-       verified that the sender's system identifier matches its own
-       <filename>pg_control</> contents.
-     </para>
-     <para>
        If the WAL sender process is terminated normally (during postmaster
        shutdown), it will send a CommandComplete message before exiting.
        This might not happen during an abnormal shutdown, of course.
      </para>
 
      <para>
-       The receiving process can send replies back to the sender at any time,
-       using one of the following message formats (also in the payload of a
-       CopyData message):
-     </para>
-
-     <para>
       <variablelist>
       <varlistentry>
       <term>
@@ -1495,12 +1481,23 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The server's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte1
+      </term>
+      <listitem>
+      <para>
+          1 means that the client should reply to this message as soon as
+          possible, to avoid a timeout disconnect. 0 otherwise.
       </para>
       </listitem>
       </varlistentry>
@@ -1512,6 +1509,12 @@ The commands accepted in walsender mode are:
      </para>
 
      <para>
+       The receiving process can send replies back to the sender at any time,
+       using one of the following message formats (also in the payload of a
+       CopyData message):
+     </para>
+
+     <para>
       <variablelist>
       <varlistentry>
       <term>
@@ -1532,45 +1535,56 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
           The location of the last WAL byte + 1 received and written to disk
-          in the standby, in XLogRecPtr format.
+          in the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
           The location of the last WAL byte + 1 flushed to disk in
-          the standby, in XLogRecPtr format.
+          the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The location of the last WAL byte + 1 applied in the standby, in
-          XLogRecPtr format.
+          The location of the last WAL byte + 1 applied in the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
+      </term>
+      <listitem>
+      <para>
+          The client's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte1
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          If 1, the client requests the server to reply to this message
+          immediately. This can be used to ping the server, to test if
+          the connection is still healthy.
       </para>
       </listitem>
       </varlistentry>
@@ -1602,28 +1616,29 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The client's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte4
+          Int32
       </term>
       <listitem>
       <para>
-          The standby's current xmin.
+          The standby's current xmin. This may be 0, if the standby does not
+          support feedback, or is not yet in Hot Standby state.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte4
+          Int32
       </term>
       <listitem>
       <para>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b1accdc..b41bbcc 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -39,9 +39,9 @@
 #include <unistd.h>
 
 #include "access/xlog_internal.h"
+#include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "replication/walprotocol.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -93,8 +93,7 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
-static StandbyReplyMessage reply_message;
-static StandbyHSFeedbackMessage feedback_message;
+static StringInfoData	reply_message;
 
 /*
  * About SIGTERM handling:
@@ -279,10 +278,9 @@ WalReceiverMain(void)
 	walrcv_connect(conninfo, startpoint);
 	DisableWalRcvImmediateExit();
 
-	/* Initialize LogstreamResult, reply_message and feedback_message */
+	/* Initialize LogstreamResult and reply_message */
 	LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
-	MemSet(&reply_message, 0, sizeof(reply_message));
-	MemSet(&feedback_message, 0, sizeof(feedback_message));
+	initStringInfo(&reply_message);
 
 	/* Initialize the last recv timestamp */
 	last_recv_timestamp = GetCurrentTimestamp();
@@ -480,41 +478,52 @@ WalRcvQuickDieHandler(SIGNAL_ARGS)
 static void
 XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 {
+	StringInfoData tmpbuf;
+	int			hdrlen;
+	XLogRecPtr	dataStart;
+	XLogRecPtr	walEnd;
+	TimestampTz	sendTime;
+	bool		replyRequested;
+
+	initStringInfo(&tmpbuf);
+
 	switch (type)
 	{
 		case 'w':				/* WAL records */
 			{
-				WalDataMessageHeader msghdr;
-
-				if (len < sizeof(WalDataMessageHeader))
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64);
+				if (len < hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
 							 errmsg_internal("invalid WAL message received from primary")));
-				/* memcpy is required here for alignment reasons */
-				memcpy(&msghdr, buf, sizeof(WalDataMessageHeader));
+				appendBinaryStringInfo(&tmpbuf, buf, hdrlen);
 
-				ProcessWalSndrMessage(msghdr.walEnd, msghdr.sendTime);
+				dataStart = pq_getmsgint64(&tmpbuf);
+				walEnd = pq_getmsgint64(&tmpbuf);
+				sendTime = IntegerTimestampToTimestampTz(pq_getmsgint64(&tmpbuf));
+				ProcessWalSndrMessage(walEnd, sendTime);
 
-				buf += sizeof(WalDataMessageHeader);
-				len -= sizeof(WalDataMessageHeader);
-				XLogWalRcvWrite(buf, len, msghdr.dataStart);
+				buf += hdrlen;
+				len -= hdrlen;
+				XLogWalRcvWrite(buf, len, dataStart);
 				break;
 			}
 		case 'k':				/* Keepalive */
 			{
-				PrimaryKeepaliveMessage keepalive;
-
-				if (len != sizeof(PrimaryKeepaliveMessage))
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
 							 errmsg_internal("invalid keepalive message received from primary")));
-				/* memcpy is required here for alignment reasons */
-				memcpy(&keepalive, buf, sizeof(PrimaryKeepaliveMessage));
+				appendBinaryStringInfo(&tmpbuf, buf, hdrlen);
+				walEnd = pq_getmsgint64(&tmpbuf);
+				sendTime = IntegerTimestampToTimestampTz(pq_getmsgint64(&tmpbuf));
+				replyRequested = pq_getmsgbyte(&tmpbuf);
 
-				ProcessWalSndrMessage(keepalive.walEnd, keepalive.sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime);
 
 				/* If the primary requested a reply, send one immediately */
-				if (keepalive.replyRequested)
+				if (replyRequested)
 					XLogWalRcvSendReply(true, false);
 				break;
 			}
@@ -524,6 +533,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 					 errmsg_internal("invalid replication message type %d",
 									 type)));
 	}
+
+	pfree(tmpbuf.data);
 }
 
 /*
@@ -685,7 +696,10 @@ XLogWalRcvFlush(bool dying)
 static void
 XLogWalRcvSendReply(bool force, bool requestReply)
 {
-	char		buf[sizeof(StandbyReplyMessage) + 1];
+	static XLogRecPtr writePtr = 0;
+	static XLogRecPtr flushPtr = 0;
+	XLogRecPtr	applyPtr;
+	static TimestampTz sendTime = 0;
 	TimestampTz now;
 
 	/*
@@ -708,28 +722,33 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	 * probably OK.
 	 */
 	if (!force
-		&& XLByteEQ(reply_message.write, LogstreamResult.Write)
-		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
-		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
+		&& XLByteEQ(writePtr, LogstreamResult.Write)
+		&& XLByteEQ(flushPtr, LogstreamResult.Flush)
+		&& !TimestampDifferenceExceeds(sendTime, now,
 									   wal_receiver_status_interval * 1000))
 		return;
 
 	/* Construct a new message */
-	reply_message.write = LogstreamResult.Write;
-	reply_message.flush = LogstreamResult.Flush;
-	reply_message.apply = GetXLogReplayRecPtr(NULL);
-	reply_message.sendTime = now;
-	reply_message.replyRequested = requestReply;
+	writePtr = LogstreamResult.Write;
+	flushPtr = LogstreamResult.Flush;
+	applyPtr = GetXLogReplayRecPtr(NULL);
+	sendTime = now;
+
+	resetStringInfo(&reply_message);
+	pq_sendbyte(&reply_message, 'r');
+	pq_sendint64(&reply_message, writePtr);
+	pq_sendint64(&reply_message, flushPtr);
+	pq_sendint64(&reply_message, applyPtr);
+	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
-		 (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
-		 (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);
 
 	/* Prepend with the message type and send it. */
-	buf[0] = 'r';
-	memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
-	walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
+	walrcv_send(reply_message.data, reply_message.len);
 }
 
 /*
@@ -739,11 +758,11 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 static void
 XLogWalRcvSendHSFeedback(void)
 {
-	char		buf[sizeof(StandbyHSFeedbackMessage) + 1];
 	TimestampTz now;
 	TransactionId nextXid;
 	uint32		nextEpoch;
 	TransactionId xmin;
+	static TimestampTz sendTime = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -758,7 +777,7 @@ XLogWalRcvSendHSFeedback(void)
 	/*
 	 * Send feedback at most once per wal_receiver_status_interval.
 	 */
-	if (!TimestampDifferenceExceeds(feedback_message.sendTime, now,
+	if (!TimestampDifferenceExceeds(sendTime, now,
 									wal_receiver_status_interval * 1000))
 		return;
 
@@ -786,22 +805,24 @@ XLogWalRcvSendHSFeedback(void)
 	/*
 	 * Always send feedback message.
 	 */
-	feedback_message.sendTime = now;
-	feedback_message.xmin = xmin;
-	feedback_message.epoch = nextEpoch;
+	sendTime = now;
 
 	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 feedback_message.xmin,
-		 feedback_message.epoch);
-
-	/* Prepend with the message type and send it. */
-	buf[0] = 'h';
-	memcpy(&buf[1], &feedback_message, sizeof(StandbyHSFeedbackMessage));
-	walrcv_send(buf, sizeof(StandbyHSFeedbackMessage) + 1);
+		 xmin, nextEpoch);
+
+	/* Construct the the message and send it. */
+	resetStringInfo(&reply_message);
+	pq_sendbyte(&reply_message, 'h');
+	pq_sendint(&reply_message, xmin, 4);
+	pq_sendint(&reply_message, nextEpoch, 4);
+	walrcv_send(reply_message.data, reply_message.len);
 }
 
 /*
- * Keep track of important messages from primary.
+ * Update shared memory status upon receiving a message from primary.
+ *
+ * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
+ * message, reported by primary.
  */
 static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2af38f1..adeb461 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,7 +48,6 @@
 #include "nodes/replnodes.h"
 #include "replication/basebackup.h"
 #include "replication/syncrep.h"
-#include "replication/walprotocol.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
@@ -66,6 +65,16 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/*
+ * Maximum data payload in a WAL data message.	Must be >= XLOG_BLCKSZ.
+ *
+ * We don't have a good idea of what a good value would be; there's some
+ * overhead per message in both walsender and walreceiver, but on the other
+ * hand sending large batches makes walsender less responsive to signals
+ * because signals are checked only between messages.  128kB (with
+ * default 8k blocks) seems like a reasonable guess for now.
+ */
+#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
@@ -103,13 +112,10 @@ static uint32 sendOff = 0;
  */
 static XLogRecPtr sentPtr = 0;
 
-/* Buffer for processing reply messages. */
+/* Buffers for constructing outgoing messages and processing reply messages. */
+static StringInfoData output_message;
 static StringInfoData reply_message;
-/*
- * Buffer for constructing outgoing messages.
- * (1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE bytes)
- */
-static char *output_message;
+static StringInfoData tmpbuf;
 
 /*
  * Timestamp of the last receipt of the reply from the standby.
@@ -526,17 +532,24 @@ ProcessStandbyMessage(void)
 static void
 ProcessStandbyReplyMessage(void)
 {
-	StandbyReplyMessage reply;
+	XLogRecPtr	writePtr,
+				flushPtr,
+				applyPtr;
+	bool		replyRequested;
 
-	pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
+	writePtr = pq_getmsgint64(&reply_message);
+	flushPtr = pq_getmsgint64(&reply_message);
+	applyPtr = pq_getmsgint64(&reply_message);
+	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
+	replyRequested = pq_getmsgbyte(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply.write >> 32), (uint32) reply.write,
-		 (uint32) (reply.flush >> 32), (uint32) reply.flush,
-		 (uint32) (reply.apply >> 32), (uint32) reply.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);
 
 	/* Send a reply if the standby requested one. */
-	if (reply.replyRequested)
+	if (replyRequested)
 		WalSndKeepalive(false);
 
 	/*
@@ -548,9 +561,9 @@ ProcessStandbyReplyMessage(void)
 		volatile WalSnd *walsnd = MyWalSnd;
 
 		SpinLockAcquire(&walsnd->mutex);
-		walsnd->write = reply.write;
-		walsnd->flush = reply.flush;
-		walsnd->apply = reply.apply;
+		walsnd->write = writePtr;
+		walsnd->flush = flushPtr;
+		walsnd->apply = applyPtr;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -564,20 +577,21 @@ ProcessStandbyReplyMessage(void)
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	StandbyHSFeedbackMessage reply;
 	TransactionId nextXid;
 	uint32		nextEpoch;
+	TransactionId feedbackXmin;
+	uint32		feedbackEpoch;
 
 	/* Decipher the reply message */
-	pq_copymsgbytes(&reply_message, (char *) &reply,
-					sizeof(StandbyHSFeedbackMessage));
+	feedbackXmin = pq_getmsgint(&reply_message, 4);
+	feedbackEpoch = pq_getmsgint(&reply_message, 4);
 
 	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
-		 reply.xmin,
-		 reply.epoch);
+		 feedbackXmin,
+		 feedbackEpoch);
 
 	/* Ignore invalid xmin (can't actually happen with current walreceiver) */
-	if (!TransactionIdIsNormal(reply.xmin))
+	if (!TransactionIdIsNormal(feedbackXmin))
 		return;
 
 	/*
@@ -589,18 +603,18 @@ ProcessStandbyHSFeedbackMessage(void)
 	 */
 	GetNextXidAndEpoch(&nextXid, &nextEpoch);
 
-	if (reply.xmin <= nextXid)
+	if (feedbackXmin <= nextXid)
 	{
-		if (reply.epoch != nextEpoch)
+		if (feedbackEpoch != nextEpoch)
 			return;
 	}
 	else
 	{
-		if (reply.epoch + 1 != nextEpoch)
+		if (feedbackEpoch + 1 != nextEpoch)
 			return;
 	}
 
-	if (!TransactionIdPrecedesOrEquals(reply.xmin, nextXid))
+	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
 		return;					/* epoch OK, but it's wrapped around */
 
 	/*
@@ -610,9 +624,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * cleanup conflicts on the standby server.
 	 *
 	 * There is a small window for a race condition here: although we just
-	 * checked that reply.xmin precedes nextXid, the nextXid could have gotten
+	 * checked that feedbackXmin precedes nextXid, the nextXid could have gotten
 	 * advanced between our fetching it and applying the xmin below, perhaps
-	 * far enough to make reply.xmin wrap around.  In that case the xmin we
+	 * far enough to make feedbackXmin wrap around.  In that case the xmin we
 	 * set here would be "in the future" and have no effect.  No point in
 	 * worrying about this since it's too late to save the desired data
 	 * anyway.	Assuming that the standby sends us an increasing sequence of
@@ -625,7 +639,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * safe, and if we're moving it backwards, well, the data is at risk
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 */
-	MyPgXact->xmin = reply.xmin;
+	MyPgXact->xmin = feedbackXmin;
 }
 
 /* Main loop of walsender process that streams the WAL over Copy messages. */
@@ -635,17 +649,12 @@ WalSndLoop(void)
 	bool		caughtup = false;
 
 	/*
-	 * Allocate buffer that will be used for each output message.  We do this
-	 * just once to reduce palloc overhead.  The buffer must be made large
-	 * enough for maximum-sized messages.
-	 */
-	output_message = palloc(1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE);
-
-	/*
-	 * Allocate buffer that will be used for processing reply messages.  As
-	 * above, do this just once to reduce palloc overhead.
+	 * Allocate buffers that will be used for each outgoing and incoming
+	 * message.  We do this just once to reduce palloc overhead.
 	 */
+	initStringInfo(&output_message);
 	initStringInfo(&reply_message);
+	initStringInfo(&tmpbuf);
 
 	/* Initialize the last reply timestamp */
 	last_reply_timestamp = GetCurrentTimestamp();
@@ -1048,7 +1057,6 @@ XLogSend(bool *caughtup)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
-	WalDataMessageHeader msghdr;
 
 	/*
 	 * Attempt to send all data that's already been written out and fsync'd to
@@ -1125,25 +1133,32 @@ XLogSend(bool *caughtup)
 	/*
 	 * OK to read and send the slice.
 	 */
-	output_message[0] = 'w';
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'w');
+
+	pq_sendint64(&output_message, startptr);		/* dataStart */
+	pq_sendint64(&output_message, SendRqstPtr);	/* walEnd */
+	pq_sendint64(&output_message, 0);			/* sendtime, filled in later */
 
 	/*
 	 * Read the log directly into the output buffer to avoid extra memcpy
 	 * calls.
 	 */
-	XLogRead(output_message + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
+	enlargeStringInfo(&output_message, nbytes);
+	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+	output_message.len += nbytes;
+	output_message.data[output_message.len] = '\0';
 
 	/*
-	 * We fill the message header last so that the send timestamp is taken as
-	 * late as possible.
+	 * Fill the send timestamp last, so that it is taken as late as possible.
 	 */
-	msghdr.dataStart = startptr;
-	msghdr.walEnd = SendRqstPtr;
-	msghdr.sendTime = GetCurrentTimestamp();
-
-	memcpy(output_message + 1, &msghdr, sizeof(WalDataMessageHeader));
+	resetStringInfo(&tmpbuf);
+	pq_sendint64(&tmpbuf, GetCurrentIntegerTimestamp());
+	Assert(tmpbuf.len == sizeof(int64));
+	memcpy(&output_message.data[1 + sizeof(int64) + sizeof(int64)],
+		   &tmpbuf.data, sizeof(int64));
 
-	pq_putmessage_noblock('d', output_message, 1 + sizeof(WalDataMessageHeader) + nbytes);
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
 
 	sentPtr = endptr;
 
@@ -1518,19 +1533,17 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 static void
 WalSndKeepalive(bool requestReply)
 {
-	PrimaryKeepaliveMessage keepalive_message;
-
-	/* Construct a new message */
-	keepalive_message.walEnd = sentPtr;
-	keepalive_message.sendTime = GetCurrentTimestamp();
-	keepalive_message.replyRequested = requestReply;
-
 	elog(DEBUG2, "sending replication keepalive");
 
-	/* Prepend with the message type and send it. */
-	output_message[0] = 'k';
-	memcpy(output_message + 1, &keepalive_message, sizeof(PrimaryKeepaliveMessage));
-	pq_putmessage_noblock('d', output_message, sizeof(PrimaryKeepaliveMessage) + 1);
+	/* construct the message... */
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, sentPtr);		/* walEnd */
+	pq_sendint64(&output_message, GetCurrentIntegerTimestamp()); /* sendTime */
+	pq_sendbyte(&output_message, requestReply ? 1 : 0);	/* replyRequested */
+
+	/* ... and send it wrapped in CopyData */
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
 }
 
 /*
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 50ef897..823bddf 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1286,6 +1286,50 @@ GetCurrentTimestamp(void)
 }
 
 /*
+ * GetCurrentIntegerTimestamp -- get the current operating system time as int64
+ *
+ * Result is the number of milliseconds since the Postgres epoch. If compiled
+ * with --enable-integer-datetimes, this is identical to GetCurrentTimestamp(),
+ * and is implemented as a macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+GetCurrentIntegerTimestamp(void)
+{
+	int64 result;
+	struct timeval tp;
+
+	gettimeofday(&tp, NULL);
+
+	result = (int64) tp.tv_sec -
+		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
+
+	result = (result * USECS_PER_SEC) + tp.tv_usec;
+
+	return result;
+}
+#endif
+
+/*
+ * GetCurrentIntegerTimestamp -- convert an integer timestamp to native format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+TimestampTz
+IntegerTimestampToTimestampTz(int64 timestamp)
+{
+	TimestampTz result;
+
+	result = timestamp / USECS_PER_SEC;
+	result += (timestamp % USECS_PER_SEC) / 1000000.0;
+
+	return result;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 404ff91..041e97b 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -21,7 +21,6 @@
 #include "postgres.h"
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
-#include "replication/walprotocol.h"
 #include "utils/datetime.h"
 #include "utils/timestamp.h"
 
@@ -33,14 +32,13 @@
 #include <sys/types.h>
 #include <unistd.h>
 
-
-/* Size of the streaming replication protocol headers */
-#define STREAMING_HEADER_SIZE (1+sizeof(WalDataMessageHeader))
-#define STREAMING_KEEPALIVE_SIZE (1+sizeof(PrimaryKeepaliveMessage))
-
 /* fd for currently open WAL file */
 static int	walfile = -1;
 
+static void sendint64(int64 i, char *buf);
+static int64 recvint64(char *buf);
+static bool sendFeedback(XLogRecPtr blockpos, int64 now);
+
 
 /*
  * Open a new WAL file in the specified directory. Store the name
@@ -191,35 +189,32 @@ close_walfile(char *basedir, char *walname, bool segment_complete)
  * Local version of GetCurrentTimestamp(), since we are not linked with
  * backend code.
  */
-static TimestampTz
+static int64
 localGetCurrentTimestamp(void)
 {
-	TimestampTz result;
+	int64 result;
 	struct timeval tp;
 
 	gettimeofday(&tp, NULL);
 
-	result = (TimestampTz) tp.tv_sec -
+	result = (int64) tp.tv_sec -
 		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
 
-#ifdef HAVE_INT64_TIMESTAMP
 	result = (result * USECS_PER_SEC) + tp.tv_usec;
-#else
-	result = result + (tp.tv_usec / 1000000.0);
-#endif
 
 	return result;
 }
 
 /*
  * Local version of TimestampDifference(), since we are not
- * linked with backend code.
+ * linked with backend code. We always use integer timestamps, regardless
+ * of server setting.
  */
 static void
-localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
+localTimestampDifference(int64 start_time, int64 stop_time,
 						 long *secs, int *microsecs)
 {
-	TimestampTz diff = stop_time - start_time;
+	int64 diff = stop_time - start_time;
 
 	if (diff <= 0)
 	{
@@ -228,13 +223,8 @@ localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 	}
 	else
 	{
-#ifdef HAVE_INT64_TIMESTAMP
 		*secs = (long) (diff / USECS_PER_SEC);
 		*microsecs = (int) (diff % USECS_PER_SEC);
-#else
-		*secs = (long) diff;
-		*microsecs = (int) ((diff - *secs) * 1000000.0);
-#endif
 	}
 }
 
@@ -243,17 +233,13 @@ localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  * linked with backend code.
  */
 static bool
-localTimestampDifferenceExceeds(TimestampTz start_time,
-								TimestampTz stop_time,
+localTimestampDifferenceExceeds(int64 start_time,
+								int64 stop_time,
 								int msec)
 {
-	TimestampTz diff = stop_time - start_time;
+	int64 diff = stop_time - start_time;
 
-#ifdef HAVE_INT64_TIMESTAMP
 	return (diff >= msec * INT64CONST(1000));
-#else
-	return (diff * 1000.0 >= msec);
-#endif
 }
 
 /*
@@ -382,24 +368,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 											standby_message_timeout))
 		{
 			/* Time to send feedback! */
-			char		replybuf[sizeof(StandbyReplyMessage) + 1];
-			StandbyReplyMessage *replymsg;
-
-			replymsg = (StandbyReplyMessage *) (replybuf + 1);
-			replymsg->write = blockpos;
-			replymsg->flush = InvalidXLogRecPtr;
-			replymsg->apply = InvalidXLogRecPtr;
-			replymsg->sendTime = now;
-			replybuf[0] = 'r';
-
-			if (PQputCopyData(conn, replybuf, sizeof(replybuf)) <= 0 ||
-				PQflush(conn))
-			{
-				fprintf(stderr, _("%s: could not send feedback packet: %s"),
-						progname, PQerrorMessage(conn));
+			if (!sendFeedback(blockpos, now))
 				goto error;
-			}
-
 			last_status = now;
 		}
 
@@ -419,12 +389,11 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			FD_SET(PQsocket(conn), &input_mask);
 			if (standby_message_timeout)
 			{
-				TimestampTz targettime;
+				int64		targettime;
 				long		secs;
 				int			usecs;
 
-				targettime = TimestampTzPlusMilliseconds(last_status,
-												standby_message_timeout - 1);
+				targettime = last_status + (standby_message_timeout - 1) * ((int64) 1000);
 				localTimestampDifference(now,
 										 targettime,
 										 &secs,
@@ -474,19 +443,13 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 					progname, PQerrorMessage(conn));
 			goto error;
 		}
+
 		if (copybuf[0] == 'k')
 		{
 			/*
 			 * keepalive message, sent in 9.2 and newer. We just ignore this
 			 * message completely, but need to skip past it in the stream.
 			 */
-			if (r != STREAMING_KEEPALIVE_SIZE)
-			{
-				fprintf(stderr,
-						_("%s: keepalive message has incorrect size %d\n"),
-						progname, r);
-				goto error;
-			}
 			continue;
 		}
 		else if (copybuf[0] != 'w')
@@ -495,15 +458,19 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 					progname, copybuf[0]);
 			goto error;
 		}
+
+/* Size of the streaming replication protocol headers */
+#define STREAMING_HEADER_SIZE (1 /* msgtype */ + 8 /* dataStart */ + 8 /* walEnd*/ + 8 /* sendTime */)
+
 		if (r < STREAMING_HEADER_SIZE + 1)
 		{
 			fprintf(stderr, _("%s: streaming header too small: %d\n"),
 					progname, r);
 			goto error;
 		}
+		blockpos = recvint64(&copybuf[1]);
 
 		/* Extract WAL location for this block */
-		memcpy(&blockpos, copybuf + 1, 8);
 		xlogoff = blockpos % XLOG_SEG_SIZE;
 
 		/*
@@ -641,3 +608,73 @@ error:
 	walfile = -1;
 	return false;
 }
+
+/*
+ * Converts an int64 to network byte order.
+ */
+static void
+sendint64(int64 i, char *buf)
+{
+	uint32		n32;
+
+	/* High order half first, since we're doing MSB-first */
+	n32 = (uint32) (i >> 32);
+	n32 = htonl(n32);
+	memcpy(&buf[0], &n32, 4);
+
+	/* Now the low order half */
+	n32 = (uint32) i;
+	n32 = htonl(n32);
+	memcpy(&buf[4], &n32, 4);
+}
+
+/*
+ * Converts an int64 from network byte order to native format.
+ */
+static int64
+recvint64(char *buf)
+{
+	int64		result;
+	uint32		h32;
+	uint32		l32;
+
+	memcpy(&h32, buf, 4);
+	memcpy(&l32, buf + 4, 4);
+	h32 = ntohl(h32);
+	l32 = ntohl(l32);
+
+	result = h32;
+	result <<= 32;
+	result |= l32;
+
+	return result;
+}
+
+/*
+ * Send a Standby Status Update message to server.
+ */
+static bool
+sendFeedback(XLogRecPtr blockpos, int64 now)
+{
+	char		replybuf[100];
+	int 		len = 0;
+
+	replybuf[len++] = 'r';
+	sendint64(blockpos, &replybuf[len]);			/* write */
+	len += 8;
+	sendint64(InvalidXLogRecPtr, &replybuf[len]);	/* flush */
+	len += 8;
+	sendint64(InvalidXLogRecPtr, &replybuf[len]);	/* apply */
+	len += 8;
+	sendint64(now, &replybuf[13]);					/* sendTime */
+	len += 8;
+
+	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
+	{
+		fprintf(stderr, _("%s: could not send feedback packet: %s"),
+				progname, PQerrorMessage(conn));
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
deleted file mode 100644
index 396d006..0000000
--- a/src/include/replication/walprotocol.h
+++ /dev/null
@@ -1,128 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * walprotocol.h
- *	  Definitions relevant to the streaming WAL transmission protocol.
- *
- * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
- *
- * src/include/replication/walprotocol.h
- *
- *-------------------------------------------------------------------------
- */
-#ifndef _WALPROTOCOL_H
-#define _WALPROTOCOL_H
-
-#include "access/xlogdefs.h"
-#include "datatype/timestamp.h"
-
-
-/*
- * All messages from WalSender must contain these fields to allow us to
- * correctly calculate the replication delay.
- */
-typedef struct
-{
-	/* Current end of WAL on the sender */
-	XLogRecPtr	walEnd;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-
-	/*
-	 * If replyRequested is set, the client should reply immediately to this
-	 * message, to avoid a timeout disconnect.
-	 */
-	bool		replyRequested;
-} WalSndrMessage;
-
-
-/*
- * Header for a WAL data message (message type 'w').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * The header is followed by actual WAL data.  Note that the data length is
- * not specified in the header --- it's just whatever remains in the message.
- *
- * walEnd and sendTime are not essential data, but are provided in case
- * the receiver wants to adjust its behavior depending on how far behind
- * it is.
- */
-typedef struct
-{
-	/* WAL start location of the data included in this message */
-	XLogRecPtr	dataStart;
-
-	/* Current end of WAL on the sender */
-	XLogRecPtr	walEnd;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-} WalDataMessageHeader;
-
-/*
- * Keepalive message from primary (message type 'k'). (lowercase k)
- * This is wrapped within a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef WalSndrMessage PrimaryKeepaliveMessage;
-
-/*
- * Reply message from standby (message type 'r').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef struct
-{
-	/*
-	 * The xlog locations that have been written, flushed, and applied by
-	 * standby-side. These may be invalid if the standby-side is unable to or
-	 * chooses not to report these.
-	 */
-	XLogRecPtr	write;
-	XLogRecPtr	flush;
-	XLogRecPtr	apply;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-
-	/*
-	 * If replyRequested is set, the server should reply immediately to this
-	 * message, to avoid a timeout disconnect.
-	 */
-	bool		replyRequested;
-} StandbyReplyMessage;
-
-/*
- * Hot Standby feedback from standby (message type 'h').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef struct
-{
-	/*
-	 * The current xmin and epoch from the standby, for Hot Standby feedback.
-	 * This may be invalid if the standby-side does not support feedback, or
-	 * Hot Standby is not yet available.
-	 */
-	TransactionId xmin;
-	uint32		epoch;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-} StandbyHSFeedbackMessage;
-
-/*
- * Maximum data payload in a WAL data message.	Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages.  128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
-#endif   /* _WALPROTOCOL_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index e7cdb41..e4ca8c1 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -206,6 +206,13 @@ extern Datum generate_series_timestamptz(PG_FUNCTION_ARGS);
 /* Internal routines (not fmgr-callable) */
 
 extern TimestampTz GetCurrentTimestamp(void);
+#ifdef HAVE_INT64_TIMESTAMP
+#define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
+#define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#else
+extern int64 GetCurrentIntegerTimestamp(void);
+extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+#endif
 
 extern void TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 					long *secs, int *microsecs);
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#36)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

On 13.10.2012 19:35, Fujii Masao wrote:

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want to e.g
use pg_receivexlog to back up a server that runs on a different platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

$ pg_receivexlog -D hoge
pg_receivexlog: unexpected termination of replication stream: ERROR:
no data left in message

pg_basebackup -X stream caused the same error.

$ pg_basebackup -D hoge -X stream -c fast
pg_basebackup: could not send feedback packet: no COPY in progress
pg_basebackup: child process exited with error 1

In walreceiver.c, tmpbuf is allocated for every XLogWalRcvProcessMsg() call.
It should be allocated just once and continue to be used till end, to reduce
palloc overhead?

+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);

These should be macro, to avoid calculation overhead?

+	/* Construct the the message and send it. */
+	resetStringInfo(&reply_message);
+	pq_sendbyte(&reply_message, 'h');
+	pq_sendint(&reply_message, xmin, 4);
+	pq_sendint(&reply_message, nextEpoch, 4);
+	walrcv_send(reply_message.data, reply_message.len);

You seem to have forgotten to send the sendTime.

Regards,

--
Fujii Masao

#38Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#37)
1 attachment(s)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want to e.g
use pg_receivexlog to back up a server that runs on a different platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.

+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);

These should be macro, to avoid calculation overhead?

The compiler will calculate this at compilation time, it's going to be a
constant at runtime.

- Heikki

Attachments:

make-replication-protocol-arch-independent-2.patchtext/x-diff; name=make-replication-protocol-arch-independent-2.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3d72a16..5a32517 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1366,7 +1366,8 @@ The commands accepted in walsender mode are:
       WAL data is sent as a series of CopyData messages.  (This allows
       other information to be intermixed; in particular the server can send
       an ErrorResponse message if it encounters a failure after beginning
-      to stream.)  The payload in each CopyData message follows this format:
+      to stream.)  The payload of each CopyData message from server to the
+      client contains a message of one of the following formats:
      </para>
 
      <para>
@@ -1390,34 +1391,32 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The starting point of the WAL data in this message, given in
-          XLogRecPtr format.
+          The starting point of the WAL data in this message.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The current end of WAL on the server, given in
-          XLogRecPtr format.
+          The current end of WAL on the server.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The server's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
       </para>
       </listitem>
       </varlistentry>
@@ -1445,25 +1444,12 @@ The commands accepted in walsender mode are:
        continuation records can be sent in different CopyData messages.
      </para>
      <para>
-       Note that all fields within the WAL data and the above-described header
-       will be in the sending server's native format.  Endianness, and the
-       format for the timestamp, are unpredictable unless the receiver has
-       verified that the sender's system identifier matches its own
-       <filename>pg_control</> contents.
-     </para>
-     <para>
        If the WAL sender process is terminated normally (during postmaster
        shutdown), it will send a CommandComplete message before exiting.
        This might not happen during an abnormal shutdown, of course.
      </para>
 
      <para>
-       The receiving process can send replies back to the sender at any time,
-       using one of the following message formats (also in the payload of a
-       CopyData message):
-     </para>
-
-     <para>
       <variablelist>
       <varlistentry>
       <term>
@@ -1495,12 +1481,23 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The server's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte1
+      </term>
+      <listitem>
+      <para>
+          1 means that the client should reply to this message as soon as
+          possible, to avoid a timeout disconnect. 0 otherwise.
       </para>
       </listitem>
       </varlistentry>
@@ -1512,6 +1509,12 @@ The commands accepted in walsender mode are:
      </para>
 
      <para>
+       The receiving process can send replies back to the sender at any time,
+       using one of the following message formats (also in the payload of a
+       CopyData message):
+     </para>
+
+     <para>
       <variablelist>
       <varlistentry>
       <term>
@@ -1532,45 +1535,56 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
           The location of the last WAL byte + 1 received and written to disk
-          in the standby, in XLogRecPtr format.
+          in the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
           The location of the last WAL byte + 1 flushed to disk in
-          the standby, in XLogRecPtr format.
+          the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The location of the last WAL byte + 1 applied in the standby, in
-          XLogRecPtr format.
+          The location of the last WAL byte + 1 applied in the standby.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
+      </term>
+      <listitem>
+      <para>
+          The client's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte1
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          If 1, the client requests the server to reply to this message
+          immediately. This can be used to ping the server, to test if
+          the connection is still healthy.
       </para>
       </listitem>
       </varlistentry>
@@ -1602,28 +1616,29 @@ The commands accepted in walsender mode are:
       </varlistentry>
       <varlistentry>
       <term>
-          Byte8
+          Int64
       </term>
       <listitem>
       <para>
-          The server's system clock at the time of transmission,
-          given in TimestampTz format.
+          The client's system clock at the time of transmission, as
+          microseconds since midnight on 2000-01-01.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte4
+          Int32
       </term>
       <listitem>
       <para>
-          The standby's current xmin.
+          The standby's current xmin. This may be 0, if the standby does not
+          support feedback, or is not yet in Hot Standby state.
       </para>
       </listitem>
       </varlistentry>
       <varlistentry>
       <term>
-          Byte4
+          Int32
       </term>
       <listitem>
       <para>
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b1accdc..19625c5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -39,9 +39,9 @@
 #include <unistd.h>
 
 #include "access/xlog_internal.h"
+#include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "replication/walprotocol.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -93,8 +93,8 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
-static StandbyReplyMessage reply_message;
-static StandbyHSFeedbackMessage feedback_message;
+static StringInfoData	reply_message;
+static StringInfoData	incoming_message;
 
 /*
  * About SIGTERM handling:
@@ -279,10 +279,10 @@ WalReceiverMain(void)
 	walrcv_connect(conninfo, startpoint);
 	DisableWalRcvImmediateExit();
 
-	/* Initialize LogstreamResult, reply_message and feedback_message */
+	/* Initialize LogstreamResult and buffers for processing messages */
 	LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
-	MemSet(&reply_message, 0, sizeof(reply_message));
-	MemSet(&feedback_message, 0, sizeof(feedback_message));
+	initStringInfo(&reply_message);
+	initStringInfo(&incoming_message);
 
 	/* Initialize the last recv timestamp */
 	last_recv_timestamp = GetCurrentTimestamp();
@@ -480,41 +480,53 @@ WalRcvQuickDieHandler(SIGNAL_ARGS)
 static void
 XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 {
+	int			hdrlen;
+	XLogRecPtr	dataStart;
+	XLogRecPtr	walEnd;
+	TimestampTz	sendTime;
+	bool		replyRequested;
+
+	resetStringInfo(&incoming_message);
+
 	switch (type)
 	{
 		case 'w':				/* WAL records */
 			{
-				WalDataMessageHeader msghdr;
-
-				if (len < sizeof(WalDataMessageHeader))
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64);
+				if (len < hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
 							 errmsg_internal("invalid WAL message received from primary")));
-				/* memcpy is required here for alignment reasons */
-				memcpy(&msghdr, buf, sizeof(WalDataMessageHeader));
+				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
-				ProcessWalSndrMessage(msghdr.walEnd, msghdr.sendTime);
+				dataStart = pq_getmsgint64(&incoming_message);
+				walEnd = pq_getmsgint64(&incoming_message);
+				sendTime = IntegerTimestampToTimestampTz(
+					pq_getmsgint64(&incoming_message));
+				ProcessWalSndrMessage(walEnd, sendTime);
 
-				buf += sizeof(WalDataMessageHeader);
-				len -= sizeof(WalDataMessageHeader);
-				XLogWalRcvWrite(buf, len, msghdr.dataStart);
+				buf += hdrlen;
+				len -= hdrlen;
+				XLogWalRcvWrite(buf, len, dataStart);
 				break;
 			}
 		case 'k':				/* Keepalive */
 			{
-				PrimaryKeepaliveMessage keepalive;
-
-				if (len != sizeof(PrimaryKeepaliveMessage))
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
 							 errmsg_internal("invalid keepalive message received from primary")));
-				/* memcpy is required here for alignment reasons */
-				memcpy(&keepalive, buf, sizeof(PrimaryKeepaliveMessage));
+				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
+				walEnd = pq_getmsgint64(&incoming_message);
+				sendTime = IntegerTimestampToTimestampTz(
+					pq_getmsgint64(&incoming_message));
+				replyRequested = pq_getmsgbyte(&incoming_message);
 
-				ProcessWalSndrMessage(keepalive.walEnd, keepalive.sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime);
 
 				/* If the primary requested a reply, send one immediately */
-				if (keepalive.replyRequested)
+				if (replyRequested)
 					XLogWalRcvSendReply(true, false);
 				break;
 			}
@@ -685,7 +697,10 @@ XLogWalRcvFlush(bool dying)
 static void
 XLogWalRcvSendReply(bool force, bool requestReply)
 {
-	char		buf[sizeof(StandbyReplyMessage) + 1];
+	static XLogRecPtr writePtr = 0;
+	static XLogRecPtr flushPtr = 0;
+	XLogRecPtr	applyPtr;
+	static TimestampTz sendTime = 0;
 	TimestampTz now;
 
 	/*
@@ -708,28 +723,33 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	 * probably OK.
 	 */
 	if (!force
-		&& XLByteEQ(reply_message.write, LogstreamResult.Write)
-		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
-		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
+		&& XLByteEQ(writePtr, LogstreamResult.Write)
+		&& XLByteEQ(flushPtr, LogstreamResult.Flush)
+		&& !TimestampDifferenceExceeds(sendTime, now,
 									   wal_receiver_status_interval * 1000))
 		return;
 
 	/* Construct a new message */
-	reply_message.write = LogstreamResult.Write;
-	reply_message.flush = LogstreamResult.Flush;
-	reply_message.apply = GetXLogReplayRecPtr(NULL);
-	reply_message.sendTime = now;
-	reply_message.replyRequested = requestReply;
+	writePtr = LogstreamResult.Write;
+	flushPtr = LogstreamResult.Flush;
+	applyPtr = GetXLogReplayRecPtr(NULL);
+	sendTime = now;
+
+	resetStringInfo(&reply_message);
+	pq_sendbyte(&reply_message, 'r');
+	pq_sendint64(&reply_message, writePtr);
+	pq_sendint64(&reply_message, flushPtr);
+	pq_sendint64(&reply_message, applyPtr);
+	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
-		 (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
-		 (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);
 
 	/* Prepend with the message type and send it. */
-	buf[0] = 'r';
-	memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
-	walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
+	walrcv_send(reply_message.data, reply_message.len);
 }
 
 /*
@@ -739,11 +759,11 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 static void
 XLogWalRcvSendHSFeedback(void)
 {
-	char		buf[sizeof(StandbyHSFeedbackMessage) + 1];
 	TimestampTz now;
 	TransactionId nextXid;
 	uint32		nextEpoch;
 	TransactionId xmin;
+	static TimestampTz sendTime = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -758,7 +778,7 @@ XLogWalRcvSendHSFeedback(void)
 	/*
 	 * Send feedback at most once per wal_receiver_status_interval.
 	 */
-	if (!TimestampDifferenceExceeds(feedback_message.sendTime, now,
+	if (!TimestampDifferenceExceeds(sendTime, now,
 									wal_receiver_status_interval * 1000))
 		return;
 
@@ -786,22 +806,25 @@ XLogWalRcvSendHSFeedback(void)
 	/*
 	 * Always send feedback message.
 	 */
-	feedback_message.sendTime = now;
-	feedback_message.xmin = xmin;
-	feedback_message.epoch = nextEpoch;
+	sendTime = now;
 
 	elog(DEBUG2, "sending hot standby feedback xmin %u epoch %u",
-		 feedback_message.xmin,
-		 feedback_message.epoch);
-
-	/* Prepend with the message type and send it. */
-	buf[0] = 'h';
-	memcpy(&buf[1], &feedback_message, sizeof(StandbyHSFeedbackMessage));
-	walrcv_send(buf, sizeof(StandbyHSFeedbackMessage) + 1);
+		 xmin, nextEpoch);
+
+	/* Construct the the message and send it. */
+	resetStringInfo(&reply_message);
+	pq_sendbyte(&reply_message, 'h');
+	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint(&reply_message, xmin, 4);
+	pq_sendint(&reply_message, nextEpoch, 4);
+	walrcv_send(reply_message.data, reply_message.len);
 }
 
 /*
- * Keep track of important messages from primary.
+ * Update shared memory status upon receiving a message from primary.
+ *
+ * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
+ * message, reported by primary.
  */
 static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2af38f1..adeb461 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -48,7 +48,6 @@
 #include "nodes/replnodes.h"
 #include "replication/basebackup.h"
 #include "replication/syncrep.h"
-#include "replication/walprotocol.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
@@ -66,6 +65,16 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/*
+ * Maximum data payload in a WAL data message.	Must be >= XLOG_BLCKSZ.
+ *
+ * We don't have a good idea of what a good value would be; there's some
+ * overhead per message in both walsender and walreceiver, but on the other
+ * hand sending large batches makes walsender less responsive to signals
+ * because signals are checked only between messages.  128kB (with
+ * default 8k blocks) seems like a reasonable guess for now.
+ */
+#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
@@ -103,13 +112,10 @@ static uint32 sendOff = 0;
  */
 static XLogRecPtr sentPtr = 0;
 
-/* Buffer for processing reply messages. */
+/* Buffers for constructing outgoing messages and processing reply messages. */
+static StringInfoData output_message;
 static StringInfoData reply_message;
-/*
- * Buffer for constructing outgoing messages.
- * (1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE bytes)
- */
-static char *output_message;
+static StringInfoData tmpbuf;
 
 /*
  * Timestamp of the last receipt of the reply from the standby.
@@ -526,17 +532,24 @@ ProcessStandbyMessage(void)
 static void
 ProcessStandbyReplyMessage(void)
 {
-	StandbyReplyMessage reply;
+	XLogRecPtr	writePtr,
+				flushPtr,
+				applyPtr;
+	bool		replyRequested;
 
-	pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
+	writePtr = pq_getmsgint64(&reply_message);
+	flushPtr = pq_getmsgint64(&reply_message);
+	applyPtr = pq_getmsgint64(&reply_message);
+	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
+	replyRequested = pq_getmsgbyte(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply.write >> 32), (uint32) reply.write,
-		 (uint32) (reply.flush >> 32), (uint32) reply.flush,
-		 (uint32) (reply.apply >> 32), (uint32) reply.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);
 
 	/* Send a reply if the standby requested one. */
-	if (reply.replyRequested)
+	if (replyRequested)
 		WalSndKeepalive(false);
 
 	/*
@@ -548,9 +561,9 @@ ProcessStandbyReplyMessage(void)
 		volatile WalSnd *walsnd = MyWalSnd;
 
 		SpinLockAcquire(&walsnd->mutex);
-		walsnd->write = reply.write;
-		walsnd->flush = reply.flush;
-		walsnd->apply = reply.apply;
+		walsnd->write = writePtr;
+		walsnd->flush = flushPtr;
+		walsnd->apply = applyPtr;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -564,20 +577,21 @@ ProcessStandbyReplyMessage(void)
 static void
 ProcessStandbyHSFeedbackMessage(void)
 {
-	StandbyHSFeedbackMessage reply;
 	TransactionId nextXid;
 	uint32		nextEpoch;
+	TransactionId feedbackXmin;
+	uint32		feedbackEpoch;
 
 	/* Decipher the reply message */
-	pq_copymsgbytes(&reply_message, (char *) &reply,
-					sizeof(StandbyHSFeedbackMessage));
+	feedbackXmin = pq_getmsgint(&reply_message, 4);
+	feedbackEpoch = pq_getmsgint(&reply_message, 4);
 
 	elog(DEBUG2, "hot standby feedback xmin %u epoch %u",
-		 reply.xmin,
-		 reply.epoch);
+		 feedbackXmin,
+		 feedbackEpoch);
 
 	/* Ignore invalid xmin (can't actually happen with current walreceiver) */
-	if (!TransactionIdIsNormal(reply.xmin))
+	if (!TransactionIdIsNormal(feedbackXmin))
 		return;
 
 	/*
@@ -589,18 +603,18 @@ ProcessStandbyHSFeedbackMessage(void)
 	 */
 	GetNextXidAndEpoch(&nextXid, &nextEpoch);
 
-	if (reply.xmin <= nextXid)
+	if (feedbackXmin <= nextXid)
 	{
-		if (reply.epoch != nextEpoch)
+		if (feedbackEpoch != nextEpoch)
 			return;
 	}
 	else
 	{
-		if (reply.epoch + 1 != nextEpoch)
+		if (feedbackEpoch + 1 != nextEpoch)
 			return;
 	}
 
-	if (!TransactionIdPrecedesOrEquals(reply.xmin, nextXid))
+	if (!TransactionIdPrecedesOrEquals(feedbackXmin, nextXid))
 		return;					/* epoch OK, but it's wrapped around */
 
 	/*
@@ -610,9 +624,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * cleanup conflicts on the standby server.
 	 *
 	 * There is a small window for a race condition here: although we just
-	 * checked that reply.xmin precedes nextXid, the nextXid could have gotten
+	 * checked that feedbackXmin precedes nextXid, the nextXid could have gotten
 	 * advanced between our fetching it and applying the xmin below, perhaps
-	 * far enough to make reply.xmin wrap around.  In that case the xmin we
+	 * far enough to make feedbackXmin wrap around.  In that case the xmin we
 	 * set here would be "in the future" and have no effect.  No point in
 	 * worrying about this since it's too late to save the desired data
 	 * anyway.	Assuming that the standby sends us an increasing sequence of
@@ -625,7 +639,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * safe, and if we're moving it backwards, well, the data is at risk
 	 * already since a VACUUM could have just finished calling GetOldestXmin.)
 	 */
-	MyPgXact->xmin = reply.xmin;
+	MyPgXact->xmin = feedbackXmin;
 }
 
 /* Main loop of walsender process that streams the WAL over Copy messages. */
@@ -635,17 +649,12 @@ WalSndLoop(void)
 	bool		caughtup = false;
 
 	/*
-	 * Allocate buffer that will be used for each output message.  We do this
-	 * just once to reduce palloc overhead.  The buffer must be made large
-	 * enough for maximum-sized messages.
-	 */
-	output_message = palloc(1 + sizeof(WalDataMessageHeader) + MAX_SEND_SIZE);
-
-	/*
-	 * Allocate buffer that will be used for processing reply messages.  As
-	 * above, do this just once to reduce palloc overhead.
+	 * Allocate buffers that will be used for each outgoing and incoming
+	 * message.  We do this just once to reduce palloc overhead.
 	 */
+	initStringInfo(&output_message);
 	initStringInfo(&reply_message);
+	initStringInfo(&tmpbuf);
 
 	/* Initialize the last reply timestamp */
 	last_reply_timestamp = GetCurrentTimestamp();
@@ -1048,7 +1057,6 @@ XLogSend(bool *caughtup)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
-	WalDataMessageHeader msghdr;
 
 	/*
 	 * Attempt to send all data that's already been written out and fsync'd to
@@ -1125,25 +1133,32 @@ XLogSend(bool *caughtup)
 	/*
 	 * OK to read and send the slice.
 	 */
-	output_message[0] = 'w';
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'w');
+
+	pq_sendint64(&output_message, startptr);		/* dataStart */
+	pq_sendint64(&output_message, SendRqstPtr);	/* walEnd */
+	pq_sendint64(&output_message, 0);			/* sendtime, filled in later */
 
 	/*
 	 * Read the log directly into the output buffer to avoid extra memcpy
 	 * calls.
 	 */
-	XLogRead(output_message + 1 + sizeof(WalDataMessageHeader), startptr, nbytes);
+	enlargeStringInfo(&output_message, nbytes);
+	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+	output_message.len += nbytes;
+	output_message.data[output_message.len] = '\0';
 
 	/*
-	 * We fill the message header last so that the send timestamp is taken as
-	 * late as possible.
+	 * Fill the send timestamp last, so that it is taken as late as possible.
 	 */
-	msghdr.dataStart = startptr;
-	msghdr.walEnd = SendRqstPtr;
-	msghdr.sendTime = GetCurrentTimestamp();
-
-	memcpy(output_message + 1, &msghdr, sizeof(WalDataMessageHeader));
+	resetStringInfo(&tmpbuf);
+	pq_sendint64(&tmpbuf, GetCurrentIntegerTimestamp());
+	Assert(tmpbuf.len == sizeof(int64));
+	memcpy(&output_message.data[1 + sizeof(int64) + sizeof(int64)],
+		   &tmpbuf.data, sizeof(int64));
 
-	pq_putmessage_noblock('d', output_message, 1 + sizeof(WalDataMessageHeader) + nbytes);
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
 
 	sentPtr = endptr;
 
@@ -1518,19 +1533,17 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 static void
 WalSndKeepalive(bool requestReply)
 {
-	PrimaryKeepaliveMessage keepalive_message;
-
-	/* Construct a new message */
-	keepalive_message.walEnd = sentPtr;
-	keepalive_message.sendTime = GetCurrentTimestamp();
-	keepalive_message.replyRequested = requestReply;
-
 	elog(DEBUG2, "sending replication keepalive");
 
-	/* Prepend with the message type and send it. */
-	output_message[0] = 'k';
-	memcpy(output_message + 1, &keepalive_message, sizeof(PrimaryKeepaliveMessage));
-	pq_putmessage_noblock('d', output_message, sizeof(PrimaryKeepaliveMessage) + 1);
+	/* construct the message... */
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, sentPtr);		/* walEnd */
+	pq_sendint64(&output_message, GetCurrentIntegerTimestamp()); /* sendTime */
+	pq_sendbyte(&output_message, requestReply ? 1 : 0);	/* replyRequested */
+
+	/* ... and send it wrapped in CopyData */
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
 }
 
 /*
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 50ef897..823bddf 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1286,6 +1286,50 @@ GetCurrentTimestamp(void)
 }
 
 /*
+ * GetCurrentIntegerTimestamp -- get the current operating system time as int64
+ *
+ * Result is the number of milliseconds since the Postgres epoch. If compiled
+ * with --enable-integer-datetimes, this is identical to GetCurrentTimestamp(),
+ * and is implemented as a macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+GetCurrentIntegerTimestamp(void)
+{
+	int64 result;
+	struct timeval tp;
+
+	gettimeofday(&tp, NULL);
+
+	result = (int64) tp.tv_sec -
+		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
+
+	result = (result * USECS_PER_SEC) + tp.tv_usec;
+
+	return result;
+}
+#endif
+
+/*
+ * GetCurrentIntegerTimestamp -- convert an integer timestamp to native format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+TimestampTz
+IntegerTimestampToTimestampTz(int64 timestamp)
+{
+	TimestampTz result;
+
+	result = timestamp / USECS_PER_SEC;
+	result += (timestamp % USECS_PER_SEC) / 1000000.0;
+
+	return result;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 404ff91..97ab226 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -21,7 +21,6 @@
 #include "postgres.h"
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
-#include "replication/walprotocol.h"
 #include "utils/datetime.h"
 #include "utils/timestamp.h"
 
@@ -33,14 +32,13 @@
 #include <sys/types.h>
 #include <unistd.h>
 
-
-/* Size of the streaming replication protocol headers */
-#define STREAMING_HEADER_SIZE (1+sizeof(WalDataMessageHeader))
-#define STREAMING_KEEPALIVE_SIZE (1+sizeof(PrimaryKeepaliveMessage))
-
 /* fd for currently open WAL file */
 static int	walfile = -1;
 
+static void sendint64(int64 i, char *buf);
+static int64 recvint64(char *buf);
+static bool sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now);
+
 
 /*
  * Open a new WAL file in the specified directory. Store the name
@@ -191,35 +189,32 @@ close_walfile(char *basedir, char *walname, bool segment_complete)
  * Local version of GetCurrentTimestamp(), since we are not linked with
  * backend code.
  */
-static TimestampTz
+static int64
 localGetCurrentTimestamp(void)
 {
-	TimestampTz result;
+	int64 result;
 	struct timeval tp;
 
 	gettimeofday(&tp, NULL);
 
-	result = (TimestampTz) tp.tv_sec -
+	result = (int64) tp.tv_sec -
 		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
 
-#ifdef HAVE_INT64_TIMESTAMP
 	result = (result * USECS_PER_SEC) + tp.tv_usec;
-#else
-	result = result + (tp.tv_usec / 1000000.0);
-#endif
 
 	return result;
 }
 
 /*
  * Local version of TimestampDifference(), since we are not
- * linked with backend code.
+ * linked with backend code. We always use integer timestamps, regardless
+ * of server setting.
  */
 static void
-localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
+localTimestampDifference(int64 start_time, int64 stop_time,
 						 long *secs, int *microsecs)
 {
-	TimestampTz diff = stop_time - start_time;
+	int64 diff = stop_time - start_time;
 
 	if (diff <= 0)
 	{
@@ -228,13 +223,8 @@ localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 	}
 	else
 	{
-#ifdef HAVE_INT64_TIMESTAMP
 		*secs = (long) (diff / USECS_PER_SEC);
 		*microsecs = (int) (diff % USECS_PER_SEC);
-#else
-		*secs = (long) diff;
-		*microsecs = (int) ((diff - *secs) * 1000000.0);
-#endif
 	}
 }
 
@@ -243,17 +233,13 @@ localTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  * linked with backend code.
  */
 static bool
-localTimestampDifferenceExceeds(TimestampTz start_time,
-								TimestampTz stop_time,
+localTimestampDifferenceExceeds(int64 start_time,
+								int64 stop_time,
 								int msec)
 {
-	TimestampTz diff = stop_time - start_time;
+	int64 diff = stop_time - start_time;
 
-#ifdef HAVE_INT64_TIMESTAMP
 	return (diff >= msec * INT64CONST(1000));
-#else
-	return (diff * 1000.0 >= msec);
-#endif
 }
 
 /*
@@ -382,24 +368,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 											standby_message_timeout))
 		{
 			/* Time to send feedback! */
-			char		replybuf[sizeof(StandbyReplyMessage) + 1];
-			StandbyReplyMessage *replymsg;
-
-			replymsg = (StandbyReplyMessage *) (replybuf + 1);
-			replymsg->write = blockpos;
-			replymsg->flush = InvalidXLogRecPtr;
-			replymsg->apply = InvalidXLogRecPtr;
-			replymsg->sendTime = now;
-			replybuf[0] = 'r';
-
-			if (PQputCopyData(conn, replybuf, sizeof(replybuf)) <= 0 ||
-				PQflush(conn))
-			{
-				fprintf(stderr, _("%s: could not send feedback packet: %s"),
-						progname, PQerrorMessage(conn));
+			if (!sendFeedback(conn, blockpos, now))
 				goto error;
-			}
-
 			last_status = now;
 		}
 
@@ -419,12 +389,11 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			FD_SET(PQsocket(conn), &input_mask);
 			if (standby_message_timeout)
 			{
-				TimestampTz targettime;
+				int64		targettime;
 				long		secs;
 				int			usecs;
 
-				targettime = TimestampTzPlusMilliseconds(last_status,
-												standby_message_timeout - 1);
+				targettime = last_status + (standby_message_timeout - 1) * ((int64) 1000);
 				localTimestampDifference(now,
 										 targettime,
 										 &secs,
@@ -474,19 +443,13 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 					progname, PQerrorMessage(conn));
 			goto error;
 		}
+
 		if (copybuf[0] == 'k')
 		{
 			/*
 			 * keepalive message, sent in 9.2 and newer. We just ignore this
 			 * message completely, but need to skip past it in the stream.
 			 */
-			if (r != STREAMING_KEEPALIVE_SIZE)
-			{
-				fprintf(stderr,
-						_("%s: keepalive message has incorrect size %d\n"),
-						progname, r);
-				goto error;
-			}
 			continue;
 		}
 		else if (copybuf[0] != 'w')
@@ -495,15 +458,19 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 					progname, copybuf[0]);
 			goto error;
 		}
+
+/* Size of the streaming replication protocol headers */
+#define STREAMING_HEADER_SIZE (1 /* msgtype */ + 8 /* dataStart */ + 8 /* walEnd*/ + 8 /* sendTime */)
+
 		if (r < STREAMING_HEADER_SIZE + 1)
 		{
 			fprintf(stderr, _("%s: streaming header too small: %d\n"),
 					progname, r);
 			goto error;
 		}
+		blockpos = recvint64(&copybuf[1]);
 
 		/* Extract WAL location for this block */
-		memcpy(&blockpos, copybuf + 1, 8);
 		xlogoff = blockpos % XLOG_SEG_SIZE;
 
 		/*
@@ -641,3 +608,74 @@ error:
 	walfile = -1;
 	return false;
 }
+
+/*
+ * Converts an int64 to network byte order.
+ */
+static void
+sendint64(int64 i, char *buf)
+{
+	uint32		n32;
+
+	/* High order half first, since we're doing MSB-first */
+	n32 = (uint32) (i >> 32);
+	n32 = htonl(n32);
+	memcpy(&buf[0], &n32, 4);
+
+	/* Now the low order half */
+	n32 = (uint32) i;
+	n32 = htonl(n32);
+	memcpy(&buf[4], &n32, 4);
+}
+
+/*
+ * Converts an int64 from network byte order to native format.
+ */
+static int64
+recvint64(char *buf)
+{
+	int64		result;
+	uint32		h32;
+	uint32		l32;
+
+	memcpy(&h32, buf, 4);
+	memcpy(&l32, buf + 4, 4);
+	h32 = ntohl(h32);
+	l32 = ntohl(l32);
+
+	result = h32;
+	result <<= 32;
+	result |= l32;
+
+	return result;
+}
+
+/*
+ * Send a Standby Status Update message to server.
+ */
+static bool
+sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now)
+{
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	int 		len = 0;
+
+	replybuf[len++] = 'r';
+	sendint64(blockpos, &replybuf[len]);			/* write */
+	len += 8;
+	sendint64(InvalidXLogRecPtr, &replybuf[len]);	/* flush */
+	len += 8;
+	sendint64(InvalidXLogRecPtr, &replybuf[len]);	/* apply */
+	len += 8;
+	sendint64(now, &replybuf[len]);					/* sendTime */
+	len += 8;
+	replybuf[len++] = 0;							/* replyRequested */
+
+	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
+	{
+		fprintf(stderr, _("%s: could not send feedback packet: %s"),
+				progname, PQerrorMessage(conn));
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
deleted file mode 100644
index 396d006..0000000
--- a/src/include/replication/walprotocol.h
+++ /dev/null
@@ -1,128 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * walprotocol.h
- *	  Definitions relevant to the streaming WAL transmission protocol.
- *
- * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
- *
- * src/include/replication/walprotocol.h
- *
- *-------------------------------------------------------------------------
- */
-#ifndef _WALPROTOCOL_H
-#define _WALPROTOCOL_H
-
-#include "access/xlogdefs.h"
-#include "datatype/timestamp.h"
-
-
-/*
- * All messages from WalSender must contain these fields to allow us to
- * correctly calculate the replication delay.
- */
-typedef struct
-{
-	/* Current end of WAL on the sender */
-	XLogRecPtr	walEnd;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-
-	/*
-	 * If replyRequested is set, the client should reply immediately to this
-	 * message, to avoid a timeout disconnect.
-	 */
-	bool		replyRequested;
-} WalSndrMessage;
-
-
-/*
- * Header for a WAL data message (message type 'w').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * The header is followed by actual WAL data.  Note that the data length is
- * not specified in the header --- it's just whatever remains in the message.
- *
- * walEnd and sendTime are not essential data, but are provided in case
- * the receiver wants to adjust its behavior depending on how far behind
- * it is.
- */
-typedef struct
-{
-	/* WAL start location of the data included in this message */
-	XLogRecPtr	dataStart;
-
-	/* Current end of WAL on the sender */
-	XLogRecPtr	walEnd;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-} WalDataMessageHeader;
-
-/*
- * Keepalive message from primary (message type 'k'). (lowercase k)
- * This is wrapped within a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef WalSndrMessage PrimaryKeepaliveMessage;
-
-/*
- * Reply message from standby (message type 'r').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef struct
-{
-	/*
-	 * The xlog locations that have been written, flushed, and applied by
-	 * standby-side. These may be invalid if the standby-side is unable to or
-	 * chooses not to report these.
-	 */
-	XLogRecPtr	write;
-	XLogRecPtr	flush;
-	XLogRecPtr	apply;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-
-	/*
-	 * If replyRequested is set, the server should reply immediately to this
-	 * message, to avoid a timeout disconnect.
-	 */
-	bool		replyRequested;
-} StandbyReplyMessage;
-
-/*
- * Hot Standby feedback from standby (message type 'h').  This is wrapped within
- * a CopyData message at the FE/BE protocol level.
- *
- * Note that the data length is not specified here.
- */
-typedef struct
-{
-	/*
-	 * The current xmin and epoch from the standby, for Hot Standby feedback.
-	 * This may be invalid if the standby-side does not support feedback, or
-	 * Hot Standby is not yet available.
-	 */
-	TransactionId xmin;
-	uint32		epoch;
-
-	/* Sender's system clock at the time of transmission */
-	TimestampTz sendTime;
-} StandbyHSFeedbackMessage;
-
-/*
- * Maximum data payload in a WAL data message.	Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages.  128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
-#endif   /* _WALPROTOCOL_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index e7cdb41..e4ca8c1 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -206,6 +206,13 @@ extern Datum generate_series_timestamptz(PG_FUNCTION_ARGS);
 /* Internal routines (not fmgr-callable) */
 
 extern TimestampTz GetCurrentTimestamp(void);
+#ifdef HAVE_INT64_TIMESTAMP
+#define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
+#define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#else
+extern int64 GetCurrentIntegerTimestamp(void);
+extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+#endif
 
 extern void TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 					long *secs, int *microsecs);
#39Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#35)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
On 13.10.2012 19:35, Fujii Masao wrote:

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

Thanks, fixed.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Is it worth adding the same mechanism (send back the reply immediately
if walsender request a reply) into pg_basebackup and pg_receivexlog?

Good catch. Yes, they should be taught about this too. I'll look into
doing that too.

If you have not started and you don't have objection, I can pickup this to
complete it.

For both (pg_basebackup and pg_receivexlog), we need to get a timeout
parameter from user in command line, as
there is no conf file here. New Option can be -t (parameter name can be
recvtimeout).

The main changes will be in function ReceiveXlogStream(), it is a common
function for both
Pg_basebackup and pg_receivexlog. Handling will be done in same way as we
have done in walreceiver.

Suggestions/Comments?

With Regards,
Amit Kapila.

#40Amit Kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#39)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wednesday, October 17, 2012 5:16 PM Amit Kapila wrote:

On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
On 13.10.2012 19:35, Fujii Masao wrote:

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

Thanks, fixed.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and

StandbyReplyMessage.

Is it worth adding the same mechanism (send back the reply

immediately

if walsender request a reply) into pg_basebackup and pg_receivexlog?

Good catch. Yes, they should be taught about this too. I'll look into
doing that too.

If you have not started and you don't have objection, I can pickup this
to
complete it.

For both (pg_basebackup and pg_receivexlog), we need to get a timeout
parameter from user in command line, as
there is no conf file here. New Option can be -t (parameter name can be
recvtimeout).

The main changes will be in function ReceiveXlogStream(), it is a common
function for both
Pg_basebackup and pg_receivexlog. Handling will be done in same way as
we
have done in walreceiver.

Some more functions where it receives the data files also need similar
handling in pg_basebackup.

With Regards,
Amit Kapila.

#41Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wed, Oct 17, 2012 at 8:46 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
On 13.10.2012 19:35, Fujii Masao wrote:

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

Thanks, fixed.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Is it worth adding the same mechanism (send back the reply immediately
if walsender request a reply) into pg_basebackup and pg_receivexlog?

Good catch. Yes, they should be taught about this too. I'll look into
doing that too.

If you have not started and you don't have objection, I can pickup this to
complete it.

For both (pg_basebackup and pg_receivexlog), we need to get a timeout
parameter from user in command line, as
there is no conf file here. New Option can be -t (parameter name can be
recvtimeout).

The main changes will be in function ReceiveXlogStream(), it is a common
function for both
Pg_basebackup and pg_receivexlog. Handling will be done in same way as we
have done in walreceiver.

Suggestions/Comments?

Before implementing the timeout parameter, I think that it's better to change
both pg_basebackup background process and pg_receivexlog so that they
send back the reply message immediately when they receive the keepalive
message requesting the reply. Currently, they always ignore such keepalive
message, so status interval parameter (-s) in them always must be set to
the value less than replication timeout. We can avoid this troublesome
parameter setting by introducing the same logic of walreceiver into both
pg_basebackup background process and pg_receivexlog.

Regards,

--
Fujii Masao

#42Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#38)
Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tue, Oct 16, 2012 at 9:31 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want to
e.g
use pg_receivexlog to back up a server that runs on a different platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.

Thanks for updating the patch!

We should remove the check of integer_datetime by pg_basebackup
background process and pg_receivexlog? Currently, they always check
it, and then if its setting value is not the same between a client and
server, they fail. Thanks to the patch, ISTM this check is no longer
required.

+ pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());

In XLogWalRcvSendReply() and XLogWalRcvSendHSFeedback(),
GetCurrentTimestamp() is called twice. I think that we can skip the
latter call if integer-datetime is enabled because the return value of
GetCurrentTimestamp() and GetCurrentIntegerTimestamp() is in the
same format. It's worth reducing the number of GetCurrentTimestamp()
calls, I think.

 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
-		 (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
-		 (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
-		 (uint32) (reply.write >> 32), (uint32) reply.write,
-		 (uint32) (reply.flush >> 32), (uint32) reply.flush,
-		 (uint32) (reply.apply >> 32), (uint32) reply.apply);
+		 (uint32) (writePtr >> 32), (uint32) writePtr,
+		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
+		 (uint32) (applyPtr >> 32), (uint32) applyPtr);

Isn't it worth logging not only WAL location but also the replyRequested
flag in these debug message?

The remaining of the patch looks good to me.

+                               hdrlen = sizeof(int64) + sizeof(int64) +
sizeof(int64);
+                               hdrlen = sizeof(int64) + sizeof(int64) +
sizeof(char);

These should be macro, to avoid calculation overhead?

The compiler will calculate this at compilation time, it's going to be a
constant at runtime.

Yes, you're right.

Regards,

--
Fujii Masao

#43Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#41)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
On Wed, Oct 17, 2012 at 8:46 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
On 13.10.2012 19:35, Fujii Masao wrote:

On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, thanks. Committed.

I found one typo. The attached patch fixes that typo.

Thanks, fixed.

ISTM you need to update the protocol.sgml because you added
the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.

Is it worth adding the same mechanism (send back the reply immediately
if walsender request a reply) into pg_basebackup and pg_receivexlog?

Good catch. Yes, they should be taught about this too. I'll look into
doing that too.

If you have not started and you don't have objection, I can pickup this to
complete it.

For both (pg_basebackup and pg_receivexlog), we need to get a timeout
parameter from user in command line, as
there is no conf file here. New Option can be -t (parameter name can be
recvtimeout).

The main changes will be in function ReceiveXlogStream(), it is a common
function for both
Pg_basebackup and pg_receivexlog. Handling will be done in same way as we
have done in walreceiver.

Suggestions/Comments?

Before implementing the timeout parameter, I think that it's better to change
both pg_basebackup background process and pg_receivexlog so that they
send back the reply message immediately when they receive the keepalive
message requesting the reply. Currently, they always ignore such keepalive
message, so status interval parameter (-s) in them always must be set to
the value less than replication timeout. We can avoid this troublesome
parameter setting by introducing the same logic of walreceiver into both
pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned by you (send immediate reply for keepalive).
Both basebackup and pg_receivexlog uses the same function ReceiveXLogStream, so single change for both will address the issue.

Now further to this for introducing timeout in pg_basebackup and pg_receivexlog:
We can have mechanism similar to wal receiver timeout while streaming the data from server, but same logic can not be used incase network goes down during getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is called in synchronous mode, so it keeps waiting for infinite time till it gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).
2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite wait), we can send some finite time. This time can be received as command line argument
from respective utility and set the same in PGconn structure.
In order to have timeout value in PGconn, we can have:
a. Add new parameter in PGconn to indicate the receive timeout.
b. Use the existing parameter connect_timeout for receive timeout also but this may lead to confusion.
3. Any other better option?

Apart from above issue, there is possibility that if during connect time network goes down, then it might hang, because connect_timeout by default will be NULL and connectDBComplete will start waiting inifinitely for connection to become successful.
So shall we have command line argument separately for this also or any other way as you suugest.

Suggestions/Comments

With Regards,
Amit Kapila.

Attachments:

pg_basebackup_keepalive_reply.patchapplication/octet-stream; name=pg_basebackup_keepalive_reply.patchDownload
*** a/src/bin/pg_basebackup/receivelog.c
--- b/src/bin/pg_basebackup/receivelog.c
***************
*** 256,261 **** localTimestampDifferenceExceeds(TimestampTz start_time,
--- 256,290 ----
  #endif
  }
  
+ /* Send reply to Sendert task.
+   * replyRequested is used to decide whether any immediate reply is expected 
+   * from sender
+   */
+ static bool sendReplyToSender(PGconn *conn, int64 now, bool replyRequested, XLogRecPtr blockpos)
+ {
+ 	/* Time to send feedback! */
+ 	char		replybuf[sizeof(StandbyReplyMessage) + 1];
+ 	StandbyReplyMessage *replymsg;
+ 
+ 	replymsg = (StandbyReplyMessage *) (replybuf + 1);
+ 	replymsg->write = blockpos;
+ 	replymsg->flush = InvalidXLogRecPtr;
+ 	replymsg->apply = InvalidXLogRecPtr;
+ 	replymsg->sendTime = now;
+ 	replymsg->replyRequested = replyRequested;
+ 	replybuf[0] = 'r';
+ 
+ 	if (PQputCopyData(conn, replybuf, sizeof(replybuf)) <= 0 ||
+ 		PQflush(conn))
+ 	{
+ 		fprintf(stderr, _("%s: could not send feedback packet: %s"),
+ 				progname, PQerrorMessage(conn));
+ 		return false;
+ 	}
+ 
+ 	return true;
+ }
+ 
  /*
   * Receive a log stream starting at the specified position.
   *
***************
*** 291,296 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 320,326 ----
  	char	   *copybuf = NULL;
  	int64		last_status = -1;
  	XLogRecPtr	blockpos = InvalidXLogRecPtr;
+ 	PrimaryKeepaliveMessage keepalive;
  
  	if (sysidentifier != NULL)
  	{
***************
*** 381,402 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  			localTimestampDifferenceExceeds(last_status, now,
  											standby_message_timeout))
  		{
! 			/* Time to send feedback! */
! 			char		replybuf[sizeof(StandbyReplyMessage) + 1];
! 			StandbyReplyMessage *replymsg;
! 
! 			replymsg = (StandbyReplyMessage *) (replybuf + 1);
! 			replymsg->write = blockpos;
! 			replymsg->flush = InvalidXLogRecPtr;
! 			replymsg->apply = InvalidXLogRecPtr;
! 			replymsg->sendTime = now;
! 			replybuf[0] = 'r';
! 
! 			if (PQputCopyData(conn, replybuf, sizeof(replybuf)) <= 0 ||
! 				PQflush(conn))
  			{
- 				fprintf(stderr, _("%s: could not send feedback packet: %s"),
- 						progname, PQerrorMessage(conn));
  				goto error;
  			}
  
--- 411,419 ----
  			localTimestampDifferenceExceeds(last_status, now,
  											standby_message_timeout))
  		{
! 			/* Sending false here, because no reply is expected from server */
! 			if (sendReplyToSender(conn, now, false, blockpos) == false)
  			{
  				goto error;
  			}
  
***************
*** 487,492 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 504,527 ----
  						progname, r);
  				goto error;
  			}
+ 
+ 			/* copy the received buffer to keepalive */
+ 			memcpy(&keepalive, copybuf, sizeof(PrimaryKeepaliveMessage));
+ 
+ 			/* If as part of keepalive message from sender, an immediate reply is requested 
+ 			 * then send the same to sender.
+ 			 */
+ 			if (keepalive.replyRequested)
+ 			{
+ 				now = localGetCurrentTimestamp();
+ 				if (sendReplyToSender(conn, now, false, blockpos) == false)
+ 				{
+ 					goto error;
+ 				}
+ 
+ 				last_status = now;			
+ 			}
+ 			
  			continue;
  		}
  		else if (copybuf[0] != 'w')
#44Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#38)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On 16.10.2012 15:31, Heikki Linnakangas wrote:

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

Oh, I didn't remember that we've documented the specific structs
that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),

I think there's consensus that 9.3 would be a good time to do that
as we changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want
to e.g
use pg_receivexlog to back up a server that runs on a different
platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.

Committed this now, after fixing a few more bugs that came up during
testing. Next, I'll take a look at the patch you sent for adding
timeouts to pg_basebackup and pg_receivexlog
(http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853BBED@szxeml509-mbs)

- Heikki

#45Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit kapila (#43)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's better to change
both pg_basebackup background process and pg_receivexlog so that they
send back the reply message immediately when they receive the keepalive
message requesting the reply. Currently, they always ignore such keepalive
message, so status interval parameter (-s) in them always must be set to
the value less than replication timeout. We can avoid this troublesome
parameter setting by introducing the same logic of walreceiver into both
pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned by you (send immediate reply for keepalive).
Both basebackup and pg_receivexlog uses the same function ReceiveXLogStream, so single change for both will address the issue.

Thanks, committed this one after shuffling it around the changes I
committed yesterday. I also updated the docs to not claim that -s option
is required to avoid timeout disconnects anymore.

- Heikki

#46Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#45)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's better

to change

both pg_basebackup background process and pg_receivexlog so that they
send back the reply message immediately when they receive the

keepalive

message requesting the reply. Currently, they always ignore such

keepalive

message, so status interval parameter (-s) in them always must be set

to

the value less than replication timeout. We can avoid this

troublesome

parameter setting by introducing the same logic of walreceiver into

both

pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned

by you (send immediate reply for keepalive).

Both basebackup and pg_receivexlog uses the same function

ReceiveXLogStream, so single change for both will address the issue.

Thanks, committed this one after shuffling it around the changes I
committed yesterday. I also updated the docs to not claim that -s option
is required to avoid timeout disconnects anymore.

Thank you.
However I think still the issue will not be completely solved.
pg_basebackup/pg_receivexlog can still take long time to
detect network break as they don't have timeout concept. To do that I have
sent one proposal which is mentioned at end of mail chain:
http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C3828
53BBED@szxeml509-mbs

Do you think there is any need to introduce such mechanism in
pg_basebackup/pg_receivexlog?

With Regards,
Amit Kapila.

#47Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#44)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thu, Nov 8, 2012 at 2:22 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 16.10.2012 15:31, Heikki Linnakangas wrote:

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

Oh, I didn't remember that we've documented the specific structs
that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),

I think there's consensus that 9.3 would be a good time to do that
as we changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want
to e.g
use pg_receivexlog to back up a server that runs on a different
platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.

Committed this now, after fixing a few more bugs that came up during
testing.

As I suggested upthread, pg_basebackup and pg_receivexlog no longer
need to check integer_datetimes before establishing the connection,
thanks to this commit. If this is right, the attached patch should be applied.
The patch just removes the check of integer_datetimes by pg_basebackup
and pg_receivexlog.

Regards,

--
Fujii Masao

Attachments:

dont_check_integer_datetimes_v1.patchapplication/octet-stream; name=dont_check_integer_datetimes_v1.patchDownload
*** a/src/bin/pg_basebackup/streamutil.c
--- b/src/bin/pg_basebackup/streamutil.c
***************
*** 83,89 **** GetConnection(void)
  	const char **keywords;
  	const char **values;
  	char	   *password = NULL;
- 	const char *tmpparam;
  
  	if (dbhost)
  		argcount++;
--- 83,88 ----
***************
*** 180,212 **** GetConnection(void)
  		free(values);
  		free(keywords);
  
- 		/*
- 		 * Ensure we have the same value of integer timestamps as the server
- 		 * we are connecting to.
- 		 */
- 		tmpparam = PQparameterStatus(tmpconn, "integer_datetimes");
- 		if (!tmpparam)
- 		{
- 			fprintf(stderr,
- 					_("%s: could not determine server setting for integer_datetimes\n"),
- 					progname);
- 			PQfinish(tmpconn);
- 			exit(1);
- 		}
- 
- #ifdef HAVE_INT64_TIMESTAMP
- 		if (strcmp(tmpparam, "on") != 0)
- #else
- 		if (strcmp(tmpparam, "off") != 0)
- #endif
- 		{
- 			fprintf(stderr,
- 			 _("%s: integer_datetimes compile flag does not match server\n"),
- 					progname);
- 			PQfinish(tmpconn);
- 			exit(1);
- 		}
- 
  		/* Store the password for next run */
  		if (password)
  			dbpassword = password;
--- 179,184 ----
#48Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#47)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Fri, Nov 9, 2012 at 1:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Nov 8, 2012 at 2:22 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 16.10.2012 15:31, Heikki Linnakangas wrote:

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:

Oh, I didn't remember that we've documented the specific structs
that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),

I think there's consensus that 9.3 would be a good time to do that
as we changed the XLogRecPtr format anyway.

This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want
to e.g
use pg_receivexlog to back up a server that runs on a different
platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..

Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.

Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.

Committed this now, after fixing a few more bugs that came up during
testing.

As I suggested upthread, pg_basebackup and pg_receivexlog no longer
need to check integer_datetimes before establishing the connection,
thanks to this commit. If this is right, the attached patch should be applied.
The patch just removes the check of integer_datetimes by pg_basebackup
and pg_receivexlog.

Another comment that I made upthread is:

--------
In XLogWalRcvSendReply() and XLogWalRcvSendHSFeedback(),
GetCurrentTimestamp() is called twice. I think that we can skip the
latter call if integer-datetime is enabled because the return value of
GetCurrentTimestamp() and GetCurrentIntegerTimestamp() is in the
same format. It's worth reducing the number of GetCurrentTimestamp()
calls, I think.
--------

Attached patch removes redundant GetCurrentTimestamp() call
from XLogWalRcvSendReply() and XLogWalRcvSendHSFeedback(),
if --enable-integer-datetimes.

Regards,

--
Fujii Masao

Attachments:

reduce_get_current_timestamp_v1.patchapplication/octet-stream; name=reduce_get_current_timestamp_v1.patchDownload
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 745,751 **** XLogWalRcvSendReply(bool force, bool requestReply)
--- 745,755 ----
  	pq_sendint64(&reply_message, writePtr);
  	pq_sendint64(&reply_message, flushPtr);
  	pq_sendint64(&reply_message, applyPtr);
+ #ifdef HAVE_INT64_TIMESTAMP
+ 	pq_sendint64(&reply_message, now);
+ #else
  	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+ #endif
  	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
  
  	/* Send it */
***************
*** 816,822 **** XLogWalRcvSendHSFeedback(void)
--- 820,830 ----
  	/* Construct the the message and send it. */
  	resetStringInfo(&reply_message);
  	pq_sendbyte(&reply_message, 'h');
+ #ifdef HAVE_INT64_TIMESTAMP
+ 	pq_sendint64(&reply_message, now);
+ #else
  	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+ #endif
  	pq_sendint(&reply_message, xmin, 4);
  	pq_sendint(&reply_message, nextEpoch, 4);
  	walrcv_send(reply_message.data, reply_message.len);
#49Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's better

to change

both pg_basebackup background process and pg_receivexlog so that they
send back the reply message immediately when they receive the

keepalive

message requesting the reply. Currently, they always ignore such

keepalive

message, so status interval parameter (-s) in them always must be set

to

the value less than replication timeout. We can avoid this

troublesome

parameter setting by introducing the same logic of walreceiver into

both

pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned

by you (send immediate reply for keepalive).

Both basebackup and pg_receivexlog uses the same function

ReceiveXLogStream, so single change for both will address the issue.

Thanks, committed this one after shuffling it around the changes I
committed yesterday. I also updated the docs to not claim that -s option
is required to avoid timeout disconnects anymore.

Thank you.
However I think still the issue will not be completely solved.
pg_basebackup/pg_receivexlog can still take long time to
detect network break as they don't have timeout concept. To do that I have
sent one proposal which is mentioned at end of mail chain:
http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C3828
53BBED@szxeml509-mbs

Do you think there is any need to introduce such mechanism in
pg_basebackup/pg_receivexlog?

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

Regards,

--
Fujii Masao

#50Amit Kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#49)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's

better

to change

both pg_basebackup background process and pg_receivexlog so that

they

send back the reply message immediately when they receive the

keepalive

message requesting the reply. Currently, they always ignore such

keepalive

message, so status interval parameter (-s) in them always must be

set

to

the value less than replication timeout. We can avoid this

troublesome

parameter setting by introducing the same logic of walreceiver

into

both

pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification

mentioned

by you (send immediate reply for keepalive).

Both basebackup and pg_receivexlog uses the same function

ReceiveXLogStream, so single change for both will address the issue.

Thanks, committed this one after shuffling it around the changes I
committed yesterday. I also updated the docs to not claim that -s

option

is required to avoid timeout disconnects anymore.

Thank you.
However I think still the issue will not be completely solved.
pg_basebackup/pg_receivexlog can still take long time to
detect network break as they don't have timeout concept. To do that I

have

sent one proposal which is mentioned at end of mail chain:
http://archives.postgresql.org/message-

id/6C0B27F7206C9E4CA54AE035729E9C3828

53BBED@szxeml509-mbs

Do you think there is any need to introduce such mechanism in
pg_basebackup/pg_receivexlog?

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

By background process, you mean ReceiveXlogStream?
For both.

I think for background process, it can be done in a way similar to what we
have done for walreceiver.
But I have some doubts for how to do for main process:

Logic similar to walreceiver can not be used incase network goes down during
getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is
called in synchronous mode, so it keeps waiting for infinite time till it
gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).
2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
wait), we can send some finite time. This time can be received as command
line argument
from respective utility and set the same in PGconn structure.
In order to have timeout value in PGconn, we can have:
a. Add new parameter in PGconn to indicate the receive timeout.
b. Use the existing parameter connect_timeout for receive timeout
also but this may lead to confusion.
3. Any other better option?

Apart from above issue, there is possibility that if during connect time
network goes down, then it might hang, because connect_timeout by default
will be NULL and connectDBComplete will start waiting inifinitely for
connection to become successful.
So shall we have command line argument separately for this also or any other
way as you suugest.

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

Yes, its useful, but for walsender the main problem is that it uses blocking
send call to send the data.
I have tried using tcp_keepalive settings, but the send call doesn't comeout
incase of network break.
The only way I could get it out is:
change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
the command
echo "8" > /proc/sys/net/ipv4/tcp_retries2
As per recommendation, its value should be at-least 8 (equivalent to 100
sec)

Do you have any idea, how it can be achieved?

With Regards,
Amit Kapila.

#51Fujii Masao
masao.fujii@gmail.com
In reply to: Noname (#1)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's

better

to change

both pg_basebackup background process and pg_receivexlog so that

they

send back the reply message immediately when they receive the

keepalive

message requesting the reply. Currently, they always ignore such

keepalive

message, so status interval parameter (-s) in them always must be

set

to

the value less than replication timeout. We can avoid this

troublesome

parameter setting by introducing the same logic of walreceiver

into

both

pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification

mentioned

by you (send immediate reply for keepalive).

Both basebackup and pg_receivexlog uses the same function

ReceiveXLogStream, so single change for both will address the issue.

Thanks, committed this one after shuffling it around the changes I
committed yesterday. I also updated the docs to not claim that -s

option

is required to avoid timeout disconnects anymore.

Thank you.
However I think still the issue will not be completely solved.
pg_basebackup/pg_receivexlog can still take long time to
detect network break as they don't have timeout concept. To do that I

have

sent one proposal which is mentioned at end of mail chain:
http://archives.postgresql.org/message-

id/6C0B27F7206C9E4CA54AE035729E9C3828

53BBED@szxeml509-mbs

Do you think there is any need to introduce such mechanism in
pg_basebackup/pg_receivexlog?

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

By background process, you mean ReceiveXlogStream?
For both.

I think for background process, it can be done in a way similar to what we
have done for walreceiver.

Yes.

But I have some doubts for how to do for main process:

Logic similar to walreceiver can not be used incase network goes down during
getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is
called in synchronous mode, so it keeps waiting for infinite time till it
gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).

+1

Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
wait), we can send some finite time. This time can be received as command
line argument
from respective utility and set the same in PGconn structure.
In order to have timeout value in PGconn, we can have:
a. Add new parameter in PGconn to indicate the receive timeout.
b. Use the existing parameter connect_timeout for receive timeout
also but this may lead to confusion.
3. Any other better option?

Apart from above issue, there is possibility that if during connect time
network goes down, then it might hang, because connect_timeout by default
will be NULL and connectDBComplete will start waiting inifinitely for
connection to become successful.
So shall we have command line argument separately for this also or any other
way as you suugest.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

Yes, its useful, but for walsender the main problem is that it uses blocking
send call to send the data.
I have tried using tcp_keepalive settings, but the send call doesn't comeout
incase of network break.
The only way I could get it out is:
change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
the command
echo "8" > /proc/sys/net/ipv4/tcp_retries2
As per recommendation, its value should be at-least 8 (equivalent to 100
sec)

Do you have any idea, how it can be achieved?

What about using pq_putmessage_noblock()?

Regards,

--
Fujii Masao

#52Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#51)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's

better

to change

both pg_basebackup background process and pg_receivexlog so that

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the

timeout mechanism for the walsender during backup.

Yes, its useful, but for walsender the main problem is that it uses blocking
send call to send the data.
I have tried using tcp_keepalive settings, but the send call doesn't comeout
incase of network break.
The only way I could get it out is:
change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
the command

echo "8" > /proc/sys/net/ipv4/tcp_retries2

As per recommendation, its value should be at-least 8 (equivalent to 100
sec)

Do you have any idea, how it can be achieved?

What about using pq_putmessage_noblock()?

I will try this, but do you know why at first place in code the blocking mode is used to send files?
I am asking as I am little scared that it should not break any design which was initially thought of while making send of files as blocking.

With Regards,
Amit Kapila.

#53Fujii Masao
masao.fujii@gmail.com
In reply to: Amit kapila (#52)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tue, Nov 13, 2012 at 1:06 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Before implementing the timeout parameter, I think that it's

better

to change

both pg_basebackup background process and pg_receivexlog so that

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the

timeout mechanism for the walsender during backup.

Yes, its useful, but for walsender the main problem is that it uses blocking
send call to send the data.
I have tried using tcp_keepalive settings, but the send call doesn't comeout
incase of network break.
The only way I could get it out is:
change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
the command

echo "8" > /proc/sys/net/ipv4/tcp_retries2

As per recommendation, its value should be at-least 8 (equivalent to 100
sec)

Do you have any idea, how it can be achieved?

What about using pq_putmessage_noblock()?

I will try this, but do you know why at first place in code the blocking mode is used to send files?
I am asking as I am little scared that it should not break any design which was initially thought of while making send of files as blocking.

I'm afraid I don't know why. I guess that using non-blocking mode complicates
the code, so in the first version of pg_basebackup the blocking mode
was adopted.

Regards,

--
Fujii Masao

#54Amit kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#51)
1 attachment(s)
Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

By background process, you mean ReceiveXlogStream?
For both.

I think for background process, it can be done in a way similar to what we
have done for walreceiver.

Yes.

But I have some doubts for how to do for main process:

Logic similar to walreceiver can not be used incase network goes down during
getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is
called in synchronous mode, so it keeps waiting for infinite time till it
gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).

+1

Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
wait), we can send some finite time. This time can be received as command
line argument
from respective utility and set the same in PGconn structure.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

I have prepared an attached patch to make pg_basebackup and pg_receivexlog as non-blocking.
To do so I have to add new command line parameters in pg_basebackup and pg_receivexlog
for now added two more command line arguments
a. "-r" for pg_basebackup and pg_receivexlog to take receive time-out value. Default value for this parameter is 60 sec.
b. "-t" for pg_basebackup and pg_receivexlog to take initial connection timeout value. Default value is infinite wait.
We can change to accept --conninfo as well.

I feel apart from above, remaining problem is for function call PQgetResult()
1. Wherever query is getting sent from BaseBackup, it calls the function PQgetResult to receive the result of query.
As PQgetResult() is blocking function (it calls pqWait which can hang), so if network is down before sending the query itself,
then there will not be any result, so it will keep hanging in PQgetResult .
IMO, it can be solved in below ways:
a. Create one corresponding non-blocking function. But this function is being called from inside some of the
other libpq function (PQexec->PQexecFinish->PQgetResult). So it can be little tricky to solve this way.
b. Add the receive_timeout variable in PGconn structure and use it in pqWait for timeout whenever it is set.
c. any other better way?

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

What about using pq_putmessage_noblock()?

I think may be some more functions also needs to be made as noblock. I am still evaluating.

I will upload the attached patch in commitfest if you don't have any objections?

More Suggestions/Comments?

With Regards,
Amit Kapila.

Attachments:

noblock_basebackup_and_receivexlog.patchapplication/octet-stream; name=noblock_basebackup_and_receivexlog.patchDownload
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***************
*** 47,52 **** bool		includewal = false;
--- 47,56 ----
  bool		streamwal = false;
  bool		fastcheckpoint = false;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int		standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char		*standby_connect_timeout = NULL;		
+ 
+ #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
  /* Progress counters */
  static uint64 totalsize;
***************
*** 125,130 **** usage(void)
--- 129,138 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));	
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
***************
*** 237,244 **** LogStreamerMain(logstreamer_param *param)
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout,
! 						   true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
--- 245,252 ----
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout, 
! 						   standby_recv_timeout, true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
***************
*** 290,297 **** StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
  	}
  #endif
  
! 	/* Get a second connection */
! 	param->bgconn = GetConnection();
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 298,306 ----
  	}
  #endif
  
! 	/* Get a second connection. Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.*/
! 	param->bgconn = GetConnection(standby_connect_timeout);
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
***************
*** 467,472 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 476,482 ----
  	char		filename[MAXPGPATH];
  	char	   *copybuf = NULL;
  	FILE	   *tarfile = NULL;
+ 	int64   last_recv_timestamp;
  
  #ifdef HAVE_LIBZ
  	gzFile		ztarfile = NULL;
***************
*** 584,592 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 594,605 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64		now;
  
  		if (copybuf != NULL)
  		{
***************
*** 594,600 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  		if (r == -1)
  		{
  			/*
--- 607,668 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
! 
! 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
! 			if (r == 0 || (r < 0 && errno == EINTR))
! 			{
! 				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 */
! 				if (standby_recv_timeout > 0)
! 				{
! 					now = localGetCurrentTimestamp();
! 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
! 					{
! 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
! 																			progname);
! 						disconnect_and_exit(1);
! 					}
! 				}
! 				
! 				continue;
! 			}
! 			else if (r < 0)
! 			{
! 				fprintf(stderr, _("%s: select() failed: %s\n"),
! 						progname, strerror(errno));
! 				disconnect_and_exit(1);
! 			}
! 			/* Else there is actually data on the socket */
! 			if (PQconsumeInput(conn) == 0)
! 			{
! 				fprintf(stderr,
! 						_("%s: could not receive data from WAL Sender: %s"),
! 						progname, PQerrorMessage(conn));
! 				disconnect_and_exit(1);
! 			}
! 
! 			/* Set the last reply timestamp */
! 			last_recv_timestamp = localGetCurrentTimestamp();
! 
! 			/* Some data is received, so go back read them in buffer*/			
! 			continue;
! 		}		
  		if (r == -1)
  		{
  			/*
***************
*** 665,670 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 733,741 ----
  			disconnect_and_exit(1);
  		}
  
+ 		/* Set the last reply timestamp */
+ 		last_recv_timestamp = localGetCurrentTimestamp();	
+ 
  #ifdef HAVE_LIBZ
  		if (ztarfile != NULL)
  		{
***************
*** 714,719 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
--- 785,791 ----
  	int			current_padding = 0;
  	char	   *copybuf = NULL;
  	FILE	   *file = NULL;
+ 	int64  last_recv_timestamp;
  
  	if (PQgetisnull(res, rownum, 0))
  		strcpy(current_path, basedir);
***************
*** 731,739 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
--- 803,815 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64    		now;
+ 
  
  		if (copybuf != NULL)
  		{
***************
*** 741,748 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  
  		if (r == -1)
  		{
  			/*
--- 817,878 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
  
+ 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
+ 			if (r == 0 || (r < 0 && errno == EINTR))
+ 			{
+ 				/*
+ 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
+ 				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
+ 																			progname);
+ 						disconnect_and_exit(1);
+ 					}
+ 				}
+ 				
+ 				continue;
+ 			}
+ 			else if (r < 0)
+ 			{
+ 				fprintf(stderr, _("%s: select() failed: %s\n"),
+ 						progname, strerror(errno));
+ 				disconnect_and_exit(1);
+ 			}
+ 			/* Else there is actually data on the socket */
+ 			if (PQconsumeInput(conn) == 0)
+ 			{
+ 				fprintf(stderr,
+ 						_("%s: could not receive data from WAL Sender: %s"),
+ 						progname, PQerrorMessage(conn));
+ 				disconnect_and_exit(1);
+ 			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 
+ 			/* Some data is received, so go back read them in buffer*/			
+ 			continue;
+ 		}
  		if (r == -1)
  		{
  			/*
***************
*** 755,765 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
  		}
  		else if (r == -2)
  		{
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 
  		if (file == NULL)
  		{
  			int			filemode;
--- 885,898 ----
  		}
  		else if (r == -2)
  		{
+ 		    fprintf(stderr, "\n");
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 		
! 		/* Set the last reply timestamp */
! 		last_recv_timestamp = localGetCurrentTimestamp();
  		if (file == NULL)
  		{
  			int			filemode;
***************
*** 953,961 **** BaseBackup(void)
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 1086,1094 ----
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server. Sending connect_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
***************
*** 1254,1259 **** main(int argc, char **argv)
--- 1387,1394 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},	
  		{"verbose", no_argument, NULL, 'v'},
  		{"progress", no_argument, NULL, 'P'},
  		{NULL, 0, NULL, 0}
***************
*** 1280,1286 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:xX:l:zZ:c:h:p:U:s:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 1415,1421 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:xX:l:zZ:c:h:p:U:s:r:t:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
***************
*** 1392,1397 **** main(int argc, char **argv)
--- 1527,1552 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 
+ 				break;			
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
  			case 'v':
  				verbose++;
  				break;
*** a/src/bin/pg_basebackup/pg_receivexlog.c
--- b/src/bin/pg_basebackup/pg_receivexlog.c
***************
*** 41,46 **** char	   *basedir = NULL;
--- 41,48 ----
  int			verbose = 0;
  int			noloop = 0;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int			standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char			*standby_connect_timeout = NULL;		
  volatile bool time_to_abort = false;
  
  
***************
*** 69,74 **** usage(void)
--- 71,80 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));		
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
***************
*** 224,232 **** StreamLog(void)
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
--- 230,239 ----
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server, Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
***************
*** 280,286 **** StreamLog(void)
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout, false);
  
  	PQfinish(conn);
  }
--- 287,294 ----
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout,
! 					  standby_recv_timeout, false);
  
  	PQfinish(conn);
  }
***************
*** 312,317 **** main(int argc, char **argv)
--- 320,327 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},				
  		{"verbose", no_argument, NULL, 'v'},
  		{NULL, 0, NULL, 0}
  	};
***************
*** 336,342 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 346,352 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:r:t:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
***************
*** 374,379 **** main(int argc, char **argv)
--- 384,409 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				break;
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
+ 				
  			case 'n':
  				noloop = 1;
  				break;
*** a/src/bin/pg_basebackup/receivelog.c
--- b/src/bin/pg_basebackup/receivelog.c
***************
*** 190,196 **** close_walfile(char *basedir, char *walname, bool segment_complete)
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! static int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
--- 190,196 ----
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
***************
*** 210,216 **** localGetCurrentTimestamp(void)
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! static void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
--- 210,216 ----
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
***************
*** 232,238 **** localTimestampDifference(int64 start_time, int64 stop_time,
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! static bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
--- 232,238 ----
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
***************
*** 342,348 **** bool
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
--- 342,349 ----
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, 
! 				  int standby_recv_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
***************
*** 350,355 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 351,358 ----
  	char	   *copybuf = NULL;
  	int64		last_status = -1;
  	XLogRecPtr	blockpos = InvalidXLogRecPtr;
+ 	int64	 last_recv_timestamp;
+ 	bool		ping_sent = false;
  
  	if (sysidentifier != NULL)
  	{
***************
*** 403,408 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 406,415 ----
  	}
  	PQclear(res);
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
+ 	ping_sent = false;
+ 	
  	/*
  	 * Receive the actual xlog data
  	 */
***************
*** 486,495 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Continue the loop and either
! 				 * deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
  				continue;
  			}
  			else if (r < 0)
--- 493,531 ----
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 * and then either deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating XLogStream receiver due to timeout\n"),
+ 																			progname);
+ 						goto error;
+ 					}
+ 
+ 					/*
+ 					 * We didn't receive anything new, for half of receiver
+ 					 * replication timeout. Ping the server, if not already done.
+ 					 */
+ 					if (!ping_sent)
+ 					{
+ 						if (localTimestampDifferenceExceeds(last_recv_timestamp, now, (standby_recv_timeout/2)))
+ 						{
+ 							if (!sendFeedback(conn, blockpos, now, true))
+ 							{
+ 								goto error;
+ 							}
+ 
+ 							last_status = now;
+ 							ping_sent = true;
+ 						}
+ 					}		
+ 				}			
+ 				
  				continue;
  			}
  			else if (r < 0)
***************
*** 506,511 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 542,552 ----
  						progname, PQerrorMessage(conn));
  				goto error;
  			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 			ping_sent = false;
+ 			
  			continue;
  		}
  		if (r == -1)
***************
*** 518,524 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  			goto error;
  		}
  
! 		/* Check the message type. */
  		if (copybuf[0] == 'k')
  		{
  			int		pos;
--- 559,568 ----
  			goto error;
  		}
  
! 		/* Set the last reply timestamp */
! 		last_recv_timestamp = localGetCurrentTimestamp();
! 		ping_sent = false;
! 		
  		if (copybuf[0] == 'k')
  		{
  			int		pos;
*** a/src/bin/pg_basebackup/receivelog.h
--- b/src/bin/pg_basebackup/receivelog.h
***************
*** 7,16 ****
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 				  XLogRecPtr startpos,
! 				  uint32 timeline,
! 				  char *sysidentifier,
! 				  char *basedir,
! 				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout,
! 				  bool rename_partial);
--- 7,24 ----
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 							  XLogRecPtr startpos,
! 							  uint32 timeline,
! 							  char *sysidentifier,
! 							  char *basedir,
! 							  stream_stop_callback stream_stop,
! 							  int standby_message_timeout,
! 							  int standby_recv_timeout,
! 							  bool rename_partial);
! 
! extern int64 localGetCurrentTimestamp(void);
! extern void  localTimestampDifference(int64 start_time, int64 stop_time,
! 								 long *secs, int *microsecs);
! extern bool  localTimestampDifferenceExceeds(int64 start_time,
! 								int64 stop_time,
! 								int msec);
*** a/src/bin/pg_basebackup/streamutil.c
--- b/src/bin/pg_basebackup/streamutil.c
***************
*** 72,80 **** pg_malloc0(size_t size)
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
   */
  PGconn *
! GetConnection(void)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
--- 72,82 ----
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
+  * Set conn_timeout to PGconn structure if their value 
+  * is not NULL.
   */
  PGconn *
! GetConnection(char *conn_timeout)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
***************
*** 91,96 **** GetConnection(void)
--- 93,100 ----
  		argcount++;
  	if (dbport)
  		argcount++;
+ 	if (conn_timeout)
+ 		argcount++;
  
  	keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
  	values = pg_malloc0((argcount + 1) * sizeof(*values));
***************
*** 120,125 **** GetConnection(void)
--- 124,135 ----
  		values[i] = dbport;
  		i++;
  	}
+ 	if (conn_timeout != NULL)
+ 	{
+ 		keywords[i] = "connect_timeout";
+ 		values[i] = conn_timeout;
+ 		i++;
+ 	}
  
  	while (true)
  	{
*** a/src/bin/pg_basebackup/streamutil.h
--- b/src/bin/pg_basebackup/streamutil.h
***************
*** 19,22 **** extern PGconn *conn;
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! extern PGconn *GetConnection(void);
--- 19,22 ----
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! PGconn	   *GetConnection(char *conn_timeout);
#55Amit Kapila
amit.kapila@huawei.com
In reply to: Amit kapila (#54)
Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Thursday, November 15, 2012 7:29 PM Amit kapila wrote:

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Are you planning to introduce the timeout mechanism in pg_basebackup

I feel apart from above, remaining problem is for function call
PQgetResult() 1. Wherever query is getting sent from BaseBackup, it
calls the function PQgetResult to receive the result of query.
As PQgetResult() is blocking function (it calls pqWait which can
hang), so if network is down before sending the query itself,
then there will not be any result, so it will keep hanging in
PQgetResult .
IMO, it can be solved in below ways:
a. Create one corresponding non-blocking function. But this function is
being called from inside some of the
other libpq function (PQexec->PQexecFinish->PQgetResult). So it can
be little tricky to solve this way.
b. Add the receive_timeout variable in PGconn structure and use it in
pqWait for timeout whenever it is set.
c. any other better way?

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

What about using pq_putmessage_noblock()?

I think may be some more functions also needs to be made as noblock. I
am still evaluating.

Done the analysis and seems that for below API's also, we need equivalent
noblock, otherwise same problem can happen as they are also
used in the flow.
a. pq_endmessage
b. EndCommand
c. pq_puttextmessage
d. pq_putemptymessage
e. ReadyForQuery - For this, because now walsender and normal
backend are same.
f. ReadCommand - For this, because now walsender and normal backend
are same. It seems solution for it can be tricky as pq_getbyte is not called
from first level function.

Suggestions/Thoughts?

With Regards,
Amit Kapila.

#56Boszormenyi Zoltan
zb@cybertec.at
In reply to: Amit kapila (#54)
Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

Hi,

2012-11-15 14:59 keltezéssel, Amit kapila írta:

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:

On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:

On 19.10.2012 14:42, Amit kapila wrote:

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:

Are you planning to introduce the timeout mechanism in pg_basebackup
main process? Or background process? It's useful to implement both.

By background process, you mean ReceiveXlogStream?
For both.
I think for background process, it can be done in a way similar to what we
have done for walreceiver.

Yes.

But I have some doubts for how to do for main process:
Logic similar to walreceiver can not be used incase network goes down during
getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is
called in synchronous mode, so it keeps waiting for infinite time till it
gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).

+1
Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
wait), we can send some finite time. This time can be received as command
line argument
from respective utility and set the same in PGconn structure.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

I have prepared an attached patch to make pg_basebackup and pg_receivexlog as non-blocking.
To do so I have to add new command line parameters in pg_basebackup and pg_receivexlog
for now added two more command line arguments
a. "-r" for pg_basebackup and pg_receivexlog to take receive time-out value. Default value for this parameter is 60 sec.
b. "-t" for pg_basebackup and pg_receivexlog to take initial connection timeout value. Default value is infinite wait.
We can change to accept --conninfo as well.

I feel apart from above, remaining problem is for function call PQgetResult()
1. Wherever query is getting sent from BaseBackup, it calls the function PQgetResult to receive the result of query.
As PQgetResult() is blocking function (it calls pqWait which can hang), so if network is down before sending the query itself,
then there will not be any result, so it will keep hanging in PQgetResult .
IMO, it can be solved in below ways:
a. Create one corresponding non-blocking function. But this function is being called from inside some of the
other libpq function (PQexec->PQexecFinish->PQgetResult). So it can be little tricky to solve this way.
b. Add the receive_timeout variable in PGconn structure and use it in pqWait for timeout whenever it is set.
c. any other better way?

BTW, IIRC the walsender has no timeout mechanism during sending
backup data to pg_basebackup. So it's also useful to implement the
timeout mechanism for the walsender during backup.

What about using pq_putmessage_noblock()?

I think may be some more functions also needs to be made as noblock. I am still evaluating.

I will upload the attached patch in commitfest if you don't have any objections?

More Suggestions/Comments?

With Regards,
Amit Kapila.

I am reviewing your patch.

* Is the patch in context diff format <http://en.wikipedia.org/wiki/Diff#Context_format&gt;?

Yes.

* Does it apply cleanly to the current git master?

Not quite cleanly but it doesn't produce rejects or fuzz, only offset warnings:

[zozo@localhost postgresql]$ cat ../noblock_basebackup_and_receivexlog.patch | patch -p1
patching file src/bin/pg_basebackup/pg_basebackup.c
Hunk #1 succeeded at 41 (offset -6 lines).
Hunk #2 succeeded at 123 (offset -6 lines).
Hunk #3 succeeded at 239 (offset -6 lines).
Hunk #4 succeeded at 292 (offset -6 lines).
Hunk #5 succeeded at 470 (offset -6 lines).
Hunk #6 succeeded at 588 (offset -6 lines).
Hunk #7 succeeded at 601 (offset -6 lines).
Hunk #8 succeeded at 727 (offset -6 lines).
Hunk #9 succeeded at 779 (offset -6 lines).
Hunk #10 succeeded at 797 (offset -6 lines).
Hunk #11 succeeded at 811 (offset -6 lines).
Hunk #12 succeeded at 879 (offset -6 lines).
Hunk #13 succeeded at 1080 (offset -6 lines).
Hunk #14 succeeded at 1381 (offset -6 lines).
Hunk #15 succeeded at 1409 (offset -6 lines).
Hunk #16 succeeded at 1521 (offset -6 lines).
patching file src/bin/pg_basebackup/pg_receivexlog.c
Hunk #1 succeeded at 35 (offset -6 lines).
Hunk #2 succeeded at 65 (offset -6 lines).
Hunk #3 succeeded at 224 (offset -6 lines).
Hunk #4 succeeded at 281 (offset -6 lines).
Hunk #5 succeeded at 314 (offset -6 lines).
Hunk #6 succeeded at 341 (offset -5 lines).
Hunk #7 succeeded at 379 (offset -5 lines).
patching file src/bin/pg_basebackup/receivelog.c
Hunk #1 succeeded at 181 (offset -9 lines).
Hunk #2 succeeded at 201 (offset -9 lines).
Hunk #3 succeeded at 223 (offset -9 lines).
Hunk #4 succeeded at 333 (offset -9 lines).
Hunk #5 succeeded at 342 (offset -9 lines).
Hunk #6 succeeded at 397 (offset -9 lines).
Hunk #7 succeeded at 484 (offset -9 lines).
Hunk #8 succeeded at 533 (offset -9 lines).
Hunk #9 succeeded at 550 (offset -9 lines).
patching file src/bin/pg_basebackup/receivelog.h
patching file src/bin/pg_basebackup/streamutil.c
Hunk #1 succeeded at 66 (offset -6 lines).
Hunk #2 succeeded at 87 (offset -6 lines).
Hunk #3 succeeded at 118 (offset -6 lines).
patching file src/bin/pg_basebackup/streamutil.h

* Does it include reasonable tests, necessary doc patches, etc?

The test cases are not applicable. There is no test framework for
testing network outage in "make check".

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

* Does the patch actually implement that?

It seems so, the patch adds the connect_timeout parameter to
the connection options and uses PQgetCopyData(..., 1) to get
the data asynchronously and uses select(2) to watch for incoming
data.

* Do we want that?

It can speed up detecting network breakdown so yes.

* Do we already have it?

No.

* Does it follow SQL spec, or the community-agreed behavior?

There's no such SQL spec. The behaviour is desired.

* Does it include pg_dump support (if applicable)?

Not applicable.

* Are there dangers?

The patch author researched more functions that need
to be extended in a nonblocking way.
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00863.php

* Have all the bases been covered?

For pg_basebackup/pg_receivexlog (for PQgetCopyData and
PQconnect), yes.

Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch.

* Does the feature work as advertised?

Yes.

I tested it between two machines and pulled the ethernet
plug while pg_basebackup was running. With "-r 2", pg_basebackup
detected the timeout after 2 seconds. Without the patch, I lost
patience after two minutes and pressed Ctrl-C in pg_basebackup.

I also tested pg_receivexlog and it also noticed the network error
in the specified timeout.

* Are there corner cases the author has failed to consider?

As far as I can see in the client-side libpq code flow, no.

* Are there any assertion failures or crashes?

Not applicable, the patch is for client applicatiions.

* Does the patch slow down simple tests?

No.

* If it claims to improve performance, does it?

Not applicable, not a performance patch. But it really
improves detecting network breakdown.

* Does it slow down other things?

No.

* Does it follow the project coding guidelines
<http://developer.postgresql.org/pgdocs/postgres/source.html&gt;?

Yes.

* Are there portability issues?

No. It introduces atoi() and select() as new calls, these are portable.

* Will it work on Windows/BSD etc?

It should.

* Are the comments sufficient and accurate?

This chunk below removes a comment which seems obvious enough
so it's not needed:

***************
*** 518,524 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
goto error;
}

!               /* Check the message type. */
                 if (copybuf[0] == 'k')
                 {
                         int             pos;
--- 559,568 ----
                         goto error;
                 }

! /* Set the last reply timestamp */
! last_recv_timestamp = localGetCurrentTimestamp();
! ping_sent = false;
!
if (copybuf[0] == 'k')
{
int pos;
***************

Other comments are sufficient and accurate.

* Does it do what it says, correctly?

This question is redundant with the above "Does the feature work as advertised?"
So yes.

* Does it produce compiler warnings?

No.

* Can you make it crash?

No.

* Is everything done in a way that fits together coherently with other features/modules?

Yes.

* Are there interdependencies that can cause problems?

No.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
http://www.postgresql.at/

#57Hari Babu
haribabu.kommi@huawei.com
In reply to: Boszormenyi Zoltan (#56)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:

I am reviewing your patch.
• Is the patch in context diff format?
Yes.

Thanks for reviewing the patch.

• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset

warnings:

Will rebase the patch to head.

• Does it include reasonable tests, necessary doc patches, etc?
The test cases are not applicable. There is no test framework for
testing network outage in "make check".

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.

Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch.

I also think it is better to handle it as a separate patch for walsender.

• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***************
*** 518,524 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,

uint32 timeline,

                        goto error;
                }
  
!               /* Check the message type. */
                if (copybuf[0] == 'k')
                {
                        int             pos;
--- 559,568 ----
                        goto error;
                }
  
!               /* Set the last reply timestamp */
!               last_recv_timestamp = localGetCurrentTimestamp();
!               ping_sent = false;
!               
                if (copybuf[0] == 'k')
                {
                        int             pos;
***************

Other comments are sufficient and accurate.

I will fix and update the patch.

Please let me know if anything apart from above needs to be taken care.

Regards,
Hari babu.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Hari Babu
haribabu.kommi@huawei.com
In reply to: Noname (#1)
1 attachment(s)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On January 02, 2013 12:41 PM Hari Babu wrote:

On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:

I am reviewing your patch.
• Is the patch in context diff format?
Yes.

Thanks for reviewing the patch.

• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset

warnings:

Will rebase the patch to head.

• Does it include reasonable tests, necessary doc patches, etc?
The test cases are not applicable. There is no test framework for
testing network outage in "make check".

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.

Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch.

I also think it is better to handle it as a separate patch for walsender.

• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***************
*** 518,524 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,

uint32 timeline,

                        goto error;
                }
  
!               /* Check the message type. */
                if (copybuf[0] == 'k')
                {
                        int             pos;
--- 559,568 ----
                        goto error;
                }
  
!               /* Set the last reply timestamp */
!               last_recv_timestamp = localGetCurrentTimestamp();
!               ping_sent = false;
!               
                if (copybuf[0] == 'k')
                {
                        int             pos;
***************

Other comments are sufficient and accurate.

I will fix and update the patch.

The attached V2 patch in the mail handles all the review comments identified
above.

Regards,
Hari babu.

Attachments:

pg_basebkup_recvxlog_noblock_comm_v2.patchapplication/octet-stream; name=pg_basebkup_recvxlog_noblock_comm_v2.patchDownload
*** a/doc/src/sgml/ref/pg_basebackup.sgml
--- b/doc/src/sgml/ref/pg_basebackup.sgml
***************
*** 386,391 **** PostgreSQL documentation
--- 386,411 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that receiver waits for communication from server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>     
+ 
+      <varlistentry>
+       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that client wait for connection to establish with server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>          
+ 
+      <varlistentry>
        <term><option>-U <replaceable>username</replaceable></option></term>
        <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
        <listitem>
*** a/doc/src/sgml/ref/pg_receivexlog.sgml
--- b/doc/src/sgml/ref/pg_receivexlog.sgml
***************
*** 164,169 **** PostgreSQL documentation
--- 164,189 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that receiver waits for communication from server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>     
+ 
+      <varlistentry>
+       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that client wait for connection to establish with server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>  
+      
+      <varlistentry>
        <term><option>-U <replaceable>username</replaceable></option></term>
        <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
        <listitem>
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***************
*** 41,46 **** bool		includewal = false;
--- 41,50 ----
  bool		streamwal = false;
  bool		fastcheckpoint = false;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int		standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char		*standby_connect_timeout = NULL;		
+ 
+ #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
  /* Progress counters */
  static uint64 totalsize;
***************
*** 119,124 **** usage(void)
--- 123,132 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));	
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
***************
*** 231,238 **** LogStreamerMain(logstreamer_param *param)
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout,
! 						   true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
--- 239,246 ----
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout, 
! 						   standby_recv_timeout, true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
***************
*** 284,291 **** StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
  	}
  #endif
  
! 	/* Get a second connection */
! 	param->bgconn = GetConnection();
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 292,300 ----
  	}
  #endif
  
! 	/* Get a second connection. Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.*/
! 	param->bgconn = GetConnection(standby_connect_timeout);
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
***************
*** 461,466 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 470,476 ----
  	char		filename[MAXPGPATH];
  	char	   *copybuf = NULL;
  	FILE	   *tarfile = NULL;
+ 	int64   last_recv_timestamp;
  
  #ifdef HAVE_LIBZ
  	gzFile		ztarfile = NULL;
***************
*** 578,586 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 588,599 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64		now;
  
  		if (copybuf != NULL)
  		{
***************
*** 588,594 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  		if (r == -1)
  		{
  			/*
--- 601,662 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
! 
! 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
! 			if (r == 0 || (r < 0 && errno == EINTR))
! 			{
! 				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 */
! 				if (standby_recv_timeout > 0)
! 				{
! 					now = localGetCurrentTimestamp();
! 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
! 					{
! 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
! 																			progname);
! 						disconnect_and_exit(1);
! 					}
! 				}
! 				
! 				continue;
! 			}
! 			else if (r < 0)
! 			{
! 				fprintf(stderr, _("%s: select() failed: %s\n"),
! 						progname, strerror(errno));
! 				disconnect_and_exit(1);
! 			}
! 			/* Else there is actually data on the socket */
! 			if (PQconsumeInput(conn) == 0)
! 			{
! 				fprintf(stderr,
! 						_("%s: could not receive data from WAL Sender: %s"),
! 						progname, PQerrorMessage(conn));
! 				disconnect_and_exit(1);
! 			}
! 
! 			/* Set the last reply timestamp */
! 			last_recv_timestamp = localGetCurrentTimestamp();
! 
! 			/* Some data is received, so go back read them in buffer*/			
! 			continue;
! 		}		
  		if (r == -1)
  		{
  			/*
***************
*** 659,664 **** ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
--- 727,735 ----
  			disconnect_and_exit(1);
  		}
  
+ 		/* Set the last reply timestamp */
+ 		last_recv_timestamp = localGetCurrentTimestamp();	
+ 
  #ifdef HAVE_LIBZ
  		if (ztarfile != NULL)
  		{
***************
*** 708,713 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
--- 779,785 ----
  	int			current_padding = 0;
  	char	   *copybuf = NULL;
  	FILE	   *file = NULL;
+ 	int64  last_recv_timestamp;
  
  	if (PQgetisnull(res, rownum, 0))
  		strcpy(current_path, basedir);
***************
*** 725,733 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
--- 797,809 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64    		now;
+ 
  
  		if (copybuf != NULL)
  		{
***************
*** 735,742 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  
  		if (r == -1)
  		{
  			/*
--- 811,872 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
  
+ 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
+ 			if (r == 0 || (r < 0 && errno == EINTR))
+ 			{
+ 				/*
+ 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
+ 				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
+ 																			progname);
+ 						disconnect_and_exit(1);
+ 					}
+ 				}
+ 				
+ 				continue;
+ 			}
+ 			else if (r < 0)
+ 			{
+ 				fprintf(stderr, _("%s: select() failed: %s\n"),
+ 						progname, strerror(errno));
+ 				disconnect_and_exit(1);
+ 			}
+ 			/* Else there is actually data on the socket */
+ 			if (PQconsumeInput(conn) == 0)
+ 			{
+ 				fprintf(stderr,
+ 						_("%s: could not receive data from WAL Sender: %s"),
+ 						progname, PQerrorMessage(conn));
+ 				disconnect_and_exit(1);
+ 			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 
+ 			/* Some data is received, so go back read them in buffer*/			
+ 			continue;
+ 		}
  		if (r == -1)
  		{
  			/*
***************
*** 749,759 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
  		}
  		else if (r == -2)
  		{
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 
  		if (file == NULL)
  		{
  			int			filemode;
--- 879,892 ----
  		}
  		else if (r == -2)
  		{
+ 		    fprintf(stderr, "\n");
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 		
! 		/* Set the last reply timestamp */
! 		last_recv_timestamp = localGetCurrentTimestamp();
  		if (file == NULL)
  		{
  			int			filemode;
***************
*** 947,955 **** BaseBackup(void)
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 1080,1088 ----
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server. Sending connect_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
***************
*** 1248,1253 **** main(int argc, char **argv)
--- 1381,1388 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},	
  		{"verbose", no_argument, NULL, 'v'},
  		{"progress", no_argument, NULL, 'P'},
  		{NULL, 0, NULL, 0}
***************
*** 1274,1280 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:xX:l:zZ:c:h:p:U:s:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 1409,1415 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:xX:l:zZ:c:h:p:U:s:r:t:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
***************
*** 1386,1391 **** main(int argc, char **argv)
--- 1521,1546 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 
+ 				break;			
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
  			case 'v':
  				verbose++;
  				break;
*** a/src/bin/pg_basebackup/pg_receivexlog.c
--- b/src/bin/pg_basebackup/pg_receivexlog.c
***************
*** 35,40 **** char	   *basedir = NULL;
--- 35,42 ----
  int			verbose = 0;
  int			noloop = 0;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int			standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char			*standby_connect_timeout = NULL;		
  volatile bool time_to_abort = false;
  
  
***************
*** 63,68 **** usage(void)
--- 65,74 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));		
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
***************
*** 218,226 **** StreamLog(void)
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
--- 224,233 ----
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server, Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
***************
*** 274,280 **** StreamLog(void)
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout, false);
  
  	PQfinish(conn);
  }
--- 281,288 ----
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout,
! 					  standby_recv_timeout, false);
  
  	PQfinish(conn);
  }
***************
*** 306,311 **** main(int argc, char **argv)
--- 314,321 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},				
  		{"verbose", no_argument, NULL, 'v'},
  		{NULL, 0, NULL, 0}
  	};
***************
*** 331,337 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 341,347 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:r:t:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
***************
*** 369,374 **** main(int argc, char **argv)
--- 379,404 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				break;
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
+ 				
  			case 'n':
  				noloop = 1;
  				break;
*** a/src/bin/pg_basebackup/receivelog.c
--- b/src/bin/pg_basebackup/receivelog.c
***************
*** 181,187 **** close_walfile(char *basedir, char *walname, bool segment_complete)
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! static int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
--- 181,187 ----
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
***************
*** 201,207 **** localGetCurrentTimestamp(void)
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! static void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
--- 201,207 ----
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
***************
*** 223,229 **** localTimestampDifference(int64 start_time, int64 stop_time,
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! static bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
--- 223,229 ----
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
***************
*** 333,339 **** bool
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
--- 333,340 ----
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, 
! 				  int standby_recv_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
***************
*** 341,346 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 342,349 ----
  	char	   *copybuf = NULL;
  	int64		last_status = -1;
  	XLogRecPtr	blockpos = InvalidXLogRecPtr;
+ 	int64	 last_recv_timestamp;
+ 	bool		ping_sent = false;
  
  	if (sysidentifier != NULL)
  	{
***************
*** 394,399 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 397,406 ----
  	}
  	PQclear(res);
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
+ 	ping_sent = false;
+ 	
  	/*
  	 * Receive the actual xlog data
  	 */
***************
*** 477,486 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Continue the loop and either
! 				 * deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
  				continue;
  			}
  			else if (r < 0)
--- 484,522 ----
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 * and then either deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating XLogStream receiver due to timeout\n"),
+ 																			progname);
+ 						goto error;
+ 					}
+ 
+ 					/*
+ 					 * We didn't receive anything new, for half of receiver
+ 					 * replication timeout. Ping the server, if not already done.
+ 					 */
+ 					if (!ping_sent)
+ 					{
+ 						if (localTimestampDifferenceExceeds(last_recv_timestamp, now, (standby_recv_timeout/2)))
+ 						{
+ 							if (!sendFeedback(conn, blockpos, now, true))
+ 							{
+ 								goto error;
+ 							}
+ 
+ 							last_status = now;
+ 							ping_sent = true;
+ 						}
+ 					}		
+ 				}			
+ 				
  				continue;
  			}
  			else if (r < 0)
***************
*** 497,502 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 533,543 ----
  						progname, PQerrorMessage(conn));
  				goto error;
  			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 			ping_sent = false;
+ 			
  			continue;
  		}
  		if (r == -1)
***************
*** 509,514 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
--- 550,559 ----
  			goto error;
  		}
  
+ 		/* Set the last reply timestamp */
+ 		last_recv_timestamp = localGetCurrentTimestamp();
+ 		ping_sent = false;
+ 
  		/* Check the message type. */
  		if (copybuf[0] == 'k')
  		{
*** a/src/bin/pg_basebackup/receivelog.h
--- b/src/bin/pg_basebackup/receivelog.h
***************
*** 7,16 ****
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 				  XLogRecPtr startpos,
! 				  uint32 timeline,
! 				  char *sysidentifier,
! 				  char *basedir,
! 				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout,
! 				  bool rename_partial);
--- 7,24 ----
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 							  XLogRecPtr startpos,
! 							  uint32 timeline,
! 							  char *sysidentifier,
! 							  char *basedir,
! 							  stream_stop_callback stream_stop,
! 							  int standby_message_timeout,
! 							  int standby_recv_timeout,
! 							  bool rename_partial);
! 
! extern int64 localGetCurrentTimestamp(void);
! extern void  localTimestampDifference(int64 start_time, int64 stop_time,
! 								 long *secs, int *microsecs);
! extern bool  localTimestampDifferenceExceeds(int64 start_time,
! 								int64 stop_time,
! 								int msec);
*** a/src/bin/pg_basebackup/streamutil.c
--- b/src/bin/pg_basebackup/streamutil.c
***************
*** 66,74 **** pg_malloc0(size_t size)
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
   */
  PGconn *
! GetConnection(void)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
--- 66,76 ----
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
+  * Set conn_timeout to PGconn structure if their value 
+  * is not NULL.
   */
  PGconn *
! GetConnection(char *conn_timeout)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
***************
*** 85,90 **** GetConnection(void)
--- 87,94 ----
  		argcount++;
  	if (dbport)
  		argcount++;
+ 	if (conn_timeout)
+ 		argcount++;
  
  	keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
  	values = pg_malloc0((argcount + 1) * sizeof(*values));
***************
*** 114,119 **** GetConnection(void)
--- 118,129 ----
  		values[i] = dbport;
  		i++;
  	}
+ 	if (conn_timeout != NULL)
+ 	{
+ 		keywords[i] = "connect_timeout";
+ 		values[i] = conn_timeout;
+ 		i++;
+ 	}
  
  	while (true)
  	{
*** a/src/bin/pg_basebackup/streamutil.h
--- b/src/bin/pg_basebackup/streamutil.h
***************
*** 19,22 **** extern PGconn *conn;
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! extern PGconn *GetConnection(void);
--- 19,22 ----
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! PGconn	   *GetConnection(char *conn_timeout);
#59Boszormenyi Zoltan
zb@cybertec.at
In reply to: Hari Babu (#58)
1 attachment(s)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2013-01-04 13:43 keltezéssel, Hari Babu írta:

On January 02, 2013 12:41 PM Hari Babu wrote:

On January 01, 2013 10:19 PM Boszormenyi Zoltan wrote:

I am reviewing your patch.
• Is the patch in context diff format?
Yes.

Thanks for reviewing the patch.

• Does it apply cleanly to the current git master?
Not quite cleanly but it doesn't produce rejects or fuzz, only offset

warnings:

Will rebase the patch to head.

• Does it include reasonable tests, necessary doc patches, etc?
The test cases are not applicable. There is no test framework for
testing network outage in "make check".

There are no documentation patches for the new --recvtimeout=INTERVAL
and --conntimeout=INTERVAL options for either pg_basebackup or
pg_receivexlog.

I will add the documentation for the same.

Per the previous comment, no. But those are for the backend
to notice network breakdowns and as such, they need a
separate patch.

I also think it is better to handle it as a separate patch for walsender.

• Are the comments sufficient and accurate?
This chunk below removes a comment which seems obvious enough
so it's not needed:
***************
*** 518,524 **** ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos,

uint32 timeline,

goto error;
}

!               /* Check the message type. */
if (copybuf[0] == 'k')
{
int             pos;
--- 559,568 ----
goto error;
}

! /* Set the last reply timestamp */
! last_recv_timestamp = localGetCurrentTimestamp();
! ping_sent = false;
!
if (copybuf[0] == 'k')
{
int pos;
***************

Other comments are sufficient and accurate.

I will fix and update the patch.

The attached V2 patch in the mail handles all the review comments identified
above.

Regards,
Hari babu.

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
http://www.postgresql.at/

Attachments:

pg_basebkup_recvxlog_noblock_comm_v3.patchtext/x-patch; name=pg_basebkup_recvxlog_noblock_comm_v3.patchDownload
diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
*** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-05 17:34:30.742135371 +0100
--- postgresql/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-07 15:11:40.787007890 +0100
*************** PostgreSQL documentation
*** 400,405 ****
--- 400,425 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that receiver waits for communication from server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>     
+ 
+      <varlistentry>
+       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that client wait for connection to establish with server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>          
+ 
+      <varlistentry>
        <term><option>-U <replaceable>username</replaceable></option></term>
        <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
        <listitem>
diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_receivexlog.sgml postgresql/doc/src/sgml/ref/pg_receivexlog.sgml
*** postgresql.orig/doc/src/sgml/ref/pg_receivexlog.sgml	2012-11-08 13:13:04.152630639 +0100
--- postgresql/doc/src/sgml/ref/pg_receivexlog.sgml	2013-01-07 15:11:40.788007898 +0100
*************** PostgreSQL documentation
*** 164,169 ****
--- 164,189 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that receiver waits for communication from server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>     
+ 
+      <varlistentry>
+       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that client wait for connection to establish with server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>  
+      
+      <varlistentry>
        <term><option>-U <replaceable>username</replaceable></option></term>
        <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
        <listitem>
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c postgresql/src/bin/pg_basebackup/pg_basebackup.c
*** postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c	2013-01-05 17:34:30.778135625 +0100
--- postgresql/src/bin/pg_basebackup/pg_basebackup.c	2013-01-07 15:16:24.610037886 +0100
*************** bool		streamwal = false;
*** 45,50 ****
--- 45,54 ----
  bool		fastcheckpoint = false;
  bool		writerecoveryconf = false;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int		standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char		*standby_connect_timeout = NULL;		
+ 
+ #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
  
  /* Progress counters */
  static uint64 totalsize;
*************** usage(void)
*** 130,135 ****
--- 134,143 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));	
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
*************** LogStreamerMain(logstreamer_param *param
*** 242,249 ****
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout,
! 						   true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
--- 250,257 ----
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  						   param->sysidentifier, param->xlogdir,
! 						   reached_end_position, standby_message_timeout, 
! 						   standby_recv_timeout, true))
  
  		/*
  		 * Any errors will already have been reported in the function process,
*************** StartLogStreamer(char *startpos, uint32
*** 295,302 ****
  	}
  #endif
  
! 	/* Get a second connection */
! 	param->bgconn = GetConnection();
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 303,311 ----
  	}
  #endif
  
! 	/* Get a second connection. Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.*/
! 	param->bgconn = GetConnection(standby_connect_timeout);
  	if (!param->bgconn)
  		/* Error message already written in GetConnection() */
  		exit(1);
*************** ReceiveTarFile(PGconn *conn, PGresult *r
*** 511,516 ****
--- 520,526 ----
  	char		filename[MAXPGPATH];
  	char	   *copybuf = NULL;
  	FILE	   *tarfile = NULL;
+ 	int64		last_recv_timestamp;
  	char		tarhdr[512];
  	bool		basetablespace = PQgetisnull(res, rownum, 0);
  	bool		in_tarhdr = true;
*************** ReceiveTarFile(PGconn *conn, PGresult *r
*** 634,642 ****
--- 644,655 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64		now;
  
  		if (copybuf != NULL)
  		{
*************** ReceiveTarFile(PGconn *conn, PGresult *r
*** 644,650 ****
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  		if (r == -1)
  		{
  			/*
--- 657,718 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
! 
! 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
! 			if (r == 0 || (r < 0 && errno == EINTR))
! 			{
! 				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 */
! 				if (standby_recv_timeout > 0)
! 				{
! 					now = localGetCurrentTimestamp();
! 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
! 					{
! 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
! 																			progname);
! 						disconnect_and_exit(1);
! 					}
! 				}
! 				
! 				continue;
! 			}
! 			else if (r < 0)
! 			{
! 				fprintf(stderr, _("%s: select() failed: %s\n"),
! 						progname, strerror(errno));
! 				disconnect_and_exit(1);
! 			}
! 			/* Else there is actually data on the socket */
! 			if (PQconsumeInput(conn) == 0)
! 			{
! 				fprintf(stderr,
! 						_("%s: could not receive data from WAL Sender: %s"),
! 						progname, PQerrorMessage(conn));
! 				disconnect_and_exit(1);
! 			}
! 
! 			/* Set the last reply timestamp */
! 			last_recv_timestamp = localGetCurrentTimestamp();
! 
! 			/* Some data is received, so go back read them in buffer*/			
! 			continue;
! 		}		
  		if (r == -1)
  		{
  			/*
*************** ReceiveTarFile(PGconn *conn, PGresult *r
*** 680,685 ****
--- 748,756 ----
  			/* 2 * 512 bytes empty data at end of file */
  			WRITE_TAR_DATA(zerobuf, sizeof(zerobuf));
  
+ 		/* Set the last reply timestamp */
+ 		last_recv_timestamp = localGetCurrentTimestamp();	
+ 
  #ifdef HAVE_LIBZ
  			if (ztarfile != NULL)
  			{
*************** ReceiveAndUnpackTarFile(PGconn *conn, PG
*** 860,865 ****
--- 931,937 ----
  	bool		basetablespace = PQgetisnull(res, rownum, 0);
  	char	   *copybuf = NULL;
  	FILE	   *file = NULL;
+ 	int64  last_recv_timestamp;
  
  	if (basetablespace)
  		strcpy(current_path, basedir);
*************** ReceiveAndUnpackTarFile(PGconn *conn, PG
*** 877,885 ****
--- 949,961 ----
  		disconnect_and_exit(1);
  	}
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
  	while (1)
  	{
  		int			r;
+ 		int64    		now;
+ 
  
  		if (copybuf != NULL)
  		{
*************** ReceiveAndUnpackTarFile(PGconn *conn, PG
*** 887,894 ****
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 0);
  
  		if (r == -1)
  		{
  			/*
--- 963,1024 ----
  			copybuf = NULL;
  		}
  
! 		r = PQgetCopyData(conn, &copybuf, 1);
! 		if (r == 0)
! 		{
! 			/*
! 			 * In async mode, and no data available. We block on reading but
! 			 * not more than the specified timeout, so that we can send a
! 			 * response back to the client.
! 			 */
! 			fd_set		input_mask;
! 			struct timeval timeout;
! 
! 			FD_ZERO(&input_mask);
! 			FD_SET(PQsocket(conn), &input_mask);
! 			timeout.tv_sec = 0; 
! 			timeout.tv_usec = NAPTIME_PER_CYCLE*1000;
  
+ 			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, &timeout);
+ 			if (r == 0 || (r < 0 && errno == EINTR))
+ 			{
+ 				/*
+ 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
+ 				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),
+ 																			progname);
+ 						disconnect_and_exit(1);
+ 					}
+ 				}
+ 				
+ 				continue;
+ 			}
+ 			else if (r < 0)
+ 			{
+ 				fprintf(stderr, _("%s: select() failed: %s\n"),
+ 						progname, strerror(errno));
+ 				disconnect_and_exit(1);
+ 			}
+ 			/* Else there is actually data on the socket */
+ 			if (PQconsumeInput(conn) == 0)
+ 			{
+ 				fprintf(stderr,
+ 						_("%s: could not receive data from WAL Sender: %s"),
+ 						progname, PQerrorMessage(conn));
+ 				disconnect_and_exit(1);
+ 			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 
+ 			/* Some data is received, so go back read them in buffer*/			
+ 			continue;
+ 		}
  		if (r == -1)
  		{
  			/*
*************** ReceiveAndUnpackTarFile(PGconn *conn, PG
*** 901,911 ****
  		}
  		else if (r == -2)
  		{
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 
  		if (file == NULL)
  		{
  			int			filemode;
--- 1031,1044 ----
  		}
  		else if (r == -2)
  		{
+ 		    fprintf(stderr, "\n");
  			fprintf(stderr, _("%s: could not read COPY data: %s"),
  					progname, PQerrorMessage(conn));
  			disconnect_and_exit(1);
  		}
! 		
! 		/* Set the last reply timestamp */
! 		last_recv_timestamp = localGetCurrentTimestamp();
  		if (file == NULL)
  		{
  			int			filemode;
*************** BaseBackup(void)
*** 1213,1221 ****
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
--- 1346,1354 ----
  	char		xlogend[64];
  
  	/*
! 	 * Connect in replication mode to the server. Sending connect_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		exit(1);
*************** main(int argc, char **argv)
*** 1524,1529 ****
--- 1657,1664 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},	
  		{"verbose", no_argument, NULL, 'v'},
  		{"progress", no_argument, NULL, 'P'},
  		{NULL, 0, NULL, 0}
*************** main(int argc, char **argv)
*** 1550,1556 ****
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:RxX:l:zZ:c:h:p:U:s:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 1685,1691 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:RxX:l:zZ:c:h:p:U:s:r:t:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
*************** main(int argc, char **argv)
*** 1665,1670 ****
--- 1800,1825 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 
+ 				break;			
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
  			case 'v':
  				verbose++;
  				break;
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_receivexlog.c postgresql/src/bin/pg_basebackup/pg_receivexlog.c
*** postgresql.orig/src/bin/pg_basebackup/pg_receivexlog.c	2013-01-02 09:19:03.856521815 +0100
--- postgresql/src/bin/pg_basebackup/pg_receivexlog.c	2013-01-07 15:11:40.792007931 +0100
*************** char	   *basedir = NULL;
*** 35,40 ****
--- 35,42 ----
  int			verbose = 0;
  int			noloop = 0;
  int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int			standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char			*standby_connect_timeout = NULL;		
  volatile bool time_to_abort = false;
  
  
*************** usage(void)
*** 63,68 ****
--- 65,74 ----
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
  			 "                         time between status packets sent to server (in seconds)\n"));
+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));		
  	printf(_("  -U, --username=NAME    connect as specified database user\n"));
  	printf(_("  -w, --no-password      never prompt for password\n"));
  	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
*************** StreamLog(void)
*** 218,226 ****
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server
  	 */
! 	conn = GetConnection();
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
--- 224,233 ----
  				lo;
  
  	/*
! 	 * Connect in replication mode to the server, Sending connect_timeout
! 	 * as configured, there is no need for rw_timeout.
  	 */
! 	conn = GetConnection(standby_connect_timeout);
  	if (!conn)
  		/* Error message already written in GetConnection() */
  		return;
*************** StreamLog(void)
*** 274,280 ****
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout, false);
  
  	PQfinish(conn);
  }
--- 281,288 ----
  				timeline);
  
  	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
! 					  stop_streaming, standby_message_timeout,
! 					  standby_recv_timeout, false);
  
  	PQfinish(conn);
  }
*************** main(int argc, char **argv)
*** 306,311 ****
--- 314,321 ----
  		{"no-password", no_argument, NULL, 'w'},
  		{"password", no_argument, NULL, 'W'},
  		{"status-interval", required_argument, NULL, 's'},
+ 		{"recvtimeout", required_argument, NULL, 'r'},		
+ 		{"conntimeout", required_argument, NULL, 't'},				
  		{"verbose", no_argument, NULL, 'v'},
  		{NULL, 0, NULL, 0}
  	};
*************** main(int argc, char **argv)
*** 331,337 ****
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 341,347 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:r:t:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
*************** main(int argc, char **argv)
*** 369,374 ****
--- 379,404 ----
  					exit(1);
  				}
  				break;
+ 			case 'r':
+ 				standby_recv_timeout = atoi(optarg)*1000;
+ 				if (standby_recv_timeout < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid recv timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				break;
+ 			case 't':
+ 				if (atoi(optarg) < 0)
+ 				{
+ 					fprintf(stderr, _("%s: invalid connect timeout \"%s\"\n"),
+ 							progname, optarg);
+ 					exit(1);
+ 				}
+ 				
+ 				standby_connect_timeout = pg_strdup(optarg);
+ 				break;								
+ 				
  			case 'n':
  				noloop = 1;
  				break;
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/receivelog.c postgresql/src/bin/pg_basebackup/receivelog.c
*** postgresql.orig/src/bin/pg_basebackup/receivelog.c	2013-01-02 09:19:03.856521815 +0100
--- postgresql/src/bin/pg_basebackup/receivelog.c	2013-01-07 15:11:40.792007931 +0100
*************** close_walfile(char *basedir, char *walna
*** 181,187 ****
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! static int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
--- 181,187 ----
   * backend code. The protocol always uses integer timestamps, regardless of
   * server setting.
   */
! int64
  localGetCurrentTimestamp(void)
  {
  	int64 result;
*************** localGetCurrentTimestamp(void)
*** 201,207 ****
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! static void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
--- 201,207 ----
   * Local version of TimestampDifference(), since we are not linked with
   * backend code.
   */
! void
  localTimestampDifference(int64 start_time, int64 stop_time,
  						 long *secs, int *microsecs)
  {
*************** localTimestampDifference(int64 start_tim
*** 223,229 ****
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! static bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
--- 223,229 ----
   * Local version of TimestampDifferenceExceeds(), since we are not
   * linked with backend code.
   */
! bool
  localTimestampDifferenceExceeds(int64 start_time,
  								int64 stop_time,
  								int msec)
*************** bool
*** 333,339 ****
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
--- 333,340 ----
  ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
  				  char *sysidentifier, char *basedir,
  				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout, 
! 				  int standby_recv_timeout, bool rename_partial)
  {
  	char		query[128];
  	char		current_walfile_name[MAXPGPATH];
*************** ReceiveXlogStream(PGconn *conn, XLogRecP
*** 341,346 ****
--- 342,349 ----
  	char	   *copybuf = NULL;
  	int64		last_status = -1;
  	XLogRecPtr	blockpos = InvalidXLogRecPtr;
+ 	int64	 last_recv_timestamp;
+ 	bool		ping_sent = false;
  
  	if (sysidentifier != NULL)
  	{
*************** ReceiveXlogStream(PGconn *conn, XLogRecP
*** 394,399 ****
--- 397,406 ----
  	}
  	PQclear(res);
  
+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();
+ 	ping_sent = false;
+ 	
  	/*
  	 * Receive the actual xlog data
  	 */
*************** ReceiveXlogStream(PGconn *conn, XLogRecP
*** 477,486 ****
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Continue the loop and either
! 				 * deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
  				continue;
  			}
  			else if (r < 0)
--- 484,522 ----
  			if (r == 0 || (r < 0 && errno == EINTR))
  			{
  				/*
! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
! 				 * and then either deliver a status packet to the server or just go back into
  				 * blocking.
  				 */
+ 				if (standby_recv_timeout > 0)
+ 				{
+ 					now = localGetCurrentTimestamp();
+ 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
+ 					{
+ 						fprintf(stderr, _("%s: terminating XLogStream receiver due to timeout\n"),
+ 																			progname);
+ 						goto error;
+ 					}
+ 
+ 					/*
+ 					 * We didn't receive anything new, for half of receiver
+ 					 * replication timeout. Ping the server, if not already done.
+ 					 */
+ 					if (!ping_sent)
+ 					{
+ 						if (localTimestampDifferenceExceeds(last_recv_timestamp, now, (standby_recv_timeout/2)))
+ 						{
+ 							if (!sendFeedback(conn, blockpos, now, true))
+ 							{
+ 								goto error;
+ 							}
+ 
+ 							last_status = now;
+ 							ping_sent = true;
+ 						}
+ 					}		
+ 				}			
+ 				
  				continue;
  			}
  			else if (r < 0)
*************** ReceiveXlogStream(PGconn *conn, XLogRecP
*** 497,502 ****
--- 533,543 ----
  						progname, PQerrorMessage(conn));
  				goto error;
  			}
+ 
+ 			/* Set the last reply timestamp */
+ 			last_recv_timestamp = localGetCurrentTimestamp();
+ 			ping_sent = false;
+ 			
  			continue;
  		}
  		if (r == -1)
*************** ReceiveXlogStream(PGconn *conn, XLogRecP
*** 509,514 ****
--- 550,559 ----
  			goto error;
  		}
  
+ 		/* Set the last reply timestamp */
+ 		last_recv_timestamp = localGetCurrentTimestamp();
+ 		ping_sent = false;
+ 
  		/* Check the message type. */
  		if (copybuf[0] == 'k')
  		{
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/receivelog.h postgresql/src/bin/pg_basebackup/receivelog.h
*** postgresql.orig/src/bin/pg_basebackup/receivelog.h	2012-06-11 06:22:48.200921787 +0200
--- postgresql/src/bin/pg_basebackup/receivelog.h	2013-01-07 15:11:40.793007938 +0100
***************
*** 7,16 ****
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 				  XLogRecPtr startpos,
! 				  uint32 timeline,
! 				  char *sysidentifier,
! 				  char *basedir,
! 				  stream_stop_callback stream_stop,
! 				  int standby_message_timeout,
! 				  bool rename_partial);
--- 7,24 ----
  typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
  
  extern bool ReceiveXlogStream(PGconn *conn,
! 							  XLogRecPtr startpos,
! 							  uint32 timeline,
! 							  char *sysidentifier,
! 							  char *basedir,
! 							  stream_stop_callback stream_stop,
! 							  int standby_message_timeout,
! 							  int standby_recv_timeout,
! 							  bool rename_partial);
! 
! extern int64 localGetCurrentTimestamp(void);
! extern void  localTimestampDifference(int64 start_time, int64 stop_time,
! 								 long *secs, int *microsecs);
! extern bool  localTimestampDifferenceExceeds(int64 start_time,
! 								int64 stop_time,
! 								int msec);
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/streamutil.c postgresql/src/bin/pg_basebackup/streamutil.c
*** postgresql.orig/src/bin/pg_basebackup/streamutil.c	2013-01-02 09:19:03.856521815 +0100
--- postgresql/src/bin/pg_basebackup/streamutil.c	2013-01-07 15:11:40.793007938 +0100
*************** pg_malloc0(size_t size)
*** 66,74 ****
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
   */
  PGconn *
! GetConnection(void)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
--- 66,76 ----
   * Connect to the server. Returns a valid PGconn pointer if connected,
   * or NULL on non-permanent error. On permanent error, the function will
   * call exit(1) directly.
+  * Set conn_timeout to PGconn structure if their value 
+  * is not NULL.
   */
  PGconn *
! GetConnection(char *conn_timeout)
  {
  	PGconn	   *tmpconn;
  	int			argcount = 4;	/* dbname, replication, fallback_app_name,
*************** GetConnection(void)
*** 85,90 ****
--- 87,94 ----
  		argcount++;
  	if (dbport)
  		argcount++;
+ 	if (conn_timeout)
+ 		argcount++;
  
  	keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
  	values = pg_malloc0((argcount + 1) * sizeof(*values));
*************** GetConnection(void)
*** 114,119 ****
--- 118,129 ----
  		values[i] = dbport;
  		i++;
  	}
+ 	if (conn_timeout != NULL)
+ 	{
+ 		keywords[i] = "connect_timeout";
+ 		values[i] = conn_timeout;
+ 		i++;
+ 	}
  
  	while (true)
  	{
diff -dcrpN postgresql.orig/src/bin/pg_basebackup/streamutil.h postgresql/src/bin/pg_basebackup/streamutil.h
*** postgresql.orig/src/bin/pg_basebackup/streamutil.h	2012-10-03 10:40:48.299207401 +0200
--- postgresql/src/bin/pg_basebackup/streamutil.h	2013-01-07 15:11:40.794007945 +0100
*************** extern PGconn *conn;
*** 19,22 ****
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! extern PGconn *GetConnection(void);
--- 19,22 ----
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
  
! PGconn	   *GetConnection(char *conn_timeout);
#60Hari Babu
haribabu.kommi@huawei.com
In reply to: Boszormenyi Zoltan (#59)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On January 07, 2013 7:53 PM Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Patch is verified. Thanks for rebasing the patch.

Regards,
Hari babu.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Boszormenyi Zoltan (#59)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

Hi.

This patch was marked "Needs review" with no reviewers in the ongoing
CF, so I decided to take a look at it. I see that Zoltan has posted a
review, so I've added him to the list.

But I took a look at the latest patch in any case. Here are some
comments, mostly cosmetic ones.

diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
*** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-05 17:34:30.742135371 +0100
--- postgresql/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-07 15:11:40.787007890 +0100
*************** PostgreSQL documentation
*** 400,405 ****
--- 400,425 ----
</varlistentry>
<varlistentry>
+       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that receiver waits for communication from server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>     

I would reword this as "The maximum time (in seconds) to wait for data
from the server (default: wait forever)".

+      <varlistentry>
+       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
+       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
+       <listitem>
+        <para>
+         time that client wait for connection to establish with server (in seconds).
+        </para>
+       </listitem>
+      </varlistentry>          

Likewise, "The maximum time (in seconds) to wait for a connection to the
server to succeed (default: wait forever)".

Same thing in pg_receivexlog.sgml. Also, there's trailing whitespace in
various places in these files (and elsewhere in the patch), which should
be fixed.

diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c postgresql/src/bin/pg_basebackup/pg_basebackup.c
*** postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c	2013-01-05 17:34:30.778135625 +0100
--- postgresql/src/bin/pg_basebackup/pg_basebackup.c	2013-01-07 15:16:24.610037886 +0100
*************** bool		streamwal = false;
*** 45,50 ****
--- 45,54 ----
bool		fastcheckpoint = false;
bool		writerecoveryconf = false;
int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+ int		standby_recv_timeout = 60*1000;		/* 60 sec = default */
+ char		*standby_connect_timeout = NULL;		

I don't really like standby_recv_timeout being an int and
standby_connect_timeout being a char *. I understand that it's so that
it can be assigned to "values[i]" in GetConnection(), but that reason is
very distant, and not obvious from this code at all.

That said, I don't know if it's really worth bothering with.

+ #define NAPTIME_PER_CYCLE 100 /* max sleep time between cycles (100ms) */

This probably needs a better comment. Why are we sleeping between
cycles? What cycles?

+ 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
+ 		   "                             server (in seconds)\n"));
+ 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
+ 		   "                             with server (in seconds)\n"));	

Same comments about wording apply, but perhaps there's no need to
mention the default.

! if (r == 0 || (r < 0 && errno == EINTR))
! {
! /*
! * Got a timeout or signal. Before Continuing the loop, check for timeout.
! */
! if (standby_recv_timeout > 0)
! {
! now = localGetCurrentTimestamp();

I'd make "now" local to this block, and get rid of the comment. The two
"if"s are perfectly clear. This applies to the same pattern in other
places in the patch as well.

! if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
! {
! fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),

Better wording? "DB File receive" is confusing. Even something like
"Closing connection due to read timeout" would be better. Or perhaps
you can make it like the following message, slightly lower:

! if (PQconsumeInput(conn) == 0)
! {
! fprintf(stderr,
! _("%s: could not receive data from WAL Sender: %s"),
! progname, PQerrorMessage(conn));

…and in the former case, say "read timeout" instead of PQerrorMessage().

! /* Set the last reply timestamp */
! last_recv_timestamp = localGetCurrentTimestamp();
!
! /* Some data is received, so go back read them in buffer*/
! continue;

No need for these comments.

+ 	/* Set the last reply timestamp */
+ 	last_recv_timestamp = localGetCurrentTimestamp();

Likewise (in various places).

/*
! * Connect in replication mode to the server, Sending connect_timeout
! * as configured, there is no need for rw_timeout.
*/
! conn = GetConnection(standby_connect_timeout);

This comment is pretty confusing.

* Connect to the server. Returns a valid PGconn pointer if connected,
* or NULL on non-permanent error. On permanent error, the function will
* call exit(1) directly.
+  * Set conn_timeout to PGconn structure if their value 
+  * is not NULL.
*/
PGconn *
! GetConnection(char *conn_timeout)

And this comment is just wrong.

The patch looks OK otherwise. Zoltan indicated that his tests were
successful, so I didn't retest. Marking "Waiting on author" again.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Boszormenyi Zoltan (#59)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Now that I look at this a high-level perspective, why are we only
worried about timeouts in the Copy-mode and when connecting? The initial
checkpoint could take a long time too, and if the server turns into a
black hole while the checkpoint is running, pg_basebackup will still
hang. Then again, a short timeout on that phase would be a bad idea,
because the checkpoint can indeed take a long time.

In streaming replication, the keep-alive messages carry additional
information, the timestamps and WAL locations, so a keepalive makes
sense at that level. But otherwise, aren't we just trying to reimplement
TCP keepalives? TCP keepalives are not perfect, but if we want to have
an application level timeout, it should be implemented in the FE/BE
protocol.

I don't think we need to do anything specific to pg_basebackup. The user
can simply specify TCP keepalive settings in the connection string, like
with any libpq program.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#62)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Now that I look at this a high-level perspective, why are we only
worried about timeouts in the Copy-mode and when connecting? The
initial
checkpoint could take a long time too, and if the server turns into a
black hole while the checkpoint is running, pg_basebackup will still
hang. Then again, a short timeout on that phase would be a bad idea,
because the checkpoint can indeed take a long time.

True, but IMO, if somebody want to take basebackup, he should do that when
the server is not loaded.

In streaming replication, the keep-alive messages carry additional
information, the timestamps and WAL locations, so a keepalive makes
sense at that level. But otherwise, aren't we just trying to
reimplement
TCP keepalives? TCP keepalives are not perfect, but if we want to have
an application level timeout, it should be implemented in the FE/BE
protocol.

I don't think we need to do anything specific to pg_basebackup. The
user
can simply specify TCP keepalive settings in the connection string,
like
with any libpq program.

I think currently user has no way to specify TCP keepalive settings from
pg_basebackup, please let me know if there is any such existing way?

I think specifying TCP settings is very cumbersome for most users, that's
the reason most standard interfaces (ODBC/JDBC) have such application level
timeout mechanism.

By implementing in FE/BE protocol (do you mean to say that make such
non-blocking behavior inside Libpq or something else), it might be generic
and can be used for others as well but it might need few interface changes.

IMHO if by having such less impact changes for pg_basebackup, it makes
pg_basebackup network sensitive, the current approach can also be
considered.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#63)
Passing connection string to pg_basebackup

On 18.01.2013 08:50, Amit Kapila wrote:

I think currently user has no way to specify TCP keepalive settings from
pg_basebackup, please let me know if there is any such existing way?

I was going to say you can just use "keepalives_idle=30" in the
connection string. But there's no way to pass a connection string to
pg_basebackup on the command line! The usual way to pass a connection
string is to pass it as the database name, and PQconnect will expand it,
but that doesn't work with pg_basebackup because it hardcodes the
database name as "replication". D'oh.

You could still use environment variables and a service file to do it,
but it's certainly more cumbersome. It clearly should be possible to
pass a full connection string to pg_basebackup, that's an obvious oversight.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#64)
Re: Passing connection string to pg_basebackup

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing way?

I was going to say you can just use "keepalives_idle=30" in the
connection string. But there's no way to pass a connection string to
pg_basebackup on the command line! The usual way to pass a connection
string is to pass it as the database name, and PQconnect will expand
it,
but that doesn't work with pg_basebackup because it hardcodes the
database name as "replication". D'oh.

You could still use environment variables and a service file to do it,
but it's certainly more cumbersome. It clearly should be possible to
pass a full connection string to pg_basebackup, that's an obvious
oversight.

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives or
connection_timeout
2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#65)
Re: Passing connection string to pg_basebackup

On 18.01.2013 13:41, Amit Kapila wrote:

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing way?

I was going to say you can just use "keepalives_idle=30" in the
connection string. But there's no way to pass a connection string to
pg_basebackup on the command line! The usual way to pass a connection
string is to pass it as the database name, and PQconnect will expand
it,
but that doesn't work with pg_basebackup because it hardcodes the
database name as "replication". D'oh.

You could still use environment variables and a service file to do it,
but it's certainly more cumbersome. It clearly should be possible to
pass a full connection string to pg_basebackup, that's an obvious
oversight.

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives or
connection_timeout
2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

Right. Let's do just 1 for now. An general application level, non-TCP,
keepalive message at the libpq level might be a good idea, but that's a
much larger patch, definitely not 9.3 material.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#66)
Re: Passing connection string to pg_basebackup

On Friday, January 18, 2013 5:35 PM Heikki Linnakangas wrote:

On 18.01.2013 13:41, Amit Kapila wrote:

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing

way?

I was going to say you can just use "keepalives_idle=30" in the
connection string. But there's no way to pass a connection string to
pg_basebackup on the command line! The usual way to pass a

connection

string is to pass it as the database name, and PQconnect will expand
it,
but that doesn't work with pg_basebackup because it hardcodes the
database name as "replication". D'oh.

You could still use environment variables and a service file to do

it,

but it's certainly more cumbersome. It clearly should be possible to
pass a full connection string to pg_basebackup, that's an obvious
oversight.

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives

or

connection_timeout
2. Support recv_timeout separately to provide a way to users who are

not

comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

Right. Let's do just 1 for now.

I shall fix it as Review comment and update the patch and change the
location from CF-3 to current CF.

An general application level, non-TCP,
keepalive message at the libpq level might be a good idea, but that's a
much larger patch, definitely not 9.3 material.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#64)
Re: Passing connection string to pg_basebackup

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

You could still use environment variables and a service file to do it, but
it's certainly more cumbersome. It clearly should be possible to pass a full
connection string to pg_basebackup, that's an obvious oversight.

FWIW, +1. I would consider it a bugfix (backpatch, etc).

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#66)
Re: Passing connection string to pg_basebackup

On Fri, Jan 18, 2013 at 1:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 18.01.2013 13:41, Amit Kapila wrote:

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing way?

I was going to say you can just use "keepalives_idle=30" in the
connection string. But there's no way to pass a connection string to
pg_basebackup on the command line! The usual way to pass a connection
string is to pass it as the database name, and PQconnect will expand
it,
but that doesn't work with pg_basebackup because it hardcodes the
database name as "replication". D'oh.

You could still use environment variables and a service file to do it,
but it's certainly more cumbersome. It clearly should be possible to
pass a full connection string to pg_basebackup, that's an obvious
oversight.

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives or
connection_timeout
2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

Right. Let's do just 1 for now. An general application level, non-TCP,
keepalive message at the libpq level might be a good idea, but that's a much
larger patch, definitely not 9.3 material.

+1 for doing 1 now. But actually, I think we can just keep it that way
in the future as well. If you need to specify these fairly advanced
options, using a connection string really isn't a problem.

I think it would be more worthwhile to go through the rest of the
tools in bin/ and make sure they *all* support connection strings.
And, an important point, do it the same way.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Magnus Hagander
magnus@hagander.net
In reply to: Dimitri Fontaine (#68)
Re: Passing connection string to pg_basebackup

On Fri, Jan 18, 2013 at 2:43 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

You could still use environment variables and a service file to do it, but
it's certainly more cumbersome. It clearly should be possible to pass a full
connection string to pg_basebackup, that's an obvious oversight.

FWIW, +1. I would consider it a bugfix (backpatch, etc).

While it's a feature I'd very much like to see, I really don't think
you can consider it a bugfix. It's functionality that was left out -
it's not like we tried to implement it and it didn't work. We pushed
the whole implementation to "next version" (and then forgot about
actually putting it in the next version, until now)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Magnus Hagander (#70)
Re: Passing connection string to pg_basebackup

Magnus Hagander <magnus@hagander.net> writes:

FWIW, +1. I would consider it a bugfix (backpatch, etc).

While it's a feature I'd very much like to see, I really don't think
you can consider it a bugfix. It's functionality that was left out -
it's not like we tried to implement it and it didn't work. We pushed
the whole implementation to "next version" (and then forgot about
actually putting it in the next version, until now)

Thanks for reminding me about that, I completely forgot about all that.

On the other hand, discrepancies in between command line arguments
processing in our tools are already not helping our users (even if
pg_dump -d seems to have been fixed along the years); so much so that
I'm having a hard time finding any upside into having a different set of
command line argument capabilities for the same tool depending on the
major version.

We are not talking about a new feature per se, but exposing a feature
that about every other command line tool we ship have. So I think I'm
standing on my position that it should get backpatched as a "fix".

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Fontaine (#71)
Re: Passing connection string to pg_basebackup

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

On the other hand, discrepancies in between command line arguments
processing in our tools are already not helping our users (even if
pg_dump -d seems to have been fixed along the years); so much so that
I'm having a hard time finding any upside into having a different set of
command line argument capabilities for the same tool depending on the
major version.

We are not talking about a new feature per se, but exposing a feature
that about every other command line tool we ship have. So I think I'm
standing on my position that it should get backpatched as a "fix".

I don't think that argument holds any water at all. There would still
be differences in command line argument capabilities out there ---
they'd just be between minor versions not major ones. That's not any
easier for people to deal with. And what will you say to someone whose
application got broken by a minor-version update?

If this feature were all that critical someone would have noticed its
lack before now, anyway. So I can't get excited about back-patching.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#72)
Re: Passing connection string to pg_basebackup

Tom Lane <tgl@sss.pgh.pa.us> writes:

I don't think that argument holds any water at all. There would still
be differences in command line argument capabilities out there ---
they'd just be between minor versions not major ones. That's not any
easier for people to deal with. And what will you say to someone whose
application got broken by a minor-version update?

Fair enough, I suppose.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#72)
Re: Passing connection string to pg_basebackup

On Sat, Jan 19, 2013 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

On the other hand, discrepancies in between command line arguments
processing in our tools are already not helping our users (even if
pg_dump -d seems to have been fixed along the years); so much so that
I'm having a hard time finding any upside into having a different set of
command line argument capabilities for the same tool depending on the
major version.

We are not talking about a new feature per se, but exposing a feature
that about every other command line tool we ship have. So I think I'm
standing on my position that it should get backpatched as a "fix".

I don't think that argument holds any water at all. There would still
be differences in command line argument capabilities out there ---
they'd just be between minor versions not major ones. That's not any
easier for people to deal with. And what will you say to someone whose
application got broken by a minor-version update?

I heartily agree. I can say from firsthand experience that when minor
releases break things for customers (and they do), the customers get
*really* cranky. Based on recent experience, I think we should be
tightening our standards for what gets back-patched, not loosening
them. (No, I don't have a specific example off-hand, sorry.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Kevin Grittner
kgrittn@mail.com
In reply to: Robert Haas (#74)
Re: Passing connection string to pg_basebackup

Robert Haas wrote:

I heartily agree. I can say from firsthand experience that when minor
releases break things for customers (and they do), the customers get
*really* cranky. Based on recent experience, I think we should be
tightening our standards for what gets back-patched, not loosening
them.

+1

Any change in a minor release which causes working production code
to break very quickly and seriously erodes confidence in the
ability to apply a minor release without extensive (and expensive)
testing. When that confidence erordes, users stay on old minor
releases for extended periods -- often until they hit one of the
bugs which was fixed in a minor release.

We need to be very conservative about back-patching any changes in
user-visible behavior.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Magnus Hagander
magnus@hagander.net
In reply to: Noname (#1)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Now that I look at this a high-level perspective, why are we only
worried about timeouts in the Copy-mode and when connecting? The
initial
checkpoint could take a long time too, and if the server turns into a
black hole while the checkpoint is running, pg_basebackup will still
hang. Then again, a short timeout on that phase would be a bad idea,
because the checkpoint can indeed take a long time.

True, but IMO, if somebody want to take basebackup, he should do that when
the server is not loaded.

A lot of installations don't have such an optino, because there is no
time whe nthe server is not loaded.

In streaming replication, the keep-alive messages carry additional
information, the timestamps and WAL locations, so a keepalive makes
sense at that level. But otherwise, aren't we just trying to
reimplement
TCP keepalives? TCP keepalives are not perfect, but if we want to have
an application level timeout, it should be implemented in the FE/BE
protocol.

I don't think we need to do anything specific to pg_basebackup. The
user
can simply specify TCP keepalive settings in the connection string,
like
with any libpq program.

I think currently user has no way to specify TCP keepalive settings from
pg_basebackup, please let me know if there is any such existing way?

You can set it through environment variables. As was discussed
elsewhere, it would be good to have the ability to do it natively to
pg_basebackup as well.

I think specifying TCP settings is very cumbersome for most users, that's
the reason most standard interfaces (ODBC/JDBC) have such application level
timeout mechanism.

By implementing in FE/BE protocol (do you mean to say that make such
non-blocking behavior inside Libpq or something else), it might be generic
and can be used for others as well but it might need few interface changes.

If it's specifying them that is cumbersome, then that's the part we
should fix, rather than modifying the protocol, no?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Amit Kapila
amit.kapila@huawei.com
In reply to: Magnus Hagander (#76)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Monday, January 21, 2013 6:22 PM Magnus Hagander

On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Now that I look at this a high-level perspective, why are we only
worried about timeouts in the Copy-mode and when connecting? The
initial
checkpoint could take a long time too, and if the server turns into

a

black hole while the checkpoint is running, pg_basebackup will still
hang. Then again, a short timeout on that phase would be a bad idea,
because the checkpoint can indeed take a long time.

True, but IMO, if somebody want to take basebackup, he should do that

when

the server is not loaded.

A lot of installations don't have such an optino, because there is no
time whe nthe server is not loaded.

Good to know about it.
I have always heard that customer will run background maintenance activities
(Reindex, Vacuum Full, etc) when the server is less loaded.
For example
a. Billing applications in telecom, at night times they can be relatively
less loaded.
b. Any databases used for Sensex transactions, they will be relatively free
once the market is closed.
c. Banking solutions, because transactions are done mostly in day times.

There will be many cases where Database server will be loaded all the times,
if you can give some example, it will be a good learning for me.

In streaming replication, the keep-alive messages carry additional
information, the timestamps and WAL locations, so a keepalive makes
sense at that level. But otherwise, aren't we just trying to
reimplement
TCP keepalives? TCP keepalives are not perfect, but if we want to

have

an application level timeout, it should be implemented in the FE/BE
protocol.

I don't think we need to do anything specific to pg_basebackup. The
user
can simply specify TCP keepalive settings in the connection string,
like
with any libpq program.

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing way?

You can set it through environment variables. As was discussed
elsewhere, it would be good to have the ability to do it natively to
pg_basebackup as well.

Sure, already modifying the existing patch to support connection string in
pg_basebackup and pg_receivexlog.

I think specifying TCP settings is very cumbersome for most users,

that's

the reason most standard interfaces (ODBC/JDBC) have such application

level

timeout mechanism.

By implementing in FE/BE protocol (do you mean to say that make such
non-blocking behavior inside Libpq or something else), it might be

generic

and can be used for others as well but it might need few interface

changes.

If it's specifying them that is cumbersome, then that's the part we
should fix, rather than modifying the protocol, no?

That can be done as part of point 2 of initial proposal
(2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives).

To achieve this there can be 2 ways.
1. Change in FE/BE protocol - I am not sure exactly how this can be done,
but as per Heikki this is better way of implementing it.
2. Make the socket as non-blocking in pg_basebackup.

Advantage of Approach-1 is that if we do in such a fashion that in lower
layers (libpq) it is addressed then all other apps (pg_basebackup, etc) can
use it, no need to handle separately in each application.

So now as changes in Approach-1 seems to be invasive, we decided to do it
later.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Hari Babu
haribabu.kommi@huawei.com
In reply to: Magnus Hagander (#69)
Re: Passing connection string to pg_basebackup

On Saturday, January 19, 2013 5:49 PM Magnus Hagander wrote:

On Fri, Jan 18, 2013 at 1:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 18.01.2013 13:41, Amit Kapila wrote:

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives or
connection_timeout
2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

Right. Let's do just 1 for now. An general application level, non-TCP,
keepalive message at the libpq level might be a good idea, but that's a

much

larger patch, definitely not 9.3 material.

+1 for doing 1 now. But actually, I think we can just keep it that way
in the future as well. If you need to specify these fairly advanced
options, using a connection string really isn't a problem.

I think it would be more worthwhile to go through the rest of the
tools in bin/ and make sure they *all* support connection strings.
And, an important point, do it the same way.

Presently I am trying to implement the option-1 by adding an extra command
line
Option -C "connection_string" to pg_basebackup and pg_receivexlog.
This option can be used with all the tools in bin folder.

The existing command line options to the tools are not planned to remove as
of now.

To handle both options, we can follow these approaches.

1. To make the code simpler, the connection string is formed inside with the
existing
command line options, if the user is not provided the "connection_string"
option.
which is used for further processing.

2. The connection_string and existing command line options are handled
separately.

I feel approach-1 is better. Please provide your suggestions on the same.

Regards,
Hari babu.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Hari Babu
haribabu.kommi@huawei.com
In reply to: Noname (#1)
1 attachment(s)
Re: Passing connection string to pg_basebackup

On Tue, Jan 22, 2013 3:27 PM Hari Babu wrote:

On Saturday, January 19, 2013 5:49 PM Magnus Hagander wrote:

On Fri, Jan 18, 2013 at 1:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 18.01.2013 13:41, Amit Kapila wrote:

On Friday, January 18, 2013 3:46 PM Heikki Linnakangas wrote:

On 18.01.2013 08:50, Amit Kapila wrote:

So to solve this problem below can be done:
1. Support connection string in pg_basebackup and mention keepalives or
connection_timeout
2. Support recv_timeout separately to provide a way to users who are

not

comfortable tcp keepalives

a. 1 can be done alone
b. 2 can be done alone
c. both 1 and 2.

Right. Let's do just 1 for now. An general application level, non-TCP,
keepalive message at the libpq level might be a good idea, but that's a

much

larger patch, definitely not 9.3 material.

+1 for doing 1 now. But actually, I think we can just keep it that way
in the future as well. If you need to specify these fairly advanced
options, using a connection string really isn't a problem.

I think it would be more worthwhile to go through the rest of the
tools in bin/ and make sure they *all* support connection strings.
And, an important point, do it the same way.

Presently I am trying to implement the option-1 by adding an extra command

line

Option -C "connection_string" to pg_basebackup and pg_receivexlog.
This option can be used with all the tools in bin folder.

The existing command line options to the tools are not planned to remove as

of now.

To handle both options, we can follow these approaches.

1. To make the code simpler, the connection string is formed inside with

the existing

command line options, if the user is not provided the "connection_string"

option.

which is used for further processing.

2. The connection_string and existing command line options are handled

separately.

I feel approach-1 is better. Please provide your suggestions on the same.

Here is the patch which handles taking of connection string as an argument
to pg_basebackup and pg_receivexlog.

Description of changes:

1. New command line "-C connection-string"option is added for passing the
connection string.
2. Used "PQconnectdb" function for connecting to server instead of existing
function "PQconnectdbParams".
3. The existing command line parameters are formed in a string and passed to
"PQconnectdb" function.
4. With the connection string, if user provides additional options with
existing command line options, higher priority is given for the additional
options.
5. "conninfo_parse" function is modified to handle of single quote in the
password provided as input.

please provide your suggestions.

Regards,
Hari babu.

Attachments:

pg_basebkup_recvxlog_conn_string_v1.patchapplication/octet-stream; name=pg_basebkup_recvxlog_conn_string_v1.patchDownload
*** a/doc/src/sgml/ref/pg_basebackup.sgml
--- b/doc/src/sgml/ref/pg_basebackup.sgml
***************
*** 357,362 **** PostgreSQL documentation
--- 357,375 ----
     <para>
      The following command-line options control the database connection parameters.
  
+ 	<variablelist>
+      <varlistentry>
+       <term><option>-C <replaceable class="parameter">"OPTIONS"</replaceable></option></term>
+       <term><option>--connection-string=<replaceable class="parameter">"OPTIONS"</replaceable></option></term>
+       <listitem>
+        <para>
+         Specifies connection string options, used for connecting to server.
+         These option can be used along with other user supplied options.
+         In case of conflicting option user supplied option is choosen.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
      <variablelist>
       <varlistentry>
        <term><option>-h <replaceable class="parameter">host</replaceable></option></term>
*** a/doc/src/sgml/ref/pg_receivexlog.sgml
--- b/doc/src/sgml/ref/pg_receivexlog.sgml
***************
*** 121,126 **** PostgreSQL documentation
--- 121,139 ----
     <para>
      The following command-line options control the database connection parameters.
  
+ 	<variablelist>
+      <varlistentry>
+       <term><option>-C <replaceable class="parameter">"OPTIONS"</replaceable></option></term>
+       <term><option>--connection-string=<replaceable class="parameter">"OPTIONS"</replaceable></option></term>
+       <listitem>
+        <para>
+         Specifies connection string options, used for connecting to server.
+         These option can be used along with other user supplied options.
+         In case of conflicting option user supplied option is choosen.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
      <variablelist>
       <varlistentry>
        <term><option>-h <replaceable class="parameter">host</replaceable></option></term>
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***************
*** 126,131 **** usage(void)
--- 126,132 ----
  	printf(_("  -V, --version          output version information, then exit\n"));
  	printf(_("  -?, --help             show this help, then exit\n"));
  	printf(_("\nConnection options:\n"));
+ 	printf(_("  -C, --connection-string=\"OPTIONS\" Connect to server using OPTIONS\n"));
  	printf(_("  -h, --host=HOSTNAME    database server host or socket directory\n"));
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
***************
*** 1104,1125 **** ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
  }
  
  /*
-  * Escape single quotes used in connection parameters
-  */
- static char *
- escape_quotes(const char *src)
- {
- 	char	   *result = escape_single_quotes_ascii(src);
- 
- 	if (!result)
- 	{
- 		fprintf(stderr, _("%s: out of memory\n"), progname);
- 		exit(1);
- 	}
- 	return result;
- }
- 
- /*
   * Create a recovery.conf file in memory using a PQExpBuffer
   */
  static void
--- 1105,1110 ----
***************
*** 1532,1537 **** main(int argc, char **argv)
--- 1517,1523 ----
  		{"pgdata", required_argument, NULL, 'D'},
  		{"format", required_argument, NULL, 'F'},
  		{"checkpoint", required_argument, NULL, 'c'},
+ 		{"connection-string", required_argument, NULL, 'C'},
  		{"write-recovery-conf", no_argument, NULL, 'R'},
  		{"xlog", no_argument, NULL, 'x'},
  		{"xlog-method", required_argument, NULL, 'X'},
***************
*** 1570,1576 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:RxX:l:zZ:c:h:p:U:s:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
--- 1556,1562 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:F:RxX:l:zZ:C:c:h:p:U:s:wWvP",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
***************
*** 1661,1666 **** main(int argc, char **argv)
--- 1647,1655 ----
  					exit(1);
  				}
  				break;
+ 			case 'C':
+ 				connection_string = pg_strdup(optarg);
+ 				break;
  			case 'h':
  				dbhost = pg_strdup(optarg);
  				break;
*** a/src/bin/pg_basebackup/pg_receivexlog.c
--- b/src/bin/pg_basebackup/pg_receivexlog.c
***************
*** 58,63 **** usage(void)
--- 58,64 ----
  	printf(_("  -V, --version          output version information, then exit\n"));
  	printf(_("  -?, --help             show this help, then exit\n"));
  	printf(_("\nConnection options:\n"));
+ 	printf(_("  -C, --connection-string=\"OPTIONS\" Connect to server using OPTIONS\n"));
  	printf(_("  -h, --host=HOSTNAME    database server host or socket directory\n"));
  	printf(_("  -p, --port=PORT        database server port number\n"));
  	printf(_("  -s, --status-interval=INTERVAL\n"
***************
*** 305,310 **** main(int argc, char **argv)
--- 306,312 ----
  	static struct option long_options[] = {
  		{"help", no_argument, NULL, '?'},
  		{"version", no_argument, NULL, 'V'},
+ 		{"connection-string", required_argument, NULL, 'C'},
  		{"directory", required_argument, NULL, 'D'},
  		{"host", required_argument, NULL, 'h'},
  		{"port", required_argument, NULL, 'p'},
***************
*** 338,348 **** main(int argc, char **argv)
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "D:h:p:U:s:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
  		{
  			case 'D':
  				basedir = pg_strdup(optarg);
  				break;
--- 340,353 ----
  		}
  	}
  
! 	while ((c = getopt_long(argc, argv, "C:D:h:p:U:s:nwWv",
  							long_options, &option_index)) != -1)
  	{
  		switch (c)
  		{
+ 			case 'C':
+ 				connection_string = pg_strdup(optarg);
+ 				break;
  			case 'D':
  				basedir = pg_strdup(optarg);
  				break;
*** a/src/bin/pg_basebackup/streamutil.c
--- b/src/bin/pg_basebackup/streamutil.c
***************
*** 24,29 **** char	   *dbport = NULL;
--- 24,47 ----
  int			dbgetpassword = 0;	/* 0=auto, -1=never, 1=always */
  static char *dbpassword = NULL;
  PGconn	   *conn = NULL;
+ char	   *connection_string = NULL;
+ 
+ 
+ /*
+  * Escape single quotes used in connection parameters
+  */
+ char *
+ escape_quotes(const char *src)
+ {
+ 	char	   *result = escape_single_quotes_ascii(src);
+ 
+ 	if (!result)
+ 	{
+ 		fprintf(stderr, _("%s: out of memory\n"), progname);
+ 		exit(1);
+ 	}
+ 	return result;
+ }
  
  /*
   * strdup() and malloc() replacements that print an error and exit
***************
*** 71,119 **** PGconn *
  GetConnection(void)
  {
  	PGconn	   *tmpconn;
- 	int			argcount = 4;	/* dbname, replication, fallback_app_name,
- 								 * password */
- 	int			i;
- 	const char **keywords;
- 	const char **values;
  	char	   *password = NULL;
  	const char *tmpparam;
! 
! 	if (dbhost)
! 		argcount++;
! 	if (dbuser)
! 		argcount++;
! 	if (dbport)
! 		argcount++;
! 
! 	keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
! 	values = pg_malloc0((argcount + 1) * sizeof(*values));
! 
! 	keywords[0] = "dbname";
! 	values[0] = "replication";
! 	keywords[1] = "replication";
! 	values[1] = "true";
! 	keywords[2] = "fallback_application_name";
! 	values[2] = progname;
! 	i = 3;
  	if (dbhost)
! 	{
! 		keywords[i] = "host";
! 		values[i] = dbhost;
! 		i++;
! 	}
  	if (dbuser)
! 	{
! 		keywords[i] = "user";
! 		values[i] = dbuser;
! 		i++;
! 	}
  	if (dbport)
! 	{
! 		keywords[i] = "port";
! 		values[i] = dbport;
! 		i++;
! 	}
  
  	while (true)
  	{
--- 89,122 ----
  GetConnection(void)
  {
  	PGconn	   *tmpconn;
  	char	   *password = NULL;
  	const char *tmpparam;
! 	char		conninfo[MAXCONNINFO];
! 	int			conn_len = 0;
! 	char		*conninfo_ptr = conninfo;
! 
! 	/*
! 	 * Connect using deliberately undocumented parameter: replication. The
! 	 * database name is ignored by the server in replication mode.
! 	 */
! 	conn_len += snprintf(conninfo_ptr, (sizeof(conninfo) - conn_len),
! 	"%s dbname=replication replication=true fallback_application_name='%s' ",
! 	connection_string ? connection_string : "", progname);
! 
! 	/*
! 	 * Extra options provided are added at the end of connection string.
! 	 */
  	if (dbhost)
! 		conn_len += snprintf((conninfo_ptr + conn_len), (sizeof(conninfo) - conn_len),
! 					"host='%s' ", dbhost);
! 
  	if (dbuser)
! 		conn_len += snprintf((conninfo_ptr + conn_len), (sizeof(conninfo) - conn_len),
! 					"user=%s ", dbuser);
! 
  	if (dbport)
! 		conn_len += snprintf((conninfo_ptr + conn_len), (sizeof(conninfo) - conn_len),
! 					"port=%s ", dbport);
  
  	while (true)
  	{
***************
*** 127,144 **** GetConnection(void)
  			 * meaning this is the call for a second session to the same
  			 * database, so just forcibly reuse that password.
  			 */
! 			keywords[argcount - 1] = "password";
! 			values[argcount - 1] = dbpassword;
  			dbgetpassword = -1; /* Don't try again if this fails */
  		}
  		else if (dbgetpassword == 1)
  		{
  			password = simple_prompt(_("Password: "), 100, false);
! 			keywords[argcount - 1] = "password";
! 			values[argcount - 1] = password;
  		}
  
! 		tmpconn = PQconnectdbParams(keywords, values, true);
  
  		/*
  		 * If there is too little memory even to allocate the PGconn object
--- 130,147 ----
  			 * meaning this is the call for a second session to the same
  			 * database, so just forcibly reuse that password.
  			 */
! 			snprintf((conninfo_ptr + conn_len), (sizeof(conninfo) - conn_len),
! 						"password='%s' ", escape_quotes(dbpassword));
  			dbgetpassword = -1; /* Don't try again if this fails */
  		}
  		else if (dbgetpassword == 1)
  		{
  			password = simple_prompt(_("Password: "), 100, false);
! 			snprintf((conninfo_ptr + conn_len), (sizeof(conninfo) - conn_len),
! 						"password='%s' ", escape_quotes(password));
  		}
  
! 		tmpconn = PQconnectdb(conninfo);
  
  		/*
  		 * If there is too little memory even to allocate the PGconn object
***************
*** 165,179 **** GetConnection(void)
  			fprintf(stderr, _("%s: could not connect to server: %s\n"),
  					progname, PQerrorMessage(tmpconn));
  			PQfinish(tmpconn);
- 			free(values);
- 			free(keywords);
  			return NULL;
  		}
  
- 		/* Connection ok! */
- 		free(values);
- 		free(keywords);
- 
  		/*
  		 * Ensure we have the same value of integer timestamps as the server
  		 * we are connecting to.
--- 168,176 ----
*** a/src/bin/pg_basebackup/streamutil.h
--- b/src/bin/pg_basebackup/streamutil.h
***************
*** 5,10 **** extern char *dbhost;
--- 5,11 ----
  extern char *dbuser;
  extern char *dbport;
  extern int	dbgetpassword;
+ extern char	*connection_string;
  
  /* Connection kept global so we can disconnect easily */
  extern PGconn *conn;
***************
*** 15,20 **** extern PGconn *conn;
--- 16,22 ----
  	exit(code);									\
  	}
  
+ extern char *escape_quotes(const char *src);
  
  extern char *pg_strdup(const char *s);
  extern void *pg_malloc0(size_t size);
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 200,205 ****
--- 200,211 ----
  #endif
  
  /*
+  * MAXCONNINFO: maximum size of a connection string.
+  */
+ #define MAXCONNINFO		1024
+ 
+ 
+ /*
   *------------------------------------------------------------------------
   * The following symbols are for enabling debugging code, not for
   * controlling user-visible features or resource limits.
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 23,35 **** extern int	wal_receiver_status_interval;
  extern int	wal_receiver_timeout;
  extern bool hot_standby_feedback;
  
- /*
-  * MAXCONNINFO: maximum size of a connection string.
-  *
-  * XXX: Should this move to pg_config_manual.h?
-  */
- #define MAXCONNINFO		1024
- 
  /* Can we allow the standby to accept replication connection from another standby? */
  #define AllowCascadeReplication() (EnableHotStandby && max_wal_senders > 0)
  
--- 23,28 ----
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***************
*** 4216,4224 **** conninfo_parse(const char *conninfo, PQExpBuffer errorMessage,
  				}
  				if (*cp == '\'')
  				{
- 					*cp2 = '\0';
  					cp++;
! 					break;
  				}
  				*cp2++ = *cp++;
  			}
--- 4216,4227 ----
  				}
  				if (*cp == '\'')
  				{
  					cp++;
! 					if (*cp != '\'')
! 					{
! 						*cp2 = '\0';
! 						break;
! 					}
  				}
  				*cp2++ = *cp++;
  			}
#80Magnus Hagander
magnus@hagander.net
In reply to: Noname (#1)
Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

On Tue, Jan 22, 2013 at 7:31 AM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Monday, January 21, 2013 6:22 PM Magnus Hagander

On Fri, Jan 18, 2013 at 7:50 AM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Wednesday, January 16, 2013 4:02 PM Heikki Linnakangas wrote:

On 07.01.2013 16:23, Boszormenyi Zoltan wrote:

Since my other patch against pg_basebackup is now committed,
this patch doesn't apply cleanly, patch rejects 2 hunks.
The fixed up patch is attached.

Now that I look at this a high-level perspective, why are we only
worried about timeouts in the Copy-mode and when connecting? The
initial
checkpoint could take a long time too, and if the server turns into

a

black hole while the checkpoint is running, pg_basebackup will still
hang. Then again, a short timeout on that phase would be a bad idea,
because the checkpoint can indeed take a long time.

True, but IMO, if somebody want to take basebackup, he should do that

when

the server is not loaded.

A lot of installations don't have such an optino, because there is no
time whe nthe server is not loaded.

Good to know about it.
I have always heard that customer will run background maintenance activities
(Reindex, Vacuum Full, etc) when the server is less loaded.
For example
a. Billing applications in telecom, at night times they can be relatively
less loaded.

That assumes there is a nighttime.. If you're operating in enough
timezones, that won't happen.

b. Any databases used for Sensex transactions, they will be relatively free
once the market is closed.
c. Banking solutions, because transactions are done mostly in day times.

True. But those are definitely very very narrow usecases ;)

Don't get me wrong. There are a *lot* of people who have nighttimes to
do maintenance in. They are the lucky ones :) But we can't assume this
scenario.

There will be many cases where Database server will be loaded all the times,
if you can give some example, it will be a good learning for me.

Most internet based businesses that do business in multiple countries.
Or really, any business that has customers in multiple timezones
across the world. And even more to the point, any business who's
*customers* have customers in multiple timezones across the world,
provided they are services-based.

In streaming replication, the keep-alive messages carry additional
information, the timestamps and WAL locations, so a keepalive makes
sense at that level. But otherwise, aren't we just trying to
reimplement
TCP keepalives? TCP keepalives are not perfect, but if we want to

have

an application level timeout, it should be implemented in the FE/BE
protocol.

I don't think we need to do anything specific to pg_basebackup. The
user
can simply specify TCP keepalive settings in the connection string,
like
with any libpq program.

I think currently user has no way to specify TCP keepalive settings

from

pg_basebackup, please let me know if there is any such existing way?

You can set it through environment variables. As was discussed
elsewhere, it would be good to have the ability to do it natively to
pg_basebackup as well.

Sure, already modifying the existing patch to support connection string in
pg_basebackup and pg_receivexlog.

Good.

I think specifying TCP settings is very cumbersome for most users,

that's

the reason most standard interfaces (ODBC/JDBC) have such application

level

timeout mechanism.

By implementing in FE/BE protocol (do you mean to say that make such
non-blocking behavior inside Libpq or something else), it might be

generic

and can be used for others as well but it might need few interface

changes.

If it's specifying them that is cumbersome, then that's the part we
should fix, rather than modifying the protocol, no?

That can be done as part of point 2 of initial proposal
(2. Support recv_timeout separately to provide a way to users who are not
comfortable tcp keepalives).

Looking at the bigger picture, we should in that case support those on
*all* our frontend applications, and not just pg_basebackup. To me, it
makes more sense to just say "use the connection string method to
connect when you need to set these parameters". There are always going
to be some parameters that require that.

To achieve this there can be 2 ways.
1. Change in FE/BE protocol - I am not sure exactly how this can be done,
but as per Heikki this is better way of implementing it.
2. Make the socket as non-blocking in pg_basebackup.

Advantage of Approach-1 is that if we do in such a fashion that in lower
layers (libpq) it is addressed then all other apps (pg_basebackup, etc) can
use it, no need to handle separately in each application.

So now as changes in Approach-1 seems to be invasive, we decided to do it
later.

Ok - I haven't really been following the thread, but that doesn't seem
unreasonable. The thing I was objecting to is putting in special
parameters to pg_basebackup to deal with it, rather than just
implementing the connection string option, which is consistent with
other tools and will give us other parameters as well for free.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers