pg_basebackup may fail to send feedbacks.

Started by Kyotaro HORIGUCHIalmost 11 years ago18 messages

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

2 attachment(s)

Hello, I received an report that pg_basebackup with "-X stream"
always exits with the following error.

pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The walsender had been terminated by replication timeout a bit
before the above message.

LOG: terminating walsender process due to replication timeout

====

I digged into this and found that a accumulation of delay in
receiving xlog stream can cause such an error. This hardly occurs
on an ordinary environment but this would be caused by temporary
(or permanent) heavy load on the receiver-side machine. A virtual
machine environment could raises the chance, I suppose.

In HandleCopyStream(), the feedbacks are sent only after breaks
of the xlog stream, so continuous flow of xlog stream prevents it
from being sent with expected intervals. walsender sends
keepalive message for such a case, but it arrives with a long
delay being caught by the congestion of the stream, but this is
not a problem because the keepalive is intended to be sent while
idle.

I think that the status feedback should be sent whenever
standby_message_timeout has elapsed just after (or before) an
incoming message is processed. The seemingly most straightforward
way to fix this is breaking the innner-loop (while(r != 0)) if
the time to feedback comes as the attached patch #1
(0001-Make-sure-to-send-...).

What do you thing about this?

regards,

=====
- How to reproduce the situation.

As mentioned before, this hardly occurs on ordinary
environment. But simulating the heavy load by inserting a delay
in HandleCopyStream() effectively let the error occur.

With the attached patch #2(insert_interval.diff), the following
operation let us see the situation. The walsender reports timeout
although the stream processing is delayed but running steadily.

postgresql.conf:

wal_level = hot_standby
max_wal_senders = 2
wal_sender_timeout = 20s # * 2 of default of standby_message_timeout

Terminal1$ psql postgres
postgres=# CREATE TABLE t1 (a text);
postgres=# ALTER TABLE t1 ALTER COLUMN a SET STORAGE EXTERNAL;
postgres=# INSERT INTO t1 (SELECT repeat('x', 10000) FROM generate_series(0, 99999)); -- about 1GB
postgres=# ^D

Terminal1$ pgbench -i postgres
Terminal1$ pgbench -T 600 -h localhost postgres

Terminal2$ pg_basebackupo -r 32k -X stream -D data -h localhost

... in about a couple of minutes on my environment..

Terminal1:

LOG: terminating walsender process due to replication timeout

... after another couple of minutes.

Terminal2:

pg_basebackup: could not receive data from WAL stream: server closed...

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Make-sure-to-send-feedback-at-desired-timing.patchtext/x-patch; charset=us-asciiDownload

>From 0df5eb2229d0e417e8aa52d83ea6b30681bdeef9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 2 Feb 2015 12:49:45 +0900
Subject: [PATCH] Make sure to send feedback at desired timing.

Continuous stream due to heavy-load on client side can prevent
feedbacks to be sent with expected interval, and it results in a
replication timeout on walsender. Exiting from the fast-path loop when
the time comes to make sure the feedback to be sent with expected
intervals.
---
 src/bin/pg_basebackup/receivelog.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..fb6738d 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -925,9 +925,15 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			}
 
 			/*
-			 * Process the received data, and any subsequent data we
-			 * can read without blocking.
+			 * Process the received data, and any subsequent data we can read
+			 * without blocking except when the time to feedback comes.
 			 */
+			now = feGetCurrentTimestamp();
+			if (standby_message_timeout > 0 &&
+				feTimestampDifferenceExceeds(last_status, now,
+											 standby_message_timeout))
+				break;
+
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
 	}
-- 
2.1.0.GIT

insert_interval.difftext/x-patch; charset=us-asciiDownload

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..69be40a 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -928,6 +928,7 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			 * Process the received data, and any subsequent data we
 			 * can read without blocking.
 			 */
+			usleep(100 * 1000);
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
 	}

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: pg_basebackup may fail to send feedbacks.

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

regards,

At Mon, 02 Feb 2015 13:48:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150202.134834.100672846.horiguchi.kyotaro@lab.ntt.co.jp>

Hello, I received an report that pg_basebackup with "-X stream"
always exits with the following error.

pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The walsender had been terminated by replication timeout a bit
before the above message.

LOG: terminating walsender process due to replication timeout

====

I digged into this and found that a accumulation of delay in
receiving xlog stream can cause such an error. This hardly occurs
on an ordinary environment but this would be caused by temporary
(or permanent) heavy load on the receiver-side machine. A virtual
machine environment could raises the chance, I suppose.

In HandleCopyStream(), the feedbacks are sent only after breaks
of the xlog stream, so continuous flow of xlog stream prevents it
from being sent with expected intervals. walsender sends
keepalive message for such a case, but it arrives with a long
delay being caught by the congestion of the stream, but this is
not a problem because the keepalive is intended to be sent while
idle.

I think that the status feedback should be sent whenever
standby_message_timeout has elapsed just after (or before) an
incoming message is processed. The seemingly most straightforward
way to fix this is breaking the innner-loop (while(r != 0)) if
the time to feedback comes as the attached patch #1
(0001-Make-sure-to-send-...).

What do you thing about this?

regards,

=====
- How to reproduce the situation.

As mentioned before, this hardly occurs on ordinary
environment. But simulating the heavy load by inserting a delay
in HandleCopyStream() effectively let the error occur.

With the attached patch #2(insert_interval.diff), the following
operation let us see the situation. The walsender reports timeout
although the stream processing is delayed but running steadily.

postgresql.conf:

wal_level = hot_standby
max_wal_senders = 2
wal_sender_timeout = 20s # * 2 of default of standby_message_timeout

Terminal1$ psql postgres
postgres=# CREATE TABLE t1 (a text);
postgres=# ALTER TABLE t1 ALTER COLUMN a SET STORAGE EXTERNAL;
postgres=# INSERT INTO t1 (SELECT repeat('x', 10000) FROM generate_series(0, 99999)); -- about 1GB
postgres=# ^D

Terminal1$ pgbench -i postgres
Terminal1$ pgbench -T 600 -h localhost postgres

Terminal2$ pg_basebackupo -r 32k -X stream -D data -h localhost

... in about a couple of minutes on my environment..

Terminal1:

LOG: terminating walsender process due to replication timeout

... after another couple of minutes.

Terminal2:

pg_basebackup: could not receive data from WAL stream: server closed...

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: pg_basebackup may fail to send feedbacks.

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

Seems walreceiver has the same problem. No?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#3)

Re: pg_basebackup may fail to send feedbacks.

Hello,

At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com>

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

Seems walreceiver has the same problem. No?

pg_receivexlog.c would have the same problem since it uses the
same function with pg_basebackup.c.

The correspondent of HandleCopyStream in wansender is
WalReceiverMain, and it doesn't seem to have the same kind of
loop shown below. It seems to surely send feedback per one
record.

| r = stream_reader();
| while (r > 0)
| {
| ... wal record processing stuff without sending feedback..
| r = stream_reader();
| }

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#4)

Re: pg_basebackup may fail to send feedbacks.

On Thu, Feb 5, 2015 at 10:20 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com>

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

Seems walreceiver has the same problem. No?

pg_receivexlog.c would have the same problem since it uses the
same function with pg_basebackup.c.

The correspondent of HandleCopyStream in wansender is
WalReceiverMain, and it doesn't seem to have the same kind of
loop shown below. It seems to surely send feedback per one
record.

| r = stream_reader();
| while (r > 0)
| {
| ... wal record processing stuff without sending feedback..
| r = stream_reader();
| }

WalReceiverMain() has the similar code as follows.

len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
if (len != 0)
{
for (;;)
{
if (len > 0)
{
....
len = walrcv_receive(0, &buf);
}
}

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#5)

Re: pg_basebackup may fail to send feedbacks.

Sorry, I misunderstood that.

At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com>

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

Seems walreceiver has the same problem. No?

pg_receivexlog.c would have the same problem since it uses the
same function with pg_basebackup.c.

The correspondent of HandleCopyStream in wansender is
WalReceiverMain, and it doesn't seem to have the same kind of
loop shown below. It seems to surely send feedback per one
record.

| r = stream_reader();
| while (r > 0)
| {
| ... wal record processing stuff without sending feedback..
| r = stream_reader();
| }

WalReceiverMain() has the similar code as follows.

len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
if (len != 0)
{
for (;;)
{
if (len > 0)
{
....
len = walrcv_receive(0, &buf);
}
}

The loop seems a bit different but eventually the same about this
discussion.

408> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
409> if (len != 0)
410> {
415> for (;;)
416> {
417> if (len > 0)
418> {
425> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
426> }
427-438> else {break;}
439> len = walrcv_receive(0, &buf);
440> }
441> }

XLogWalRcvProcessMsg doesn't send feedback when processing
'w'=WAL record packet. So the same thing as pg_basebackup and
pg_receivexlog will occur on walsender.

Exiting the for(;;) loop by TimestampDifferenceExceeds just
before line 439 is an equivalent measure to I poposed for
receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
is seemingly simpler (but I feel a bit uncomfortable for the
latter)

Or other measures?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#6)

Re: pg_basebackup may fail to send feedbacks.

On Fri, Feb 6, 2015 at 3:22 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Sorry, I misunderstood that.

At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGudGCMnHZinkd37i+JijDkruEcrea1NCRs1MMtE3rOFQ@mail.gmail.com>

On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm very sorry for confused report. The problem found in 9.4.0
and the diagnosis was mistakenly done on master.

9.4.0 has no problem of feedback delay caused by slow xlog
receiving on pg_basebackup mentioned in the previous mail. But
the current master still has this problem.

Seems walreceiver has the same problem. No?

pg_receivexlog.c would have the same problem since it uses the
same function with pg_basebackup.c.

The correspondent of HandleCopyStream in wansender is
WalReceiverMain, and it doesn't seem to have the same kind of
loop shown below. It seems to surely send feedback per one
record.

| r = stream_reader();
| while (r > 0)
| {
| ... wal record processing stuff without sending feedback..
| r = stream_reader();
| }

WalReceiverMain() has the similar code as follows.

len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
if (len != 0)
{
for (;;)
{
if (len > 0)
{
....
len = walrcv_receive(0, &buf);
}
}

The loop seems a bit different but eventually the same about this
discussion.

408> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
409> if (len != 0)
410> {
415> for (;;)
416> {
417> if (len > 0)
418> {
425> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
426> }
427-438> else {break;}
439> len = walrcv_receive(0, &buf);
440> }
441> }

XLogWalRcvProcessMsg doesn't send feedback when processing
'w'=WAL record packet. So the same thing as pg_basebackup and
pg_receivexlog will occur on walsender.

Exiting the for(;;) loop by TimestampDifferenceExceeds just
before line 439 is an equivalent measure to I poposed for
receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
is seemingly simpler (but I feel a bit uncomfortable for the
latter)

I'm concerned about the performance degradation by calling
getimeofday() per a record.

Or other measures?

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#7)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hello,

The attached patch is fix for walreciever not using gettimeofday,
and fix for receivelog using it.

XLogWalRcvProcessMsg doesn't send feedback when processing
'w'=WAL record packet. So the same thing as pg_basebackup and
pg_receivexlog will occur on walsender.

Exiting the for(;;) loop by TimestampDifferenceExceeds just
before line 439 is an equivalent measure to I poposed for
receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
is seemingly simpler (but I feel a bit uncomfortable for the
latter)

I'm concerned about the performance degradation by calling
getimeofday() per a record.

Or other measures?

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

At the beginning of the "while(len > 0){if (len > 0){" block,
last_recv_timestamp is always kept up to date (by using
gettimeotda():). So we can use the variable instead of
gettimeofday() in my previous proposal.

The start time of the timeout could be last_recv_timestamp at the
beginning of the while loop, since it is a bit earlier than the
actual time at the point.

The another solution would be using
RegisterTimeout/enable_timeout_after and it seemed to be work for
me. It can throw out the time counting stuff in the loop we are
talking about and that of XLogWalRcvSendHSFeedback and
XLogWalRcvSendReply, but it might be a bit too large for the
gain.

Considering pg_basebackup/receivexlog, the loop in receivelog.c
does not maintain the time value within it, so I think we are
forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal
reading function simply gets the data from the buffer in memory
for most calls so the gettimeofday for each iteration could slow
the process significantly. SIGALRM seems to be valuable for the
case.

Thoughts?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Make-sure-to-send-feedback-at-desired-timing.patchtext/x-patch; charset=us-asciiDownload

>From c9017f7c55de864bb3459f6f927803577e94c5eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 2 Feb 2015 12:49:45 +0900
Subject: [PATCH] Make sure to send feedback at desired timing.

Continuous stream due to heavy-load on client side can prevent
feedbacks to be sent with expected interval, and it results in a
replication timeout on walsender. Exiting from the fast-path loop when
the time comes to make sure the feedback to be sent with expected
intervals.
---
 src/backend/replication/walreceiver.c | 54 +++++++++++++++++++++--------------
 src/bin/pg_basebackup/receivelog.c    | 10 +++++--
 2 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index bfbc02f..d77dc91 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -409,35 +409,45 @@ WalReceiverMain(void)
 				if (len != 0)
 				{
 					/*
+					 * The feedback interval cannot be longer than
+					 * wal_receiver_status_interval. last_recv_timestamp is a
+					 * bit earlier than the actual time here so this is
+					 * available for the start time of the timeout in the loop
+					 * below.
+					 */
+					TimestampTz last_feedback = last_recv_timestamp;
+
+					/*
 					 * Process the received data, and any subsequent data we
 					 * can read without blocking.
 					 */
-					for (;;)
+					while (len > 0)
 					{
-						if (len > 0)
-						{
-							/*
-							 * Something was received from master, so reset
-							 * timeout
-							 */
-							last_recv_timestamp = GetCurrentTimestamp();
-							ping_sent = false;
-							XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
-						}
-						else if (len == 0)
-							break;
-						else if (len < 0)
-						{
-							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
-							endofwal = true;
+						/*
+						 * Something was received from master, so reset
+						 * timeout
+						 */
+						last_recv_timestamp = GetCurrentTimestamp();
+						ping_sent = false;
+						XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
+
+						/* Exit this loop if the time to reply has come */
+						if (TimestampDifferenceExceeds(last_feedback, last_recv_timestamp,
+													  wal_receiver_status_interval * 1000))
 							break;
-						}
+
 						len = walrcv_receive(0, &buf);
 					}
+					if (len < 0)
+					{
+						ereport(LOG,
+								(errmsg("replication terminated by primary server"),
+								 errdetail("End of WAL reached on timeline %u at %X/%X.",
+										   startpointTLI,
+										   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
+						endofwal = true;
+						break;
+					}
 
 					/* Let the master know that we received some data. */
 					XLogWalRcvSendReply(false, false);
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..fb6738d 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -925,9 +925,15 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			}
 
 			/*
-			 * Process the received data, and any subsequent data we
-			 * can read without blocking.
+			 * Process the received data, and any subsequent data we can read
+			 * without blocking except when the time to feedback comes.
 			 */
+			now = feGetCurrentTimestamp();
+			if (standby_message_timeout > 0 &&
+				feTimestampDifferenceExceeds(last_status, now,
+											 standby_message_timeout))
+				break;
+
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
 	}
-- 
2.1.0.GIT

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#8)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hi,

15 19:48:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150210.194823.219136034.horiguchi.kyotaro@lab.ntt.co.jp>

Considering pg_basebackup/receivexlog, the loop in receivelog.c
does not maintain the time value within it, so I think we are
forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal
reading function simply gets the data from the buffer in memory
for most calls so the gettimeofday for each iteration could slow
the process significantly. SIGALRM seems to be valuable for the
case.

As a very fancy and ugly sample, the attached patch does this,
although, I'm a bit at a loss how to make this signal things to
be more sober..

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-receivelog.c-sigalrm.patchtext/x-patch; charset=us-asciiDownload

>From 2dc0123c22902069986d25552902899b5ac537b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 10 Feb 2015 21:30:12 +0900
Subject: [PATCH] receivelog.c sigalrm

---
 src/bin/pg_basebackup/receivelog.c | 44 ++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..d2e4c1a 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -16,6 +16,8 @@
 
 #include <sys/stat.h>
 #include <unistd.h>
+#include <signal.h>
+#include <sys/time.h>
 
 /* local includes */
 #include "receivelog.h"
@@ -801,6 +803,13 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 	return true;
 }
 
+static bool sigalrm_fired = false;
+static void
+SigAlrmHandler(int s)
+{
+	sigalrm_fired = true;
+}
+
 /*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
@@ -818,6 +827,10 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 	char	   *copybuf = NULL;
 	int64		last_status = -1;
 	XLogRecPtr	blockpos = startpos;
+	sighandler_t oldhandler;
+
+	oldhandler = pqsignal(SIGALRM, SigAlrmHandler);
+	Assert(oldhandler == SIG_DFL);
 
 	still_sending = true;
 
@@ -879,6 +892,22 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		sleeptime = CalculateCopyStreamSleeptime(now, standby_message_timeout,
 												 last_status);
 
+		{
+			struct itimerval itv;
+			long secs;
+			int  usecs;
+			int err;
+
+			MemSet(&itv, 0, sizeof(itv));
+			itv.it_value.tv_sec = standby_message_timeout / 1000;
+			itv.it_value.tv_usec = (standby_message_timeout % 1000) * 1000;
+			sigalrm_fired = false;
+			if (setitimer(ITIMER_REAL, &itv, NULL) != 0)
+			{
+				fprintf(stderr, "could not set timer: %m");
+				goto error;
+			}
+		}
 		r = CopyStreamReceive(conn, sleeptime, &copybuf);
 		while (r != 0)
 		{
@@ -924,15 +953,22 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 				goto error;
 			}
 
-			/*
-			 * Process the received data, and any subsequent data we
-			 * can read without blocking.
-			 */
+			if (sigalrm_fired)
+				break;
+
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
+		if (!sigalrm_fired)
+		{
+			struct itimerval itv;
+			MemSet(&itv, 0, sizeof(itv));
+			setitimer(ITIMER_REAL, &itv, NULL);
+		}
 	}
 
+
 error:
+	pqsignal(SIGALRM, SIG_IGN);
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
 	return NULL;
-- 
2.1.0.GIT

#10

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#8)

Re: pg_basebackup may fail to send feedbacks.

On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

The attached patch is fix for walreciever not using gettimeofday,
and fix for receivelog using it.

XLogWalRcvProcessMsg doesn't send feedback when processing
'w'=WAL record packet. So the same thing as pg_basebackup and
pg_receivexlog will occur on walsender.

Exiting the for(;;) loop by TimestampDifferenceExceeds just
before line 439 is an equivalent measure to I poposed for
receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
is seemingly simpler (but I feel a bit uncomfortable for the
latter)

I'm concerned about the performance degradation by calling
getimeofday() per a record.

Or other measures?

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

At the beginning of the "while(len > 0){if (len > 0){" block,
last_recv_timestamp is always kept up to date (by using
gettimeotda():). So we can use the variable instead of
gettimeofday() in my previous proposal.

Yes, but something like last_recv_timestamp doesn't exist in
REL9_1_STABLE. So you don't think that your proposed fix
should be back-patched to all supported versions. Right?

The start time of the timeout could be last_recv_timestamp at the
beginning of the while loop, since it is a bit earlier than the
actual time at the point.

Sounds strange to me. I think that last_recv_timestamp should be
compared with the time when the last feedback message was sent,
e.g., maybe you can expose sendTime in XLogWalRcvSendReplay()
as global variable, and use it to compare with last_recv_timestamp.

When we get out of the WAL receive loop due to the timeout check
which your patch added, XLogWalRcvFlush() is always executed.
I don't think that current WAL file needs to be flushed at that time.

The another solution would be using
RegisterTimeout/enable_timeout_after and it seemed to be work for
me. It can throw out the time counting stuff in the loop we are
talking about and that of XLogWalRcvSendHSFeedback and
XLogWalRcvSendReply, but it might be a bit too large for the
gain.

Yes, sounds overkill.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#10)

3 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Thank you for the comment. Three new patches are attached. I
forgot to give a revision number on the previous patch, but I
think this is the 2nd version.

1. walrcv_reply_fix_94_v2.patch
Walreceiver patch applyable on master/REL9_4_STBLE/REL9_3_STABLE

2. walrcv_reply_fix_92_v2.patch
Walreceiver patch applyable on REL9_2_STABLE

3. walrcv_reply_fix_91_v2.patch
Walreceiver patch applyable on REL9_1_STABLE

At Sat, 14 Feb 2015 03:01:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHZ_4yWyokA+5ks9s_698adjH8+0haMCSC9WYFh37GY7g@mail.gmail.com>

On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

At the beginning of the "while(len > 0){if (len > 0){" block,
last_recv_timestamp is always kept up to date (by using
gettimeotda():). So we can use the variable instead of
gettimeofday() in my previous proposal.

Yes, but something like last_recv_timestamp doesn't exist in
REL9_1_STABLE. So you don't think that your proposed fix
should be back-patched to all supported versions. Right?

Back to 9.3 has the loop with the same structure. 9.2 is in a bit
differenct shape but looks to have the same issue. 9.1 and 9.0
has the same staps with 9.2 but 9.0 doesn't has wal sender
timeout and 9.1 does not have a variable having current time.

9.3 and later: the previous patch works, but revised as your
comment.

9.2: The time of the last reply is stored in the file-scope
variable reply_message, and the current time is stored in
walrcv->lastMsgReceiptTime. The timeout is determined using
theses variables.

9.1: walrcv doesn't have lastMsgReceiptTime. The current time
should be acquired explicitly in the innermost loop. I tried
calling gettimeofday() once per several loops. The skip count
is determined by anticipated worst duration to process a wal
record and wal_receiver_status_interval. However, I doubt the
necessity to do so.. The value of the worst duration is simply
a random guess.

9.0: No point to do this kind of fix.

The start time of the timeout could be last_recv_timestamp at the
beginning of the while loop, since it is a bit earlier than the
actual time at the point.

Sounds strange to me. I think that last_recv_timestamp should be
compared with the time when the last feedback message was sent,
e.g., maybe you can expose sendTime in XLogWalRcvSendReplay()
as global variable, and use it to compare with last_recv_timestamp.

You're right to do exactly right thing, but, as you mentioned,
exposing sendTime is requied to do so. The solution I proposed is
the second-best assuming that wal_sender_timeout is recommended
to be longer enough (several times longer) than
wal_receiver_status_interval. If no one minds to expose sendTime,
it is the best solution. The attached patch does it.

# The added comment in the patch was wrong, though.

When we get out of the WAL receive loop due to the timeout check
which your patch added, XLogWalRcvFlush() is always executed.
I don't think that current WAL file needs to be flushed at that time.

Thank you for pointing it out. Run XLogWalRcvSendReply(force =
true) immediately instead of breaking from the loop.

The another solution would be using
RegisterTimeout/enable_timeout_after and it seemed to be work for
me. It can throw out the time counting stuff in the loop we are
talking about and that of XLogWalRcvSendHSFeedback and
XLogWalRcvSendReply, but it might be a bit too large for the
gain.

Yes, sounds overkill.

However, gettimeofday() for each recv is also a overkill for most
cases. I'll post the patches for receivelog.c based on the patch
for 9.1 wal receiver.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

walrcv_reply_fix_94_v2.patch.patchtext/x-patch; charset=us-asciiDownload

>From 184f72ac34e7b787527dfa8ed76c1fbd2d970407 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 10 Feb 2015 14:56:23 +0900
Subject: [PATCH] Make walreceiver to keep regular reply message even on heavy
 load v2.

Wal receiver cannot send receiver reply message while it is receiving
continuous WAL stream caused by heavy load or something else. This
patch makes wal receiver to send reply message even on such a
situation.
---
 src/backend/replication/walreceiver.c | 59 ++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..43d218d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -91,6 +91,12 @@ static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
 
 /*
+ * The time when the last wal receiver reply was sent. This is used to escape
+ * from receiving loop so that replay messages are kept regulary.
+ */
+static TimestampTz walRcvReplySendTime = 0;
+
+/*
  * Flags set by interrupt handlers of walreceiver for later service in the
  * main loop.
  */
@@ -423,32 +429,34 @@ WalReceiverMain(void)
 					 * Process the received data, and any subsequent data we
 					 * can read without blocking.
 					 */
-					for (;;)
+					while (len > 0)
 					{
-						if (len > 0)
-						{
-							/*
-							 * Something was received from master, so reset
-							 * timeout
-							 */
-							last_recv_timestamp = GetCurrentTimestamp();
-							ping_sent = false;
-							XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
-						}
-						else if (len == 0)
-							break;
-						else if (len < 0)
-						{
-							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
-							endofwal = true;
-							break;
-						}
+						/*
+						 * Something was received from master, so reset
+						 * timeout
+						 */
+						last_recv_timestamp = GetCurrentTimestamp();
+						ping_sent = false;
+						XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
+
+						/* Send reply if needed */
+						if (TimestampDifferenceExceeds(walRcvReplySendTime,
+											last_recv_timestamp,
+											wal_receiver_status_interval * 1000))
+							XLogWalRcvSendReply(true, false);
+
 						len = walrcv_receive(0, &buf);
 					}
+					if (len < 0)
+					{
+						ereport(LOG,
+								(errmsg("replication terminated by primary server"),
+								 errdetail("End of WAL reached on timeline %u at %X/%X.",
+										   startpointTLI,
+										   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
+						endofwal = true;
+						break;
+					}
 
 					/* Let the master know that we received some data. */
 					XLogWalRcvSendReply(false, false);
@@ -1040,7 +1048,6 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
-	static TimestampTz sendTime = 0;
 	TimestampTz now;
 
 	/*
@@ -1065,10 +1072,10 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	if (!force
 		&& writePtr == LogstreamResult.Write
 		&& flushPtr == LogstreamResult.Flush
-		&& !TimestampDifferenceExceeds(sendTime, now,
+		&& !TimestampDifferenceExceeds(walRcvReplySendTime, now,
 									   wal_receiver_status_interval * 1000))
 		return;
-	sendTime = now;
+	walRcvReplySendTime = now;
 
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
-- 
2.1.0.GIT

walrcv_reply_fix_92_v2.patchtext/x-patch; charset=us-asciiDownload

>From ed60223555ae6b30c203276f1cf8890711e0c267 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 17 Feb 2015 19:00:07 +0900
Subject: [PATCH] Make walreceiver to keep regular reply message even on heavy
 load v2.

Wal receiver cannot send receiver reply message while it is receiving
continuous WAL stream caused by heavy load or something else. This
patch makes wal receiver to send reply message even on such a
situation.
---
 src/backend/replication/walreceiver.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 753316e..407e70b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -319,13 +319,22 @@ WalReceiverMain(void)
 		/* Wait a while for data to arrive */
 		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
 		{
-			/* Accept the received data, and process it */
-			XLogWalRcvProcessMsg(type, buf, len);
-
 			/* Receive any more data we can without sleeping */
-			while (walrcv_receive(0, &type, &buf, &len))
+			do
+			{
+				/* Accept the received data, and process it */
 				XLogWalRcvProcessMsg(type, buf, len);
 
+				/*
+				 * walrecv->lastMsgReceiptTime has the time as of
+				 * XLogWalRcvProcessMsg just above.
+				 */
+				if (TimestampDifferenceExceeds(reply_message.sendTime,
+											   walrcv->lastMsgReceiptTime,
+											   wal_receiver_status_interval * 1000))
+					XLogWalRcvSendReply();
+			} while (walrcv_receive(0, &type, &buf, &len));
+
 			/* Let the master know that we received some data. */
 			XLogWalRcvSendReply();
 
-- 
2.1.0.GIT

walrcv_reply_fix_91_v2.patchtext/x-patch; charset=us-asciiDownload

>From 64dbbd49719656eb3fbeb7fd8582a0fb952d54e9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 17 Feb 2015 19:00:07 +0900
Subject: [PATCH] Make walreceiver to keep regular reply message even on heavy
 load v2.

Wal receiver cannot send receiver reply message while it is receiving
continuous WAL stream caused by heavy load or something else. This
patch makes wal receiver to send reply message even on such a
situation.
---
 src/backend/replication/walreceiver.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3dfc3d6..a3e1fd9 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -67,6 +67,10 @@ walrcv_send_type walrcv_send = NULL;
 walrcv_disconnect_type walrcv_disconnect = NULL;
 
 #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
+#define WAL_PROCESS_WORST_DURATION 1 /* Anticipated worst duration to process
+									  * wal record in seconds. This is used to
+									  * calculate how often to check the time
+									  * to send reply message */
 
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -316,13 +320,34 @@ WalReceiverMain(void)
 		/* Wait a while for data to arrive */
 		if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
 		{
-			/* Accept the received data, and process it */
-			XLogWalRcvProcessMsg(type, buf, len);
+			/*
+			 * Check timeout once per several loops in order to avoid exccess
+			 * calling of gettimeofday().
+			 */
+			int check_interval = 
+				wal_receiver_status_interval / WAL_PROCESS_WORST_DURATION;
+			int count = 0;
 
 			/* Receive any more data we can without sleeping */
-			while (walrcv_receive(0, &type, &buf, &len))
+			do
+			{
+				/* Accept the received data, and process it */
 				XLogWalRcvProcessMsg(type, buf, len);
 
+				/*
+				 * walrecv->lastMsgReceiptTime has the time as of
+				 * XLogWalRcvProcessMsg just above.
+				 */
+				if (count++ >= check_interval)
+				{
+					if (TimestampDifferenceExceeds(reply_message.sendTime,
+										GetCurrentTimestamp(),
+										wal_receiver_status_interval * 1000))
+						XLogWalRcvSendReply();
+					count = 0;
+				}
+			} while (walrcv_receive(0, &type, &buf, &len));
+
 			/* Let the master know that we received some data. */
 			XLogWalRcvSendReply();
 
-- 
2.1.0.GIT

#12

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#11)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hello, this is the last patch for pg_basebackup/pg_receivexlog on
master (9.5). Preor versions don't have this issue.

4. basebackup_reply_fix_mst_v2.patch
receivelog.c patch applyable on master.

This is based on the same design with
walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday().

regards,

At Tue, 17 Feb 2015 19:45:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150217.194519.58137941.horiguchi.kyotaro@lab.ntt.co.jp>

Thank you for the comment. Three new patches are attached. I
forgot to give a revision number on the previous patch, but I
think this is the 2nd version.

1. walrcv_reply_fix_94_v2.patch
Walreceiver patch applyable on master/REL9_4_STBLE/REL9_3_STABLE

2. walrcv_reply_fix_92_v2.patch
Walreceiver patch applyable on REL9_2_STABLE

3. walrcv_reply_fix_91_v2.patch
Walreceiver patch applyable on REL9_1_STABLE

At Sat, 14 Feb 2015 03:01:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHZ_4yWyokA+5ks9s_698adjH8+0haMCSC9WYFh37GY7g@mail.gmail.com>

On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Introduce the maximum number of records that we can receive and
process between feedbacks? For example, we can change
XLogWalRcvSendHSFeedback() so that it's called per at least
8 records. I'm not sure what number is good, though...

At the beginning of the "while(len > 0){if (len > 0){" block,
last_recv_timestamp is always kept up to date (by using
gettimeotda():). So we can use the variable instead of
gettimeofday() in my previous proposal.

Yes, but something like last_recv_timestamp doesn't exist in
REL9_1_STABLE. So you don't think that your proposed fix
should be back-patched to all supported versions. Right?

Back to 9.3 has the loop with the same structure. 9.2 is in a bit
differenct shape but looks to have the same issue. 9.1 and 9.0
has the same staps with 9.2 but 9.0 doesn't has wal sender
timeout and 9.1 does not have a variable having current time.

9.3 and later: the previous patch works, but revised as your
comment.

9.2: The time of the last reply is stored in the file-scope
variable reply_message, and the current time is stored in
walrcv->lastMsgReceiptTime. The timeout is determined using
theses variables.

9.1: walrcv doesn't have lastMsgReceiptTime. The current time
should be acquired explicitly in the innermost loop. I tried
calling gettimeofday() once per several loops. The skip count
is determined by anticipated worst duration to process a wal
record and wal_receiver_status_interval. However, I doubt the
necessity to do so.. The value of the worst duration is simply
a random guess.

9.0: No point to do this kind of fix.

The start time of the timeout could be last_recv_timestamp at the
beginning of the while loop, since it is a bit earlier than the
actual time at the point.

Sounds strange to me. I think that last_recv_timestamp should be
compared with the time when the last feedback message was sent,
e.g., maybe you can expose sendTime in XLogWalRcvSendReplay()
as global variable, and use it to compare with last_recv_timestamp.

You're right to do exactly right thing, but, as you mentioned,
exposing sendTime is requied to do so. The solution I proposed is
the second-best assuming that wal_sender_timeout is recommended
to be longer enough (several times longer) than
wal_receiver_status_interval. If no one minds to expose sendTime,
it is the best solution. The attached patch does it.

# The added comment in the patch was wrong, though.

When we get out of the WAL receive loop due to the timeout check
which your patch added, XLogWalRcvFlush() is always executed.
I don't think that current WAL file needs to be flushed at that time.

Thank you for pointing it out. Run XLogWalRcvSendReply(force =
true) immediately instead of breaking from the loop.

The another solution would be using
RegisterTimeout/enable_timeout_after and it seemed to be work for
me. It can throw out the time counting stuff in the loop we are
talking about and that of XLogWalRcvSendHSFeedback and
XLogWalRcvSendReply, but it might be a bit too large for the
gain.

Yes, sounds overkill.

However, gettimeofday() for each recv is also a overkill for most
cases. I'll post the patches for receivelog.c based on the patch
for 9.1 wal receiver.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

basebackup_reply_fix_mst_v2.patchtext/x-patch; charset=us-asciiDownload

>From cfe01eadd4d8567f4410bccb8334c52fc897c002 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 18 Feb 2015 12:30:58 +0900
Subject: [PATCH] Make pg_bascbackup and pg_receivexlog to keep sending WAL
 receive feedback regularly on heavy load v2.

pg_basebackup and pg_receivexlog fail to send receiver reply message
while they are receiving continuous WAL stream caused by heavy load or
something else. This patch makes them to send reply message even on
such a situation.
---
 src/bin/pg_basebackup/receivelog.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..453a047 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -24,6 +24,10 @@
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
 
+#define WAL_PROCESS_WORST_DURATION 1 /* Anticipated worst duration to process
+									  * wal record in seconds. This is used to
+									  * calculate how often to check the time
+									  * to send reply message */
 
 /* fd and filename for currently open WAL file */
 static int	walfile = -1;
@@ -826,6 +830,9 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		int			r;
 		int64		now;
 		long		sleeptime;
+		int			reply_timeout_check_interval = 
+			standby_message_timeout / 1000 / WAL_PROCESS_WORST_DURATION;
+		int			loop_count = 0;
 
 		/*
 		 * Check if we should continue streaming, or abort at this point.
@@ -879,6 +886,7 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		sleeptime = CalculateCopyStreamSleeptime(now, standby_message_timeout,
 												 last_status);
 
+		loop_count = 0;
 		r = CopyStreamReceive(conn, sleeptime, &copybuf);
 		while (r != 0)
 		{
@@ -925,6 +933,23 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			}
 
 			/*
+			 * Keep sending feedbacks regularly. We check it once per several
+			 * iterations for performance reason.
+			 */
+			if (loop_count++ >= reply_timeout_check_interval)
+			{
+				now = feGetCurrentTimestamp();
+				if (feTimestampDifferenceExceeds(last_status, now,
+												 standby_message_timeout))
+				{
+					if (!sendFeedback(conn, blockpos, now, false))
+						goto error;
+					last_status = now;
+				}
+				loop_count = 0;
+			}
+
+			/*
 			 * Process the received data, and any subsequent data we
 			 * can read without blocking.
 			 */
-- 
2.1.0.GIT

#13

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#12)

Re: pg_basebackup may fail to send feedbacks.

On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, this is the last patch for pg_basebackup/pg_receivexlog on
master (9.5). Preor versions don't have this issue.

4. basebackup_reply_fix_mst_v2.patch
receivelog.c patch applyable on master.

This is based on the same design with
walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday().

Thanks for updating the patches! But I'm still not sure if the idea depending
on the frequent calls of gettimeofday() for each WAL receive is good or not.
Some users may complain about the performance impact by such frequent calls
and we may want to get rid of them from walreceiver loop in the future.
If we adopt your idea now, I'm afraid that it would tie our hands in that case.

How much impact can such frequent calls of gettimeofday() have on replication
performance? If it's not negligible, probably we should remove them at first
and find out another idea to fix the problem you pointed. ISTM that it's not so
difficult to remove them. Thought? Do you have any numbers which can prove
that such frequent gettimeofday() has only ignorable impact on the performance?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#13)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hello,

At Thu, 19 Feb 2015 19:22:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGLFLaFrCYcuikkVefNaoEL448TLSJ9oPsvb17v3foZHA@mail.gmail.com>

On Wed, Feb 18, 2015 at 5:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, this is the last patch for pg_basebackup/pg_receivexlog on
master (9.5). Preor versions don't have this issue.

4. basebackup_reply_fix_mst_v2.patch
receivelog.c patch applyable on master.

This is based on the same design with
walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday().

Thanks for updating the patches! But I'm still not sure if the idea depending
on the frequent calls of gettimeofday() for each WAL receive is good or not.

Neither do I. Nowadays, linux on AMD64/x64 environment has no
problem even if gettimeofday() called frequently, but Windows
seems to have a problem and I don't know about other platforms.

One possible timing source is LSN.

if ((blockpos - last_blockpos) / BLKSZ > 0)
{
now = feGetCurrentTimestamp();
if (feTimestampDifferenceExceeds(last_status, now,

if (!sendFeedback(conn, blockpos, now, false))
}
}

last_blockpos = blockpos;

But once per PAGESZ can easily be more frequent than once per 10
records and XLOG_SEG_SIZE seems too big. However I don't see any
bases to determine the frequency between them nor other than the
time itself.

SIGALRM seems to me to be more preferable to keep the main jobe
as fast as possible than introducing a code with no reasonable
basis.

Some users may complain about the performance impact by such frequent calls
and we may want to get rid of them from walreceiver loop in the future.
If we adopt your idea now, I'm afraid that it would tie our hands in that case.

How much impact can such frequent calls of gettimeofday() have on replication
performance? If it's not negligible, probably we should remove them at first
and find out another idea to fix the problem you pointed. ISTM that it's not so
difficult to remove them. Thought? Do you have any numbers which can prove
that such frequent gettimeofday() has only ignorable impact on the performance?

The attached patch is 'the more sober' version of SIGLARM patch.

I'll search for the another way after this.

regards,

Attachments:

basebackup_reply_fix_mst_v3_SIGALRM.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..c55af83 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -16,6 +16,8 @@
 
 #include <sys/stat.h>
 #include <unistd.h>
+#include <signal.h>
+#include <sys/time.h>
 
 /* local includes */
 #include "receivelog.h"
@@ -33,6 +35,12 @@ static XLogRecPtr lastFlushPosition = InvalidXLogRecPtr;
 
 static bool still_sending = true;		/* feedback still needs to be sent? */
 
+static bool standby_message_timeout_active = false;   /* timeout is active?  */
+static bool standby_message_timeout_expired = false;  /* timeout expired?  */
+
+static void SigAlrmHandler(int s);
+static int set_standby_message_timeout(long duration);
+static int cancel_standby_message_timeout(void);
 static PGresult *HandleCopyStream(PGconn *conn, XLogRecPtr startpos,
 				 uint32 timeline, char *basedir,
 			   stream_stop_callback stream_stop, int standby_message_timeout,
@@ -60,6 +68,14 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
 
+
+static void
+SigAlrmHandler(int s)
+{
+	standby_message_timeout_expired = true;
+}
+
+
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
 {
@@ -802,6 +818,65 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * set_standby_message_timeout: Set standby message timeout.
+ *
+ * duration is the timeout duration in milliseconds.
+ *
+ * Returns 0 on success and -1 on error.
+ */
+static int
+set_standby_message_timeout(long duration)
+{
+	struct itimerval itv;
+
+	/* negative duration means no standby_message_timeout. */
+	if (duration < 0) return 0;
+
+	MemSet(&itv, 0, sizeof(itv));
+	itv.it_value.tv_sec = duration / 1000;
+	itv.it_value.tv_usec = (duration % 1000) * 1000;
+	standby_message_timeout_active = true;
+	standby_message_timeout_expired = false;
+	if (setitimer(ITIMER_REAL, &itv, NULL) != 0)
+	{
+		fprintf(stderr, "could not set timer: %m");
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * cancel_standby_message_timeout: cancel standby message timeout if active.
+ *
+ * Returns 0 on success and -1 on error.
+ */
+static int
+cancel_standby_message_timeout(void)
+{
+	struct itimerval itv;
+
+	if (!standby_message_timeout_active)
+		return;
+
+	standby_message_timeout_active = false;
+
+	if (standby_message_timeout_expired)
+	{
+		standby_message_timeout_expired = false;
+		return 0;
+	}
+	MemSet(&itv, 0, sizeof(itv));
+	if (setitimer(ITIMER_REAL, &itv, NULL) != 0)
+	{
+		fprintf(stderr, "could not set timer: %m");
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
@@ -818,6 +893,15 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 	char	   *copybuf = NULL;
 	int64		last_status = -1;
 	XLogRecPtr	blockpos = startpos;
+	sighandler_t	oldhandler;
+
+	oldhandler = pqsignal(SIGALRM, SigAlrmHandler);
+
+	if (oldhandler != SigAlrmHandler)
+	{
+		/* SIGALRM should not be used here  */
+		Assert(oldhandler == SIG_DFL);
+	}
 
 	still_sending = true;
 
@@ -878,23 +962,9 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		 */
 		sleeptime = CalculateCopyStreamSleeptime(now, standby_message_timeout,
 												 last_status);
-
 		r = CopyStreamReceive(conn, sleeptime, &copybuf);
-		while (r != 0)
+		while (r > 0)
 		{
-			if (r == -1)
-				goto error;
-			if (r == -2)
-			{
-				PGresult	*res = HandleEndOfCopyStream(conn, copybuf, blockpos,
-														 basedir, partial_suffix,
-														 stoppos, mark_done);
-				if (res == NULL)
-					goto error;
-				else
-					return res;
-			}
-
 			/* Check the message type. */
 			if (copybuf[0] == 'k')
 			{
@@ -925,14 +995,62 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			}
 
 			/*
+			  If standby_message_timeout has expired, send feed back and then
+			 * set timeout again.
+			 */
+			if (standby_message_timeout_expired)
+			{
+				if (!sendFeedback(conn, blockpos, now, false))
+					goto error;
+				last_status = feGetCurrentTimestamp();
+
+				/* Set next timeout */
+				sleeptime = CalculateCopyStreamSleeptime(now, standby_message_timeout,
+														 last_status);
+				if (set_standby_message_timeout(sleeptime) < 0)
+					goto error;
+			}
+
+			/*
 			 * Process the received data, and any subsequent data we
 			 * can read without blocking.
 			 */
 			r = CopyStreamReceive(conn, 0, &copybuf);
+
+			/* If continuous input has come, set timeout for standby
+			 * message. This fires a bit later than exact desired time but it
+			 * would be earlier enough because standby_message_timeout is
+			 * ususally set enough smaller than wal_sender_timeout.
+			 */
+			if (r > 0 && !standby_message_timeout_active)
+			{
+				if (set_standby_message_timeout(sleeptime) < 0)
+					goto error;
+
+			}
 		}
-	}
+		if (cancel_standby_message_timeout() < 0)
+			goto error;
 
+		if (r == -1)
+			goto error;
+		if (r == -2)
+		{
+			PGresult	*res = HandleEndOfCopyStream(conn, copybuf, blockpos,
+													 basedir, partial_suffix,
+													 stoppos, mark_done);
+			if (res == NULL)
+				goto error;
+			else
+				return res;
+		}
+
+	}
+	
 error:
+	cancel_standby_message_timeout();
+	pqsignal(SIGALRM, oldhandler);
+
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
 	return NULL;

#15

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#14)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hello, the attached is the v4 patch that checks feedback timing
every WAL segments boundary.

At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150220.172914.241732690.horiguchi.kyotaro@lab.ntt.co.jp>

Some users may complain about the performance impact by such
frequent calls and we may want to get rid of them from
walreceiver loop in the future. If we adopt your idea now,
I'm afraid that it would tie our hands in that case.

How much impact can such frequent calls of gettimeofday()
have on replication performance? If it's not negligible,
probably we should remove them at first and find out another
idea to fix the problem you pointed. ISTM that it's not so
difficult to remove them. Thought? Do you have any numbers
which can prove that such frequent gettimeofday() has only
ignorable impact on the performance?

The attached patch is 'the more sober' version of SIGLARM patch.

I said that checking whether to send feedback every boundary
between WAL segments seemed too long but after some rethinking, I
changed my mind.

- The most large possible delay source in the busy-receive loop
is fsyncing at closing WAL segment file just written, this can
take several seconds. Freezing longer than the timeout
interval could not be saved and is not worth saving anyway.

- 16M bytes-disk-writes intervals between gettimeofday() seems
to be gentle enough even on platforms where gettimeofday() is
rather heavy.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

basebackup_reply_fix_mst_v4_WALSEG.patchtext/x-patch; charset=us-asciiDownload

>From 945e18713af86a357a7ac24ff5cd855e1f79a927 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 24 Feb 2015 17:52:01 +0900
Subject: [PATCH] Make effort to send feedback regulary on heavy load.

pg_basebackup and pg_receivexlog might be forced to omit sending
feedback for long time by continuous replication stream caused by
possible heavy load on receiver side. Keep alives from the server
could be delayed on such a situation. This patch let them make efforts
to send feedback on such a situation. On every boundary between WAL
segments, send feedback if so the time has come just after syncing and
closing the segment just finished.
---
 src/bin/pg_basebackup/receivelog.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..5d15b11 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -45,7 +45,8 @@ static bool ProcessKeepaliveMsg(PGconn *conn, char *copybuf, int len,
 static bool ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 							   XLogRecPtr *blockpos, uint32 timeline,
 							   char *basedir, stream_stop_callback stream_stop,
-							   char *partial_suffix, bool mark_done);
+							   char *partial_suffix, bool mark_done,
+							   int standby_message_timeout, int64 *last_status);
 static PGresult *HandleEndOfCopyStream(PGconn *conn, char *copybuf,
 									   XLogRecPtr blockpos, char *basedir, char *partial_suffix,
 									   XLogRecPtr *stoppos, bool mark_done);
@@ -906,7 +907,8 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			{
 				if (!ProcessXLogDataMsg(conn, copybuf, r, &blockpos,
 										timeline, basedir, stream_stop,
-										partial_suffix, mark_done))
+										partial_suffix, mark_done,
+										standby_message_timeout, &last_status))
 					goto error;
 
 				/*
@@ -1115,7 +1117,8 @@ static bool
 ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 				   XLogRecPtr *blockpos, uint32 timeline,
 				   char *basedir, stream_stop_callback stream_stop,
-				   char *partial_suffix, bool mark_done)
+				   char *partial_suffix, bool mark_done,
+				   int standby_message_timeout, int64 *last_status)
 {
 	int			xlogoff;
 	int			bytes_left;
@@ -1223,12 +1226,29 @@ ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 		/* Did we reach the end of a WAL segment? */
 		if (*blockpos % XLOG_SEG_SIZE == 0)
 		{
+			int64 now;
 			if (!close_walfile(basedir, partial_suffix, *blockpos, mark_done))
 				/* Error message written in close_walfile() */
 				return false;
 
 			xlogoff = 0;
 
+			/*
+			 * Continuous input stream might cause long duration after the
+			 * previous feedback. Here is a good point to check if the time to
+			 * feedback has come since the fsync done in close_walfile() might
+			 * have taken long time.
+			 */
+			now = feGetCurrentTimestamp();
+			if (standby_message_timeout > 0 &&
+				feTimestampDifferenceExceeds(*last_status, now,
+											 standby_message_timeout))
+			{
+				if (!sendFeedback(conn, *blockpos, now, false))
+					return false;
+				*last_status = now;
+			}
+
 			if (still_sending && stream_stop(*blockpos, timeline, true))
 			{
 				if (PQputCopyEnd(conn, NULL) <= 0 || PQflush(conn))
-- 
2.1.0.GIT

#16

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#15)

Re: pg_basebackup may fail to send feedbacks.

On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, the attached is the v4 patch that checks feedback timing
every WAL segments boundary.

At Fri, 20 Feb 2015 17:29:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20150220.172914.241732690.horiguchi.kyotaro@lab.ntt.co.jp>

Some users may complain about the performance impact by such
frequent calls and we may want to get rid of them from
walreceiver loop in the future. If we adopt your idea now,
I'm afraid that it would tie our hands in that case.

How much impact can such frequent calls of gettimeofday()
have on replication performance? If it's not negligible,
probably we should remove them at first and find out another
idea to fix the problem you pointed. ISTM that it's not so
difficult to remove them. Thought? Do you have any numbers
which can prove that such frequent gettimeofday() has only
ignorable impact on the performance?

The attached patch is 'the more sober' version of SIGLARM patch.

I said that checking whether to send feedback every boundary
between WAL segments seemed too long but after some rethinking, I
changed my mind.

- The most large possible delay source in the busy-receive loop
is fsyncing at closing WAL segment file just written, this can
take several seconds. Freezing longer than the timeout
interval could not be saved and is not worth saving anyway.

- 16M bytes-disk-writes intervals between gettimeofday() seems
to be gentle enough even on platforms where gettimeofday() is
rather heavy.

Sounds reasonable to me.

So we don't need to address the problem in walreceiver side so proactively
because it tries to send the feedback every after flushing the WAL records.
IOW, the problem you observed is less likely to happen. Right?

+            now = feGetCurrentTimestamp();
+            if (standby_message_timeout > 0 &&

Minor comment: should feGetCurrentTimestamp() be called after the check of
standby_message_timeout > 0, to avoid unnecessary calls of that?

 ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
                    XLogRecPtr *blockpos, uint32 timeline,
                    char *basedir, stream_stop_callback stream_stop,
-                   char *partial_suffix, bool mark_done)
+                   char *partial_suffix, bool mark_done,
+                   int standby_message_timeout, int64 *last_status)

Maybe it's time to refactor this ugly coding (i.e., currently many arguments
need to be given to each functions. Looks ugly)...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 11 years ago

In reply to: Fujii Masao (#16)

1 attachment(s)

Re: pg_basebackup may fail to send feedbacks.

Hi, the attached is the v5 patch.

- Do feGetCurrentTimestamp() only when necessary.
- Rebased to current master

At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG1tJHpG03oZgwoKxt5wYD5v4S3HuTgSx7RotBhHnjwJw@mail.gmail.com>

On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, the attached is the v4 patch that checks feedback timing
every WAL segments boundary.

I said that checking whether to send feedback every boundary
between WAL segments seemed too long but after some rethinking, I
changed my mind.

- The most large possible delay source in the busy-receive loop
is fsyncing at closing WAL segment file just written, this can
take several seconds. Freezing longer than the timeout
interval could not be saved and is not worth saving anyway.

- 16M bytes-disk-writes intervals between gettimeofday() seems
to be gentle enough even on platforms where gettimeofday() is
rather heavy.

Sounds reasonable to me.

So we don't need to address the problem in walreceiver side so proactively
because it tries to send the feedback every after flushing the WAL records.
IOW, the problem you observed is less likely to happen. Right?
+            now = feGetCurrentTimestamp();
+            if (standby_message_timeout > 0 &&

Surely it would hardly happen, especially on ordinary
configuration.

However, the continuous receiving of the replication stream is a
quite normal behavior even if hardly happens. So the receiver
should guarantee the feedbacks to be sent by logic as long as it
is working normally, as long as the code for the special case
won't be too large and won't take too long time:).

The current walreceiver doesn't look trying to send feedbacks
after flushing every wal records. HandleCopyStream will
restlessly process the records in a gapless replication stream,
sending feedback only when keepalive packet arrives. It is the
fact that the response to the keepalive would help greatly but it
is not what the keepalives are for. It is intended to be used to
confirm if a silent receiver is still alive.

Even with this fix, the case that one write or flush takes longer
time than the feedback interval cannot be saved, but it would be
ok since it should be regarded as disorder.

Minor comment: should feGetCurrentTimestamp() be called after the check of
standby_message_timeout > 0, to avoid unnecessary calls of that?

Ah, you're right. I'll fixed it.

ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
XLogRecPtr *blockpos, uint32 timeline,
char *basedir, stream_stop_callback stream_stop,
-                   char *partial_suffix, bool mark_done)
+                   char *partial_suffix, bool mark_done,
+                   int standby_message_timeout, int64 *last_status)

Maybe it's time to refactor this ugly coding (i.e., currently many arguments
need to be given to each functions. Looks ugly)...

I'm increasing the ugliness:(

XLog stuff seems to need to share many states widely to work. But
the parameter list of the function looks to be bearable to this
extent, to me:).

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

basebackup_reply_fix_mst_v5_WALSEG.patchtext/x-patch; charset=us-asciiDownload

>From ef7b04c9ddf351ca99736d9ec9fa1954383cd124 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 24 Feb 2015 17:52:01 +0900
Subject: [PATCH] Make effort to send feedback regulary on heavy load.

pg_basebackup and pg_receivexlog might be forced to omit sending
feedback for long time by continuous replication stream caused by
possible heavy load on receiver side. Keep alives from the server
could be delayed on such a situation. This patch let them make efforts
to send feedback on such a situation. On every boundary between WAL
segments, send feedback if so the time has come just after syncing and
closing the segment just finished.
---
 src/bin/pg_basebackup/receivelog.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..df51f9d 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -45,7 +45,8 @@ static bool ProcessKeepaliveMsg(PGconn *conn, char *copybuf, int len,
 static bool ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 							   XLogRecPtr *blockpos, uint32 timeline,
 							   char *basedir, stream_stop_callback stream_stop,
-							   char *partial_suffix, bool mark_done);
+							   char *partial_suffix, bool mark_done,
+							   int standby_message_timeout, int64 *last_status);
 static PGresult *HandleEndOfCopyStream(PGconn *conn, char *copybuf,
 									   XLogRecPtr blockpos, char *basedir, char *partial_suffix,
 									   XLogRecPtr *stoppos, bool mark_done);
@@ -906,7 +907,8 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			{
 				if (!ProcessXLogDataMsg(conn, copybuf, r, &blockpos,
 										timeline, basedir, stream_stop,
-										partial_suffix, mark_done))
+										partial_suffix, mark_done,
+										standby_message_timeout, &last_status))
 					goto error;
 
 				/*
@@ -1115,7 +1117,8 @@ static bool
 ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 				   XLogRecPtr *blockpos, uint32 timeline,
 				   char *basedir, stream_stop_callback stream_stop,
-				   char *partial_suffix, bool mark_done)
+				   char *partial_suffix, bool mark_done,
+				   int standby_message_timeout, int64 *last_status)
 {
 	int			xlogoff;
 	int			bytes_left;
@@ -1223,12 +1226,31 @@ ProcessXLogDataMsg(PGconn *conn, char *copybuf, int len,
 		/* Did we reach the end of a WAL segment? */
 		if (*blockpos % XLOG_SEG_SIZE == 0)
 		{
+			int64 now;
 			if (!close_walfile(basedir, partial_suffix, *blockpos, mark_done))
 				/* Error message written in close_walfile() */
 				return false;
 
 			xlogoff = 0;
 
+			/*
+			 * Continuous input stream might cause long duration after the
+			 * previous feedback. Here is a good point to check if the time to
+			 * feedback has come because the fsync done in close_walfile()
+			 * might have taken long time.
+			 */
+			if (standby_message_timeout > 0)
+			{
+				now = feGetCurrentTimestamp();
+				if(feTimestampDifferenceExceeds(*last_status, now,
+												standby_message_timeout))
+				{
+					if (!sendFeedback(conn, *blockpos, now, false))
+						return false;
+					*last_status = now;
+				}
+			}
+
 			if (still_sending && stream_stop(*blockpos, timeline, true))
 			{
 				if (PQputCopyEnd(conn, NULL) <= 0 || PQflush(conn))
-- 
2.1.0.GIT

#18

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: pg_basebackup may fail to send feedbacks.

On Tue, Mar 10, 2015 at 5:29 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hi, the attached is the v5 patch.

- Do feGetCurrentTimestamp() only when necessary.
- Rebased to current master

At Mon, 2 Mar 2015 20:21:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG1tJHpG03oZgwoKxt5wYD5v4S3HuTgSx7RotBhHnjwJw@mail.gmail.com>

On Tue, Feb 24, 2015 at 6:44 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, the attached is the v4 patch that checks feedback timing
every WAL segments boundary.

..
I said that checking whether to send feedback every boundary
between WAL segments seemed too long but after some rethinking, I
changed my mind.

- The most large possible delay source in the busy-receive loop
is fsyncing at closing WAL segment file just written, this can
take several seconds. Freezing longer than the timeout
interval could not be saved and is not worth saving anyway.

- 16M bytes-disk-writes intervals between gettimeofday() seems
to be gentle enough even on platforms where gettimeofday() is
rather heavy.

Sounds reasonable to me.

So we don't need to address the problem in walreceiver side so proactively
because it tries to send the feedback every after flushing the WAL records.
IOW, the problem you observed is less likely to happen. Right?
+            now = feGetCurrentTimestamp();
+            if (standby_message_timeout > 0 &&
Surely it would hardly happen, especially on ordinary
configuration.

However, the continuous receiving of the replication stream is a
quite normal behavior even if hardly happens. So the receiver
should guarantee the feedbacks to be sent by logic as long as it
is working normally, as long as the code for the special case
won't be too large and won't take too long time:).

The current walreceiver doesn't look trying to send feedbacks
after flushing every wal records. HandleCopyStream will
restlessly process the records in a gapless replication stream,
sending feedback only when keepalive packet arrives. It is the
fact that the response to the keepalive would help greatly but it
is not what the keepalives are for. It is intended to be used to
confirm if a silent receiver is still alive.

Even with this fix, the case that one write or flush takes longer
time than the feedback interval cannot be saved, but it would be
ok since it should be regarded as disorder.

Minor comment: should feGetCurrentTimestamp() be called after the check of
standby_message_timeout > 0, to avoid unnecessary calls of that?

Ah, you're right. I'll fixed it.

Even if the timeout has not elapsed yet, if synchronous mode is true, we should
send feedback here?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers