CREATE REPLICATION SLOT fails on a timeout

Started by Steve Singerover 11 years ago6 messages

steve@ssinger.info

over 11 years ago

1 attachment(s)

I am finding that my logical walsender connections are being terminated
due to a timeout on the CREATE REPLICATION SLOT command. with
"terminating walsender process due to replication timeout"

Below is the stack trace when this happens

#3 0x000000000067df28 in WalSndCheckTimeOut
(now=now@entry=453585463823871) at walsender.c:1748
#4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
#5 logical_read_xlog_page (state=<optimized out>,
targetPagePtr=691724288, reqLen=<optimized out>,
targetRecPtr=<optimized out>, cur_page=0x18476e0 "}\320\005",
pageTLI=<optimized out>) at walsender.c:741
#6 0x00000000004f41bf in ReadPageInternal (state=state@entry=0x17aa140,
pageptr=pageptr@entry=691724288,
reqLen=reqLen@entry=3600) at xlogreader.c:525
#7 0x00000000004f454a in XLogReadRecord (state=0x17aa140,
RecPtr=691727856, RecPtr@entry=0,
errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228
#8 0x0000000000675e5c in DecodingContextFindStartpoint
(ctx=ctx@entry=0x17a0358) at logical.c:460
#9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command (
cmd_string=cmd_string@entry=0x17f1478 "CREATE_REPLICATION_SLOT
\"slon_4_1\" LOGICAL \"slony1_funcs.2.2.0\"")
at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x177db50, dbname=0x177db30 "test1",

(gdb) p last_reply_timestamp
$1 = 0

I propose the attached patch sets last_reply_timestamp to now() it
starts processing a command. Since receiving a command is hearing
something from the client.

Steve

Attachments:

timeout_init.difftext/x-patch; name=timeout_init.diffDownload

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..56a2f10
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** exec_replication_command(const char *cmd
*** 1276,1281 ****
--- 1276,1282 ----
  								  parse_rc))));
  
  	cmd_node = replication_parse_result;
+ 	last_reply_timestamp = GetCurrentTimestamp();
  
  	switch (cmd_node->type)
  	{

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Steve Singer (#1)

Re: CREATE REPLICATION SLOT fails on a timeout

Hi,

On 2014-05-16 16:37:16 -0400, Steve Singer wrote:

I am finding that my logical walsender connections are being terminated due
to a timeout on the CREATE REPLICATION SLOT command. with "terminating
walsender process due to replication timeout"

Below is the stack trace when this happens

#3 0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871)
at walsender.c:1748
#4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
...
#9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command () at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x177db50, dbname=0x177db30 "test1",

(gdb) p last_reply_timestamp
$1 = 0

I propose the attached patch sets last_reply_timestamp to now() it starts
processing a command. Since receiving a command is hearing something from
the client.

Hm. Yes, that's a problem.

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..56a2f10
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** exec_replication_command(const char *cmd
*** 1276,1281 ****
--- 1276,1282 ----
parse_rc))));

cmd_node = replication_parse_result;
+ last_reply_timestamp = GetCurrentTimestamp();

switch (cmd_node->type)
{

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Steve Singer

steve@ssinger.info

over 11 years ago

In reply to: Andres Freund (#2)

1 attachment(s)

Re: CREATE REPLICATION SLOT fails on a timeout

On 05/16/2014 04:43 PM, Andres Freund wrote:

Hi,

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund

That makes sense.
A patch that does that is attached.

Steve

Attachments:

timeout_nonzero.difftext/x-patch; name=timeout_nonzero.diffDownload

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..d23f06b
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** WalSndCheckTimeOut(TimestampTz now)
*** 1738,1744 ****
  	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  										  wal_sender_timeout);
  
! 	if (wal_sender_timeout > 0 && now >= timeout)
  	{
  		/*
  		 * Since typically expiration of replication timeout means
--- 1738,1744 ----
  	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  										  wal_sender_timeout);
  
! 	if (last_reply_timestamp > 0 && wal_sender_timeout > 0 && now >= timeout)
  	{
  		/*
  		 * Since typically expiration of replication timeout means

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Steve Singer (#3)

1 attachment(s)

Re: CREATE REPLICATION SLOT fails on a timeout

On 2014-05-16 17:02:33 -0400, Steve Singer wrote:

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

That makes sense.
A patch that does that is attached.

I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...

Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.

Thanks for the bugreport!

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Don-t-pay-heed-to-wal_sender_timeout-while-creating-.patchtext/x-patch; charset=us-asciiDownload

>From 324d74e16dd8ee2a0fa977d92a269fd43a746196 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 17 May 2014 01:22:01 +0200
Subject: [PATCH] Don't pay heed to wal_sender_timeout while creating a
 decoding slot.

Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for
futher WAL using WalSndWaitForWal(). That used to always respect
wal_sender_timeout and kill the session when waiting long enough
because no feedback/ping messages can be sent while the slot is still
being created.
Add the notion that last_reply_timestamp = 0 means that the walsender
currently doesn't need timeout processing and add checks for it in a
couple of places.

Bugreport and initial patch by Steve Singer, revised by Andres Freund.
---
 src/backend/replication/walsender.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5c11d68..90394ce 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -148,9 +148,10 @@ static StringInfoData reply_message;
 static StringInfoData tmpbuf;
 
 /*
- * Timestamp of the last receipt of the reply from the standby.
+ * Timestamp of the last receipt of the reply from the standby. Set to 0 if a
+ * the process currently shouldn't be killed by wal_sender_timeout.
  */
-static TimestampTz last_reply_timestamp;
+static TimestampTz last_reply_timestamp = 0;
 
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
@@ -796,6 +797,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 										logical_read_xlog_page,
 										WalSndPrepareWrite, WalSndWriteData);
 
+		/*
+		 * Signal that we don't need the timeout mechanism. We're just
+		 * creating the replication slot and don't yet accept feedback
+		 * messages or send keepalives. As we possibly need to wait for
+		 * further WAL the walsender would possibly be killed too soon.
+		 */
+		last_reply_timestamp = 0;
+
 		/* build initial snapshot, might take a while */
 		DecodingContextFindStartpoint(ctx);
 
@@ -1693,7 +1702,7 @@ WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;		/* 10 s */
 
-	if (wal_sender_timeout > 0)
+	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
@@ -1735,6 +1744,10 @@ WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
 
+	/* don't bail out if we're doing something that doesn't require timeouts */
+	if (last_reply_timestamp <= 0)
+		return;
+
 	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
 										  wal_sender_timeout);
 
@@ -2879,7 +2892,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 {
 	TimestampTz ping_time;
 
-	if (wal_sender_timeout <= 0)
+	/*
+	 * Don't send keepalive messages if timeouts are globally disabled or
+	 * we're doing something not partaking in timeouts.
+	 */
+	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
 		return;
 
 	if (waiting_for_ping_response)
-- 
2.0.0.rc2.4.g1dc51c6.dirty

Noname

andres@anarazel.de

over 11 years ago

In reply to: Andres Freund (#4)

Re: CREATE REPLICATION SLOT fails on a timeout

Hi,

On 2014-05-17 01:34:25 +0200, Andres Freund wrote:

On 2014-05-16 17:02:33 -0400, Steve Singer wrote:

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

That makes sense.
A patch that does that is attached.

I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...

Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.

Thanks for the bugreport!

Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Steve Singer

steve@ssinger.info

over 11 years ago

In reply to: Noname (#5)

Re: CREATE REPLICATION SLOT fails on a timeout

On 05/28/2014 06:41 PM, Andres Freund wrote:

Hi,

Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!

The fix seems to work (I am no longer getting the timeout on slot creation)

Thanks

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers